From patchwork Thu Sep 26 02:07:10 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161797
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2701A924
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:09:43 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id F165C222BF
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:09:42 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="vtMO3hNC"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729188AbfIZCJm (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:09:42 -0400
Received: from mail-wr1-f65.google.com ([209.85.221.65]:42836 "EHLO
        mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1725884AbfIZCJm (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:09:42 -0400
Received: by mail-wr1-f65.google.com with SMTP id n14so483502wrw.9
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:09:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=WkyYrt+XBJI7/yD/HW/nuhIiQzz4/kyO8hMVvs+5cqc=;
        b=vtMO3hNCVOM23gvyiySrzahCdrJ5FYxzqNsyxr6W6k4uP9v2A19gWUnk5cmOsVwlQ6
         GyCarn0PATz5Lo3gEKf3XrRy9If9Sd190Ood/XctLwVu5gtkxXwnnBEjF7Qx4RlyDC+4
         FcBO+YzXQeJ8swdiaaCr+hK9L3CYoD2KONGPBONAk+/Wu0wiyhppazI7Xobel+WGB/w+
         orCOcK1sD0yE6uPN1mmdHryoSjYYPgCypZS9Yu+qBA6CwyOdTRlk818y+UPSk0Avts92
         063hZ6R850+xGfJP8T7kT2yGGMor+lf2OLwHXRNYp5ShLHOZdhw9DEIEdXDrcHo9ncFk
         zIHg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=WkyYrt+XBJI7/yD/HW/nuhIiQzz4/kyO8hMVvs+5cqc=;
        b=EDOlqG3sOHxnmmIMdY8xhkfXCCX8NHgNPcCYLpEJzI//Zpi3CAWR8GZkedWucHKvTF
         orUI536ob0hulKWmTO+2Om9OPi6WfkzVlkpJSkJ0h6I4zZvQlEUQ9X6nBKku75EpGHkh
         iu1UXm6O5ylRhL1uyiyca1DKj8okbMLHJHTiL2GefbGGqQdWi34jrfc3YvO3OqoPglgv
         xWgKX2XzQYbyjdLorblHb1uQTewbJb74He5hw1oGBSgYLGk3b0uCVG5t59o+4PACMIU7
         BwUEo4eBCdQ0i/8slDF81YLo00POpuhQRExcVx/mAn2hMed2B8x3xhHIpyLav0lZhvau
         U+kw==
X-Gm-Message-State: APjAAAXeLo9Zp2FdNCeS5yLhbRiMDaT1Vp+Zm4rD73hWcCsRnBzOnuhi
        DzPiEimq/cI7oKzWChhEVRzrJ4icabU=
X-Google-Smtp-Source: 
 APXvYqxvrrG0JBRVAsetI++p1tMlFnhzq1Z5lItzsoCaWYVsxWk4xZ5vqqm0BSndI4UckCIaOVg21A==
X-Received: by 2002:adf:dc41:: with SMTP id m1mr828753wrj.46.1569463777686;
        Wed, 25 Sep 2019 19:09:37 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.09.36
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:09:37 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 01/16] fs: Add the ZUF filesystem to the build + License
Date: Thu, 26 Sep 2019 05:07:10 +0300
Message-Id: <20190926020725.19601-2-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This adds the ZUF filesystem-in-user_mode module to the
fs/ build system.

Also added:
	* fs/zuf/Kconfig
	* fs/zuf/module.c - This file contains the LICENCE
			    of zuf code base
	* fs/zuf/Makefile - Rather empty Makefile with only
			    module.c above

I add the fs/zuf/Makefile to demonstrate that at every
patch-set stage code still compiles and there are no external
references outside of the code already submitted.

Off course only at the very last patch we have a working ZUF feeder

[LICENCE]

  zuf.ko is a GPLv2 licensed project.

  However the ZUS user mode Server is a BSD-3-Clause licensed
  project.
  Therefor you will see that:
	zus_api.h
	md_def.h
	md.h
	t2.h
  Are common files with the ZUS project. And are separately dual
  Licensed as:
	GPL-2.0 WITH Linux-syscall-note or BSD-3-Clause.

  Any code contributor to these headers should note that her/his code to
  these files only, is dual licensed.

  This is for the obvious reasons as these headers define the API between
  Kernel and the user-mode Server.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/Kconfig       |  1 +
 fs/Makefile      |  1 +
 fs/zuf/Kconfig   | 24 ++++++++++++
 fs/zuf/Makefile  | 14 +++++++
 fs/zuf/module.c  | 28 ++++++++++++++
 fs/zuf/zus_api.h | 96 ++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 164 insertions(+)
 create mode 100644 fs/zuf/Kconfig
 create mode 100644 fs/zuf/Makefile
 create mode 100644 fs/zuf/module.c
 create mode 100644 fs/zuf/zus_api.h

diff --git a/fs/Kconfig b/fs/Kconfig
index bfb1c6095c7a..452244733bb5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -261,6 +261,7 @@ source "fs/romfs/Kconfig"
 source "fs/pstore/Kconfig"
 source "fs/sysv/Kconfig"
 source "fs/ufs/Kconfig"
+source "fs/zuf/Kconfig"
 
 endif # MISC_FILESYSTEMS
 
diff --git a/fs/Makefile b/fs/Makefile
index d60089fd689b..178e27ddd605 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -130,3 +130,4 @@ obj-$(CONFIG_F2FS_FS)		+= f2fs/
 obj-$(CONFIG_CEPH_FS)		+= ceph/
 obj-$(CONFIG_PSTORE)		+= pstore/
 obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
+obj-$(CONFIG_ZUFS_FS)		+= zuf/
diff --git a/fs/zuf/Kconfig b/fs/zuf/Kconfig
new file mode 100644
index 000000000000..58288f4245c2
--- /dev/null
+++ b/fs/zuf/Kconfig
@@ -0,0 +1,24 @@
+config ZUFS_FS
+	tristate "ZUF - Zero-copy User-mode Feeder"
+	depends on BLOCK
+	depends on ZONE_DEVICE
+	select CRC16
+	select MEMCG
+	help
+	   ZUFS Kernel part.
+	   To enable say Y here.
+
+	   To compile this as a module,  choose M here: the module will be
+	   called zuf.ko
+
+config ZUF_DEBUG
+	bool "ZUF: enable debug subsystems use"
+	depends on ZUFS_FS
+	default n
+	help
+	  INTERNAL QA USE ONLY!!! DO NOT USE!!!
+	  Please leave as N here
+
+	  This option adds some extra code that helps
+	  in QA testing of the code. It may slow the
+	  operation and produce bigger code
diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
new file mode 100644
index 000000000000..452cec55f34d
--- /dev/null
+++ b/fs/zuf/Makefile
@@ -0,0 +1,14 @@
+#
+# ZUF: Zero-copy User-mode Feeder
+#
+# Copyright (c) 2018 NetApp Inc. All rights reserved.
+#
+# ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+#
+# Makefile for the Linux zufs Kernel Feeder.
+#
+
+obj-$(CONFIG_ZUFS_FS) += zuf.o
+
+# Main FS
+zuf-y += module.o
diff --git a/fs/zuf/module.c b/fs/zuf/module.c
new file mode 100644
index 000000000000..523633c1bf9d
--- /dev/null
+++ b/fs/zuf/module.c
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * zuf - Zero-copy User-mode Feeder
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <https://www.gnu.org/licenses/>.
+ */
+#include <linux/module.h>
+
+#include "zus_api.h"
+
+MODULE_AUTHOR("Boaz Harrosh <boazh@netapp.com>");
+MODULE_AUTHOR("Sagi Manole <sagim@netapp.com>");
+MODULE_DESCRIPTION("Zero-copy User-mode Feeder");
+MODULE_LICENSE("GPL");
+MODULE_VERSION(__stringify(ZUFS_MAJOR_VERSION) "."
+		__stringify(ZUFS_MINOR_VERSION));
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
new file mode 100644
index 000000000000..069153fc0b96
--- /dev/null
+++ b/fs/zuf/zus_api.h
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note or BSD-3-Clause */
+/*
+ * zufs_api.h:
+ *	ZUFS (Zero-copy User-mode File System) is:
+ *		zuf (Zero-copy User-mode Feeder (Kernel)) +
+ *		zus (Zero-copy User-mode Server (daemon))
+ *
+ *	This file defines the API between the open source FS
+ *	Server, and the Kernel module,
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+#ifndef _LINUX_ZUFS_API_H
+#define _LINUX_ZUFS_API_H
+
+#include <linux/types.h>
+#include <linux/uuid.h>
+#include <linux/fiemap.h>
+#include <stddef.h>
+
+#ifdef __cplusplus
+#define NAMELESS(X) X
+#else
+#define NAMELESS(X)
+#endif
+
+/*
+ * Version rules:
+ *   This is the zus-to-zuf API version. And not the Filesystem
+ * on disk structures versions. These are left to the FS-plugging
+ * to supply and check.
+ * Specifically any of the API structures and constants found in this
+ * file.
+ * If the changes are made in a way backward compatible with old
+ * user-space, MINOR is incremented. Else MAJOR is incremented.
+ *
+ * It is up to the Server to decides if it wants to run with this
+ * Kernel or not. Version is only passively reported.
+ */
+#define ZUFS_MINORS_PER_MAJOR	1024
+#define ZUFS_MAJOR_VERSION 1
+#define ZUFS_MINOR_VERSION 0
+
+/* Kernel versus User space compatibility definitions */
+#ifdef __KERNEL__
+
+#include <linux/statfs.h>
+
+#else /* ! __KERNEL__ */
+
+/* verify statfs64 definition is included */
+#if !defined(__USE_LARGEFILE64) && defined(_SYS_STATFS_H)
+#error "include to 'sys/statfs.h' must appear after 'zus_api.h'"
+#else
+#define __USE_LARGEFILE64 1
+#endif
+
+#include <sys/statfs.h>
+
+#include <string.h>
+
+#define u8 uint8_t
+#define umode_t uint16_t
+
+#define PAGE_SHIFT     12
+#define PAGE_SIZE      (1 << PAGE_SHIFT)
+
+#ifndef ALIGN
+#define ALIGN(x, a)		ALIGN_MASK(x, (typeof(x))(a) - 1)
+#define ALIGN_MASK(x, mask)	(((x) + (mask)) & ~(mask))
+#endif
+
+#ifndef likely
+#define likely(x_)	__builtin_expect(!!(x_), 1)
+#define unlikely(x_)	__builtin_expect(!!(x_), 0)
+#endif
+
+#ifndef BIT
+#define BIT(b)  (1UL << (b))
+#endif
+
+/* RHEL/CentOS7 are missing these */
+#ifndef FALLOC_FL_UNSHARE_RANGE
+#define FALLOC_FL_UNSHARE_RANGE         0x40
+#endif
+#ifndef FALLOC_FL_INSERT_RANGE
+#define FALLOC_FL_INSERT_RANGE		0x20
+#endif
+
+#endif /*  ndef __KERNEL__ */
+
+#endif /* _LINUX_ZUFS_API_H */

From patchwork Thu Sep 26 02:07:11 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161799
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0B2CC14E5
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:10:10 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id DE0C3222BF
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:10:09 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="jx9E7DGF"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729443AbfIZCKJ (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:10:09 -0400
Received: from mail-wr1-f67.google.com ([209.85.221.67]:43867 "EHLO
        mail-wr1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727403AbfIZCKJ (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:10:09 -0400
Received: by mail-wr1-f67.google.com with SMTP id q17so480695wrx.10
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:10:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=IGlN+heuyTZ+mY6VztfGn4UabzYzGl/zkHtT6rP7TzI=;
        b=jx9E7DGFNiOvZGgMp0oAfuSA9agiLMJS/iqPQWmyxBo1PJ/M2eF1zjVes5QBo33x+6
         z4HQmpKxK8MoxE6ruCzpZqZ9t4DWGmyZ6F19rfkMCAvWd0v0JE/n4+yDEWydX/xMalX8
         s1gwjBQZHzkinUwcfZd00j9OT0l68VYNHf++6u9FsIXiJ6z3/MzDPYPQahgQEe1LFqWA
         YVFiv0+edk/5DZlq/OabVOENsTXyGVTTwu4omOAZcQgXGvuH0ffkuZorU8dj5tcv+h1Q
         gCDJY3gmlGoBO1hNeDLDL1lje1Ia+kg6hnGHcS6hD8kXg6lQeq0PTGS9v6D6famw/D4j
         TjYg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=IGlN+heuyTZ+mY6VztfGn4UabzYzGl/zkHtT6rP7TzI=;
        b=CrKpLP7DoI4IAy9dB3pT5LSmHTeSLiGXNV+Fk3aCAOEeaaiZ87VKz5qFxjfRHZ93yo
         4DYh3+TlNfSYn/C71fU2G/+xO4AzYqheOh1/VWPUs5Hb3fP2fi/NfqpbqvtTBa8SDF8t
         u1WDODaHbGJTQEX5w3WtA/ZsveXqMFf8KZ/f6Zi61vaWGDs+tXTAnhcX986DLG7o+RDs
         xQc1m6KaL1uYhhF00mvMv1hYWSFcGmb5J3u4txGguevTfDdBw+2bLleBHoGxgIC8OqhU
         ZgYMuXSYf0PVyX7PggxhP/TWgJFt7T9KawuMcVBeJNPkGpdAmOMJvw3AovbfOaqoOy+R
         4cDw==
X-Gm-Message-State: APjAAAXS6vOFYeltzsSvtRfp5+6ZuIW5wx08b5FH0VZ+fR7JTWyWyWLF
        AgUoQFQGik1gecFPQ7b4GQRPXReyGE0=
X-Google-Smtp-Source: 
 APXvYqwj8f1IHdifiiaFuKPEatnSCR7Xi+WTvnZ/Adiuyh8puGX3sY2lBqWW7L3cwGMmG0zBdFKdEw==
X-Received: by 2002:a5d:51d2:: with SMTP id n18mr872226wrv.10.1569463806948;
        Wed, 25 Sep 2019 19:10:06 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.10.05
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:10:06 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 02/16] MAINTAINERS: Add the ZUFS maintainership
Date: Thu, 26 Sep 2019 05:07:11 +0300
Message-Id: <20190926020725.19601-3-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

ZUFS sitting in the fs/zuf/ directory is maintained
by Netapp. (I added my email)

I keep this as separate patch as this file might be
a source of conflicts

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 MAINTAINERS | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index a50e97a63bc8..8703871c1505 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17867,6 +17867,12 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/zswap.c
 
+ZUFS ZERO COPY USER-MODE FILESYSTEM
+M:	Boaz Harrosh <boazh@netapp.com>
+L:	linux-fsdevel@vger.kernel.org
+S:	Maintained
+F:	fs/zuf/
+
 THE REST
 M:	Linus Torvalds <torvalds@linux-foundation.org>
 L:	linux-kernel@vger.kernel.org

From patchwork Thu Sep 26 02:07:12 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161801
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0ACFE924
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:10:26 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id C8C54222BF
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:10:25 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="yLLL6Y4J"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729595AbfIZCKZ (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:10:25 -0400
Received: from mail-wm1-f67.google.com ([209.85.128.67]:55565 "EHLO
        mail-wm1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727403AbfIZCKZ (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:10:25 -0400
Received: by mail-wm1-f67.google.com with SMTP id a6so723774wma.5
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:10:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=XvNOAOvos37M7SxH2U1HvuKaFmWVzDDgGHtoZLa9Yqs=;
        b=yLLL6Y4JW7zJKpJ1V1CjuA2af+Zi+85dwybAwImXWZDPVr34v32XZHg0MXWk7G3BIX
         Ip14AOS+Do32DteaChy27V52F4h9vSrmrcDA19XT2kXIXy8pyOd9iyB4SA8vQ+/vnnby
         3SaATTen/nKXTIbJ7b47Da8tv+wUSLC7xBjoLz0cuFvPbdyzsC4w9dBtHq1UYW16hkjz
         vbqYNSAsjGlr8EIsQGFp5pTZTJNNyMU0teXYB9PRMaapF6ulzNlj3ofLqqz8yRzkBkri
         GPT7DLFi9BZkpyLSi0Zws6JfKbwR9CsMwHVUa8SDGHKBed2bweOEH28Ofryuaj6Z9uYN
         q0iw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=XvNOAOvos37M7SxH2U1HvuKaFmWVzDDgGHtoZLa9Yqs=;
        b=hzCmoJCW5jho/5xI7BeUI2nQlN/FWFOigSKA3tIfom4JRAsNrcf+Cp6O7qJItz5hcW
         3GYdFXpaH7tuFP0+katQGPbjlEp+c2LFBPHZEKm7b5tO9IK5KgukcvEPfWPKsfdAGgyu
         dtpRcqsS5r/t23ADHbOMEVmpc+wIpNffJQrNUo6t/7EQs72EQngnnXebqy5wELRryCZv
         DKZSzVxplKpgV7k9tWp9X8qdnw0R/QKU1I3TBXxt64tCksPJf9hWYKFIfzbKRELTZVh7
         Jr5APfJW4b3NLf0G7xzrdHUCZ34QA+5ys/mv3BOGnMQtE154uBalhU6WG1qSETgXWHY0
         gV2g==
X-Gm-Message-State: APjAAAWO8eHtnYl2GYdlyOFqsMej8FWluDGc0DRUvK+dqQLW7K6YVbFZ
        ihiir+c0RC+dGQCpT/eWWKEssdyx0y8=
X-Google-Smtp-Source: 
 APXvYqxsZp1xPJ1VD3KIRRz3tswOAnmP44j2SloYv3INsDqu9JcE4m+bZAv3iPPO8YyduxnzV+SutQ==
X-Received: by 2002:a1c:bcd6:: with SMTP id m205mr757763wmf.129.1569463820259;
        Wed, 25 Sep 2019 19:10:20 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.10.18
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:10:19 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 03/16] zuf: Preliminary Documentation
Date: Thu, 26 Sep 2019 05:07:12 +0300
Message-Id: <20190926020725.19601-4-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Adding Documentation/filesystems/zufs.txt.

Adding some Documentation first. So to give the reviewer
of the coming patch-set. Some background and overview of
the all system.

[v2]
  Incorporated Randy's few comments.

Randy Please give it an harder review?

CC: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 Documentation/filesystems/zufs.txt | 386 +++++++++++++++++++++++++++++
 1 file changed, 386 insertions(+)
 create mode 100644 Documentation/filesystems/zufs.txt

diff --git a/Documentation/filesystems/zufs.txt b/Documentation/filesystems/zufs.txt
new file mode 100644
index 000000000000..2a347a446aa7
--- /dev/null
+++ b/Documentation/filesystems/zufs.txt
@@ -0,0 +1,386 @@
+ZUFS - Zero-copy User-mode FileSystem
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Trees:
+	git clone https://github.com/NetApp/zufs-zuf -b upstream
+	git clone https://github.com/NetApp/zufs-zus -b upstream
+
+patches, comments, questions, requests to:
+	boazh@netapp.com
+
+Introduction:
+~~~~~~~~~~~~~
+
+ZUFS - stands for Zero-copy User-mode FS
+▪ It is geared towards true zero copy end to end of both data and meta data.
+▪ It is geared towards very *low latency*, very high CPU locality, lock-less
+  parallelism.
+▪ Synchronous operations
+▪ Numa awareness
+
+  ZUFS is a, from scratch, implementation of a filesystem-in-user-space, which
+tries to address the above goals. It is aimed for pmem based FSs. But supports
+any other type of FSs
+
+Glossary and names:
+~~~~~~~~~~~~~~~~~~~
+
+ZUF - Zero-copy User-mode Feeder
+  zuf.ko is the Kernel VFS component. Its job is to interface with the Kernel
+  VFS and dispatch commands to a User-mode application Server.
+  Uptodate code is found at:
+	git clone https://github.com/NetApp/zufs-zuf -b upstream
+
+ZUS - Zero-copy User-mode Server
+  zufs utilizes a User-mode server application. That takes care of the detailed
+  communication protocol and correctness with the Kernel.
+  In turn it utilizes many zusFS Filesystem plugins to implement the actual
+  on disc Filesystem.
+  Uptodate code is found at:
+	git clone https://github.com/NetApp/zufs-zus -b upstream
+
+zusFS - FS plugins
+  These are .so loadable modules that implement one or more Filesystem-types
+  (mount -t xyz).
+  The zus server communicates with the plugin via a set of function vectors
+  for the different operations. And establishes communication via defined
+  structures.
+
+Filesystem-type:
+  At startup zus registers with the Kernel one or more Filesystem-type(s)
+  Associated with the type is a unique type-name (mount -t foofs) +
+  different info about the fs, like a magic number and so on.
+  One Server can support many FS-types, in turn each FS-type can mount
+  multiple super-blocks, each supporting multiple devices.
+
+Device-Table (MDT) - A zufs FS can support multiple devices
+  ZUF in Kernel may receive, like any mount command a block-device or none.
+  For the former if the specified FS-types states so in a special field.
+  The mount will look for a Device table. A list of devices in a specific
+  order sitting at some offset on each block-device. The system will then
+  proceed to open and own all these devices and associate them to the mounting
+  super-block.
+  If FS-type specifies a -1 at DT_offset then there is no device table
+  and a DT of a single device is created. (If we have no devices, none
+  is specified than we operate without any block devices. (Mount options give
+  some indication of the storage information))
+  The device table has special consideration for pmem devices and will
+  present the all linear array of devices to zus, as one flat mmap space.
+  Alternatively all non-pmem devices are also provided an interface
+  with facility of data movement from pmem to slower devices.
+  A detailed NUMA info is exported to the Server for maximum utilization.
+  Each device has an associated NUMA node, so Server can optimize IO to
+  these devices
+
+pmem: (Also called t1)
+  Multiple pmem devices are presented to the server as a single
+  linear file mmap. Something like /dev/dax. But it is strictly
+  available only to the specific super-block that owns it.
+
+Shadow: (For debugging)
+  "Shadow" is used for debugging the correct persistence of pmem based
+  filesystems. With pmem if modified a user must call cl_flush/sfence
+  for the data to be guarantied resistance. This is very hard to test
+  and time consuming. So for that we invented the shadow.
+  There is a special mode bit in the MDT header that denotes a shadow
+  system. In a shadow setup each pmem device is divided in half. First
+  half is available for FS storage. The second half is a Shadow. IE
+  each time the FS calls cl_flush or mov_nt the data is then memcopied
+  to the shadow.
+  At mount time the Shadow is copied onto the main part. And thous
+  presenting only those bits that where persisted by the FS. So a simple
+  remount can simulate a full machine reboot.
+  The Shadow is presented as the upper part of the mmaped region. IE
+  the all t1 ranged is repeated again. The zus core code fasilitates
+  zusFS implementors in accessing this facility
+
+zufs_dpp_t - Dual port pointer type
+  At some points in the protocol there are objects that return from zus
+  (The Server) to the Kernel via a dpp_t. This is a special kind of pointer
+  It is actually an offset 8 bytes aligned with the 3 low bits specifying
+  a pool code: [offset = dpp_t & ~0x7] [pool = dpp_t & 0x7]
+  pool == 0 means the offset is in pmem who's management is by zuf and
+  a full easy access is provided for zus.
+
+  pool != 0 Is a pre-established file (up to 6 such files per sb) where
+  the zus has an mmap on the file and the Kernel can access that data
+  via an offset into the file.
+  pool == 7 denotes an offset into the application buffers associated
+  with the current IO.
+  All dpp_t objects life time rules are strictly defined.
+  Mainly the primary use of dpp_t is the on-pmem inode structure. Both
+  zus and zuf can access and change this structure. On any modification
+  the zus is called so to be notified of any changes, persistence.
+  More such objects are: Symlinks, xattrs, data-blocks etc...
+
+Relay-wait-object:
+  communication between Kernel and server are done via zus-threads that
+  sleep in Kernel (inside an IOCTL) and wait for commands. Once received
+  the IOCTL returns operation id executed and the return info is returned via
+  a new IOCTL call, which then waits for the next operation.
+  To wake up the sleeping thread we use a Relay-wait-object. Currently
+  it is two waitqueue_head(s) back to back.
+  In future we should investigate the use of a new special scheduler object
+  That switches from thread A to predefined thread ZT context without passing
+  through the scheduler at all.
+  (The switching is already very fast, faster then anything currently
+   in the Kernel. But I believe I can shave another 1 micro off a roundtrip)
+
+ZT-threads-array:
+  The novelty of the zufs is the ZT-threads system. 3 threads or more are
+  pre-created for each active core in the system.
+  ▪ The thread is AFFINITY set for that single core only.
+  ▪ Special communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)
+    At initialization the ZT thread communicates through a ZT_INIT ioctl
+    and registers as the handler of that core (Channel)
+  ▪ Also for each ZT, Kernel allocates an IOCTL-buffer that is directly
+    accessed by Kernel. In turn that IOCTL-buffer is mmaped by zus
+    for the Server access of that communication buffer. (This is for zero
+    copy operations as well as avoiding the smem memory barrier)
+  ▪ IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation
+    via the IOCTL_ZU_WAIT_OPT call.
+
+  ▪ On operation dispatch current CPU's ZT free channel is selected.
+    Operation info is set into the IOCTL-buffer, the ZT is woken and the
+    application thread is put to sleep.
+  ▪ After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released,
+    Server wait for new operation on that CPU.
+  ▪ Each ZT has a cyclic logic. Each call to IOCTL_ZU_WAIT_OPT from Server
+    returns the results of the previous operation, before going to sleep
+    waiting to receive a new operation.
+	zus			zuf-zt				application
+    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+     ---> IOCTL_ZU_WAIT_OPT    if (app-waiting)
+     |					wake-up-application	 -> return to app
+     |				FS-WAIT
+     |				|				<- POSIX call
+     |				V		<- fs-wake-up(dispatch)
+     |			<- return with new command
+     |--<- do_new_operation
+
+ZUS-mount-thread:
+  The system utilizes a single mount thread. (This thread is not affinity to any
+  core).
+  ▪ It will first Register all FS-types supported by this Server (By calling
+    all zusFS plugins to register their supported types). Once done
+  ▪ As above, the thread sleeps in Kernel via the IOCTL_ZU_MOUNT call.
+  ▪ When the Kernel receives a mount request (vfs calles the fs_type->mount opt)
+    a mount is dispatched back to zus.
+  ▪ NOTE: That only on very first mount the above ZT-threads-array is created
+    the same ZT-array is then used for all super-blocks in the system
+  ▪ As part of the mount command in the context of this same mount-thread
+    a call to IOCTL_ZU_GRAB_PMEM will establish an interface to the pmem
+    Associated with this super_block
+  ▪ On return like above a new call to IOCTL_ZU_MOUNT will return info of the
+    mount before sleeping in kernel waiting for a new dispatch. All SB info
+    is provided to zuf, including the root inode info. Kernel then proceeds
+    to complete the mount call.
+  ▪ NOTE that since there is a single mount thread lots of FS-registration
+    super_block and pmem management are lockless.
+
+Philosophy of operations:
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. [zuf-root]
+
+On module load  (zuf.ko) A special pseudo FS is mounted on /sys/fs/zuf. This is
+called zuf-root.
+The zuf-root has no visible files. All communication is done via special-files.
+special-files are open(O_TMPFILE) and establish a special role via an
+IOCTL. (Example above ZT-thread is one such special file)
+All communications with the server are done via the zuf-root. Each root owns
+many FS-types and each FS-type owns many super-blocks of this type. All Sharing
+the same communication channels.
+Since all FS-type Servers live in the same zus application address space, at
+times. If the administrator wants to separate between different servers, he/she
+can mount a new zuf-root and point a new server instance on that new mount,
+registering other FS-types on that other instance. The all communication array
+will then be duplicated as well.
+(Otherwise pointing a new server instance on a busy root will return an error)
+
+2. [zus server start]
+  ▪ On load all configured zusFS plugins are loaded.
+  ▪ The Server starts by starting a single mount thread.
+  ▪ It than proceeds to register with Kernel all FS-types it will support.
+    (This is done on the single mount thread, so FS-registration and
+     mount/umount operate in a single thread and therefor need not any locks)
+  ▪ Sleeping in the Kernel on a special-file of that zuf-root. waiting for
+    a mount command.
+
+3. [mount -t xyz]
+  [In Kernel]
+  ▪ If xyz was registered above as part of the Server startup. the regular
+    mount command will come to the zuf module with a zuf_mount() call. with
+    the xyz-FS-info. In turn this points to a zuf-root.
+  ▪ Code than proceed to load a device-table of devices as  specified above.
+    It then establishes an multi_devices object with a specific sb_id.
+  ▪ It proceeds to call mount_bdev. Always with the same main-device
+    thous fully sporting automatic bind mounts. Even if different
+    devices are given to the mount command.
+  ▪ In zuf_fill_super it will then dispatch (awaken) the mount thread
+    specifying two parameters. One the FS-type to mount, and then
+    the sb_id Associated with this super_block.
+
+  [In zus]
+  ▪ A zus_super_block_info is allocated.
+  ▪ zus calls PMEM_GRAB(sb_id) to establish a direct mapping to its
+    pmem devices. On return we have full access to our PMEM
+
+  ▪ ZT-threads-array
+    If this is the first mount the ZT-threads-array is created and
+    established. The mount thread will wait until all zt-threads finished
+    initialization and ready to rock.
+  ▪ Root-zus_inode is loaded and is returned to kernel
+  ▪ More info about the mount like block sizes and so on are returned to kernel.
+
+  [In Kernel]
+   The zuf_fill_super is finalized vectors established and we have a new
+   super_block ready for operations.
+
+4. An FS operation like create or WRITE/READ and so on arrives from application
+   via VFS. Eventually an Operation is dispatched to zus:
+   ▪ A special per-operation descriptor is filled up with all parameters.
+   ▪ A current CPU channel is grabbed. the operation descriptor is put on
+     that channel (ZT). Including get_user_pages or Kernel-pages associated
+     with this OPT.
+   ▪ The ZT is awaken, app thread put to sleep.
+   ▪ Optionally in ZT context pages are mapped to that ZT-vma. This is so we
+     are sure the map is only on a single core. And no other core's TLB is
+     affected.
+   ▪ ZT thread is returned to user-space.
+   ▪ In ZT context the zus Server calls the appropriate zusFS->operation
+     vector. Output params filled.
+   ▪ zus calls again with an IOCTL_ZU_WAIT_OPT with the same descriptor
+     to return the requested info.
+   ▪ At Kernel (zuf) the app thread is awaken with the results, and the
+     ZT thread goes back to sleep waiting a new operation.
+
+   ZT rules:
+       A ZT thread should try to minimize it's sleeps. it might take locks
+   In which case we will see that the same CPU channel is reentered via another
+   application/thread. But now that CPU channel is taken.  What we do is we
+   utilize a few channels (ZTs) per core and those threads may grab another
+   channel. But this only postpones the problem. On a busy contended system,
+   all such channels will be consumed. If all channels are taken the
+   application thread is put on a busy scheduling wait until a channel can
+   be grabbed.
+   If The server needs to sleep for a long time it should utilize the
+   ZUFS_ASYNC return option. The app is then kept sleeping on an
+   operation-context object and the ZT freed for foreground operation.
+   At some point in time when the server completes the delayed operation
+   it will notify the Kernel with a special async IO-context cookie.
+   And the app will be awakened.
+
+4. On umount the operation is reversed and all resources are released.
+5. In case of an application or Server crash, all resources are Associated
+   with files, on file_release these resources are caught and freed.
+
+Objects and life-time
+~~~~~~~~~~~~~~~~~~~~~
+
+Each Kernel object type has an assosiated zus Server object type who's life
+time is governed by the life-time of the Kernel object. Therefor the Server's
+job is easy because it need not establish any object caches / hashes and so on.
+
+Inside zus all objects are allocated by the zusFS plugin. So in turn it can
+allocate a bigger space for its own private data and access it via the
+container_off() coding pattern. So when I say below a zus-object I mean both
+zus public part + zusFS private part of the same object.
+
+All operations return a User-mode pointer that are opaque to the the Kernel
+code, they are just a cookie which is returned back to zus, when needed.
+At times when we want the Kernel to have direct access to a zus object like
+zufs_inode, along with the cookie we also return a dpp_t, with a defined
+structure.
+
+Kernel object 			| zus object 		| Kernel access (via dpp_t)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+zuf_fs_type
+	file_system_type	| zus_fs_info		| no
+
+zuf_sb_info
+	super_block		| zus_sb_info		| no
+
+zuf_inode_info			|			|
+	vfs_inode		| zus_inode_info	| no
+	zufs_inode *		| 	zufs_inode *	| yes
+	synlink *		|	char-array	| yes
+	xattr**			|	zus_xattr	| yes
+
+When a Kernel object's time is to die, a final call to zus is
+dispatched so the associated object can also be freed. Which means
+that on memory pressure when object caches are evicted also the zus
+memory resources are freed.
+
+
+How to use zufs:
+~~~~~~~~~~~~~~~~
+
+The most updated documentation of how to use the latest code bases
+is the script (set of scripts) at fs/do-zu/zudo on the zus git tree
+
+We the developers at Netapp use this script to mount and test our
+latest code. So any new Secret will be found in these scripts. Please
+read them as the ultimate source of how to operate things.
+
+We assume you cloned these git trees:
+[]$ mkdir zufs; cd zufs
+[]$ git clone https://github.com/NetApp/zufs-zuf -b upstream
+[]$ git clone https://github.com/NetApp/zufs-zuf -b upstream
+
+This will create the following trees
+zufs/zus - Source code for Server
+zufs/zuf - Linux Kernel source tree to compile and install on your machine
+
+Also specifically:
+zufs/zus/fs/do-zu/zudo - script Documenting how to run things
+
+[]$ cd zufs
+
+First time
+[] zus/fs/do-zu/zudo
+this will create a file:
+	zus/fs/do-zu/zu.conf
+
+Edit this file for your environment. Devices, mount-point and so on.
+On first run an example file will be created for you. Fill in the
+blanks. Most params can stay as is in most cases
+
+Now lets start running:
+
+[1]$ zus/fs/do-zu/zudo mkfs
+This will run the proper mkfs command selected at zu.conf file
+with the proper devices.
+
+[2]$ zus/fs/do-zu/zudo zuf-insmod
+This loads the zuf.ko module
+
+[3]$ zus/fs/do-zu/zudo zuf-root
+This mounts the zuf-root FS above on /sys/fs/zuf (automatically created in [2])
+
+[4]$ zus/fs/do-zu/zudo zus-up
+This runs the zus daemon in the background
+
+[5]$ zus/fs/do-zu/zudo mount
+This mount the mkfs FS above on the specified dir in zu.conf
+
+To run all the 5 commands above at once do:
+[]$ zus/fs/do-zu/zudo up
+
+To undo all the above in reverse order do:
+[]$ zus/fs/do-zu/zudo down
+
+And the most magic command is:
+[]$ zus/fs/do-zu/zudo again
+Will do a "down", then update-mods, then "up"
+(update-mods is a special script to copy the latest compiled binaries)
+
+Now you are ready for some:
+[]$ zus/fs/do-zu/zudo xfstest
+xfstests is assumed to be installed in the regular /opt/xfstests dir
+
+Again please see inside the scripts what each command does
+these scripts are the ultimate Documentation, do not believe
+anything I'm saying here. (Because it is outdated by now)
+
+Have a nice day

From patchwork Thu Sep 26 02:07:13 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161803
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3DF5E924
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:10:54 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id F245E222BF
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:10:53 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="k77x5oCA"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1730192AbfIZCKv (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:10:51 -0400
Received: from mail-wm1-f67.google.com ([209.85.128.67]:33874 "EHLO
        mail-wm1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727403AbfIZCKv (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:10:51 -0400
Received: by mail-wm1-f67.google.com with SMTP id y135so5456444wmc.1
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:10:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=doMFHs3xuk4Zk/j3OOpT4lFbxk9LYUDthnx1h3amEq8=;
        b=k77x5oCA2eI60bIKRd7VtQDc97rD0ldH6EbF6yoOUTg5ATlE3C6dfTI4GO8jddsPmB
         KshSfRLwrCt7WIgZcd/EswxDVPWAN6eYsL7u3dD1WjDUNJ0gMkgJyFuZmcocwPICf8j+
         1dACfwf1EWiUf0r4Ore/nfETePJgdwjoUavj2Gn2VIHTzlNDIuobIr4a8hTAeJ/JVCsA
         Fa3k8O0M7JlcEzDWiwHDkkhdQ2Km1B1fI/SpzJitTICjndhi2xdlEJ7ehZa0o1JZaWfT
         zh7GEHytsk/5Xb6j0zabxDiJ3G/dpOhNTBOMtEluWkZdXGq8noJakXBflnLn5qCJv1z6
         8cMg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=doMFHs3xuk4Zk/j3OOpT4lFbxk9LYUDthnx1h3amEq8=;
        b=TJabv/buoyzNrNUOL13kpPFBBTguLRiGNmEhkxJDygfkDIsFB0VkU4WXbgTJ2dkcU3
         kCip1aKFFk1fiyvoFFUNkoUnHjs1yMyhXo3vTNYfMaDmbu4ojKa8Qheu2tlqscHXWfm0
         eENXkqg4NzNrxeYQSoVLzJm0DTMu7TGYGcEoXWRZ3/CDiKjVMVQ2+b0NMPXLuwnddp1T
         JwnXYBlhi6AkXk0U2TKnYugqlnhzQYXjQuxOZlmYSe/cSkOyFshnPGUkPpFOLMM+r9d1
         g95r2050rXiMYcRP71nbzg5pL3Kw80fkaNtO9nxVjO87EkxaNsaUK0oatTbjhLztYONX
         A1eQ==
X-Gm-Message-State: APjAAAUmzo0nzNf6C4o2zyaoLIhe0NyM4R7e/ukAIEJtkHthJpkNQqwZ
        sCFAs99GmBtZ0nkf+vCneyeLk7dr1yo=
X-Google-Smtp-Source: 
 APXvYqwiGUYHTXjRq7wck+I7bDTZ1sanpJcxHHkw7U2BHBe3DtXDUmRH65d9iAb0g7gABJ+7/5aQxg==
X-Received: by 2002:a05:600c:1103:: with SMTP id
 b3mr850757wma.3.1569463846774;
        Wed, 25 Sep 2019 19:10:46 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.10.45
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:10:46 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 04/16] zuf: zuf-rootfs
Date: Thu, 26 Sep 2019 05:07:13 +0300
Message-Id: <20190926020725.19601-5-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

zuf-root is a pseudo FS that the zusd Server communicates through,
registers new file-systems. receives new mount requests.

In this patch we have the bring up of that special FS.

The principal communication with zuf-rootfs is done through
tmep-files + io-ctls.
Caller does an open(O_TMPFILE) and invokes some IOCTL_XXX on
the file. The specific ioctl establishes one of zuf_special_file
types object and attaches the object to the file-ptr and by that
defining special behavior for that object.

Otherwise zuf-rootfs is not an FS at all. It has a few viewable
variable files, exposing state and info about the system. In this
patch we can see the "state" variable-file, that denotes to user-mode
when the Kernel is ready for new mounts. And the registered_fs which
exposes what zufFS(s) where registered with the Kernel.

There is a one-to-one relationship between a zuf-root SB and
a zusd Server. Each zusd Server can support multiple zusFS
plugins and register multiple filesystem-types.

The zuf-rootfs (mount -t zuf) is usually mounted on
/sys/fs/zuf. The /sys/fs/zuf directory is automatically created
when zuf.ko is loaded. If an admin wants to run more zusd server
applications she/he can mount a second instance of -t zuf on some
dir and point the new zusd Server to it. (zusd has an optional path
argument). Otherwise a second instance attempting to communicate
with a busy zuf-root will fail.

TODO: How to trigger a first mount on module_load. Currently
admin needs to manually "mount -t zuf none /sys/fs/zuf"

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   4 +
 fs/zuf/_extern.h  |  41 +++++
 fs/zuf/_pr.h      |  63 +++++++
 fs/zuf/super.c    |  53 ++++++
 fs/zuf/zuf-core.c |  69 ++++++++
 fs/zuf/zuf-root.c | 438 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf.h      | 116 ++++++++++++
 fs/zuf/zus_api.h  |  36 ++++
 8 files changed, 820 insertions(+)
 create mode 100644 fs/zuf/_extern.h
 create mode 100644 fs/zuf/_pr.h
 create mode 100644 fs/zuf/super.c
 create mode 100644 fs/zuf/zuf-core.c
 create mode 100644 fs/zuf/zuf-root.c
 create mode 100644 fs/zuf/zuf.h

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 452cec55f34d..b08c08e73faa 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -10,5 +10,9 @@
 
 obj-$(CONFIG_ZUFS_FS) += zuf.o
 
+# ZUF core
+zuf-y += zuf-core.o zuf-root.o
+
 # Main FS
+zuf-y += super.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
new file mode 100644
index 000000000000..0e8aa52f1259
--- /dev/null
+++ b/fs/zuf/_extern.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_EXTERN_H__
+#define __ZUF_EXTERN_H__
+/*
+ * DO NOT INCLUDE this file directly, it is included by zuf.h
+ * It is here because zuf.h got to big
+ */
+
+/*
+ * extern functions declarations
+ */
+
+/* zuf-core.c */
+int zufc_zts_init(struct zuf_root_info *zri); /* Some private types in core */
+void zufc_zts_fini(struct zuf_root_info *zri);
+
+long zufc_ioctl(struct file *filp, unsigned int cmd, ulong arg);
+int zufc_release(struct inode *inode, struct file *file);
+int zufc_mmap(struct file *file, struct vm_area_struct *vma);
+
+/* zuf-root.c */
+int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
+
+/* super.c */
+int zuf_init_inodecache(void);
+void zuf_destroy_inodecache(void);
+
+struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
+			 const char *dev_name, void *data);
+
+#endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
new file mode 100644
index 000000000000..51924b6bd2a5
--- /dev/null
+++ b/fs/zuf/_pr.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_PR_H__
+#define __ZUF_PR_H__
+
+#ifdef pr_fmt
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#endif
+
+/*
+ * Debug code
+ */
+#define zuf_err(s, args ...)		pr_err("[%s:%d] " s, __func__, \
+							__LINE__, ## args)
+#define zuf_err_cnd(silent, s, args ...) \
+	do {if (!silent) \
+		pr_err("[%s:%d] " s, __func__, __LINE__, ## args); \
+	} while (0)
+#define zuf_warn(s, args ...)		pr_warn("[%s:%d] " s, __func__, \
+							__LINE__, ## args)
+#define zuf_warn_cnd(silent, s, args ...) \
+	do {if (!silent) \
+		pr_warn("[%s:%d] " s, __func__, __LINE__, ## args); \
+	} while (0)
+#define zuf_info(s, args ...)          pr_info("~info~ " s, ## args)
+
+#define zuf_chan_debug(c, s, args...)	pr_debug(c " [%s:%d] " s, __func__, \
+							__LINE__, ## args)
+
+/* ~~~ channel prints ~~~ */
+#define zuf_dbg_perf(s, args ...)	zuf_chan_debug("perfo", s, ##args)
+#define zuf_dbg_err(s, args ...)	zuf_chan_debug("error", s, ##args)
+#define zuf_dbg_vfs(s, args ...)	zuf_chan_debug("vfs  ", s, ##args)
+#define zuf_dbg_rw(s, args ...)		zuf_chan_debug("rw   ", s, ##args)
+#define zuf_dbg_t1(s, args ...)		zuf_chan_debug("t1   ", s, ##args)
+#define zuf_dbg_xattr(s, args ...)	zuf_chan_debug("xattr", s, ##args)
+#define zuf_dbg_acl(s, args ...)	zuf_chan_debug("acl  ", s, ##args)
+#define zuf_dbg_t2(s, args ...)		zuf_chan_debug("t2dbg", s, ##args)
+#define zuf_dbg_t2_rw(s, args ...)	zuf_chan_debug("t2grw", s, ##args)
+#define zuf_dbg_core(s, args ...)	zuf_chan_debug("core ", s, ##args)
+#define zuf_dbg_mmap(s, args ...)	zuf_chan_debug("mmap ", s, ##args)
+#define zuf_dbg_zus(s, args ...)	zuf_chan_debug("zusdg", s, ##args)
+#define zuf_dbg_verbose(s, args ...)	zuf_chan_debug("d-oto", s, ##args)
+
+#define md_err		zuf_err
+#define md_warn		zuf_warn
+#define md_err_cnd	zuf_err_cnd
+#define md_warn_cnd	zuf_warn_cnd
+#define md_dbg_err	zuf_dbg_err
+#define md_dbg_verbose	zuf_dbg_verbose
+
+
+#endif /* define __ZUF_PR_H__ */
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
new file mode 100644
index 000000000000..f7f7798425a9
--- /dev/null
+++ b/fs/zuf/super.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Super block operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>
+ */
+
+#include <linux/types.h>
+#include <linux/parser.h>
+#include <linux/statfs.h>
+#include <linux/backing-dev.h>
+
+#include "zuf.h"
+
+static struct kmem_cache *zuf_inode_cachep;
+
+static void _init_once(void *foo)
+{
+	struct zuf_inode_info *zii = foo;
+
+	inode_init_once(&zii->vfs_inode);
+}
+
+int __init zuf_init_inodecache(void)
+{
+	zuf_inode_cachep = kmem_cache_create("zuf_inode_cache",
+					       sizeof(struct zuf_inode_info),
+					       0,
+					       (SLAB_RECLAIM_ACCOUNT |
+						SLAB_MEM_SPREAD |
+						SLAB_TYPESAFE_BY_RCU),
+					       _init_once);
+	if (zuf_inode_cachep == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+void zuf_destroy_inodecache(void)
+{
+	kmem_cache_destroy(zuf_inode_cachep);
+}
+
+struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
+			 const char *dev_name, void *data)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
new file mode 100644
index 000000000000..c9bb31f75bed
--- /dev/null
+++ b/fs/zuf/zuf-core.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Ioctl operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/delay.h>
+#include <linux/pfn_t.h>
+#include <linux/sched/signal.h>
+
+#include "zuf.h"
+
+int zufc_zts_init(struct zuf_root_info *zri)
+{
+	return 0;
+}
+
+void zufc_zts_fini(struct zuf_root_info *zri)
+{
+}
+
+long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
+{
+	switch (cmd) {
+	default:
+		zuf_err("%d\n", cmd);
+		return -ENOTTY;
+	}
+}
+
+int zufc_release(struct inode *inode, struct file *file)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (!zsf)
+		return 0;
+
+	switch (zsf->type) {
+	default:
+		return 0;
+	}
+}
+
+int zufc_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (unlikely(!zsf)) {
+		zuf_err("Which mmap is that !!!!\n");
+		return -ENOTTY;
+	}
+
+	switch (zsf->type) {
+	default:
+		zuf_err("type=%d\n", zsf->type);
+		return -ENOTTY;
+	}
+}
diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c
new file mode 100644
index 000000000000..ea7eb810ea9d
--- /dev/null
+++ b/fs/zuf/zuf-root.c
@@ -0,0 +1,438 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * ZUF Root filesystem.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUS-ZUF interaction is done via a small specialized FS that
+ * provides the communication with the mount-thread, ZTs, pmem devices,
+ * and so on ...
+ * Subsequently all FS super_blocks are children of this root, and point
+ * to it. All sharing the same zuf communication channels.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <asm-generic/mman.h>
+
+#include "zuf.h"
+
+/* ~~~~ Register/Unregister FS-types ~~~~ */
+#ifdef CONFIG_LOCKDEP
+
+/*
+ * NOTE: When CONFIG_LOCKDEP is on. register_filesystem() complains when
+ * the fstype object is from a kmalloc. Because of some lockdep_keys not
+ * being const_obj something.
+ *
+ * So in this case we have maximum of 16 fstypes system wide
+ * (Total for all mounted zuf_root(s)). This way we can have them
+ * in const_obj memory below at g_fs_array
+ */
+
+enum { MAX_LOCKDEP_FSs = 16 };
+static uint g_fs_next;
+static struct zuf_fs_type g_fs_array[MAX_LOCKDEP_FSs];
+
+static struct zuf_fs_type *_fs_type_alloc(void)
+{
+	struct zuf_fs_type *ret;
+
+	if (MAX_LOCKDEP_FSs <= g_fs_next)
+		return NULL;
+
+	ret = &g_fs_array[g_fs_next++];
+	memset(ret, 0, sizeof(*ret));
+	return ret;
+}
+
+static void _fs_type_free(struct zuf_fs_type *zft)
+{
+	if (zft == &g_fs_array[0])
+		g_fs_next = 0;
+}
+
+#else /* !CONFIG_LOCKDEP*/
+static struct zuf_fs_type *_fs_type_alloc(void)
+{
+	return kcalloc(1, sizeof(struct zuf_fs_type), GFP_KERNEL);
+}
+
+static void _fs_type_free(struct zuf_fs_type *zft)
+{
+	kfree(zft);
+}
+#endif /*CONFIG_LOCKDEP*/
+
+
+static ssize_t _state_read(struct file *file, char __user *buf, size_t len,
+			   loff_t *ppos)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	const char *msg;
+
+	if (*ppos > 0)
+		return 0;
+
+	switch (zri->state) {
+	case ZUF_ROOT_INITIALIZING:
+		msg = "initializing\n";
+		break;
+	case ZUF_ROOT_REGISTERING_FS:
+		msg = "registering_fs\n";
+		break;
+	case ZUF_ROOT_MOUNT_READY:
+		msg = "mount_ready\n";
+		break;
+	case ZUF_ROOT_SERVER_FAILED:
+		msg = "server_failed\n";
+		break;
+	default:
+		msg = "UNKNOWN\n";
+		break;
+	}
+
+	return simple_read_from_buffer(buf, len, ppos, msg, strlen(msg));
+}
+
+static const struct file_operations _state_ops = {
+	.open = nonseekable_open,
+	.read = _state_read,
+	.llseek = no_llseek,
+};
+
+static ssize_t _registered_fs_read(struct file *file, char __user *buf,
+				   size_t len, loff_t *ppos)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	size_t buff_len = 0;
+	struct zuf_fs_type *zft;
+	char *fs_buff, *p;
+	ssize_t ret;
+	size_t name_len;
+
+	list_for_each_entry(zft, &zri->fst_list, list)
+		buff_len += strlen(zft->rfi.fsname) + 1;
+
+	if (unlikely(*ppos > buff_len))
+		return -EINVAL;
+	if (*ppos == buff_len)
+		return 0;
+
+	fs_buff = kzalloc(buff_len + 1, GFP_KERNEL);
+	if (unlikely(!fs_buff))
+		return -ENOMEM;
+
+	p = fs_buff;
+	list_for_each_entry(zft, &zri->fst_list, list) {
+		if (p != fs_buff) {
+			*p = ' ';
+			++p;
+		}
+		name_len = strlen(zft->rfi.fsname);
+		memcpy(p, zft->rfi.fsname, name_len);
+		p += name_len;
+	}
+
+	p = fs_buff + *ppos;
+	buff_len = buff_len - *ppos;
+	ret = simple_read_from_buffer(buf, len, ppos, p, buff_len);
+	kfree(fs_buff);
+
+	return ret;
+}
+
+static const struct file_operations _registered_fs_ops = {
+	.open = nonseekable_open,
+	.read = _registered_fs_read,
+	.llseek = no_llseek,
+};
+
+
+int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs)
+{
+	struct zuf_fs_type *zft = _fs_type_alloc();
+	struct zuf_root_info *zri = ZRI(sb);
+
+	if (unlikely(!zft))
+		return -ENOMEM;
+
+	if (zri->state == ZUF_ROOT_INITIALIZING)
+		zri->state = ZUF_ROOT_REGISTERING_FS;
+
+	/* Original vfs file type */
+	zft->vfs_fst.owner	= THIS_MODULE;
+	zft->vfs_fst.name	= kstrdup(rfs->rfi.fsname, GFP_KERNEL);
+	zft->vfs_fst.mount	= zuf_mount;
+	zft->vfs_fst.kill_sb	= kill_block_super;
+
+	/* ZUS info about this FS */
+	zft->rfi		= rfs->rfi;
+	zft->zus_zfi		= rfs->zus_zfi;
+	INIT_LIST_HEAD(&zft->list);
+	/* Back pointer to our communication channels */
+	zft->zri		= ZRI(sb);
+
+	zuf_add_fs_type(zft->zri, zft);
+	zuf_info("register_filesystem [%s]\n", zft->vfs_fst.name);
+	return register_filesystem(&zft->vfs_fst);
+}
+
+static void _unregister_all_fses(struct zuf_root_info *zri)
+{
+	struct zuf_fs_type *zft, *n;
+
+	list_for_each_entry_safe_reverse(zft, n, &zri->fst_list, list) {
+		unregister_filesystem(&zft->vfs_fst);
+		list_del_init(&zft->list);
+		_fs_type_free(zft);
+	}
+}
+
+static int zufr_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+
+	drop_nlink(inode);
+	return 0;
+}
+
+/* Force alignment of 2M for all vma(s)
+ *
+ * This belongs to t1.c and what it does for mmap. But we do not mind
+ * that both our mmaps (grab_pmem or ZTs) will be 2M aligned so keep
+ * it here. And zus mappings just all match perfectly with no need for
+ * holes.
+ * FIXME: This is copy/paste from dax-device. It can be very much simplified
+ * for what we need.
+ */
+static unsigned long zufr_get_unmapped_area(struct file *filp,
+		unsigned long addr, unsigned long len, unsigned long pgoff,
+		unsigned long flags)
+{
+	unsigned long off, off_end, off_align, len_align, addr_align;
+	unsigned long align = PMD_SIZE;
+
+	if (addr)
+		goto out;
+
+	off = pgoff << PAGE_SHIFT;
+	off_end = off + len;
+	off_align = round_up(off, align);
+
+	if ((off_end <= off_align) || ((off_end - off_align) < align))
+		goto out;
+
+	len_align = len + align;
+	if ((off + len_align) < off)
+		goto out;
+
+	addr_align = current->mm->get_unmapped_area(filp, addr, len_align,
+			pgoff, flags);
+	if (!IS_ERR_VALUE(addr_align)) {
+		addr_align += (off - addr_align) & (align - 1);
+		return addr_align;
+	}
+ out:
+	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
+}
+
+static const struct inode_operations zufr_inode_operations;
+static const struct file_operations zufr_file_dir_operations = {
+	.open		= dcache_dir_open,
+	.release	= dcache_dir_close,
+	.llseek		= dcache_dir_lseek,
+	.read		= generic_read_dir,
+	.iterate_shared	= dcache_readdir,
+	.fsync		= noop_fsync,
+	.unlocked_ioctl = zufc_ioctl,
+};
+static const struct file_operations zufr_file_reg_operations = {
+	.fsync			= noop_fsync,
+	.unlocked_ioctl		= zufc_ioctl,
+	.get_unmapped_area	= zufr_get_unmapped_area,
+	.mmap			= zufc_mmap,
+	.release		= zufc_release,
+};
+
+static int zufr_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct zuf_root_info *zri = ZRI(dir->i_sb);
+	struct inode *inode;
+	int err;
+
+	inode = new_inode(dir->i_sb);
+	if (!inode)
+		return -ENOMEM;
+
+	/* We need to impersonate device-dax (S_DAX + S_IFCHR) in order to get
+	 * the PMD (huge) page faults and allow RDMA memory access via GUP
+	 * (get_user_pages_longterm).
+	 */
+	inode->i_flags = S_DAX;
+	mode = (mode & ~S_IFREG) | S_IFCHR; /* change file type to char */
+
+	inode->i_ino = ++zri->next_ino; /* none atomic only one mount thread */
+	inode->i_blocks = inode->i_size = 0;
+	inode->i_ctime = inode->i_mtime = current_time(inode);
+	inode->i_atime = inode->i_ctime;
+	inode_init_owner(inode, dir, mode);
+
+	inode->i_op = &zufr_inode_operations;
+	inode->i_fop = &zufr_file_reg_operations;
+
+	err = insert_inode_locked(inode);
+	if (unlikely(err)) {
+		zuf_err("[%ld] insert_inode_locked => %d\n", inode->i_ino, err);
+		goto fail;
+	}
+	d_tmpfile(dentry, inode);
+	unlock_new_inode(inode);
+	return 0;
+
+fail:
+	clear_nlink(inode);
+	make_bad_inode(inode);
+	iput(inode);
+	return err;
+}
+
+static void zufr_put_super(struct super_block *sb)
+{
+	struct zuf_root_info *zri = ZRI(sb);
+
+	zufc_zts_fini(zri);
+	_unregister_all_fses(zri);
+
+	zuf_info("zuf_root umount\n");
+}
+
+static void zufr_evict_inode(struct inode *inode)
+{
+	clear_inode(inode);
+}
+
+static const struct inode_operations zufr_inode_operations = {
+	.lookup		= simple_lookup,
+
+	.tmpfile	= zufr_tmpfile,
+	.unlink		= zufr_unlink,
+};
+static const struct super_operations zufr_super_operations = {
+	.statfs		= simple_statfs,
+
+	.evict_inode	= zufr_evict_inode,
+	.put_super	= zufr_put_super,
+};
+
+#define ZUFR_SUPER_MAGIC 0x1717
+
+static int zufr_fill_super(struct super_block *sb, void *data, int silent)
+{
+	static struct tree_descr zufr_files[] = {
+		[2] = {"state", &_state_ops, S_IFREG | 0400},
+		[3] = {"registered_fs", &_registered_fs_ops, S_IFREG | 0400},
+		{""},
+	};
+	struct zuf_root_info *zri;
+	struct inode *root_i;
+	int err;
+
+	zri = kzalloc(sizeof(*zri), GFP_KERNEL);
+	if (!zri) {
+		zuf_err_cnd(silent,
+			    "Not enough memory to allocate zuf_root_info\n");
+		return -ENOMEM;
+	}
+
+	err = simple_fill_super(sb, ZUFR_SUPER_MAGIC, zufr_files);
+	if (unlikely(err)) {
+		kfree(zri);
+		return err;
+	}
+
+	sb->s_op = &zufr_super_operations;
+	sb->s_fs_info = zri;
+	zri->sb = sb;
+
+	root_i = sb->s_root->d_inode;
+	root_i->i_fop = &zufr_file_dir_operations;
+	root_i->i_op = &zufr_inode_operations;
+
+	mutex_init(&zri->sbl_lock);
+	INIT_LIST_HEAD(&zri->fst_list);
+
+	err = zufc_zts_init(zri);
+	if (unlikely(err))
+		return err; /* put will be called we have a root */
+
+	return 0;
+}
+
+static struct dentry *zufr_mount(struct file_system_type *fs_type,
+				  int flags, const char *dev_name,
+				  void *data)
+{
+	struct dentry *ret = mount_nodev(fs_type, flags, data, zufr_fill_super);
+
+	if (IS_ERR_OR_NULL(ret)) {
+		zuf_dbg_err("mount_nodev(%s, %s) => %ld\n", dev_name,
+			    (char *)data, PTR_ERR(ret));
+		return ret;
+	}
+
+	zuf_info("zuf_root mount [%s]\n", dev_name);
+	return ret;
+}
+
+static struct file_system_type zufr_type = {
+	.owner =	THIS_MODULE,
+	.name =		"zuf",
+	.mount =	zufr_mount,
+	.kill_sb	= kill_litter_super,
+};
+
+/* Create an /sys/fs/zuf/ directory. to mount on */
+static struct kset *zufr_kset;
+
+int __init zuf_root_init(void)
+{
+	int err = zuf_init_inodecache();
+
+	if (unlikely(err))
+		return err;
+
+	zufr_kset = kset_create_and_add("zuf", NULL, fs_kobj);
+	if (!zufr_kset) {
+		err = -ENOMEM;
+		goto un_inodecache;
+	}
+
+	err = register_filesystem(&zufr_type);
+	if (unlikely(err))
+		goto un_kset;
+
+	return 0;
+
+un_kset:
+	kset_unregister(zufr_kset);
+un_inodecache:
+	zuf_destroy_inodecache();
+	return err;
+}
+
+static void __exit zuf_root_exit(void)
+{
+	unregister_filesystem(&zufr_type);
+	kset_unregister(zufr_kset);
+	zuf_destroy_inodecache();
+}
+
+module_init(zuf_root_init)
+module_exit(zuf_root_exit)
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
new file mode 100644
index 000000000000..919b84f7478f
--- /dev/null
+++ b/fs/zuf/zuf.h
@@ -0,0 +1,116 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Definitions for the ZUF filesystem.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __ZUF_H
+#define __ZUF_H
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/xattr.h>
+#include <linux/exportfs.h>
+#include <linux/page_ref.h>
+#include <linux/mm.h>
+
+#include "zus_api.h"
+
+#include "_pr.h"
+
+enum zlfs_e_special_file {
+	zlfs_e_zt = 1,
+	zlfs_e_mout_thread,
+	zlfs_e_pmem,
+	zlfs_e_dpp_buff,
+	zlfs_e_private_mount,
+};
+
+struct zuf_special_file {
+	enum zlfs_e_special_file type;
+	struct file *file;
+};
+
+struct zuf_private_mount_info {
+	struct zuf_special_file zsf;
+	struct super_block *sb;
+};
+
+enum {
+	ZUF_ROOT_INITIALIZING = 0,
+	ZUF_ROOT_REGISTERING_FS = 1,
+	ZUF_ROOT_MOUNT_READY = 2,
+	ZUF_ROOT_SERVER_FAILED	= 3,	/* server crashed unexpectedly */
+};
+
+/* This is the zuf-root.c mini filesystem */
+struct zuf_root_info {
+	#define SBL_INC 64
+	struct sb_is_list {
+		uint num;
+		uint max;
+		struct super_block **array;
+	} sbl;
+	struct mutex sbl_lock;
+
+	ulong next_ino;
+
+	/* The definition of _ztp is private to zuf-core.c */
+	struct zuf_threads_pool *_ztp;
+
+	struct super_block *sb;
+	struct list_head fst_list;
+	int state;
+};
+
+static inline struct zuf_root_info *ZRI(struct super_block *sb)
+{
+	struct zuf_root_info *zri = sb->s_fs_info;
+
+	WARN_ON(zri->sb != sb);
+	return zri;
+}
+
+struct zuf_fs_type {
+	struct file_system_type vfs_fst;
+	struct zus_fs_info	*zus_zfi;
+	struct register_fs_info rfi;
+	struct zuf_root_info *zri;
+
+	struct list_head list;
+};
+
+static inline void zuf_add_fs_type(struct zuf_root_info *zri,
+				   struct zuf_fs_type *zft)
+{
+	/* Unlocked for now only one mount-thread with zus */
+	list_add(&zft->list, &zri->fst_list);
+}
+
+/*
+ * ZUF per-inode data in memory
+ */
+struct zuf_inode_info {
+	struct inode		vfs_inode;
+};
+
+static inline struct zuf_inode_info *ZUII(struct inode *inode)
+{
+	return container_of(inode, struct zuf_inode_info, vfs_inode);
+}
+
+/* Keep this include last thing in file */
+#include "_extern.h"
+
+#endif /* __ZUF_H */
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 069153fc0b96..f293e03460be 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -93,4 +93,40 @@
 
 #endif /*  ndef __KERNEL__ */
 
+struct zufs_ioc_hdr {
+	__s32 err;	/* IN/OUT must be first */
+	__u16 in_len;	/* How much to be copied *to* zus */
+	__u16 out_max;	/* Max receive buffer at dispatch caller */
+	__u16 out_start;/* Start of output parameters (to caller) */
+	__u16 out_len;	/* How much to be copied *from* zus to caller */
+			/* can be modified by zus */
+	__u16 operation;/* One of e_zufs_operation */
+	__u16 flags;	/* e_zufs_hdr_flags bit flags */
+	__u32 offset;	/* Start of user buffer in ZT mmap */
+	__u32 len;	/* Len of user buffer in ZT mmap */
+};
+
+struct register_fs_info {
+	char fsname[16];	/* Only 4 chars and a NUL please      */
+	__u32 FS_magic;         /* This is the FS's version && magic  */
+	__u32 FS_ver_major;	/* on disk, not the zuf-to-zus version*/
+	__u32 FS_ver_minor;	/* (See also struct md_dev_table)   */
+	__u32 notused;
+
+	__u64 dt_offset;
+	__u64 s_maxbytes;
+	__u32 s_time_gran;
+	__u32 def_mode;
+};
+
+/* Register FS */
+/* A cookie from user-mode given in register_fs_info */
+struct zus_fs_info;
+struct zufs_ioc_register_fs {
+	struct zufs_ioc_hdr hdr;
+	struct zus_fs_info *zus_zfi;
+	struct register_fs_info rfi;
+};
+#define ZU_IOC_REGISTER_FS	_IOWR('Z', 10, struct zufs_ioc_register_fs)
+
 #endif /* _LINUX_ZUFS_API_H */

From patchwork Thu Sep 26 02:07:14 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161805
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 129BB924
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:11:19 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id BC276222BD
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:11:18 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="BcPswcjb"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1730204AbfIZCLS (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:11:18 -0400
Received: from mail-wm1-f66.google.com ([209.85.128.66]:35157 "EHLO
        mail-wm1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727403AbfIZCLS (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:11:18 -0400
Received: by mail-wm1-f66.google.com with SMTP id y21so748340wmi.0
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:11:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=sxjAHwEwDNiMy6S90d8rh9EFaxyVFbJePkiNhcl+5HA=;
        b=BcPswcjb50Br5OlfxRdtNpv1QIQkjG8k/U2/vFNClUe8uBOGIKJ3QtPK4Fm8gGfYrT
         RYcGPL6GbFACaD51AcAMXRxoFrp4y7n+m+9/7unvkbiy+XF1AY3gNw+zeOfs7hqzOLvA
         rJTPzGP+Im4WnEZEG3oHP4xfFIp24B3GNJEQ+rLHayEEtqkSrzFcBYrbRGqh76bI4wvz
         M698wm0plKVBbrLMAJIgbwV24o3Vn/COpDaf/Y4ZprHweDYIZqR2LRV0PCZYqvQsmMeF
         eZuw7fHDlA+N6TE7KSjxNx7KRUQ1k/LzM5JmZr+geHeZxmHUU8FfKQqAzKSg+XLWrKTD
         yxKw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=sxjAHwEwDNiMy6S90d8rh9EFaxyVFbJePkiNhcl+5HA=;
        b=WvWxx5h8e/FlmeTMaS12NOB9+lablxKd7ZK14cDHFYlPyU5Xs3iduKutk3LuGL3XEV
         d9ayrrYCbR/pwwoWyPb6/09/zdQ6Pi/g8L+Sd66tL8EgMZFHIUE0kYgkcWHX92iNclUa
         1KA3yVIp5jfOCj36kIglQ0A+fdxkjux9zaw8yKsyJmZqrGMIBYHCbOyjhekiV72jrZas
         tnHqAk2wnxQXZhNgSWOd+h3VxLLBB1p/Z5dn3FSDJYXTblTvGnNGRlF6a/5ArTdvYff2
         3v/Iy5aZKCF+49GO5nVDzMEngoWcYEixHCiYbxn988D3IMUv9f4sMeeJQB1OgnYwDClO
         /L+A==
X-Gm-Message-State: APjAAAXyLxOuahbC3kn5y4gfw1xI/lvKdmslwbh2NZmOMSQ3RWWdpSXg
        3bxeLE/W7L/9sF6+eGn+vGvcZkBGEGw=
X-Google-Smtp-Source: 
 APXvYqxe6j+4b9YOfwL+Z4H9JU30CYWk5BSM9HhRP1bGflgwGLUodxRhuis21LYKPbFa+WQettV13A==
X-Received: by 2002:a05:600c:20c4:: with SMTP id
 y4mr720309wmm.87.1569463869811;
        Wed, 25 Sep 2019 19:11:09 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.11.07
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:11:08 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 05/16] zuf: zuf-core The ZTs
Date: Thu, 26 Sep 2019 05:07:14 +0300
Message-Id: <20190926020725.19601-6-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

zuf-core establishes the communication channels with the ZUS
User Mode Server.

In this patch we have the core communication mechanics.
Which is the Novelty of this project.
(See previous submitted documentation for more info)

Users will come later in the patchset

NOTE: The use of the file relay.h. defines an object "relay".
 "Relay" here is in the sense of a relay-race where runners
 pass the baton from runner to runner.
 Also here it is when thread of an Application passes execution
 to the Server thread and back.
 TODO: In future we might define a new scheduler object that
       will do the same but without passing through the scheduler
       at all but relinquishing the reminder of its time slice
       to the next thread. Maybe we can cut another 1/2 a micro
       off the latency of an IOP (By avoiding locks and atomics)

[v2 for Linux v5.3]
  lin-jump5.3: task_struct cpus_allowed => cpus_mask

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/_extern.h  |   16 +
 fs/zuf/_pr.h      |    5 +
 fs/zuf/relay.h    |  104 +++++
 fs/zuf/zuf-core.c | 1077 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/zuf/zuf.h      |   41 ++
 fs/zuf/zus_api.h  |  291 ++++++++++++
 6 files changed, 1533 insertions(+), 1 deletion(-)
 create mode 100644 fs/zuf/relay.h

diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 0e8aa52f1259..1f786fc24b85 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -27,6 +27,22 @@ void zufc_zts_fini(struct zuf_root_info *zri);
 long zufc_ioctl(struct file *filp, unsigned int cmd, ulong arg);
 int zufc_release(struct inode *inode, struct file *file);
 int zufc_mmap(struct file *file, struct vm_area_struct *vma);
+const char *zuf_op_name(enum e_zufs_operation op);
+
+int zufc_dispatch_mount(struct zuf_root_info *zri, struct zus_fs_info *zus_zfi,
+			enum e_mount_operation operation,
+			struct zufs_ioc_mount *zim);
+
+int __zufc_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo);
+static inline
+int zufc_dispatch(struct zuf_root_info *zri, struct zufs_ioc_hdr *hdr,
+		  struct page **pages, uint nump)
+{
+	struct zuf_dispatch_op zdo;
+
+	zuf_dispatch_init(&zdo, hdr, pages, nump);
+	return __zufc_dispatch(zri, &zdo);
+}
 
 /* zuf-root.c */
 int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
index 51924b6bd2a5..2cdb0806687b 100644
--- a/fs/zuf/_pr.h
+++ b/fs/zuf/_pr.h
@@ -34,6 +34,11 @@
 	} while (0)
 #define zuf_info(s, args ...)          pr_info("~info~ " s, ## args)
 
+#define zuf_err_dispatch(sb, s, args ...) \
+	do { if (zuf_fst(sb)->zri->state != ZUF_ROOT_SERVER_FAILED) \
+		pr_err("[%s:%d] " s, __func__, __LINE__, ## args); \
+	} while (0)
+
 #define zuf_chan_debug(c, s, args...)	pr_debug(c " [%s:%d] " s, __func__, \
 							__LINE__, ## args)
 
diff --git a/fs/zuf/relay.h b/fs/zuf/relay.h
new file mode 100644
index 000000000000..4cf642e177cd
--- /dev/null
+++ b/fs/zuf/relay.h
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Relay scheduler-object Header file.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#ifndef __RELAY_H__
+#define __RELAY_H__
+
+/* ~~~~ Relay ~~~~ */
+struct relay {
+	wait_queue_head_t fss_wq;
+	bool fss_wakeup;
+	bool fss_waiting;
+
+	wait_queue_head_t app_wq;
+	bool app_wakeup;
+	bool app_waiting;
+
+	cpumask_t cpus_allowed;
+};
+
+static inline void relay_init(struct relay *relay)
+{
+	init_waitqueue_head(&relay->fss_wq);
+	init_waitqueue_head(&relay->app_wq);
+}
+
+static inline bool relay_is_app_waiting(struct relay *relay)
+{
+	return relay->app_waiting;
+}
+
+static inline void relay_app_wakeup(struct relay *relay)
+{
+	relay->app_waiting = false;
+
+	relay->app_wakeup = true;
+	wake_up(&relay->app_wq);
+}
+
+static inline int __relay_fss_wait(struct relay *relay, bool keep_locked)
+{
+	relay->fss_waiting = !keep_locked;
+	relay->fss_wakeup = false;
+	return  wait_event_interruptible(relay->fss_wq, relay->fss_wakeup);
+}
+
+static inline int relay_fss_wait(struct relay *relay)
+{
+	return __relay_fss_wait(relay, false);
+}
+
+static inline bool relay_is_fss_waiting_grab(struct relay *relay)
+{
+	if (relay->fss_waiting) {
+		relay->fss_waiting = false;
+		return true;
+	}
+	return false;
+}
+
+static inline void relay_fss_wakeup(struct relay *relay)
+{
+	relay->fss_wakeup = true;
+	wake_up(&relay->fss_wq);
+}
+
+static inline int relay_fss_wakeup_app_wait(struct relay *relay)
+{
+	relay->app_waiting = true;
+
+	relay_fss_wakeup(relay);
+
+	relay->app_wakeup = false;
+
+	return wait_event_interruptible(relay->app_wq, relay->app_wakeup);
+}
+
+static inline
+void relay_fss_wakeup_app_wait_spin(struct relay *relay, spinlock_t *spinlock)
+{
+	relay->app_waiting = true;
+
+	relay_fss_wakeup(relay);
+
+	relay->app_wakeup = false;
+	spin_unlock(spinlock);
+
+	wait_event(relay->app_wq, relay->app_wakeup);
+}
+
+static inline void relay_fss_wakeup_app_wait_cont(struct relay *relay)
+{
+	wait_event(relay->app_wq, relay->app_wakeup);
+}
+
+#endif /* ifndef __RELAY_H__ */
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index c9bb31f75bed..60f0d3ffe562 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -18,23 +18,884 @@
 #include <linux/delay.h>
 #include <linux/pfn_t.h>
 #include <linux/sched/signal.h>
+#include <linux/uaccess.h>
+#include <linux/kref.h>
 
 #include "zuf.h"
+#include "relay.h"
+
+enum { INITIAL_ZT_CHANNELS = 3 };
+
+struct zufc_thread {
+	struct zuf_special_file hdr;
+	struct relay relay;
+	struct vm_area_struct *vma;
+	int no;
+	int chan;
+
+	/* Kernel side allocated IOCTL buffer */
+	struct vm_area_struct *opt_buff_vma;
+	void *opt_buff;
+	ulong max_zt_command;
+
+	/* Next operation*/
+	struct zuf_dispatch_op *zdo;
+};
+
+struct zuf_threads_pool {
+	struct __mount_thread_info {
+		struct zuf_special_file zsf;
+		spinlock_t lock;
+		struct relay relay;
+		struct zufs_ioc_mount *zim;
+	} mount;
+
+	uint _max_zts;
+	uint _max_channels;
+	 /* array of pcp_arrays */
+	struct zufc_thread *_all_zt[ZUFS_MAX_ZT_CHANNELS];
+};
+
+/* ~~~~ some helpers ~~~~ */
+const char *zuf_op_name(enum e_zufs_operation op)
+{
+#define CASE_ENUM_NAME(e) case e: return #e
+	switch  (op) {
+		CASE_ENUM_NAME(ZUFS_OP_NULL);
+		CASE_ENUM_NAME(ZUFS_OP_BREAK);
+	case ZUFS_OP_MAX_OPT:
+	default:
+		return "UNKNOWN";
+	}
+}
+
+static inline ulong _zt_pr_no(struct zufc_thread *zt)
+{
+	/* So in hex it will be channel as first nibble and cpu as 3rd and on */
+	return ((ulong)zt->no << 8) | zt->chan;
+}
+
+static struct zufc_thread *_zt_from_cpu(struct zuf_root_info *zri,
+					int cpu, uint chan)
+{
+	return per_cpu_ptr(zri->_ztp->_all_zt[chan], cpu);
+}
+
+static struct zufc_thread *_zt_from_f_private(struct file *file)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	WARN_ON(zsf->type != zlfs_e_zt);
+	return container_of(zsf, struct zufc_thread, hdr);
+}
+
+/* ~~~~ init/ fini ~~~~ */
+static int _alloc_zts_channel(struct zuf_root_info *zri, int channel)
+{
+	zri->_ztp->_all_zt[channel] = alloc_percpu_gfp(struct zufc_thread,
+						       GFP_KERNEL | __GFP_ZERO);
+	if (unlikely(!zri->_ztp->_all_zt[channel])) {
+		zuf_err("!!! alloc_percpu channel=%d failed\n", channel);
+		return -ENOMEM;
+	}
+	return 0;
+}
 
 int zufc_zts_init(struct zuf_root_info *zri)
 {
+	int c;
+
+	zri->_ztp = kcalloc(1, sizeof(struct zuf_threads_pool), GFP_KERNEL);
+	if (unlikely(!zri->_ztp))
+		return -ENOMEM;
+
+	spin_lock_init(&zri->_ztp->mount.lock);
+	relay_init(&zri->_ztp->mount.relay);
+
+	zri->_ztp->_max_zts = num_possible_cpus();
+	zri->_ztp->_max_channels = INITIAL_ZT_CHANNELS;
+
+	for (c = 0; c < INITIAL_ZT_CHANNELS; ++c) {
+		int err = _alloc_zts_channel(zri, c);
+
+		if (unlikely(err))
+			return err;
+	}
+
 	return 0;
 }
 
 void zufc_zts_fini(struct zuf_root_info *zri)
 {
+	int c;
+
+	/* Always safe/must call zufc_zts_fini */
+	if (!zri->_ztp)
+		return;
+
+	for (c = 0; c < zri->_ztp->_max_channels; ++c) {
+		if (zri->_ztp->_all_zt[c])
+			free_percpu(zri->_ztp->_all_zt[c]);
+	}
+	kfree(zri->_ztp);
+	zri->_ztp = NULL;
+}
+
+/* ~~~~ mounting ~~~~*/
+int __zufc_dispatch_mount(struct zuf_root_info *zri,
+			  enum e_mount_operation operation,
+			  struct zufs_ioc_mount *zim)
+{
+	struct __mount_thread_info *zmt = &zri->_ztp->mount;
+
+	zim->hdr.operation = operation;
+	for (;;) {
+		bool fss_waiting;
+
+		spin_lock(&zmt->lock);
+
+		if (unlikely(!zmt->zsf.file)) {
+			spin_unlock(&zmt->lock);
+			zuf_err("Server not up\n");
+			zim->hdr.err = -EIO;
+			return zim->hdr.err;
+		}
+
+		fss_waiting = relay_is_fss_waiting_grab(&zmt->relay);
+		if (fss_waiting)
+			break;
+		/* in case of break above spin_unlock is done inside
+		 * relay_fss_wakeup_app_wait
+		 */
+
+		spin_unlock(&zmt->lock);
+
+		/* It is OK to wait if user storms mounts */
+		zuf_dbg_verbose("waiting\n");
+		msleep(100);
+	}
+
+	zmt->zim = zim;
+	relay_fss_wakeup_app_wait_spin(&zmt->relay, &zmt->lock);
+
+	if (zim->hdr.err > 0) {
+		zuf_err("[%s] Bad Server RC not negative => %d\n",
+			zuf_op_name(zim->hdr.operation), zim->hdr.err);
+		zim->hdr.err = -EBADRQC;
+	}
+	return zim->hdr.err;
+}
+
+int zufc_dispatch_mount(struct zuf_root_info *zri, struct zus_fs_info *zus_zfi,
+			enum e_mount_operation operation,
+			struct zufs_ioc_mount *zim)
+{
+	zim->hdr.out_len = sizeof(*zim);
+	zim->hdr.in_len = sizeof(*zim);
+	if (operation == ZUFS_M_MOUNT || operation == ZUFS_M_REMOUNT)
+		zim->hdr.in_len += zim->zmi.po.mount_options_len;
+	zim->zmi.zus_zfi = zus_zfi;
+	zim->zmi.num_cpu = zri->_ztp->_max_zts;
+	zim->zmi.num_channels = zri->_ztp->_max_channels;
+
+	return __zufc_dispatch_mount(zri, operation, zim);
+}
+
+static int _zu_mount(struct file *file, void *parg)
+{
+	struct super_block *sb = file->f_inode->i_sb;
+	struct zuf_root_info *zri = ZRI(sb);
+	struct __mount_thread_info *zmt = &zri->_ztp->mount;
+	bool waiting_for_reply;
+	struct zufs_ioc_mount *zim;
+	ulong cp_ret;
+	int err;
+
+	spin_lock(&zmt->lock);
+
+	if (unlikely(!file->private_data)) {
+		/* First time register this file as the mount-thread owner */
+		zmt->zsf.type = zlfs_e_mout_thread;
+		zmt->zsf.file = file;
+		file->private_data = &zmt->zsf;
+		zri->state = ZUF_ROOT_MOUNT_READY;
+	} else if (unlikely(file->private_data != zmt)) {
+		spin_unlock(&zmt->lock);
+		zuf_err("Say what?? %p != %p\n",
+			file->private_data, zmt);
+		return -EIO;
+	}
+
+	zim = zmt->zim;
+	zmt->zim = NULL;
+	waiting_for_reply = zim && relay_is_app_waiting(&zmt->relay);
+
+	spin_unlock(&zmt->lock);
+
+	if (waiting_for_reply) {
+		cp_ret = copy_from_user(zim, parg, zim->hdr.out_len);
+		if (unlikely(cp_ret)) {
+			zuf_err("copy_from_user => %ld\n", cp_ret);
+			 zim->hdr.err = -EFAULT;
+		}
+
+		relay_app_wakeup(&zmt->relay);
+	}
+
+	/* This gets to sleep until a mount comes */
+	err = relay_fss_wait(&zmt->relay);
+	if (unlikely(err || !zmt->zim)) {
+		struct zufs_ioc_hdr *hdr = parg;
+
+		/* Released by _zu_break INTER or crash */
+		zuf_dbg_zus("_zu_break? %p => %d\n", zmt->zim, err);
+		put_user(ZUFS_OP_BREAK, &hdr->operation);
+		put_user(EIO, &hdr->err);
+		return err;
+	}
+
+	zim = zmt->zim;
+	cp_ret = copy_to_user(parg, zim, zim->hdr.in_len);
+	if (unlikely(cp_ret)) {
+		err = -EFAULT;
+		zuf_err("copy_to_user =>%ld\n", cp_ret);
+	}
+	return err;
+}
+
+static void zufc_mounter_release(struct file *file)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	struct __mount_thread_info *zmt = &zri->_ztp->mount;
+
+	zuf_dbg_zus("closed fu=%d au=%d fw=%d aw=%d\n",
+		  zmt->relay.fss_wakeup, zmt->relay.app_wakeup,
+		  zmt->relay.fss_waiting, zmt->relay.app_waiting);
+
+	spin_lock(&zmt->lock);
+	zmt->zsf.file = NULL;
+	if (relay_is_app_waiting(&zmt->relay)) {
+		zri->state = ZUF_ROOT_SERVER_FAILED;
+		zuf_err("server emergency exit while IO\n");
+		if (zmt->zim)
+			zmt->zim->hdr.err = -EIO;
+		spin_unlock(&zmt->lock);
+
+		relay_app_wakeup(&zmt->relay);
+		msleep(1000); /* crap */
+	} else {
+		if (zmt->zim)
+			zmt->zim->hdr.err = 0;
+		spin_unlock(&zmt->lock);
+	}
+}
+
+/* ~~~~ ZU_IOC_NUMA_MAP ~~~~ */
+static int _zu_numa_map(struct file *file, void *parg)
+{
+	struct zufs_ioc_numa_map *numa_map;
+	int n_nodes = num_possible_nodes();
+	uint *nodes_cpu_count;
+	uint max_cpu_per_node = 0;
+	uint alloc_size;
+	int cpu, i, err;
+
+	alloc_size = sizeof(*numa_map) +
+			(n_nodes * sizeof(numa_map->cpu_set_per_node[0]));
+
+	if ((n_nodes > 255) || (alloc_size > PAGE_SIZE)) {
+		zuf_warn("!!!unexpected big machine with %d nodes alloc_size=0x%x\n",
+			  n_nodes, alloc_size);
+		return -ENOTSUPP;
+	}
+
+	nodes_cpu_count = kcalloc(n_nodes, sizeof(uint), GFP_KERNEL);
+	if (unlikely(!nodes_cpu_count))
+		return -ENOMEM;
+
+	numa_map = kzalloc(alloc_size, GFP_KERNEL);
+	if (unlikely(!numa_map)) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	numa_map->possible_nodes	= num_possible_nodes();
+	numa_map->possible_cpus		= num_possible_cpus();
+
+	numa_map->online_nodes		= num_online_nodes();
+	numa_map->online_cpus		= num_online_cpus();
+
+	for_each_online_cpu(cpu)
+		set_bit(cpu, numa_map->cpu_set_per_node[cpu_to_node(cpu)].bits);
+
+	for_each_cpu(cpu, cpu_online_mask) {
+		uint ctn  = cpu_to_node(cpu);
+		uint ncc = ++nodes_cpu_count[ctn];
+
+		max_cpu_per_node = max(max_cpu_per_node, ncc);
+	}
+
+	for (i = 1; i < n_nodes; ++i) {
+		if (nodes_cpu_count[i] &&
+		    (nodes_cpu_count[i] != nodes_cpu_count[0])) {
+			zuf_info("@[%d]=%d Unbalanced CPU sockets @[0]=%d\n",
+				  i, nodes_cpu_count[i], nodes_cpu_count[0]);
+			numa_map->nodes_not_symmetrical = true;
+			break;
+		}
+	}
+
+	numa_map->max_cpu_per_node = max_cpu_per_node;
+
+	zuf_dbg_verbose(
+		"possible_nodes=%d possible_cpus=%d online_nodes=%d online_cpus=%d\n",
+		numa_map->possible_nodes, numa_map->possible_cpus,
+		numa_map->online_nodes, numa_map->online_cpus);
+
+	err = copy_to_user(parg, numa_map, alloc_size);
+	kfree(numa_map);
+out:
+	kfree(nodes_cpu_count);
+	return err;
+}
+
+static void _prep_header_size_op(struct zufs_ioc_hdr *hdr,
+				 enum e_zufs_operation op, int err)
+{
+	memset(hdr, 0, sizeof(*hdr));
+	hdr->operation = op;
+	hdr->in_len = sizeof(*hdr);
+	hdr->err = err;
+}
+
+/* ~~~~~ ZT thread operations ~~~~~ */
+
+static int _zu_init(struct file *file, void *parg)
+{
+	struct zufc_thread *zt;
+	int cpu = smp_processor_id();
+	struct zufs_ioc_init zi_init;
+	int err;
+
+	err = copy_from_user(&zi_init, parg, sizeof(zi_init));
+	if (unlikely(err)) {
+		zuf_err("=>%d\n", err);
+		return err;
+	}
+	if (unlikely(zi_init.channel_no >= ZUFS_MAX_ZT_CHANNELS)) {
+		zuf_err("[%d] channel_no=%d\n", cpu, zi_init.channel_no);
+		return -EINVAL;
+	}
+
+	zuf_dbg_zus("[%d] channel=%d\n", cpu, zi_init.channel_no);
+
+	zt = _zt_from_cpu(ZRI(file->f_inode->i_sb), cpu, zi_init.channel_no);
+	if (unlikely(!zt)) {
+		zi_init.hdr.err = -ERANGE;
+		zuf_err("_zt_from_cpu(%d, %d) => %d\n",
+			cpu, zi_init.channel_no, err);
+		goto out;
+	}
+
+	if (unlikely(zt->hdr.file)) {
+		zi_init.hdr.err = -EINVAL;
+		zuf_err("[%d] !!! thread already set\n", cpu);
+		goto out;
+	}
+
+	relay_init(&zt->relay);
+	zt->hdr.type = zlfs_e_zt;
+	zt->hdr.file = file;
+	zt->no = cpu;
+	zt->chan = zi_init.channel_no;
+
+	zt->max_zt_command = zi_init.max_command;
+	zt->opt_buff = vmalloc(zi_init.max_command);
+	if (unlikely(!zt->opt_buff)) {
+		zi_init.hdr.err = -ENOMEM;
+		goto out;
+	}
+
+	file->private_data = &zt->hdr;
+out:
+	err = copy_to_user(parg, &zi_init, sizeof(zi_init));
+	if (err)
+		zuf_err("=>%d\n", err);
+	return err;
+}
+
+/* Caller checks that file->private_data != NULL */
+static void zufc_zt_release(struct file *file)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	struct zufc_thread *zt = _zt_from_f_private(file);
+
+	if (unlikely(zt->hdr.file != file))
+		zuf_err("What happened zt->file(%p) != file(%p)\n",
+			zt->hdr.file, file);
+
+	zuf_dbg_zus("[%d] closed fu=%d au=%d fw=%d aw=%d\n",
+		  zt->no, zt->relay.fss_wakeup, zt->relay.app_wakeup,
+		  zt->relay.fss_waiting, zt->relay.app_waiting);
+
+	if (relay_is_app_waiting(&zt->relay)) {
+		zuf_err("server emergency exit while IO\n");
+
+		/* NOTE: Do not call _unmap_pages the vma is gone */
+		zt->hdr.file = NULL;
+
+		zri->state = ZUF_ROOT_SERVER_FAILED;
+
+		relay_app_wakeup(&zt->relay);
+		msleep(1000); /* crap */
+	}
+
+	vfree(zt->opt_buff);
+	memset(zt, 0, sizeof(*zt));
+}
+
+static int _map_pages(struct zufc_thread *zt, struct page **pages, uint nump,
+		      bool map_readonly)
+{
+	int p, err;
+
+	if (!(zt->vma && pages && nump))
+		return 0;
+
+	for (p = 0; p < nump; ++p) {
+		ulong zt_addr = zt->vma->vm_start + p * PAGE_SIZE;
+		ulong pfn = page_to_pfn(pages[p]);
+		pfn_t pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+		vm_fault_t flt;
+
+		if (map_readonly)
+			flt = vmf_insert_mixed(zt->vma, zt_addr, pfnt);
+		else
+			flt = vmf_insert_mixed_mkwrite(zt->vma, zt_addr, pfnt);
+		err = zuf_flt_to_err(flt);
+		if (unlikely(err)) {
+			zuf_err("zuf: remap_pfn_range => %d p=0x%x start=0x%lx\n",
+				 err, p, zt->vma->vm_start);
+			return err;
+		}
+	}
+	return 0;
+}
+
+static void _unmap_pages(struct zufc_thread *zt, struct page **pages, uint nump)
+{
+	if (!(zt->vma && zt->zdo && pages && nump))
+		return;
+
+	zt->zdo->pages = NULL;
+	zt->zdo->nump = 0;
+
+	zap_vma_ptes(zt->vma, zt->vma->vm_start, nump * PAGE_SIZE);
+}
+
+static int _copy_outputs(struct zufc_thread *zt, void *arg)
+{
+	struct zufs_ioc_hdr *hdr = zt->zdo->hdr;
+	struct zufs_ioc_hdr *user_hdr = zt->opt_buff;
+
+	if (zt->opt_buff_vma->vm_start != (ulong)arg) {
+		zuf_err("malicious Server\n");
+		return -EINVAL;
+	}
+
+	/* Update on the user out_len and return-code */
+	hdr->err = user_hdr->err;
+	hdr->out_len = user_hdr->out_len;
+
+	if (!hdr->out_len)
+		return 0;
+
+	if ((hdr->err == -EZUFS_RETRY && zt->zdo->oh) ||
+	    (hdr->out_max < hdr->out_len)) {
+		if (WARN_ON(!zt->zdo->oh)) {
+			zuf_err("Trouble op(%s) out_max=%d out_len=%d\n",
+				zuf_op_name(hdr->operation),
+				hdr->out_max, hdr->out_len);
+			return -EFAULT;
+		}
+		zuf_dbg_zus("[%s] %d %d => %d\n",
+			    zuf_op_name(hdr->operation),
+			    hdr->out_max, hdr->out_len, hdr->err);
+		return zt->zdo->oh(zt->zdo, zt->opt_buff, zt->max_zt_command);
+	} else {
+		void *rply = (void *)hdr + hdr->out_start;
+		void *from = zt->opt_buff + hdr->out_start;
+
+		memcpy(rply, from, hdr->out_len);
+		return 0;
+	}
+}
+
+static int _zu_wait(struct file *file, void *parg)
+{
+	struct zufc_thread *zt;
+	bool __chan_is_locked = false;
+	int err;
+
+	zt = _zt_from_f_private(file);
+	if (unlikely(!zt)) {
+		zuf_err("Unexpected ZT state\n");
+		err = -ERANGE;
+		goto err;
+	}
+
+	if (!zt->hdr.file || file != zt->hdr.file) {
+		zuf_err("fatal\n");
+		err = -E2BIG;
+		goto err;
+	}
+	if (unlikely((ulong)parg != zt->opt_buff_vma->vm_start)) {
+		zuf_err("fatal 2\n");
+		err = -EINVAL;
+		goto err;
+	}
+
+	if (relay_is_app_waiting(&zt->relay)) {
+		if (unlikely(!zt->zdo)) {
+			zuf_err("User has gone...\n");
+			err = -E2BIG;
+			goto err;
+		}
+
+		/* overflow_handler might decide to execute the parg here at
+		 * zus context and return to server.
+		 * If it also has an error to report to zus it will set
+		 * zdo->hdr->err. EZUS_RETRY_DONE is when that happens.
+		 * In this case pages stay mapped in zt->vma.
+		 */
+		err = _copy_outputs(zt, parg);
+		if (err == EZUF_RETRY_DONE) {
+			put_user(zt->zdo->hdr->err, (int *)parg);
+			return 0;
+		}
+
+		_unmap_pages(zt, zt->zdo->pages, zt->zdo->nump);
+
+		zt->zdo = NULL;
+		if (unlikely(err)) /* _copy_outputs returned an err */
+			goto err;
+
+		relay_app_wakeup(&zt->relay);
+	}
+
+	err = __relay_fss_wait(&zt->relay, __chan_is_locked);
+	if (err)
+		zuf_dbg_err("[%d] relay error: %d\n", zt->no, err);
+
+	if (zt->zdo &&  zt->zdo->hdr &&
+	    zt->zdo->hdr->operation != ZUFS_OP_BREAK &&
+	    zt->zdo->hdr->operation < ZUFS_OP_MAX_OPT) {
+		/* call map here at the zuf thread so we need no locks
+		 * TODO: Currently only ZUFS_OP_WRITE protects user-buffers
+		 * we should have a bit set in zt->zdo->hdr set per operation.
+		 * TODO: Why this does not work?
+		 */
+		_map_pages(zt, zt->zdo->pages, zt->zdo->nump, 0);
+	} else {
+		/* This Means we were released by _zu_break */
+		zuf_dbg_zus("_zu_break? => %d\n", err);
+		_prep_header_size_op(zt->opt_buff, ZUFS_OP_BREAK, err);
+	}
+
+	return err;
+
+err:
+	put_user(err, (int *)parg);
+	return err;
+}
+
+static int _try_grab_zt_channel(struct zuf_root_info *zri, int cpu,
+				 struct zufc_thread **ztp)
+{
+	struct zufc_thread *zt;
+	int c;
+
+	for (c = 0; ; ++c) {
+		zt = _zt_from_cpu(zri, cpu, c);
+		if (unlikely(!zt || !zt->hdr.file))
+			break;
+
+		if (relay_is_fss_waiting_grab(&zt->relay)) {
+			*ztp = zt;
+			return true;
+		}
+	}
+
+	*ztp = _zt_from_cpu(zri, cpu, 0);
+	return false;
+}
+
+#ifdef CONFIG_ZUF_DEBUG
+#define DEBUG_CPU_SWITCH(cpu)		\
+	do {					\
+		int cpu2 = smp_processor_id();	\
+		if (cpu2 != cpu)		\
+			zuf_warn("App switched cpu1=%u cpu2=%u\n", \
+				 cpu, cpu2);	\
+	} while (0)
+
+static
+int _r_zufs_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo)
+
+#else /* !CONFIG_ZUF_DEBUG */
+#define DEBUG_CPU_SWITCH(cpu)
+
+int __zufc_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo)
+#endif /* CONFIG_ZUF_DEBUG */
+{
+	struct task_struct *app = get_current();
+	struct zufs_ioc_hdr *hdr = zdo->hdr;
+	int cpu;
+	struct zufc_thread *zt;
+
+	if (unlikely(zri->state == ZUF_ROOT_SERVER_FAILED))
+		return -EIO;
+
+	if (unlikely(hdr->out_len && !hdr->out_max)) {
+		/* TODO: Complain here and let caller code do this proper */
+		hdr->out_max = hdr->out_len;
+	}
+
+	if (unlikely(zdo->__locked_zt)) {
+		zt = zdo->__locked_zt;
+		zdo->__locked_zt = NULL;
+
+		cpu = get_cpu();
+		/* FIXME: Very Pedantic need it stay */
+		if (unlikely((zt->zdo != zdo) || cpu != zt->no)) {
+			zuf_warn("[%ld] __locked_zt but zdo(%p != %p) || cpu(%d != %d)\n",
+				 _zt_pr_no(zt), zt->zdo, zdo, cpu, zt->no);
+			put_cpu();
+			goto channel_busy;
+		}
+		goto has_channel;
+	}
+channel_busy:
+	cpu = get_cpu();
+
+	if (!_try_grab_zt_channel(zri, cpu, &zt)) {
+		put_cpu();
+
+		/* If channel was grabbed then maybe a break_all is in progress
+		 * on a different CPU make sure zt->file on this core is
+		 * updated
+		 */
+		mb();
+		if (unlikely(!zt->hdr.file)) {
+			zuf_err("[%d] !zt->file\n", cpu);
+			return -EIO;
+		}
+		zuf_dbg_err("[%d] can this be\n", cpu);
+		/* FIXME: Do something much smarter */
+		msleep(10);
+		if (signal_pending(get_current())) {
+			zuf_dbg_err("[%d] => EINTR\n", cpu);
+			return -EINTR;
+		}
+		goto channel_busy;
+	}
+
+	/* lock app to this cpu while waiting */
+	cpumask_copy(&zt->relay.cpus_allowed, &app->cpus_mask);
+	cpumask_copy(&app->cpus_mask,  cpumask_of(smp_processor_id()));
+
+	zt->zdo = zdo;
+
+has_channel:
+	if (zdo->dh)
+		zdo->dh(zdo, zt, zt->opt_buff);
+	else
+		memcpy(zt->opt_buff, zt->zdo->hdr, zt->zdo->hdr->in_len);
+
+	put_cpu();
+
+	if (relay_fss_wakeup_app_wait(&zt->relay) == -ERESTARTSYS) {
+		struct zufs_ioc_hdr *opt_hdr = zt->opt_buff;
+
+		opt_hdr->flags |= ZUFS_H_INTR;
+
+		relay_fss_wakeup_app_wait_cont(&zt->relay);
+	}
+
+	/* __locked_zt must be kept on same cpu */
+	if (!zdo->__locked_zt)
+		/* restore cpu affinity after wakeup */
+		cpumask_copy(&app->cpus_mask, &zt->relay.cpus_allowed);
+
+	DEBUG_CPU_SWITCH(cpu);
+
+	return zt->hdr.file ? hdr->err : -EIO;
+}
+
+#ifdef CONFIG_ZUF_DEBUG
+#define MAX_ZT_SEC 7
+int __zufc_dispatch(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo)
+{
+	u64 t1, t2;
+	int err;
+
+	t1 = ktime_get_ns();
+	err = _r_zufs_dispatch(zri, zdo);
+	t2 = ktime_get_ns();
+
+	if ((t2 - t1) > MAX_ZT_SEC * NSEC_PER_SEC)
+		zuf_err("zufc_dispatch(%s, [0x%x-0x%x]) took %lld sec\n",
+			zuf_op_name(zdo->hdr->operation), zdo->hdr->offset,
+			zdo->hdr->len,
+			(t2 - t1) / NSEC_PER_SEC);
+
+	return err;
+}
+#endif /* def CONFIG_ZUF_DEBUG */
+
+/* ~~~ iomap_exec && exec_buffer allocation ~~~ */
+
+struct zu_exec_buff {
+	struct zuf_special_file hdr;
+	struct vm_area_struct *vma;
+	void *opt_buff;
+	ulong alloc_size;
+};
+
+/* Do some common checks and conversions */
+static inline struct zu_exec_buff *_ebuff_from_file(struct file *file)
+{
+	struct zu_exec_buff *ebuff = file->private_data;
+
+	if (WARN_ON_ONCE(ebuff->hdr.type != zlfs_e_dpp_buff)) {
+		zuf_err("Must call ZU_IOC_ALLOC_BUFFER first\n");
+		return NULL;
+	}
+
+	if (WARN_ON_ONCE(ebuff->hdr.file != file))
+		return NULL;
+
+	return ebuff;
+}
+
+static int _zu_ebuff_alloc(struct file *file, void *arg)
+{
+	struct zufs_ioc_alloc_buffer ioc_alloc;
+	struct zu_exec_buff *ebuff;
+	int err;
+
+	err = copy_from_user(&ioc_alloc, arg, sizeof(ioc_alloc));
+	if (unlikely(err)) {
+		zuf_err("=>%d\n", err);
+		return err;
+	}
+
+	if (ioc_alloc.init_size > ioc_alloc.max_size)
+		return -EINVAL;
+
+	/* TODO: Easily Support growing */
+	/* TODO: Support global pools, also easy */
+	if (ioc_alloc.pool_no || ioc_alloc.init_size != ioc_alloc.max_size)
+		return -ENOTSUPP;
+
+	ebuff = kzalloc(sizeof(*ebuff), GFP_KERNEL);
+	if (unlikely(!ebuff))
+		return -ENOMEM;
+
+	ebuff->hdr.type = zlfs_e_dpp_buff;
+	ebuff->hdr.file = file;
+	i_size_write(file->f_inode, ioc_alloc.max_size);
+	ebuff->alloc_size =  ioc_alloc.init_size;
+	ebuff->opt_buff = vmalloc(ioc_alloc.init_size);
+	if (unlikely(!ebuff->opt_buff)) {
+		kfree(ebuff);
+		return -ENOMEM;
+	}
+
+	file->private_data = &ebuff->hdr;
+	return 0;
+}
+
+static void zufc_ebuff_release(struct file *file)
+{
+	struct zu_exec_buff *ebuff = _ebuff_from_file(file);
+
+	if (unlikely(!ebuff))
+		return;
+
+	vfree(ebuff->opt_buff);
+	ebuff->hdr.type = 0;
+	ebuff->hdr.file = NULL; /* for none-dbg Kernels && use-after-free */
+	kfree(ebuff);
+}
+
+/* ~~~~ ioctl & release handlers ~~~~ */
+static int _zu_register_fs(struct file *file, void *parg)
+{
+	struct zufs_ioc_register_fs rfs;
+	int err;
+
+	err = copy_from_user(&rfs, parg, sizeof(rfs));
+	if (unlikely(err)) {
+		zuf_err("=>%d\n", err);
+		return err;
+	}
+
+	err = zufr_register_fs(file->f_inode->i_sb, &rfs);
+	if (err)
+		zuf_err("=>%d\n", err);
+	err = put_user(err, (int *)parg);
+	return err;
+}
+
+static int _zu_break(struct file *filp, void *parg)
+{
+	struct zuf_root_info *zri = ZRI(filp->f_inode->i_sb);
+	int i, c;
+
+	zuf_dbg_core("enter\n");
+	mb(); /* TODO how to schedule on all CPU's */
+
+	for (i = 0; i < zri->_ztp->_max_zts; ++i) {
+		if (unlikely(!cpu_active(i)))
+			continue;
+		for (c = 0; c < zri->_ztp->_max_channels; ++c) {
+			struct zufc_thread *zt = _zt_from_cpu(zri, i, c);
+
+			if (unlikely(!(zt && zt->hdr.file)))
+				continue;
+			relay_fss_wakeup(&zt->relay);
+		}
+	}
+
+	if (zri->_ztp->mount.zsf.file)
+		relay_fss_wakeup(&zri->_ztp->mount.relay);
+
+	zuf_dbg_core("exit\n");
+	return 0;
 }
 
 long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
 {
+	void __user *parg = (void __user *)arg;
+
 	switch (cmd) {
+	case ZU_IOC_REGISTER_FS:
+		return _zu_register_fs(file, parg);
+	case ZU_IOC_MOUNT:
+		return _zu_mount(file, parg);
+	case ZU_IOC_NUMA_MAP:
+		return _zu_numa_map(file, parg);
+	case ZU_IOC_INIT_THREAD:
+		return _zu_init(file, parg);
+	case ZU_IOC_WAIT_OPT:
+		return _zu_wait(file, parg);
+	case ZU_IOC_ALLOC_BUFFER:
+		return _zu_ebuff_alloc(file, parg);
+	case ZU_IOC_BREAK_ALL:
+		return _zu_break(file, parg);
 	default:
-		zuf_err("%d\n", cmd);
+		zuf_err("%d %ld\n", cmd, ZU_IOC_WAIT_OPT);
 		return -ENOTTY;
 	}
 }
@@ -47,11 +908,221 @@ int zufc_release(struct inode *inode, struct file *file)
 		return 0;
 
 	switch (zsf->type) {
+	case zlfs_e_zt:
+		zufc_zt_release(file);
+		return 0;
+	case zlfs_e_mout_thread:
+		zufc_mounter_release(file);
+		return 0;
+	case zlfs_e_pmem:
+		/* NOTHING to clean for pmem file yet */
+		/* zuf_pmem_release(file);*/
+		return 0;
+	case zlfs_e_dpp_buff:
+		zufc_ebuff_release(file);
+		return 0;
 	default:
 		return 0;
 	}
 }
 
+/* ~~~~  mmap area of app buffers into server ~~~~ */
+
+static vm_fault_t zuf_zt_fault(struct vm_fault *vmf)
+{
+	zuf_err("should not fault pgoff=0x%lx\n", vmf->pgoff);
+	return VM_FAULT_SIGBUS;
+}
+
+static const struct vm_operations_struct zuf_vm_ops = {
+	.fault		= zuf_zt_fault,
+};
+
+static int _zufc_zt_mmap(struct file *file, struct vm_area_struct *vma,
+			 struct zufc_thread *zt)
+{
+	/* VM_PFNMAP for zap_vma_ptes() Careful! */
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_ops = &zuf_vm_ops;
+
+	zt->vma = vma;
+
+	zuf_dbg_core(
+		"[0x%lx] start=0x%lx end=0x%lx flags=0x%lx file-start=0x%lx\n",
+		_zt_pr_no(zt), vma->vm_start, vma->vm_end, vma->vm_flags,
+		vma->vm_pgoff);
+
+	return 0;
+}
+
+/* ~~~~  mmap the Kernel allocated IOCTL buffer per ZT ~~~~ */
+static int _opt_buff_mmap(struct vm_area_struct *vma, void *opt_buff,
+			  ulong opt_size)
+{
+	ulong offset;
+
+	if (!opt_buff)
+		return -ENOMEM;
+
+	for (offset = 0; offset < opt_size; offset += PAGE_SIZE) {
+		ulong addr = vma->vm_start + offset;
+		ulong pfn = vmalloc_to_pfn(opt_buff +  offset);
+		pfn_t pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+		int err;
+
+		zuf_dbg_verbose("[0x%lx] pfn-0x%lx addr=0x%lx buff=0x%lx\n",
+				offset, pfn, addr, (ulong)opt_buff + offset);
+
+		err = zuf_flt_to_err(vmf_insert_mixed_mkwrite(vma, addr, pfnt));
+		if (unlikely(err)) {
+			zuf_err("zuf: zuf_insert_mixed_mkwrite => %d offset=0x%lx addr=0x%lx\n",
+				 err, offset, addr);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+static vm_fault_t zuf_obuff_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct zufc_thread *zt = _zt_from_f_private(vma->vm_file);
+	long offset = (vmf->pgoff << PAGE_SHIFT) - ZUS_API_MAP_MAX_SIZE;
+	int err;
+
+	zuf_dbg_core(
+		"[0x%lx] start=0x%lx end=0x%lx file-start=0x%lx offset=0x%lx\n",
+		_zt_pr_no(zt), vma->vm_start, vma->vm_end, vma->vm_pgoff,
+		offset);
+
+	/* if Server overruns its buffer crash it dead */
+	if (unlikely((offset < 0) || (zt->max_zt_command < offset))) {
+		zuf_err("[0x%lx] start=0x%lx end=0x%lx file-start=0x%lx offset=0x%lx\n",
+			_zt_pr_no(zt), vma->vm_start,
+			vma->vm_end, vma->vm_pgoff, offset);
+		return VM_FAULT_SIGBUS;
+	}
+
+	/* We never released a zus-core.c that does not fault the
+	 * first page first. I want to see if this happens
+	 */
+	if (unlikely(offset))
+		zuf_warn("Suspicious server activity\n");
+
+	/* This faults only once at very first access */
+	err = _opt_buff_mmap(vma, zt->opt_buff, zt->max_zt_command);
+	if (unlikely(err))
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct zuf_obuff_ops = {
+	.fault		= zuf_obuff_fault,
+};
+
+static int _zufc_obuff_mmap(struct file *file, struct vm_area_struct *vma,
+			    struct zufc_thread *zt)
+{
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_ops = &zuf_obuff_ops;
+
+	zt->opt_buff_vma = vma;
+
+	zuf_dbg_core(
+		"[0x%lx] start=0x%lx end=0x%lx flags=0x%lx file-start=0x%lx\n",
+		_zt_pr_no(zt), vma->vm_start, vma->vm_end, vma->vm_flags,
+		vma->vm_pgoff);
+
+	return 0;
+}
+
+/* ~~~ */
+
+static int zufc_zt_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zufc_thread *zt = _zt_from_f_private(file);
+
+	/* We have two areas of mmap in this special file.
+	 * 0 to ZUS_API_MAP_MAX_SIZE:
+	 *	The first part where app pages are mapped
+	 *	into server per operation.
+	 * ZUS_API_MAP_MAX_SIZE of size zuf_root_info->max_zt_command
+	 *	Is where we map the per ZT ioctl-buffer, later passed
+	 *	to the zus_ioc_wait IOCTL call
+	 */
+	if (vma->vm_pgoff == ZUS_API_MAP_MAX_SIZE / PAGE_SIZE)
+		return _zufc_obuff_mmap(file, vma, zt);
+
+	/* zuf ZT API is very particular about where in its
+	 * special file we communicate
+	 */
+	if (unlikely(vma->vm_pgoff))
+		return -EINVAL;
+
+	return _zufc_zt_mmap(file, vma, zt);
+}
+
+/* ~~~~ Implementation of the ZU_IOC_ALLOC_BUFFER mmap facility ~~~~ */
+
+static vm_fault_t zuf_ebuff_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct zu_exec_buff *ebuff = _ebuff_from_file(vma->vm_file);
+	long offset = (vmf->pgoff << PAGE_SHIFT);
+	int err;
+
+	zuf_dbg_core("start=0x%lx end=0x%lx file-start=0x%lx file-off=0x%lx\n",
+		     vma->vm_start, vma->vm_end, vma->vm_pgoff, offset);
+
+	if (unlikely(!ebuff))
+		return VM_FAULT_SIGBUS;
+
+	/* if Server overruns its buffer crash it dead */
+	if (unlikely((offset < 0) || (ebuff->alloc_size < offset))) {
+		zuf_err("start=0x%lx end=0x%lx file-start=0x%lx file-off=0x%lx\n",
+			vma->vm_start, vma->vm_end, vma->vm_pgoff,
+			offset);
+		return VM_FAULT_SIGBUS;
+	}
+
+	/* We never released a zus-core.c that does not fault the
+	 * first page first. I want to see if this happens
+	 */
+	if (unlikely(offset))
+		zuf_warn("Suspicious server activity\n");
+
+	/* This faults only once at very first access */
+	err = _opt_buff_mmap(vma, ebuff->opt_buff, ebuff->alloc_size);
+	if (unlikely(err))
+		return VM_FAULT_SIGBUS;
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct zuf_ebuff_ops = {
+	.fault		= zuf_ebuff_fault,
+};
+
+static int zufc_ebuff_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zu_exec_buff *ebuff = _ebuff_from_file(vma->vm_file);
+
+	if (unlikely(!ebuff))
+		return -EINVAL;
+
+	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_ops = &zuf_ebuff_ops;
+
+	ebuff->vma = vma;
+
+	zuf_dbg_core("start=0x%lx end=0x%lx flags=0x%lx file-start=0x%lx\n",
+		      vma->vm_start, vma->vm_end, vma->vm_flags, vma->vm_pgoff);
+
+	return 0;
+}
+
 int zufc_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct zuf_special_file *zsf = file->private_data;
@@ -62,6 +1133,10 @@ int zufc_mmap(struct file *file, struct vm_area_struct *vma)
 	}
 
 	switch (zsf->type) {
+	case zlfs_e_zt:
+		return zufc_zt_mmap(file, vma);
+	case zlfs_e_dpp_buff:
+		return zufc_ebuff_mmap(file, vma);
 	default:
 		zuf_err("type=%d\n", zsf->type);
 		return -ENOTTY;
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 919b84f7478f..05ec08d17d69 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -110,6 +110,47 @@ static inline struct zuf_inode_info *ZUII(struct inode *inode)
 	return container_of(inode, struct zuf_inode_info, vfs_inode);
 }
 
+struct zuf_dispatch_op;
+typedef int (*overflow_handler)(struct zuf_dispatch_op *zdo, void *parg,
+				ulong zt_max_bytes);
+typedef void (*dispatch_handler)(struct zuf_dispatch_op *zdo, void *pzt,
+				void *parg);
+struct zuf_dispatch_op {
+	struct zufs_ioc_hdr *hdr;
+	union {
+		struct page **pages;
+		ulong *bns;
+	};
+	uint nump;
+	overflow_handler oh;
+	dispatch_handler dh;
+	struct super_block *sb;
+	struct inode *inode;
+
+	/* Don't touch zuf-core only!!! */
+	struct zufc_thread *__locked_zt;
+};
+
+static inline void
+zuf_dispatch_init(struct zuf_dispatch_op *zdo, struct zufs_ioc_hdr *hdr,
+		 struct page **pages, uint nump)
+{
+	memset(zdo, 0, sizeof(*zdo));
+	zdo->hdr = hdr;
+	zdo->pages = pages; zdo->nump = nump;
+}
+
+static inline int zuf_flt_to_err(vm_fault_t flt)
+{
+	if (likely(flt == VM_FAULT_NOPAGE))
+		return 0;
+
+	if (flt == VM_FAULT_OOM)
+		return -ENOMEM;
+
+	return -EACCES;
+}
+
 /* Keep this include last thing in file */
 #include "_extern.h"
 
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index f293e03460be..6b1fbaf24222 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -93,6 +93,123 @@
 
 #endif /*  ndef __KERNEL__ */
 
+/* first available error code after include/linux/errno.h */
+#define EZUFS_RETRY	531
+
+/* The below is private to zuf Kernel only. Is not exposed to VFS nor zus
+ * (defined here to allocate the constant)
+ */
+#define EZUF_RETRY_DONE 540
+
+/* TODO: Someone forgot i_flags & i_version for STATX_ attrs should send a patch
+ * to add them
+ */
+#define ZUFS_STATX_FLAGS	0x20000000U
+#define ZUFS_STATX_VERSION	0x40000000U
+
+/*
+ * Maximal count of links to a file
+ */
+#define ZUFS_LINK_MAX          32000
+#define ZUFS_MAX_SYMLINK	PAGE_SIZE
+#define ZUFS_NAME_LEN		255
+#define ZUFS_READAHEAD_PAGES	8
+
+/* All device sizes offsets must align on 2M */
+#define ZUFS_ALLOC_MASK		(1024 * 1024 * 2 - 1)
+
+/**
+ * zufs dual port memory
+ * This is a special type of offset to either memory or persistent-memory,
+ * that is designed to be used in the interface mechanism between userspace
+ * and kernel, and can be accessed by both.
+ * 3 first bits denote a mem-pool:
+ * 0   - pmem pool
+ * 1-6 - established shared pool by a call to zufs_ioc_create_mempool (below)
+ * 7   - offset into app memory
+ */
+typedef __u64 __bitwise zu_dpp_t;
+
+static inline uint zu_dpp_t_pool(zu_dpp_t t)
+{
+	return t & 0x7;
+}
+
+static inline ulong zu_dpp_t_val(zu_dpp_t t)
+{
+	return t & ~0x7;
+}
+
+static inline zu_dpp_t zu_enc_dpp_t(ulong v, uint pool)
+{
+	return v | pool;
+}
+
+static inline ulong zu_dpp_t_bn(zu_dpp_t t)
+{
+	return t >> 3;
+}
+
+static inline zu_dpp_t zu_enc_dpp_t_bn(ulong v, uint pool)
+{
+	return zu_enc_dpp_t(v << 3, pool);
+}
+
+/*
+ * Structure of a ZUS inode.
+ * This is all the inode fields
+ */
+
+/* See VFS inode flags at fs.h. As ZUFS support flags up to the 7th bit, we
+ * use higher bits for ZUFS specific flags
+ */
+#define ZUFS_S_IMMUTABLE 04000
+
+/* zus_inode size */
+#define ZUFS_INODE_SIZE 128    /* must be power of two */
+
+struct zus_inode {
+	__le16	i_flags;	/* Inode flags */
+	__le16	i_mode;		/* File mode */
+	__le32	i_nlink;	/* Links count */
+	__le64	i_size;		/* Size of data in bytes */
+/* 16*/	struct __zi_on_disk_desc {
+		__le64	a[2];
+	}	i_on_disk;	/* FS-specific on disc placement */
+/* 32*/	__le64	i_blocks;
+	__le64	i_mtime;	/* Inode/data Modification time */
+	__le64	i_ctime;	/* Inode/data Changed time */
+	__le64	i_atime;	/* Data Access time */
+/* 64 - cache-line boundary */
+	__le64	i_ino;		/* Inode number */
+	__le32	i_uid;		/* Owner Uid */
+	__le32	i_gid;		/* Group Id */
+	__le64	i_xattr;	/* FS-specific Extended attribute block */
+	__le64	i_generation;	/* File version (for NFS) */
+/* 96*/	union NAMELESS(_I_U) {
+		__le32	i_rdev;		/* special-inode major/minor etc ...*/
+		u8	i_symlink[32];	/* if i_size < sizeof(i_symlink) */
+		__le64	i_sym_dpp;	/* Link location if long symlink */
+		struct  _zu_dir {
+			__le64	dir_root;
+			__le64  parent;
+		}	i_dir;
+	};
+	/* Total ZUFS_INODE_SIZE bytes always */
+};
+
+/* ~~~~~ ZUFS API ioctl commands ~~~~~ */
+enum {
+	ZUS_API_MAP_MAX_PAGES	= 1024,
+	ZUS_API_MAP_MAX_SIZE	= ZUS_API_MAP_MAX_PAGES * PAGE_SIZE,
+};
+
+/* These go on zufs_ioc_hdr->flags */
+enum e_zufs_hdr_flags {
+	ZUFS_H_INTR		= (1 << 0),
+	ZUFS_H_HAS_PIGY_PUT	= (1 << 1),
+};
+
 struct zufs_ioc_hdr {
 	__s32 err;	/* IN/OUT must be first */
 	__u16 in_len;	/* How much to be copied *to* zus */
@@ -129,4 +246,178 @@ struct zufs_ioc_register_fs {
 };
 #define ZU_IOC_REGISTER_FS	_IOWR('Z', 10, struct zufs_ioc_register_fs)
 
+/* A cookie from user-mode returned by mount */
+struct zus_sb_info;
+
+/* zus cookie per inode */
+struct zus_inode_info;
+
+enum ZUFS_M_FLAGS {
+	ZUFS_M_PEDANTIC		= 0x00000001,
+	ZUFS_M_EPHEMERAL	= 0x00000002,
+	ZUFS_M_SILENT		= 0x00000004,
+	ZUFS_M_PRIVATE		= 0x00000008,
+};
+
+struct zufs_parse_options {
+	__u64 mount_flags;
+	__u32 pedantic;
+	__u32 mount_options_len;
+	char mount_options[0];
+};
+
+/* These go on  zufs_ioc_mount->hdr->operation */
+enum e_mount_operation {
+	ZUFS_M_MOUNT	= 1,
+	ZUFS_M_UMOUNT,
+	ZUFS_M_REMOUNT,
+	ZUFS_M_DDBG_RD,
+	ZUFS_M_DDBG_WR,
+};
+
+/* For zufs_mount_info->remount_flags */
+enum e_remount_flags {
+	ZUFS_REM_WAS_RO		= 0x00000001,
+	ZUFS_REM_WILL_RO	= 0x00000002,
+};
+
+/* FS specific capabilities @zufs_mount_info->fs_caps */
+enum {
+	ZUFS_FSC_ACL_ON		= 0x0001,
+	ZUFS_FSC_NIO_READS	= 0x0002,
+	ZUFS_FSC_NIO_WRITES	= 0x0004,
+};
+
+struct zufs_mount_info {
+	/* IN */
+	struct zus_fs_info *zus_zfi;
+	__u64	remount_flags;
+	__u64	sb_id;
+	__u16	num_cpu;
+	__u16	num_channels;
+	__u32	__pad;
+
+	/* OUT */
+	struct zus_sb_info *zus_sbi;
+	/* mount is also iget of root */
+	struct zus_inode_info *zus_ii;
+	zu_dpp_t _zi;
+
+	/* FS specific info */
+	__u32 fs_caps;
+	__u32 s_blocksize_bits;
+
+	/* IN - mount options, var len must be last */
+	struct zufs_parse_options po;
+};
+
+struct zufs_ddbg_info {
+	__u64 id; /* IN where to start from, OUT last ID */
+	/* IN size of buffer, OUT size of dynamic debug message */
+	__u64 len;
+	char msg[0];
+};
+
+/* mount / umount */
+struct  zufs_ioc_mount {
+	struct zufs_ioc_hdr hdr;
+	union {
+		struct zufs_mount_info zmi;
+		struct zufs_ddbg_info zdi;
+	};
+};
+#define ZU_IOC_MOUNT		_IOWR('Z', 11, struct zufs_ioc_mount)
+
+/* pmem  */
+struct zufs_cpu_set {
+	ulong bits[16];
+};
+
+struct zufs_ioc_numa_map {
+	/* Set by zus */
+	struct zufs_ioc_hdr hdr;
+
+	__u32	possible_nodes;
+	__u32	possible_cpus;
+	__u32	online_nodes;
+	__u32	online_cpus;
+
+	__u32	max_cpu_per_node;
+
+	/* This indicates that NOT all nodes have @max_cpu_per_node cpus */
+	bool	nodes_not_symmetrical;
+	__u8	__pad[19]; /* align cpu_set_per_node to next cache-line */
+
+	/* Variable size must keep last
+	 * size @possible_nodes
+	 */
+	struct zufs_cpu_set cpu_set_per_node[];
+};
+#define ZU_IOC_NUMA_MAP	_IOWR('Z', 13, struct zufs_ioc_numa_map)
+
+/* ZT init */
+enum { ZUFS_MAX_ZT_CHANNELS = 4 };
+
+struct zufs_ioc_init {
+	struct zufs_ioc_hdr hdr;
+	__u32 channel_no;
+	__u32 max_command;
+};
+#define ZU_IOC_INIT_THREAD	_IOWR('Z', 15, struct zufs_ioc_init)
+
+/* break_all (Server telling kernel to clean) */
+struct zufs_ioc_break_all {
+	struct zufs_ioc_hdr hdr;
+};
+#define ZU_IOC_BREAK_ALL	_IOWR('Z', 16, struct zufs_ioc_break_all)
+
+/* Allocate a special_file that will be a dual-port communication buffer with
+ * user mode.
+ * Server will access the buffer via the mmap of this file.
+ * Kernel will access the file via the valloc() pointer
+ *
+ * Some IOCTLs below demand use of this kind of buffer for communication
+ * TODO:
+ * pool_no is if we want to associate this buffer onto the 6 possible
+ * mem-pools per zuf_sbi. So anywhere we have a zu_dpp_t it will mean
+ * access from this pool.
+ * If pool_no is zero then it is private to only this file. In this case
+ * sb_id && zus_sbi are ignored / not needed.
+ */
+struct zufs_ioc_alloc_buffer {
+	struct zufs_ioc_hdr hdr;
+	/* The ID of the super block received in mount */
+	__u64	sb_id;
+	/* We verify the sb_id validity against zus_sbi */
+	struct zus_sb_info *zus_sbi;
+	/* max size of buffer allowed (size of mmap) */
+	__u32 max_size;
+	/* allocate this much on initial call and set into vma */
+	__u32 init_size;
+
+	/* TODO: These below are now set to ZERO. Need implementation */
+	__u16 pool_no;
+	__u16 flags;
+	__u32 reserved;
+};
+#define ZU_IOC_ALLOC_BUFFER	_IOWR('Z', 17, struct zufs_ioc_init)
+
+/* ~~~  zufs_ioc_wait_operation ~~~ */
+struct zufs_ioc_wait_operation {
+	struct zufs_ioc_hdr hdr;
+	/* maximum size is governed by zufs_ioc_init->max_command */
+	char opt_buff[];
+};
+#define ZU_IOC_WAIT_OPT		_IOWR('Z', 18, struct zufs_ioc_wait_operation)
+
+/* These are the possible operations sent from Kernel to the Server in the
+ * return of the ZU_IOC_WAIT_OPT.
+ */
+enum e_zufs_operation {
+	ZUFS_OP_NULL		= 0,
+	ZUFS_OP_BREAK		= 1,	/* Kernel telling Server to exit */
+
+	ZUFS_OP_MAX_OPT,
+};
+
 #endif /* _LINUX_ZUFS_API_H */

From patchwork Thu Sep 26 02:07:15 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161807
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DD2FF14E5
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:11:40 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 89354222BF
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:11:40 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="wkVnK7r6"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1730422AbfIZCLk (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:11:40 -0400
Received: from mail-wr1-f66.google.com ([209.85.221.66]:38440 "EHLO
        mail-wr1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727403AbfIZCLj (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:11:39 -0400
Received: by mail-wr1-f66.google.com with SMTP id w12so95089wro.5
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:11:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=c+ZMflCdXo/wIeIAHEahAcbTD+ADZkOHlrg8XyOnJUI=;
        b=wkVnK7r6Oq2ILMUirlrSYXq93As/H8GLvqume8FULdJjOK0RpZ9McaJhKfM6fQWB5+
         NVwlXtSkaCSoncdxQd7A9nPfDN3+uFQEvlsQVAtSdEg8h782Ezut3BCjZLU2K577DOhv
         CU6EUCOTZNgHKX+OkZ7ol0rmrP+O66RUx5Yj47Q3rTIgL2Usmhuw2FPTL6f7r4fZ2c/K
         2zSAQ1O8DzGIUewFie+NPW0f6YqmVj7goYnuMty5AincP+mP51NZ15sAPBvzBdPn+aF4
         gG+S5Xjmwq6OxpRtUtezeqSK27BLvQb+kRzy5vsLEcDjgRcE70viqmwT19RpsQ8VYqDD
         UvxA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=c+ZMflCdXo/wIeIAHEahAcbTD+ADZkOHlrg8XyOnJUI=;
        b=JQ9BCtFcHraEE8btixVArp/ayoIAMjNC2EPbGWoYYtyiahWd1+Nj4T0HuCX7G65U7h
         enPdJNofSJMhV31yz52OnMyMogJABiXDRetw8rX1pa8rZbeSdCH1eaK4Vk7fLiYYb23C
         9/97idygeAIvEl9cLhDKSEIrbqxCwWRIw2IkhkuKcAy92JuPpAJUSAbVRL39cvknveSx
         hoL3SXVEswdsGz/cR8XsY/ji7QvMBUxYh8ucuczyFLOkZ7c83PdtJoniOG/SKf/HSDHk
         vMHLWBHUYcNL4AFD9KVL5JGuVCY1HY53yQaNfg3Fdy9T+0AZwXh2tDsdKUcsWnBkcE8+
         8SUA==
X-Gm-Message-State: APjAAAWxoOjifbFURKNoKGxdgXjoZaDi7PvGlliLowkf1cZn9nkCSNKl
        SbK+z5w/1I0MytrYJpMy2y6If6ZX3ow=
X-Google-Smtp-Source: 
 APXvYqxFY4o9MyKRcdffMUi/xO21AiPr2+/hBAgvNGvU2D0J7UGqB56myTJsC5oftOucwdeAhpSvnQ==
X-Received: by 2002:adf:f00b:: with SMTP id j11mr795764wro.298.1569463891831;
        Wed, 25 Sep 2019 19:11:31 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.11.29
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:11:30 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 06/16] zuf: Multy Devices
Date: Thu, 26 Sep 2019 05:07:15 +0300
Message-Id: <20190926020725.19601-7-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

ZUFS supports Multiple block devices per super_block.
This here is the devices handling code. At the output
a single multi_devices (md.h) object is associated with the
mounting super_block.

There are three mode of operations:
* mount with out a device (mount -t FOO none /somepath)

* A single device - The FS stated register_fs_info->dt_offset==-1
  No checks are made by Kernel, the single bdev is registered with
  Kernel's mount_bdev. It is up to the zusFS to check validity

* Multy devices - The FS stated register_fs_info->dt_offset==X

  This mode is the main of this patch.
  A single device is given on the mount command line. At
  register_fs_info->dt_offset of this device we look for a
  zufs_dev_table structure. After all the checks we look there
  at the device list and open all devices. Any one of the devices may
  be given on command line. But they will always be opened in
  DT(Device Table) order. The Device table has the notion of two types
  of bdevs:
  T1 devices - are pmem devices capable of direct_access
  T2 devices - are none direct_access devices

  All t1 devices are presented as one linear array. in DT order
  In t1.c we mmap this space for the server to directly access
  pmem. (In the proper persistent way)

  [We do not support any direct_access device, we only support
   pmem(s) where the all device can be addressed by a single
   physical/virtual address. This is checked before mount]

   The T2 devices are also grabbed and owned by the super_block
   A later API will enable the Server to write or transfer buffers
   from T1 to T2 in a very efficient manner. Also presented as a
   single linear array in DT order.

   Both kind of devices are NUMA aware and the NUMA info is presented
   to the zusFS for optimal allocation and access.

[v2]
  The new gcc compiler does not like that the case /* fall through */
  comments comes with other text. So split the comment to two lines
  to silence the compiler.

[v3]
  Do not use __packed on interface structures

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   3 +
 fs/zuf/_extern.h  |   6 +
 fs/zuf/md.c       | 742 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/md.h       | 332 +++++++++++++++++++++
 fs/zuf/md_def.h   | 141 +++++++++
 fs/zuf/super.c    |   6 +
 fs/zuf/t1.c       | 136 +++++++++
 fs/zuf/t2.c       | 356 ++++++++++++++++++++++
 fs/zuf/t2.h       |  68 +++++
 fs/zuf/zuf-core.c |  76 +++++
 fs/zuf/zuf.h      |  54 ++++
 fs/zuf/zus_api.h  |  15 +
 12 files changed, 1935 insertions(+)
 create mode 100644 fs/zuf/md.c
 create mode 100644 fs/zuf/md.h
 create mode 100644 fs/zuf/md_def.h
 create mode 100644 fs/zuf/t1.c
 create mode 100644 fs/zuf/t2.c
 create mode 100644 fs/zuf/t2.h

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index b08c08e73faa..a247bd85d9aa 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -10,6 +10,9 @@
 
 obj-$(CONFIG_ZUFS_FS) += zuf.o
 
+# Infrastructure
+zuf-y += md.o t1.o t2.o
+
 # ZUF core
 zuf-y += zuf-core.o zuf-root.o
 
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 1f786fc24b85..a5929d3d165c 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -54,4 +54,10 @@ void zuf_destroy_inodecache(void);
 struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
 			 const char *dev_name, void *data);
 
+struct super_block *zuf_sb_from_id(struct zuf_root_info *zri, __u64 sb_id,
+				   struct zus_sb_info *zus_sbi);
+
+/* t1.c */
+int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
+
 #endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/md.c b/fs/zuf/md.c
new file mode 100644
index 000000000000..c4778b4fdff8
--- /dev/null
+++ b/fs/zuf/md.c
@@ -0,0 +1,742 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Multi-Device operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include <linux/blkdev.h>
+#include <linux/pfn_t.h>
+#include <linux/crc16.h>
+#include <linux/uuid.h>
+
+#include <linux/gcd.h>
+
+#include "_pr.h"
+#include "md.h"
+#include "t2.h"
+
+const fmode_t _g_mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL;
+
+static int _bdev_get_by_path(const char *path, struct block_device **bdev,
+			     void *holder)
+{
+	*bdev = blkdev_get_by_path(path, _g_mode, holder);
+	if (IS_ERR(*bdev)) {
+		int err = PTR_ERR(*bdev);
+		*bdev = NULL;
+		return err;
+	}
+	return 0;
+}
+
+static void _bdev_put(struct block_device **bdev)
+{
+	if (*bdev) {
+		blkdev_put(*bdev, _g_mode);
+		*bdev = NULL;
+	}
+}
+
+/* convert uuid to a /dev/ path */
+static char *_uuid_path(uuid_le *uuid, char path[PATH_UUID])
+{
+	sprintf(path, "/dev/disk/by-uuid/%pUb", uuid);
+	return path;
+}
+
+static int _bdev_get_by_uuid(struct block_device **bdev, uuid_le *uuid,
+			       void *holder, bool silent)
+{
+	char path[PATH_UUID];
+	int err;
+
+	_uuid_path(uuid, path);
+	err = _bdev_get_by_path(path, bdev, holder);
+	if (unlikely(err))
+		md_err_cnd(silent, "failed to get device path=%s =>%d\n",
+			   path, err);
+
+	return err;
+}
+
+short md_calc_csum(struct md_dev_table *mdt)
+{
+	uint n = MDT_STATIC_SIZE(mdt) - sizeof(mdt->s_sum);
+
+	return crc16(~0, (__u8 *)&mdt->s_version, n);
+}
+
+/* ~~~~~~~ mdt related functions ~~~~~~~ */
+
+int md_t2_mdt_read(struct multi_devices *md, int index,
+		   struct md_dev_table *mdt)
+{
+	int err = t2_readpage(md, index, virt_to_page(mdt));
+
+	if (err)
+		md_dbg_verbose("!!! t2_readpage err=%d\n", err);
+
+	return err;
+}
+
+static int _t2_mdt_read(struct block_device *bdev, struct md_dev_table *mdt)
+{
+	int err;
+	/* t2 interface works for all block devices */
+	struct multi_devices *md;
+	struct md_dev_info *mdi;
+
+	md = kzalloc(sizeof(*md), GFP_KERNEL);
+	if (unlikely(!md))
+		return -ENOMEM;
+
+	md->t2_count = 1;
+	md->devs[0].bdev = bdev;
+	mdi = &md->devs[0];
+	md->t2a.map = &mdi;
+	md->t2a.bn_gcd = 1; /*Does not matter only must not be zero */
+
+	err = md_t2_mdt_read(md, 0, mdt);
+
+	kfree(md);
+	return err;
+}
+
+int md_t2_mdt_write(struct multi_devices *md, struct md_dev_table *mdt)
+{
+	int i, err = 0;
+
+	for (i = 0; i < md->t2_count; ++i) {
+		ulong bn = md_o2p(md_t2_dev(md, i)->offset);
+
+		mdt->s_dev_list.id_index = mdt->s_dev_list.t1_count + i;
+		mdt->s_sum = cpu_to_le16(md_calc_csum(mdt));
+
+		err = t2_writepage(md, bn, virt_to_page(mdt));
+		if (err)
+			md_dbg_verbose("!!! t2_writepage err=%d\n", err);
+	}
+
+	return err;
+}
+
+static bool _csum_mismatch(struct md_dev_table *mdt, int silent)
+{
+	ushort crc = md_calc_csum(mdt);
+
+	if (mdt->s_sum == cpu_to_le16(crc))
+		return false;
+
+	md_warn_cnd(silent, "expected(0x%x) != s_sum(0x%x)\n",
+		      cpu_to_le16(crc), mdt->s_sum);
+	return true;
+}
+
+static bool _uuid_le_equal(uuid_le *uuid1, uuid_le *uuid2)
+{
+	return (memcmp(uuid1, uuid2, sizeof(uuid_le)) == 0);
+}
+
+static bool _mdt_compare_uuids(struct md_dev_table *mdt,
+			       struct md_dev_table *main_mdt, int silent)
+{
+	int i, dev_count;
+
+	if (!_uuid_le_equal(&mdt->s_uuid, &main_mdt->s_uuid)) {
+		md_warn_cnd(silent, "mdt uuid (%pUb != %pUb) mismatch\n",
+			      &mdt->s_uuid, &main_mdt->s_uuid);
+		return false;
+	}
+
+	dev_count = mdt->s_dev_list.t1_count + mdt->s_dev_list.t2_count +
+		    mdt->s_dev_list.rmem_count;
+	for (i = 0; i < dev_count; ++i) {
+		struct md_dev_id *dev_id1 = &mdt->s_dev_list.dev_ids[i];
+		struct md_dev_id *dev_id2 = &main_mdt->s_dev_list.dev_ids[i];
+
+		if (!_uuid_le_equal(&dev_id1->uuid, &dev_id2->uuid)) {
+			md_warn_cnd(silent,
+				    "mdt dev %d uuid (%pUb != %pUb) mismatch\n",
+				    i, &dev_id1->uuid, &dev_id2->uuid);
+			return false;
+		}
+
+		if (dev_id1->blocks != dev_id2->blocks) {
+			md_warn_cnd(silent,
+				    "mdt dev %d blocks (0x%llx != 0x%llx) mismatch\n",
+				    i, le64_to_cpu(dev_id1->blocks),
+				    le64_to_cpu(dev_id2->blocks));
+			return false;
+		}
+	}
+
+	return true;
+}
+
+bool md_mdt_check(struct md_dev_table *mdt,
+		  struct md_dev_table *main_mdt, struct block_device *bdev,
+		  struct mdt_check *mc)
+{
+	struct md_dev_id *dev_id;
+	ulong bdev_size, super_size;
+
+	BUILD_BUG_ON(MDT_STATIC_SIZE(mdt) & (SMP_CACHE_BYTES - 1));
+
+	/* Do sanity checks on the superblock */
+	if (le32_to_cpu(mdt->s_magic) != mc->magic) {
+		md_warn_cnd(mc->silent,
+			     "Magic error in super block: please run fsck\n");
+		return false;
+	}
+
+	if ((mc->major_ver != mdt_major_version(mdt)) ||
+	    (mc->minor_ver < mdt_minor_version(mdt))) {
+		md_warn_cnd(mc->silent,
+			     "mkfs-mount versions mismatch! %d.%d != %d.%d\n",
+			     mdt_major_version(mdt), mdt_minor_version(mdt),
+			     mc->major_ver, mc->minor_ver);
+		return false;
+	}
+
+	if (_csum_mismatch(mdt, mc->silent)) {
+		md_warn_cnd(mc->silent,
+			    "crc16 error in super block: please run fsck\n");
+		return false;
+	}
+
+	if (main_mdt) {
+		if (mdt->s_dev_list.t1_count != main_mdt->s_dev_list.t1_count) {
+			md_warn_cnd(mc->silent, "mdt t1 count mismatch\n");
+			return false;
+		}
+
+		if (mdt->s_dev_list.t2_count != main_mdt->s_dev_list.t2_count) {
+			md_warn_cnd(mc->silent, "mdt t2 count mismatch\n");
+			return false;
+		}
+
+		if (mdt->s_dev_list.rmem_count !=
+		    main_mdt->s_dev_list.rmem_count) {
+			md_warn_cnd(mc->silent,
+				    "mdt rmem dev count mismatch\n");
+			return false;
+		}
+
+		if (!_mdt_compare_uuids(mdt, main_mdt, mc->silent))
+			return false;
+	}
+
+	/* check alignment */
+	dev_id = &mdt->s_dev_list.dev_ids[mdt->s_dev_list.id_index];
+	super_size = md_p2o(__dev_id_blocks(dev_id));
+	if (unlikely(!super_size || super_size & mc->alloc_mask)) {
+		md_warn_cnd(mc->silent, "super_size(0x%lx) ! 2_M aligned\n",
+			      super_size);
+		return false;
+	}
+
+	if (!bdev)
+		return true;
+
+	/* check t1 device size */
+	bdev_size = i_size_read(bdev->bd_inode);
+	if (unlikely(super_size > bdev_size)) {
+		md_warn_cnd(mc->silent,
+			    "bdev_size(0x%lx) too small expected 0x%lx\n",
+			    bdev_size, super_size);
+		return false;
+	} else if (unlikely(super_size < bdev_size)) {
+		md_dbg_err("Note mdt->size=(0x%lx) < bdev_size(0x%lx)\n",
+			      super_size, bdev_size);
+	}
+
+	return true;
+}
+
+int md_set_sb(struct multi_devices *md, struct block_device *s_bdev,
+	      void *sb, int silent)
+{
+	struct md_dev_info *main_mdi = md_dev_info(md, md->dev_index);
+	int i;
+
+	main_mdi->bdev = s_bdev;
+
+	for (i = 0; i < md->t1_count + md->t2_count; ++i) {
+		struct md_dev_info *mdi;
+
+		if (i == md->dev_index)
+			continue;
+
+		mdi = md_dev_info(md, i);
+		if (mdi->bdev->bd_super && (mdi->bdev->bd_super != sb)) {
+			md_warn_cnd(silent,
+				"!!! %s already mounted on a different FS => -EBUSY\n",
+				_bdev_name(mdi->bdev));
+			return -EBUSY;
+		}
+
+		mdi->bdev->bd_super = sb;
+	}
+
+	return 0;
+}
+
+void md_fini(struct multi_devices *md, bool put_all)
+{
+	struct md_dev_info *main_mdi;
+	int i;
+
+	if (unlikely(!md))
+		return;
+
+	main_mdi = md_dev_info(md, md->dev_index);
+	kfree(md->t2a.map);
+	kfree(md->t1a.map);
+
+	for (i = 0; i < md->t1_count + md->t2_count; ++i) {
+		struct md_dev_info *mdi = md_dev_info(md, i);
+
+		if (i < md->t1_count)
+			md_t1_info_fini(mdi);
+		if (!mdi->bdev || i == md->dev_index)
+			continue;
+		mdi->bdev->bd_super = NULL;
+		_bdev_put(&mdi->bdev);
+	}
+
+	if (put_all)
+		_bdev_put(&main_mdi->bdev);
+	else
+		/* Main dev is GET && PUT by VFS. Only stop pointing to it */
+		main_mdi->bdev = NULL;
+
+	kfree(md);
+}
+
+
+/* ~~~~~~~ Pre-mount operations ~~~~~~~ */
+
+static int _get_device(struct block_device **bdev, const char *dev_name,
+		       uuid_le *uuid, void *holder, int silent,
+		       bool *bind_mount)
+{
+	int err;
+
+	if (dev_name)
+		err = _bdev_get_by_path(dev_name, bdev, holder);
+	else
+		err = _bdev_get_by_uuid(bdev, uuid, holder, silent);
+
+	if (unlikely(err)) {
+		md_err_cnd(silent,
+			"failed to get device dev_name=%s uuid=%pUb err=%d\n",
+			dev_name, uuid, err);
+		return err;
+	}
+
+	if (bind_mount &&  (*bdev)->bd_super &&
+			   (*bdev)->bd_super->s_bdev == *bdev)
+		*bind_mount = true;
+
+	return 0;
+}
+
+static int _init_dev_info(struct md_dev_info *mdi, struct md_dev_id *id,
+			  int index, u64 offset,
+			  struct md_dev_table *main_mdt,
+			  struct mdt_check *mc, bool t1_dev,
+			  int silent)
+{
+	struct md_dev_table *mdt = NULL;
+	bool mdt_alloc = false;
+	int err = 0;
+
+	if (mdi->bdev == NULL) {
+		err = _get_device(&mdi->bdev, NULL, &id->uuid, mc->holder,
+				  silent, NULL);
+		if (unlikely(err))
+			return err;
+	}
+
+	mdi->offset = offset;
+	mdi->size = md_p2o(__dev_id_blocks(id));
+	mdi->index = index;
+
+	if (t1_dev) {
+		struct page *dev_page;
+		int end_of_dev_nid;
+
+		err = md_t1_info_init(mdi, silent);
+		if (unlikely(err))
+			return err;
+
+		if ((ulong)mdi->t1i.virt_addr & mc->alloc_mask) {
+			md_warn_cnd(silent, "!!! unaligned device %s\n",
+				      _bdev_name(mdi->bdev));
+			return -EINVAL;
+		}
+
+		if (!__pfn_to_section(mdi->t1i.phys_pfn)) {
+			md_err_cnd(silent, "Intel does not like pages...\n");
+			return -EINVAL;
+		}
+
+		mdt = mdi->t1i.virt_addr;
+
+		mdi->t1i.pgmap = virt_to_page(mdt)->pgmap;
+		dev_page = pfn_to_page(mdi->t1i.phys_pfn);
+		mdi->nid = page_to_nid(dev_page);
+		end_of_dev_nid = page_to_nid(dev_page + md_o2p(mdi->size - 1));
+
+		if (mdi->nid != end_of_dev_nid)
+			md_warn("pmem crosses NUMA boundaries");
+	} else {
+		mdt = (void *)__get_free_page(GFP_KERNEL);
+		if (unlikely(!mdt)) {
+			md_dbg_err("!!! failed to alloc page\n");
+			return -ENOMEM;
+		}
+
+		mdt_alloc = true;
+		err = _t2_mdt_read(mdi->bdev, mdt);
+		if (unlikely(err)) {
+			md_err_cnd(silent, "failed to read mdt from t2 => %d\n",
+				   err);
+			goto out;
+		}
+		mdi->nid = __dev_id_nid(id);
+	}
+
+	if (!md_mdt_check(mdt, main_mdt, mdi->bdev, mc)) {
+		md_err_cnd(silent, "device %s failed integrity check\n",
+			     _bdev_name(mdi->bdev));
+		err = -EINVAL;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	if (mdt_alloc)
+		free_page((ulong)mdt);
+	return err;
+}
+
+static int _map_setup(struct multi_devices *md, ulong blocks, int dev_start,
+		      struct md_dev_larray *larray)
+{
+	ulong map_size, bn_end;
+	int i, dev_index = dev_start;
+
+	map_size = blocks / larray->bn_gcd;
+	larray->map = kcalloc(map_size, sizeof(*larray->map), GFP_KERNEL);
+	if (!larray->map) {
+		md_dbg_err("failed to allocate dev map\n");
+		return -ENOMEM;
+	}
+
+	bn_end = md_o2p(md->devs[dev_index].size);
+	for (i = 0; i < map_size; ++i) {
+		if ((i * larray->bn_gcd) >= bn_end)
+			bn_end += md_o2p(md->devs[++dev_index].size);
+		larray->map[i] = &md->devs[dev_index];
+	}
+
+	return 0;
+}
+
+static int _md_init(struct multi_devices *md, struct mdt_check *mc,
+		    struct md_dev_list *dev_list, int silent)
+{
+	struct md_dev_table *main_mdt = NULL;
+	u64 total_size = 0;
+	int i, err;
+
+	for (i = 0; i < md->t1_count; ++i) {
+		struct md_dev_info *mdi = md_t1_dev(md, i);
+		struct md_dev_table *dev_mdt;
+
+		err = _init_dev_info(mdi, &dev_list->dev_ids[i], i, total_size,
+				     main_mdt, mc, true, silent);
+		if (unlikely(err))
+			return err;
+
+		/* apparently gcd(0,X)=X which is nice */
+		md->t1a.bn_gcd = gcd(md->t1a.bn_gcd, md_o2p(mdi->size));
+		total_size += mdi->size;
+
+		dev_mdt = md_t1_addr(md, i);
+		if (!main_mdt)
+			main_mdt = dev_mdt;
+
+		if (mdt_test_option(dev_mdt, MDT_F_SHADOW))
+			memcpy(mdi->t1i.virt_addr,
+			       mdi->t1i.virt_addr + mdi->size, mdi->size);
+
+		md_dbg_verbose("dev=%d %pUb %s v=%p pfn=%lu off=%lu size=%lu\n",
+				 i, &dev_list->dev_ids[i].uuid,
+				 _bdev_name(mdi->bdev), dev_mdt,
+				 mdi->t1i.phys_pfn, mdi->offset, mdi->size);
+	}
+
+	md->t1_blocks = le64_to_cpu(main_mdt->s_t1_blocks);
+	if (unlikely(md->t1_blocks != md_o2p(total_size))) {
+		md_err_cnd(silent,
+			"FS corrupted md->t1_blocks(0x%lx) != total_size(0x%llx)\n",
+			md->t1_blocks, total_size);
+		return -EIO;
+	}
+
+	err = _map_setup(md, le64_to_cpu(main_mdt->s_t1_blocks), 0, &md->t1a);
+	if (unlikely(err))
+		return err;
+
+	md_dbg_verbose("t1 devices=%d total_size=0x%llx segment_map=0x%lx\n",
+			 md->t1_count, total_size,
+			 md_o2p(total_size) / md->t1a.bn_gcd);
+
+	if (md->t2_count == 0)
+		return 0;
+
+	/* Done with t1. Counting t2s */
+	total_size = 0;
+	for (i = 0; i < md->t2_count; ++i) {
+		struct md_dev_info *mdi = md_t2_dev(md, i);
+
+		err = _init_dev_info(mdi, &dev_list->dev_ids[md->t1_count + i],
+				     md->t1_count + i, total_size, main_mdt,
+				     mc, false, silent);
+		if (unlikely(err))
+			return err;
+
+		/* apparently gcd(0,X)=X which is nice */
+		md->t2a.bn_gcd = gcd(md->t2a.bn_gcd, md_o2p(mdi->size));
+		total_size += mdi->size;
+
+		md_dbg_verbose("dev=%d %s off=%lu size=%lu\n", i,
+				 _bdev_name(mdi->bdev), mdi->offset, mdi->size);
+	}
+
+	md->t2_blocks = le64_to_cpu(main_mdt->s_t2_blocks);
+	if (unlikely(md->t2_blocks != md_o2p(total_size))) {
+		md_err_cnd(silent,
+			"FS corrupted md->t2_blocks(0x%lx) != total_size(0x%llx)\n",
+			md->t2_blocks, total_size);
+		return -EIO;
+	}
+
+	err = _map_setup(md, le64_to_cpu(main_mdt->s_t2_blocks), md->t1_count,
+			 &md->t2a);
+	if (unlikely(err))
+		return err;
+
+	md_dbg_verbose("t2 devices=%d total_size=%llu segment_map=%lu\n",
+			 md->t2_count, total_size,
+			 md_o2p(total_size) / md->t2a.bn_gcd);
+
+	return 0;
+}
+
+static int _load_dev_list(struct md_dev_list *dev_list, struct mdt_check *mc,
+			  struct block_device *bdev, const char *dev_name,
+			  int silent)
+{
+	struct md_dev_table *mdt;
+	int err;
+
+	mdt = (void *)__get_free_page(GFP_KERNEL);
+	if (unlikely(!mdt)) {
+		md_dbg_err("!!! failed to alloc page\n");
+		return -ENOMEM;
+	}
+
+	err = _t2_mdt_read(bdev, mdt);
+	if (unlikely(err)) {
+		md_err_cnd(silent, "failed to read super block from %s => %d\n",
+			     dev_name, err);
+		goto out;
+	}
+
+	if (!md_mdt_check(mdt, NULL, bdev, mc)) {
+		md_err_cnd(silent, "bad mdt in %s\n", dev_name);
+		err = -EINVAL;
+		goto out;
+	}
+
+	*dev_list = mdt->s_dev_list;
+
+out:
+	free_page((ulong)mdt);
+	return err;
+}
+
+/* md_init - allocates and initializes ready to go multi_devices object
+ *
+ * The rule is that if md_init returns error caller must call md_fini always
+ */
+int md_init(struct multi_devices **ret_md, const char *dev_name,
+	    struct mdt_check *mc, char path[PATH_UUID],	const char **dev_path)
+{
+	struct md_dev_list *dev_list;
+	struct block_device *bdev;
+	struct multi_devices *md;
+	short id_index;
+	bool bind_mount = false;
+	int err;
+
+	md = kzalloc(sizeof(*md), GFP_KERNEL);
+	*ret_md = md;
+	if (unlikely(!md))
+		return -ENOMEM;
+
+	dev_list = kmalloc(sizeof(*dev_list), GFP_KERNEL);
+	if (unlikely(!dev_list))
+		return -ENOMEM;
+
+	err = _get_device(&bdev, dev_name, NULL, mc->holder, mc->silent,
+			  &bind_mount);
+	if (unlikely(err))
+		goto out2;
+
+	err = _load_dev_list(dev_list, mc, bdev, dev_name, mc->silent);
+	if (unlikely(err)) {
+		_bdev_put(&bdev);
+		goto out2;
+	}
+
+	id_index = le16_to_cpu(dev_list->id_index);
+	if (bind_mount) {
+		_bdev_put(&bdev);
+		md->dev_index = id_index;
+		goto out;
+	}
+
+	md->t1_count = le16_to_cpu(dev_list->t1_count);
+	md->t2_count = le16_to_cpu(dev_list->t2_count);
+	md->devs[id_index].bdev = bdev;
+
+	if ((id_index != 0)) {
+		err = _get_device(&md_t1_dev(md, 0)->bdev, NULL,
+				  &dev_list->dev_ids[0].uuid, mc->holder,
+				  mc->silent, &bind_mount);
+		if (unlikely(err))
+			goto out2;
+
+		if (bind_mount)
+			goto out;
+	}
+
+	if (md->t2_count) {
+		int t2_index = md->t1_count;
+
+		/* t2 is the primary device if given in mount, or the first
+		 * mount specified it as primary device
+		 */
+		if (id_index != md->t1_count) {
+			err = _get_device(&md_t2_dev(md, 0)->bdev, NULL,
+					  &dev_list->dev_ids[t2_index].uuid,
+					  mc->holder, mc->silent, &bind_mount);
+			if (unlikely(err))
+				goto out2;
+
+			if (bind_mount)
+				md->dev_index = t2_index;
+		}
+
+		if (t2_index <= id_index)
+			md->dev_index = t2_index;
+	}
+
+out:
+	if (md->dev_index != id_index)
+		*dev_path = _uuid_path(&dev_list->dev_ids[md->dev_index].uuid,
+				       path);
+	else
+		*dev_path = dev_name;
+
+	if (!bind_mount) {
+		err = _md_init(md, mc, dev_list, mc->silent);
+		if (unlikely(err))
+			goto out2;
+		if (!(mc->private_mnt))
+			_bdev_put(&md_dev_info(md, md->dev_index)->bdev);
+	} else {
+		md_fini(md, true);
+	}
+
+out2:
+	kfree(dev_list);
+
+	return err;
+}
+
+/* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ * PORTING SECTION:
+ * Below are members that are done differently in different Linux versions.
+ * So keep separate from code
+ */
+static int _check_da_ret(struct md_dev_info *mdi, long avail, bool silent)
+{
+	if (unlikely(avail < (long)mdi->size)) {
+		if (0 < avail) {
+			md_warn_cnd(silent,
+				"Unsupported DAX device %s (range mismatch) => 0x%lx < 0x%lx\n",
+				_bdev_name(mdi->bdev), avail, mdi->size);
+			return -ERANGE;
+		}
+		md_warn_cnd(silent, "!!! %s direct_access return => %ld\n",
+			    _bdev_name(mdi->bdev), avail);
+		return avail;
+	}
+	return 0;
+}
+
+#include <linux/dax.h>
+
+int md_t1_info_init(struct md_dev_info *mdi, bool silent)
+{
+	pfn_t a_pfn_t;
+	void *addr;
+	long nrpages, avail, pgoff;
+	int id;
+
+	mdi->t1i.dax_dev = fs_dax_get_by_bdev(mdi->bdev);
+	if (unlikely(!mdi->t1i.dax_dev))
+		return -EOPNOTSUPP;
+
+	id = dax_read_lock();
+
+	bdev_dax_pgoff(mdi->bdev, 0, PAGE_SIZE, &pgoff);
+	nrpages = dax_direct_access(mdi->t1i.dax_dev, pgoff, md_o2p(mdi->size),
+				    &addr, &a_pfn_t);
+	dax_read_unlock(id);
+	if (unlikely(nrpages <= 0)) {
+		if (!nrpages)
+			nrpages = -ERANGE;
+		avail = nrpages;
+	} else {
+		avail = md_p2o(nrpages);
+	}
+
+	mdi->t1i.virt_addr = addr;
+	mdi->t1i.phys_pfn = pfn_t_to_pfn(a_pfn_t);
+
+	md_dbg_verbose("0x%lx 0x%lx pgoff=0x%lx\n",
+			 (ulong)mdi->t1i.virt_addr, mdi->t1i.phys_pfn, pgoff);
+
+	return _check_da_ret(mdi, avail, silent);
+}
+
+void md_t1_info_fini(struct md_dev_info *mdi)
+{
+	fs_put_dax(mdi->t1i.dax_dev);
+	mdi->t1i.dax_dev = NULL;
+	mdi->t1i.virt_addr = NULL;
+}
diff --git a/fs/zuf/md.h b/fs/zuf/md.h
new file mode 100644
index 000000000000..15ba7d646544
--- /dev/null
+++ b/fs/zuf/md.h
@@ -0,0 +1,332 @@
+/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
+/*
+ * Multi-Device operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#ifndef __MD_H__
+#define __MD_H__
+
+#include <linux/types.h>
+
+#include "md_def.h"
+
+#ifndef __KERNEL__
+struct page;
+struct block_device;
+#else
+#	include <linux/blkdev.h>
+#endif /* ndef __KERNEL__ */
+
+struct md_t1_info {
+	void *virt_addr;
+#ifdef __KERNEL__
+	ulong phys_pfn;
+	struct dax_device *dax_dev;
+	struct dev_pagemap *pgmap;
+#endif /*def __KERNEL__*/
+};
+
+struct md_t2_info {
+#ifndef __KERNEL__
+	bool err_read_reported;
+	bool err_write_reported;
+#endif
+};
+
+struct md_dev_info {
+	struct block_device *bdev;
+	ulong size;
+	ulong offset;
+	union {
+		struct md_t1_info	t1i;
+		struct md_t2_info	t2i;
+	};
+	int index;
+	int nid;
+};
+
+struct md_dev_larray {
+	ulong bn_gcd;
+	struct md_dev_info **map;
+};
+
+#ifndef __KERNEL__
+struct fba {
+	int fd; void *ptr;
+	size_t size;
+	void *orig_ptr;
+};
+#endif /*! __KERNEL__*/
+
+struct zus_sb_info;
+struct multi_devices {
+	int dev_index;
+	int t1_count;
+	int t2_count;
+	struct md_dev_info devs[MD_DEV_MAX];
+	struct md_dev_larray t1a;
+	struct md_dev_larray t2a;
+#ifndef __KERNEL__
+	struct zufs_ioc_pmem pmem_info; /* As received from Kernel */
+
+	void *p_pmem_addr;
+	int fd;
+	uint user_page_size;
+	struct fba pages;
+	struct zus_sb_info *sbi;
+#else
+	ulong t1_blocks;
+	ulong t2_blocks;
+#endif /*! __KERNEL__*/
+};
+
+enum md_init_flags {
+	MD_I_F_PRIVATE		= (1UL << 0),
+};
+
+static inline __u64 md_p2o(ulong bn)
+{
+	return (__u64)bn << PAGE_SHIFT;
+}
+
+static inline ulong md_o2p(__u64 offset)
+{
+	return offset >> PAGE_SHIFT;
+}
+
+static inline ulong md_o2p_up(__u64 offset)
+{
+	return md_o2p(offset + PAGE_SIZE - 1);
+}
+
+static inline struct md_dev_info *md_t1_dev(struct multi_devices *md, int i)
+{
+	return &md->devs[i];
+}
+
+static inline struct md_dev_info *md_t2_dev(struct multi_devices *md, int i)
+{
+	return &md->devs[md->t1_count + i];
+}
+
+static inline struct md_dev_info *md_dev_info(struct multi_devices *md, int i)
+{
+	return &md->devs[i];
+}
+
+static inline void *md_t1_addr(struct multi_devices *md, int i)
+{
+	struct md_dev_info *mdi = md_t1_dev(md, i);
+
+	return mdi->t1i.virt_addr;
+}
+
+static inline ulong md_t1_blocks(struct multi_devices *md)
+{
+#ifdef __KERNEL__
+	return md->t1_blocks;
+#else
+	return md->pmem_info.mdt.s_t1_blocks;
+#endif
+}
+
+static inline ulong md_t2_blocks(struct multi_devices *md)
+{
+#ifdef __KERNEL__
+	return md->t2_blocks;
+#else
+	return md->pmem_info.mdt.s_t2_blocks;
+#endif
+}
+
+static inline struct md_dev_table *md_zdt(struct multi_devices *md)
+{
+	return md_t1_addr(md, 0);
+}
+
+static inline struct md_dev_info *md_bn_t1_dev(struct multi_devices *md,
+						 ulong bn)
+{
+	return md->t1a.map[bn / md->t1a.bn_gcd];
+}
+
+static inline uuid_le *md_main_uuid(struct multi_devices *md)
+{
+	return &md_zdt(md)->s_dev_list.dev_ids[md->dev_index].uuid;
+}
+
+#ifdef __KERNEL__
+static inline ulong md_pfn(struct multi_devices *md, ulong block)
+{
+	struct md_dev_info *mdi;
+	bool add_pfn = false;
+	ulong base_pfn;
+
+	if (unlikely(md_t1_blocks(md) <= block)) {
+		if (WARN_ON(!mdt_test_option(md_zdt(md), MDT_F_SHADOW)))
+			return 0;
+		block -= md_t1_blocks(md);
+		add_pfn = true;
+	}
+
+	mdi = md_bn_t1_dev(md, block);
+	if (add_pfn)
+		base_pfn = mdi->t1i.phys_pfn + md_o2p(mdi->size);
+	else
+		base_pfn = mdi->t1i.phys_pfn;
+	return base_pfn + (block - md_o2p(mdi->offset));
+}
+#endif /* def __KERNEL__ */
+
+static inline void *md_addr(struct multi_devices *md, ulong offset)
+{
+#ifdef __KERNEL__
+	struct md_dev_info *mdi = md_bn_t1_dev(md, md_o2p(offset));
+
+	return offset ? mdi->t1i.virt_addr + (offset - mdi->offset) : NULL;
+#else
+	return offset ? md->p_pmem_addr + offset : NULL;
+#endif
+}
+
+static inline void *md_baddr(struct multi_devices *md, ulong bn)
+{
+	return md_addr(md, md_p2o(bn));
+}
+
+static inline struct md_dev_info *md_bn_t2_dev(struct multi_devices *md,
+					       ulong bn)
+{
+	return md->t2a.map[bn / md->t2a.bn_gcd];
+}
+
+static inline int md_t2_bn_nid(struct multi_devices *md, ulong bn)
+{
+	struct md_dev_info *mdi = md_bn_t2_dev(md, bn);
+
+	return mdi->nid;
+}
+
+static inline ulong md_t2_local_bn(struct multi_devices *md, ulong bn)
+{
+#ifdef __KERNEL__
+	struct md_dev_info *mdi = md_bn_t2_dev(md, bn);
+
+	return bn - md_o2p(mdi->offset);
+#else
+	return bn; /* In zus we just let Kernel worry about it */
+#endif
+}
+
+static inline ulong md_t2_gcd(struct multi_devices *md)
+{
+	return md->t2a.bn_gcd;
+}
+
+static inline void *md_addr_verify(struct multi_devices *md, ulong offset)
+{
+	if (unlikely(offset > md_p2o(md_t1_blocks(md)))) {
+		md_dbg_err("offset=0x%lx > max=0x%llx\n",
+			    offset, md_p2o(md_t1_blocks(md)));
+		return NULL;
+	}
+
+	return md_addr(md, offset);
+}
+
+static inline struct page *md_bn_to_page(struct multi_devices *md, ulong bn)
+{
+#ifdef __KERNEL__
+	return pfn_to_page(md_pfn(md, bn));
+#else
+	return md->pages.ptr + bn * md->user_page_size;
+#endif
+}
+
+static inline ulong md_addr_to_offset(struct multi_devices *md, void *addr)
+{
+#ifdef __KERNEL__
+	/* TODO: Keep the device index in page-flags we need to fix the
+	 * page-ref right? for now with pages untouched we need this loop
+	 */
+	int dev_index;
+
+	for (dev_index = 0; dev_index < md->t1_count; ++dev_index) {
+		struct md_dev_info *mdi = md_t1_dev(md, dev_index);
+
+		if ((mdi->t1i.virt_addr <= addr) &&
+		    (addr < (mdi->t1i.virt_addr + mdi->size)))
+			return mdi->offset + (addr - mdi->t1i.virt_addr);
+	}
+
+	return 0;
+#else /* !__KERNEL__ */
+	return addr - md->p_pmem_addr;
+#endif
+}
+
+static inline ulong md_addr_to_bn(struct multi_devices *md, void *addr)
+{
+	return md_o2p(md_addr_to_offset(md, addr));
+}
+
+static inline ulong md_page_to_bn(struct multi_devices *md, struct page *page)
+{
+#ifdef __KERNEL__
+	return md_addr_to_bn(md, page_address(page));
+#else
+	ulong bytes = (void *)page - md->pages.ptr;
+
+	return bytes / md->user_page_size;
+#endif
+}
+
+#ifdef __KERNEL__
+/* TODO: Change API to take mdi and also support in um */
+static inline const char *_bdev_name(struct block_device *bdev)
+{
+	return dev_name(&bdev->bd_part->__dev);
+}
+#endif /*def __KERNEL__*/
+
+struct mdt_check {
+	ulong alloc_mask;
+	uint major_ver;
+	uint minor_ver;
+	__u32  magic;
+
+	void *holder;
+	bool silent;
+	bool private_mnt;
+};
+
+/* md.c */
+bool md_mdt_check(struct md_dev_table *mdt, struct md_dev_table *main_mdt,
+		  struct block_device *bdev, struct mdt_check *mc);
+int md_t2_mdt_read(struct multi_devices *md, int dev_index,
+		   struct md_dev_table *mdt);
+int md_t2_mdt_write(struct multi_devices *md, struct md_dev_table *mdt);
+short md_calc_csum(struct md_dev_table *mdt);
+void md_fini(struct multi_devices *md, bool put_all);
+
+#ifdef __KERNEL__
+/* length of uuid dev path /dev/disk/by-uuid/<uuid> */
+#define PATH_UUID	64
+int md_init(struct multi_devices **md, const char *dev_name,
+	    struct mdt_check *mc, char path[PATH_UUID], const char **dp);
+int md_set_sb(struct multi_devices *md, struct block_device *s_bdev, void *sb,
+	      int silent);
+int md_t1_info_init(struct md_dev_info *mdi, bool silent);
+void md_t1_info_fini(struct md_dev_info *mdi);
+
+#else /* libzus */
+int md_init_from_pmem_info(struct multi_devices *md);
+#endif
+
+#endif
diff --git a/fs/zuf/md_def.h b/fs/zuf/md_def.h
new file mode 100644
index 000000000000..72eda8516754
--- /dev/null
+++ b/fs/zuf/md_def.h
@@ -0,0 +1,141 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note or BSD-3-Clause */
+/*
+ * Multi-Device operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+#ifndef _LINUX_MD_DEF_H
+#define _LINUX_MD_DEF_H
+
+#include <linux/types.h>
+#include <linux/uuid.h>
+
+#ifndef __KERNEL__
+
+#include <stdint.h>
+#include <endian.h>
+#include <stdbool.h>
+#include <stdlib.h>
+
+#ifndef le16_to_cpu
+
+#define le16_to_cpu(x)	((__u16)le16toh(x))
+#define le32_to_cpu(x)	((__u32)le32toh(x))
+#define le64_to_cpu(x)	((__u64)le64toh(x))
+#define cpu_to_le16(x)	((__le16)htole16(x))
+#define cpu_to_le32(x)	((__le32)htole32(x))
+#define cpu_to_le64(x)	((__le64)htole64(x))
+
+#endif
+
+#ifndef __aligned
+#define	__aligned(x)			__attribute__((aligned(x)))
+#endif
+
+#endif /*  ndef __KERNEL__ */
+
+#define MDT_SIZE 4096
+
+#define MD_DEV_NUMA_SHIFT		60
+#define MD_DEV_BLOCKS_MASK		0x0FFFFFFFFFFFFFFF
+
+struct md_dev_id {
+	uuid_le	uuid;
+	__le64	blocks;
+} __aligned(8);
+
+static inline __u64 __dev_id_blocks(struct md_dev_id *dev)
+{
+	return le64_to_cpu(dev->blocks) & MD_DEV_BLOCKS_MASK;
+}
+
+static inline void __dev_id_blocks_set(struct md_dev_id *dev, __u64 blocks)
+{
+	dev->blocks &= ~MD_DEV_BLOCKS_MASK;
+	dev->blocks |= blocks;
+}
+
+static inline int __dev_id_nid(struct md_dev_id *dev)
+{
+	return (int)(le64_to_cpu(dev->blocks) >> MD_DEV_NUMA_SHIFT);
+}
+
+static inline void __dev_id_nid_set(struct md_dev_id *dev, int nid)
+{
+	dev->blocks &= MD_DEV_BLOCKS_MASK;
+	dev->blocks |= (__le64)nid << MD_DEV_NUMA_SHIFT;
+}
+
+/* 64 is the nicest number to still fit when the ZDT is 2048 and 6 bits can
+ * fit in page struct for address to block translation.
+ */
+#define MD_DEV_MAX   64
+
+struct md_dev_list {
+	__le16		   id_index;	/* index of current dev in list */
+	__le16		   t1_count;	/* # of t1 devs */
+	__le16		   t2_count;	/* # of t2 devs (after t1_count) */
+	__le16		   rmem_count;	/* align to 64 bit */
+	struct md_dev_id dev_ids[MD_DEV_MAX];
+} __aligned(64);
+
+/*
+ * Structure of the on disk multy device table
+ * NOTE: md_dev_table is always of size MDT_SIZE. These below are the
+ *   currently defined/used members in this version.
+ *   TODO: remove the s_ from all the fields
+ */
+struct md_dev_table {
+	/* static fields. they never change after file system creation.
+	 * checksum only validates up to s_start_dynamic field below
+	 */
+	__le16		s_sum;              /* checksum of this sb */
+	__le16		s_version;          /* zdt-version */
+	__le32		s_magic;            /* magic signature */
+	uuid_le		s_uuid;		    /* 128-bit uuid */
+	__le64		s_flags;
+	__le64		s_t1_blocks;
+	__le64		s_t2_blocks;
+
+	struct md_dev_list s_dev_list;
+
+	char		s_start_dynamic[0];
+
+	/* all the dynamic fields should go here */
+	__le64		s_mtime;		/* mount time */
+	__le64		s_wtime;		/* write time */
+};
+
+/* device table s_flags */
+enum enum_mdt_flags {
+	MDT_F_SHADOW		= (1UL << 0),	/* simulate cpu cache */
+	MDT_F_POSIXACL		= (1UL << 1),	/* enable acls */
+
+	MDT_F_USER_START	= 8,	/* first 8 bit reserved for mdt */
+};
+
+static inline bool mdt_test_option(struct md_dev_table *mdt,
+				   enum enum_mdt_flags flag)
+{
+	return (mdt->s_flags & flag) != 0;
+}
+
+#define MD_MINORS_PER_MAJOR	1024
+
+static inline int mdt_major_version(struct md_dev_table *mdt)
+{
+	return le16_to_cpu(mdt->s_version) / MD_MINORS_PER_MAJOR;
+}
+
+static inline int mdt_minor_version(struct md_dev_table *mdt)
+{
+	return le16_to_cpu(mdt->s_version) % MD_MINORS_PER_MAJOR;
+}
+
+#define MDT_STATIC_SIZE(mdt) ((__u64)&mdt->s_start_dynamic - (__u64)mdt)
+
+#endif /* _LINUX_MD_DEF_H */
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index f7f7798425a9..2248ee74e4c2 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -20,6 +20,12 @@
 
 static struct kmem_cache *zuf_inode_cachep;
 
+struct super_block *zuf_sb_from_id(struct zuf_root_info *zri, __u64 sb_id,
+				   struct zus_sb_info *zus_sbi)
+{
+	return NULL;
+}
+
 static void _init_once(void *foo)
 {
 	struct zuf_inode_info *zii = foo;
diff --git a/fs/zuf/t1.c b/fs/zuf/t1.c
new file mode 100644
index 000000000000..46ea7f6181fc
--- /dev/null
+++ b/fs/zuf/t1.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Just the special mmap of the all t1 array to the ZUS Server
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/pfn_t.h>
+#include <asm/pgtable.h>
+
+#include "_pr.h"
+#include "zuf.h"
+
+/* ~~~ Functions for mmap a t1-array and page faults ~~~ */
+static struct zuf_pmem_file *_pmem_from_f_private(struct file *file)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	WARN_ON(zsf->type != zlfs_e_pmem);
+	return container_of(zsf, struct zuf_pmem_file, hdr);
+}
+
+static vm_fault_t t1_fault(struct vm_fault *vmf, enum page_entry_size pe_size)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	ulong addr = vmf->address;
+	struct zuf_pmem_file *z_pmem;
+	pgoff_t size;
+	ulong bn;
+	pfn_t pfnt;
+	ulong pfn = 0;
+	vm_fault_t flt;
+
+	zuf_dbg_t1("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+		    "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p pe_size=%d\n",
+		    inode->i_ino, vma->vm_start, vma->vm_end, addr, vmf->pgoff,
+		    vmf->flags, vmf->cow_page, vmf->page, pe_size);
+
+	if (unlikely(vmf->page)) {
+		zuf_err("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+			"pgoff=0x%lx vmf_flags=0x%x page=%p cow_page=%p\n",
+			inode->i_ino, vma->vm_start, vma->vm_end, addr,
+			vmf->pgoff, vmf->flags, vmf->page, vmf->cow_page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	size = md_o2p_up(i_size_read(inode));
+	if (unlikely(vmf->pgoff >= size)) {
+		ulong pgoff = vma->vm_pgoff + md_o2p(addr - vma->vm_start);
+
+		zuf_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n",
+			 inode->i_ino, vmf->pgoff, pgoff, size);
+
+		return VM_FAULT_SIGBUS;
+	}
+
+	if (vmf->cow_page)
+		/* HOWTO: prevent private mmaps */
+		return VM_FAULT_SIGBUS;
+
+	z_pmem = _pmem_from_f_private(vma->vm_file);
+
+	switch (pe_size) {
+	case PE_SIZE_PTE:
+		zuf_err("[%ld] PTE fault not expected pgoff=0x%lx addr=0x%lx\n",
+			inode->i_ino, vmf->pgoff, addr);
+		/* Always PMD insert 2M chunks */
+		/* fall through */
+	case PE_SIZE_PMD:
+		bn = linear_page_index(vma, addr & PMD_MASK);
+		pfn = md_pfn(z_pmem->md, bn);
+		pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+		flt = vmf_insert_pfn_pmd(vmf, pfnt, true);
+		zuf_dbg_t1("[%ld] PMD pfn-0x%lx addr=0x%lx bn=0x%lx pgoff=0x%lx => %d\n",
+			inode->i_ino, pfn, addr, bn, vmf->pgoff, flt);
+		break;
+	default:
+		/* FIXME: Easily support PE_SIZE_PUD Just needs to align to
+		 * PUD_MASK at zufr_get_unmapped_area(). But this is hard today
+		 * because of the 2M nvdimm lib takes for its page flag
+		 * information with NFIT. (That need not be there in any which
+		 * case.)
+		 * Which means zufr_get_unmapped_area needs to return
+		 * a align1G+2M address start. and first 1G is map PMD size.
+		 * Very ugly, sigh.
+		 * One thing I do not understand why when the vma->vm_start is
+		 * not PUD aligned and faults requests index zero. Then system
+		 * asks for PE_SIZE_PUD anyway. say my 0 index is 1G aligned
+		 * vmf_insert_pfn_pud() will always fail because the aligned
+		 * vm_addr is outside the vma.
+		 */
+		flt = VM_FAULT_FALLBACK;
+		zuf_dbg_t1("[%ld] default? pgoff=0x%lx addr=0x%lx pe_size=0x%x => %d\n",
+			   inode->i_ino, vmf->pgoff, addr, pe_size, flt);
+	}
+
+	return flt;
+}
+
+static vm_fault_t t1_fault_pte(struct vm_fault *vmf)
+{
+	return t1_fault(vmf, PE_SIZE_PTE);
+}
+
+static const struct vm_operations_struct t1_vm_ops = {
+	.huge_fault	= t1_fault,
+	.fault		= t1_fault_pte,
+};
+
+int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct zuf_special_file *zsf = file->private_data;
+
+	if (!zsf || zsf->type != zlfs_e_pmem)
+		return -EPERM;
+
+	vma->vm_flags |= VM_HUGEPAGE;
+	vma->vm_ops = &t1_vm_ops;
+
+	zuf_dbg_vfs("[%ld] start=0x%lx end=0x%lx flags=0x%lx page_prot=0x%lx\n",
+		     file->f_mapping->host->i_ino, vma->vm_start, vma->vm_end,
+		     vma->vm_flags, pgprot_val(vma->vm_page_prot));
+
+	return 0;
+}
+
diff --git a/fs/zuf/t2.c b/fs/zuf/t2.c
new file mode 100644
index 000000000000..d293ce0ac249
--- /dev/null
+++ b/fs/zuf/t2.c
@@ -0,0 +1,356 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Tier-2 operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+
+#include <linux/bitops.h>
+#include <linux/bio.h>
+
+#include "zuf.h"
+
+#define t2_warn zuf_warn
+
+static const char *_pr_rw(int rw)
+{
+	return (rw & WRITE) ? "WRITE" : "READ";
+}
+#define t2_tis_dbg(tis, fmt, args ...) \
+	zuf_dbg_t2("%s: r=%d f=0x%lx " fmt, _pr_rw(tis->rw_flags),	       \
+		    atomic_read(&tis->refcount), tis->rw_flags, ##args)
+
+#define t2_tis_dbg_rw(tis, fmt, args ...) \
+	zuf_dbg_t2_rw("%s<%p>: r=%d f=0x%lx " fmt, _pr_rw(tis->rw_flags),     \
+		    tis->priv, atomic_read(&tis->refcount), tis->rw_flags,\
+		    ##args)
+
+/* ~~~~~~~~~~~~ Async read/write ~~~~~~~~~~ */
+void t2_io_begin(struct multi_devices *md, int rw, t2_io_done_fn done,
+		 void *priv, uint n_vects, struct t2_io_state *tis)
+{
+	atomic_set(&tis->refcount, 1);
+	tis->md = md;
+	tis->done = done;
+	tis->priv = priv;
+	tis->n_vects = min(n_vects ? n_vects : 1, (uint)BIO_MAX_PAGES);
+	tis->rw_flags = rw;
+	tis->last_t2 = -1;
+	tis->cur_bio = NULL;
+	tis->index = ~0;
+	bio_list_init(&tis->delayed_bios);
+	tis->err = 0;
+	blk_start_plug(&tis->plug);
+	t2_tis_dbg_rw(tis, "done=%pS n_vects=%d\n", done, n_vects);
+}
+
+static void _tis_put(struct t2_io_state *tis)
+{
+	t2_tis_dbg_rw(tis, "done=%pS\n", tis->done);
+
+	if (test_bit(B_TIS_FREE_AFTER_WAIT, &tis->rw_flags))
+		wake_up_var(&tis->refcount);
+	else if (tis->done)
+		/* last - done may free the tis */
+		tis->done(tis, NULL, true);
+}
+
+static inline void tis_get(struct t2_io_state *tis)
+{
+	atomic_inc(&tis->refcount);
+}
+
+static inline int tis_put(struct t2_io_state *tis)
+{
+	if (atomic_dec_and_test(&tis->refcount)) {
+		_tis_put(tis);
+		return 1;
+	}
+	return 0;
+}
+
+static int _status_to_errno(blk_status_t status)
+{
+	return blk_status_to_errno(status);
+}
+
+void t2_io_done(struct t2_io_state *tis, struct bio *bio, bool last)
+{
+	struct bio_vec *bv;
+	struct bvec_iter_all i;
+
+	if (!bio)
+		return;
+
+	bio_for_each_segment_all(bv, bio, i)
+		put_page(bv->bv_page);
+}
+
+static void _tis_bio_done(struct bio *bio)
+{
+	struct t2_io_state *tis = bio->bi_private;
+
+	t2_tis_dbg(tis, "done=%pS err=%d\n", tis->done, bio->bi_status);
+
+	if (unlikely(bio->bi_status)) {
+		zuf_dbg_err("%s: err=%d last-err=%d\n",
+			     _pr_rw(tis->rw_flags), bio->bi_status, tis->err);
+		/* Store the last one */
+		tis->err = _status_to_errno(bio->bi_status);
+	}
+
+	if (tis->done)
+		tis->done(tis, bio, false);
+	else
+		t2_io_done(tis, bio, false);
+
+	bio_put(bio);
+	tis_put(tis);
+}
+
+static bool _tis_delay(struct t2_io_state *tis)
+{
+	return 0 != (tis->rw_flags & TIS_DELAY_SUBMIT);
+}
+
+#define bio_list_for_each_safe(bio, btmp, bl)				\
+	for (bio = (bl)->head,	btmp = bio ? bio->bi_next : NULL;	\
+	     bio; bio = btmp,	btmp = bio ? bio->bi_next : NULL)
+
+static void _tis_submit_bio(struct t2_io_state *tis, bool flush, bool done)
+{
+	if (flush || done) {
+		if (_tis_delay(tis)) {
+			struct bio *btmp, *bio;
+
+			bio_list_for_each_safe(bio, btmp, &tis->delayed_bios) {
+				bio->bi_next = NULL;
+				if (bio->bi_iter.bi_sector == -1) {
+					t2_warn("!!!!!!!!!!!!!\n");
+					bio_put(bio);
+					continue;
+				}
+				t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n",
+					    bio->bi_vcnt, tis->n_vects);
+				submit_bio(bio);
+			}
+			bio_list_init(&tis->delayed_bios);
+		}
+
+		if (!tis->cur_bio)
+			return;
+
+		if (tis->cur_bio->bi_iter.bi_sector != -1) {
+			t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n",
+				    tis->cur_bio->bi_vcnt, tis->n_vects);
+			submit_bio(tis->cur_bio);
+			tis->cur_bio = NULL;
+			tis->index = ~0;
+		} else if (done) {
+			t2_tis_dbg(tis, "put cur_bio=%p\n", tis->cur_bio);
+			bio_put(tis->cur_bio);
+			WARN_ON(tis_put(tis));
+		}
+	} else if (tis->cur_bio && (tis->cur_bio->bi_iter.bi_sector != -1)) {
+		/* Not flushing regular progress */
+		if (_tis_delay(tis)) {
+			t2_tis_dbg(tis, "list_add cur_bio=%p\n", tis->cur_bio);
+			bio_list_add(&tis->delayed_bios, tis->cur_bio);
+		} else {
+			t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n",
+				    tis->cur_bio->bi_vcnt, tis->n_vects);
+			submit_bio(tis->cur_bio);
+		}
+		tis->cur_bio = NULL;
+		tis->index = ~0;
+	}
+}
+
+/* tis->cur_bio MUST be NULL, checked by caller */
+static void _tis_alloc(struct t2_io_state *tis, struct md_dev_info *mdi,
+		       gfp_t gfp)
+{
+	struct bio *bio = bio_alloc(gfp, tis->n_vects);
+	int bio_op;
+
+	if (unlikely(!bio)) {
+		if (!_tis_delay(tis))
+			t2_warn("!!! failed to alloc bio");
+		tis->err = -ENOMEM;
+		return;
+	}
+
+	if (WARN_ON(!tis || !tis->md)) {
+		tis->err = -ENOMEM;
+		return;
+	}
+
+	/* FIXME: bio_set_op_attrs macro has a BUG which does not allow this
+	 * question inline.
+	 */
+	bio_op = (tis->rw_flags & WRITE) ? REQ_OP_WRITE : REQ_OP_READ;
+	bio_set_op_attrs(bio, bio_op, 0);
+
+	bio->bi_iter.bi_sector = -1;
+	bio->bi_end_io = _tis_bio_done;
+	bio->bi_private = tis;
+
+	if (mdi) {
+		bio_set_dev(bio, mdi->bdev);
+		tis->index = mdi->index;
+	} else {
+		tis->index = ~0;
+	}
+	tis->last_t2 = -1;
+	tis->cur_bio = bio;
+	tis_get(tis);
+	t2_tis_dbg(tis, "New bio n_vects=%d\n", tis->n_vects);
+}
+
+int t2_io_prealloc(struct t2_io_state *tis, uint n_vects)
+{
+	tis->err = 0; /* reset any -ENOMEM from a previous t2_io_add */
+
+	_tis_submit_bio(tis, true, false);
+	tis->n_vects = min(n_vects ? n_vects : 1, (uint)BIO_MAX_PAGES);
+
+	t2_tis_dbg(tis, "n_vects=%d cur_bio=%p\n", tis->n_vects, tis->cur_bio);
+
+	if (!tis->cur_bio)
+		_tis_alloc(tis, NULL, GFP_NOFS);
+	return tis->err;
+}
+
+int t2_io_add(struct t2_io_state *tis, ulong t2, struct page *page)
+{
+	struct md_dev_info *mdi;
+	ulong local_t2;
+	int ret;
+
+	if (t2 > md_t2_blocks(tis->md)) {
+		zuf_err("bad t2 (0x%lx) offset\n", t2);
+		return -EFAULT;
+	}
+	get_page(page);
+
+	mdi = md_bn_t2_dev(tis->md, t2);
+	WARN_ON(!mdi);
+
+	if (unlikely(!mdi->bdev)) {
+		zuf_err("mdi->bdev == NULL!! t2=0x%lx\n", t2);
+		return -EFAULT;
+	}
+
+	local_t2 = md_t2_local_bn(tis->md, t2);
+	if (((local_t2 != (tis->last_t2 + 1)) && (tis->last_t2 != -1)) ||
+	   ((0 < tis->index) && (tis->index != mdi->index)))
+		_tis_submit_bio(tis, false, false);
+
+start:
+	if (!tis->cur_bio) {
+		_tis_alloc(tis, mdi, _tis_delay(tis) ? GFP_ATOMIC : GFP_NOFS);
+		if (unlikely(tis->err)) {
+			put_page(page);
+			return tis->err;
+		}
+	} else if (tis->index == ~0) {
+		/* the bio was allocated during t2_io_prealloc */
+		tis->index = mdi->index;
+		bio_set_dev(tis->cur_bio, mdi->bdev);
+	}
+
+	if (tis->last_t2 == -1)
+		tis->cur_bio->bi_iter.bi_sector =
+						local_t2 * T2_SECTORS_PER_PAGE;
+
+	ret = bio_add_page(tis->cur_bio, page, PAGE_SIZE, 0);
+	if (unlikely(ret != PAGE_SIZE)) {
+		t2_tis_dbg(tis, "bio_add_page=>%d bi_vcnt=%d n_vects=%d\n",
+			   ret, tis->cur_bio->bi_vcnt, tis->n_vects);
+		_tis_submit_bio(tis, false, false);
+		goto start; /* device does not support tis->n_vects */
+	}
+
+	if ((tis->cur_bio->bi_vcnt == tis->n_vects) && (tis->n_vects != 1))
+		_tis_submit_bio(tis, false, false);
+
+	t2_tis_dbg(tis, "t2=0x%lx last_t2=0x%lx local_t2=0x%lx t1=0x%lx\n",
+		   t2, tis->last_t2, local_t2, md_page_to_bn(tis->md, page));
+
+	tis->last_t2 = local_t2;
+	return 0;
+}
+
+int t2_io_end(struct t2_io_state *tis, bool wait)
+{
+	if (unlikely(!tis || !tis->md))
+		return 0; /* never initialized nothing to do */
+
+	t2_tis_dbg_rw(tis, "wait=%d\n", wait);
+
+	_tis_submit_bio(tis, true, true);
+	blk_finish_plug(&tis->plug);
+
+	if (wait)
+		set_bit(B_TIS_FREE_AFTER_WAIT, &tis->rw_flags);
+	tis_put(tis);
+
+	if (wait) {
+		wait_var_event(&tis->refcount, !atomic_read(&tis->refcount));
+		if (tis->done)
+			tis->done(tis, NULL, true);
+	}
+
+	return tis->err;
+}
+
+/* ~~~~~~~ Sync read/write ~~~~~~~ TODO: Remove soon */
+static int _sync_io_page(struct multi_devices *md, int rw, ulong bn,
+			 struct page *page)
+{
+	struct t2_io_state tis;
+	int err;
+
+	t2_io_begin(md, rw, NULL, NULL, 1, &tis);
+
+	t2_tis_dbg((&tis), "bn=0x%lx p-i=0x%lx\n", bn, page->index);
+
+	err = t2_io_add(&tis, bn, page);
+	if (unlikely(err))
+		return err;
+
+	err = submit_bio_wait(tis.cur_bio);
+	if (unlikely(err)) {
+		SetPageError(page);
+		/*
+		 * We failed to write the page out to tier-2.
+		 * Print a dire warning that things will go BAD (tm)
+		 * very quickly.
+		 */
+		zuf_err("io-error bn=0x%lx => %d\n", bn, err);
+	}
+
+	/* Same as t2_io_end+_tis_bio_done but without the kref stuff */
+	blk_finish_plug(&tis.plug);
+	put_page(page);
+	if (likely(tis.cur_bio))
+		bio_put(tis.cur_bio);
+
+	return err;
+}
+
+int t2_writepage(struct multi_devices *md, ulong bn, struct page *page)
+{
+	return _sync_io_page(md, WRITE, bn, page);
+}
+
+int t2_readpage(struct multi_devices *md, ulong bn, struct page *page)
+{
+	return _sync_io_page(md, READ, bn, page);
+}
diff --git a/fs/zuf/t2.h b/fs/zuf/t2.h
new file mode 100644
index 000000000000..cbd23dd409eb
--- /dev/null
+++ b/fs/zuf/t2.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0 or BSD-3-Clause */
+/*
+ * Tier-2 Header file.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#ifndef __T2_H__
+#define __T2_H__
+
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+#include <linux/bio.h>
+#include <linux/kref.h>
+#include "md.h"
+
+#define T2_SECTORS_PER_PAGE	(PAGE_SIZE / 512)
+
+/* t2.c */
+
+/* Sync read/write */
+int t2_writepage(struct multi_devices *md, ulong bn, struct page *page);
+int t2_readpage(struct multi_devices *md, ulong bn, struct page *page);
+
+/* Async read/write */
+struct t2_io_state;
+typedef void (*t2_io_done_fn)(struct t2_io_state *tis, struct bio *bio,
+			      bool last);
+
+struct t2_io_state {
+	atomic_t refcount; /* counts in-flight bios */
+	struct blk_plug plug;
+
+	struct multi_devices	*md;
+	int		index;
+	t2_io_done_fn	done;
+	void		*priv;
+
+	uint		n_vects;
+	ulong		rw_flags;
+	ulong		last_t2;
+	struct bio	*cur_bio;
+	struct bio_list	delayed_bios;
+	int		err;
+};
+
+/* For rw_flags above */
+/* From Kernel: WRITE		(1U << 0) */
+#define TIS_DELAY_SUBMIT	(1U << 2)
+enum {B_TIS_FREE_AFTER_WAIT = 3};
+#define TIS_FREE_AFTER_WAIT	(1U << B_TIS_FREE_AFTER_WAIT)
+#define TIS_USER_DEF_FIRST	(1U << 8)
+
+void t2_io_begin(struct multi_devices *md, int rw, t2_io_done_fn done,
+		 void *priv, uint n_vects, struct t2_io_state *tis);
+int t2_io_prealloc(struct t2_io_state *tis, uint n_vects);
+int t2_io_add(struct t2_io_state *tis, ulong t2, struct page *page);
+int t2_io_end(struct t2_io_state *tis, bool wait);
+
+/* This is done by default if t2_io_done_fn above is NULL
+ * Can also be chain-called by users.
+ */
+void t2_io_done(struct t2_io_state *tis, struct bio *bio, bool last);
+
+#endif /*def __T2_H__*/
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 60f0d3ffe562..cc49cfa95244 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -359,6 +359,78 @@ static int _zu_numa_map(struct file *file, void *parg)
 	return err;
 }
 
+/* ~~~~ PMEM GRAB ~~~~ */
+/*FIXME: At pmem the struct md_dev_list for t1(s) is not properly set
+ * For now we do not fix it and re-write the mdt. So just fix the one
+ * we are about to send to Server
+ */
+static void _fix_numa_ids(struct multi_devices *md, struct md_dev_list *mdl)
+{
+	int i;
+
+	for (i = 0; i < md->t1_count; ++i)
+		if (md->devs[i].nid != __dev_id_nid(&mdl->dev_ids[i]))
+			__dev_id_nid_set(&mdl->dev_ids[i], md->devs[i].nid);
+}
+
+static int _zu_grab_pmem(struct file *file, void *parg)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	struct zufs_ioc_pmem __user *arg_pmem = parg;
+	struct zufs_ioc_pmem *zi_pmem = kzalloc(sizeof(*zi_pmem), GFP_KERNEL);
+	struct super_block *sb;
+	struct zuf_sb_info *sbi;
+	size_t pmem_size;
+	int err;
+
+	if (unlikely(!zi_pmem))
+		return -ENOMEM;
+
+	err = get_user(zi_pmem->sb_id, &arg_pmem->sb_id);
+	if (err) {
+		zuf_err("\n");
+		goto out;
+	}
+
+	sb = zuf_sb_from_id(zri, zi_pmem->sb_id, NULL);
+	if (unlikely(!sb)) {
+		err = -ENODEV;
+		zuf_err("!!! pmem_kern_id=%llu not found\n", zi_pmem->sb_id);
+		goto out;
+	}
+	sbi = SBI(sb);
+
+	if (sbi->pmem.hdr.file) {
+		zuf_err("[%llu] pmem already taken\n", zi_pmem->sb_id);
+		err = -EIO;
+		goto out;
+	}
+
+	memcpy(&zi_pmem->mdt, md_zdt(sbi->md), sizeof(zi_pmem->mdt));
+	zi_pmem->dev_index = sbi->md->dev_index;
+	_fix_numa_ids(sbi->md, &zi_pmem->mdt.s_dev_list);
+
+	pmem_size = md_p2o(md_t1_blocks(sbi->md));
+	if (mdt_test_option(md_zdt(sbi->md), MDT_F_SHADOW))
+		pmem_size += pmem_size;
+	i_size_write(file->f_inode, pmem_size);
+	sbi->pmem.hdr.type = zlfs_e_pmem;
+	sbi->pmem.hdr.file = file;
+	sbi->pmem.md = sbi->md; /* FIXME: Use container_of in t1.c */
+	file->private_data = &sbi->pmem.hdr;
+	zuf_dbg_core("pmem %llu i_size=0x%llx GRABED %s\n",
+		     zi_pmem->sb_id, i_size_read(file->f_inode),
+		     _bdev_name(md_t1_dev(sbi->md, 0)->bdev));
+
+out:
+	zi_pmem->hdr.err = err;
+	err = copy_to_user(parg, zi_pmem, sizeof(*zi_pmem));
+	if (err)
+		zuf_err("=>%d\n", err);
+	kfree(zi_pmem);
+	return err;
+}
+
 static void _prep_header_size_op(struct zufs_ioc_hdr *hdr,
 				 enum e_zufs_operation op, int err)
 {
@@ -886,6 +958,8 @@ long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
 		return _zu_mount(file, parg);
 	case ZU_IOC_NUMA_MAP:
 		return _zu_numa_map(file, parg);
+	case ZU_IOC_GRAB_PMEM:
+		return _zu_grab_pmem(file, parg);
 	case ZU_IOC_INIT_THREAD:
 		return _zu_init(file, parg);
 	case ZU_IOC_WAIT_OPT:
@@ -1135,6 +1209,8 @@ int zufc_mmap(struct file *file, struct vm_area_struct *vma)
 	switch (zsf->type) {
 	case zlfs_e_zt:
 		return zufc_zt_mmap(file, vma);
+	case zlfs_e_pmem:
+		return zuf_pmem_mmap(file, vma);
 	case zlfs_e_dpp_buff:
 		return zufc_ebuff_mmap(file, vma);
 	default:
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 05ec08d17d69..d0cb762f50ec 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -28,6 +28,8 @@
 #include "zus_api.h"
 
 #include "_pr.h"
+#include "md.h"
+#include "t2.h"
 
 enum zlfs_e_special_file {
 	zlfs_e_zt = 1,
@@ -98,6 +100,13 @@ static inline void zuf_add_fs_type(struct zuf_root_info *zri,
 	list_add(&zft->list, &zri->fst_list);
 }
 
+/* t1.c special file to mmap our pmem */
+struct zuf_pmem_file {
+	struct zuf_special_file hdr;
+	struct multi_devices *md;
+};
+
+
 /*
  * ZUF per-inode data in memory
  */
@@ -110,6 +119,51 @@ static inline struct zuf_inode_info *ZUII(struct inode *inode)
 	return container_of(inode, struct zuf_inode_info, vfs_inode);
 }
 
+/*
+ * ZUF super-block data in memory
+ */
+struct zuf_sb_info {
+	struct super_block *sb;
+	struct multi_devices *md;
+	struct zuf_pmem_file pmem;
+
+	/* zus cookie*/
+	struct zus_sb_info *zus_sbi;
+
+	/* Mount options */
+	unsigned long	s_mount_opt;
+	ulong		fs_caps;
+	char		*pmount_dev; /* for private mount */
+
+	spinlock_t		s_mmap_dirty_lock;
+	struct list_head	s_mmap_dirty;
+};
+
+static inline struct zuf_sb_info *SBI(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
+static inline struct zuf_fs_type *ZUF_FST(struct file_system_type *fs_type)
+{
+	return container_of(fs_type, struct zuf_fs_type, vfs_fst);
+}
+
+static inline struct zuf_fs_type *zuf_fst(struct super_block *sb)
+{
+	return ZUF_FST(sb->s_type);
+}
+
+static inline struct zuf_root_info *ZUF_ROOT(struct zuf_sb_info *sbi)
+{
+	return zuf_fst(sbi->sb)->zri;
+}
+
+static inline bool zuf_rdonly(struct super_block *sb)
+{
+	return sb_rdonly(sb);
+}
+
 struct zuf_dispatch_op;
 typedef int (*overflow_handler)(struct zuf_dispatch_op *zdo, void *parg,
 				ulong zt_max_bytes);
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 6b1fbaf24222..4292a4fa5f1a 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -22,6 +22,8 @@
 #include <linux/fiemap.h>
 #include <stddef.h>
 
+#include "md_def.h"
+
 #ifdef __cplusplus
 #define NAMELESS(X) X
 #else
@@ -355,6 +357,19 @@ struct zufs_ioc_numa_map {
 };
 #define ZU_IOC_NUMA_MAP	_IOWR('Z', 13, struct zufs_ioc_numa_map)
 
+struct zufs_ioc_pmem {
+	/* Set by zus */
+	struct zufs_ioc_hdr hdr;
+	__u64 sb_id;
+
+	/* Returned to zus */
+	struct md_dev_table mdt;
+	__u32 dev_index;
+	__u32 ___pad;
+};
+/* GRAB is never ungrabed umount or file close cleans it all */
+#define ZU_IOC_GRAB_PMEM	_IOWR('Z', 14, struct zufs_ioc_pmem)
+
 /* ZT init */
 enum { ZUFS_MAX_ZT_CHANNELS = 4 };
 

From patchwork Thu Sep 26 02:07:16 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161811
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 51057924
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:12:03 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 0F42D222C0
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:12:03 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="IQQtokGC"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1730506AbfIZCMC (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:12:02 -0400
Received: from mail-wr1-f67.google.com ([209.85.221.67]:44457 "EHLO
        mail-wr1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727403AbfIZCMC (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:12:02 -0400
Received: by mail-wr1-f67.google.com with SMTP id i18so477894wru.11
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:11:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=X4sIwOUV+mZsvYH7lYW9RGDOPE/wbsJFpqx8HG1QbOA=;
        b=IQQtokGCpOi6AIC2EVsTh+KoivTd+hIWB+K3Hn58ZLbwT3zISBfZm0qmdyItrKXqpo
         lFDWHsxK8sVVZWssfPmY+ClzMpb6eMzDNlrwgbs7pc5afYCsbuU0vbdEyZM1x+J54LQP
         OIRfrH+PlHu/uqoiGJXQ3xYwe6xYZ/Y+QzkbU6XOBQ7gVNa5zzVfm60fsCdLx5ZDmdiI
         YFoQQX+kqgasJVnk6Vkv9J9jieW3OO/OD3tg+dmY53LVnUYmw4USmubob/wFQcrSEIov
         C7x1PAJ9fNc7ijyxjEMePHowfjih9+Bvi9lB242CF6iTX1o387sU1YC7UZOk5D6fqP4y
         Y0hA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=X4sIwOUV+mZsvYH7lYW9RGDOPE/wbsJFpqx8HG1QbOA=;
        b=OksxxM0eOUbV3B5lM/0M8NS+FrhbkEyNlR3K0p9n9NvCD69tK0cz2DSvAMSdHSN8Am
         oJ7JNo0Qj0oZPVteUI6YIx76GkhRV9QUIpXergRFgsGqbDEqEp/aG5vWosCIGYYIodKv
         KhL3f8ymswYAI9E64a/hyRGqmW1Z4hBevlVS+yVOqNf6IozltOJ5+d+NtxHM/mLbUaMl
         z/qe02nwfJL7JD1VpIvXIodSM+GL4uUaPWj/xIBDjfXO/2RCv5G+B+ygluAzPbCmpPGJ
         jGaIFj5+ujW0Q+4QZzCexBXHH4/vGeGuDluJtPtx4cB/bIqSjqj34kCa+pjurr5XQNoi
         FuZw==
X-Gm-Message-State: APjAAAWGcfJ+tG2dKLzbkuTPf8+pQ0Z8zTB5BykKmGq+F9Vj59gWxThC
        oi5Z8+CAD9ojR+ejMqYQy8TI2MIE2is=
X-Google-Smtp-Source: 
 APXvYqxzWtkmNzfot8WZFX6Tlo0IdLw8FP1u36EPWi50fA3dtmmFzLYDwXU65pToFajiiAYFkpVomA==
X-Received: by 2002:a5d:4803:: with SMTP id l3mr907915wrq.301.1569463914320;
        Wed, 25 Sep 2019 19:11:54 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.11.52
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:11:53 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 07/16] zuf: mounting
Date: Thu, 26 Sep 2019 05:07:16 +0300
Message-Id: <20190926020725.19601-8-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

In this patch we already establish a mounted filesystem.

These are the steps for mounting a zufs Filesystem:

* All devices (Single or Multiple) are opened and established in
  an md object.

* mount_bdev is called with the main (first) device, in turn
  fill_supper is called.

* fill_supper dispatches a mount_operation(register_fs_info) to the
  server with an sb_id of the newly created super_block.

*  The Server at the zus mount routine. Will first thing do
  a GRAB_PMEM(sb_id) ioctl call to establish a special filehandle
  through which it will have full access to the all of its pmem space.
  With that it will call the zusFS to continue to inspect the content
  of devices and mount the FS.

* On return from mount the zusFS returns the root inode info

* fill_supper continues to create a root vfs-inode and returns
  successfully.

* We now have a mounted super_block, with corresponding super_block
  objects in the Server.

* Also in this patch global sb operations like statfs show-options
  and remount. And the umount/destruction of a super_block.

* There is a special support for a "private-mounting" of devices.
  private-mounting is usually used by the zusFS fschk/mkfs type
  applications that want a full access and lock-down of its multy-devices.
  But otherwise wants an exclusive access to these devices. The private
  mount exposes all the same services to the Server application. But
  there is no registered/mounted super_block in VFS.
  This is a very powerful tool for zusFS development because the same
  exact code that is used in a running FS is also used for the FS-utils.
  The code feels exactly the same as a live mount.
  (See the zus project for more info)

[v2]
  big_alloc now uses a dedicated 8k kmem_cache pool instead of kmalloc
  because it is used in the IO fast path. And we want guarantied forward
  progress.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   2 +-
 fs/zuf/_extern.h  |  10 +-
 fs/zuf/inode.c    |  23 ++
 fs/zuf/super.c    | 806 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/zuf/zuf-core.c |  96 ++++++
 fs/zuf/zuf-root.c |   5 +
 fs/zuf/zuf.h      | 134 ++++++++
 fs/zuf/zus_api.h  |  35 ++
 8 files changed, 1107 insertions(+), 4 deletions(-)
 create mode 100644 fs/zuf/inode.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index a247bd85d9aa..a5800cad73fd 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,5 +17,5 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += super.o
+zuf-y += super.o inode.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index a5929d3d165c..b1514e5821a2 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -51,12 +51,20 @@ int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
 int zuf_init_inodecache(void);
 void zuf_destroy_inodecache(void);
 
+int zuf_8k_cache_init(void);
+void zuf_8k_cache_fini(void);
+
 struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
 			 const char *dev_name, void *data);
-
+int zuf_private_mount(struct zuf_root_info *zri, struct register_fs_info *rfi,
+		      struct zufs_mount_info *zmi, struct super_block **sb_out);
+int zuf_private_umount(struct zuf_root_info *zri, struct super_block *sb);
 struct super_block *zuf_sb_from_id(struct zuf_root_info *zri, __u64 sb_id,
 				   struct zus_sb_info *zus_sbi);
 
+/* inode.c */
+struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       zu_dpp_t _zi, bool *exist);
 /* t1.c */
 int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
 
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
new file mode 100644
index 000000000000..a6115289dcda
--- /dev/null
+++ b/fs/zuf/inode.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode methods (allocate/free/read/write).
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include "zuf.h"
+
+struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       zu_dpp_t _zi, bool *exist)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
+
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index 2248ee74e4c2..01927deb5013 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -18,12 +18,740 @@
 
 #include "zuf.h"
 
+static struct super_operations zuf_sops;
 static struct kmem_cache *zuf_inode_cachep;
 
+enum {
+	Opt_uid,
+	Opt_gid,
+	Opt_pedantic,
+	Opt_ephemeral,
+	Opt_dax,
+	Opt_zpmdev,
+	Opt_err
+};
+
+static const match_table_t tokens = {
+	{ Opt_pedantic,		"pedantic"		},
+	{ Opt_pedantic,		"pedantic=%d"		},
+	{ Opt_ephemeral,	"ephemeral"		},
+	{ Opt_dax,		"dax"			},
+	{ Opt_zpmdev,		ZUFS_PMDEV_OPT"=%s"	},
+	{ Opt_err,		NULL			},
+};
+
+static int _parse_options(struct zuf_sb_info *sbi, const char *data,
+			  bool remount, struct zufs_parse_options *po)
+{
+	char *orig_options, *options;
+	char *p;
+	substring_t args[MAX_OPT_ARGS];
+	int err = 0;
+	bool ephemeral = false;
+	bool silent = test_opt(sbi, SILENT);
+	size_t mount_options_len = 0;
+
+	/* no options given */
+	if (!data)
+		return 0;
+
+	options = orig_options = kstrdup(data, GFP_KERNEL);
+	if (!options) {
+		zuf_err_cnd(silent, "kstrdup => -ENOMEM\n");
+		return -ENOMEM;
+	}
+
+	while ((p = strsep(&options, ",")) != NULL) {
+		int token;
+
+		if (!*p)
+			continue;
+
+		/* Initialize args struct so we know whether arg was found */
+		args[0].to = args[0].from = NULL;
+		token = match_token(p, tokens, args);
+		switch (token) {
+		case Opt_pedantic:
+			if (!args[0].from) {
+				po->mount_flags |= ZUFS_M_PEDANTIC;
+				set_opt(sbi, PEDANTIC);
+				continue;
+			}
+			if (match_int(&args[0], &po->pedantic))
+				goto bad_opt;
+			break;
+		case Opt_ephemeral:
+			po->mount_flags |= ZUFS_M_EPHEMERAL;
+			set_opt(sbi, EPHEMERAL);
+			ephemeral = true;
+			break;
+		case Opt_dax:
+			set_opt(sbi, DAX);
+			break;
+		case Opt_zpmdev:
+			if (unlikely(!test_opt(sbi, PRIVATE)))
+				goto bad_opt;
+			sbi->pmount_dev = match_strdup(&args[0]);
+			if (sbi->pmount_dev == NULL)
+				goto no_mem;
+			break;
+		default: {
+			if (mount_options_len != 0) {
+				po->mount_options[mount_options_len] = ',';
+				mount_options_len++;
+			}
+			strcat(po->mount_options, p);
+			mount_options_len += strlen(p);
+		}
+		}
+	}
+
+	if (remount && test_opt(sbi, EPHEMERAL) && (ephemeral == false))
+		clear_opt(sbi, EPHEMERAL);
+out:
+	kfree(orig_options);
+	return err;
+
+bad_opt:
+	zuf_warn_cnd(silent, "Bad mount option: \"%s\"\n", p);
+	err = -EINVAL;
+	goto out;
+no_mem:
+	zuf_warn_cnd(silent, "Not enough memory to parse options");
+	err = -ENOMEM;
+	goto out;
+}
+
+static int _print_tier_info(struct multi_devices *md, char **buff, int start,
+			    int count, int *_space, char *str)
+{
+	int space = *_space;
+	char *b = *buff;
+	int printed;
+	int i;
+
+	printed = snprintf(b, space, str);
+	if (unlikely(printed > space))
+		return -ENOSPC;
+
+	b += printed;
+	space -= printed;
+
+	for (i = start; i < start + count; ++i) {
+		printed = snprintf(b, space, "%s%s", i == start ? "" : ",",
+				   _bdev_name(md_dev_info(md, i)->bdev));
+
+		if (unlikely(printed > space))
+			return -ENOSPC;
+
+		b += printed;
+		space -= printed;
+	}
+	*_space = space;
+	*buff = b;
+
+	return 0;
+}
+
+static void _print_mount_info(struct zuf_sb_info *sbi, char *mount_options)
+{
+	struct multi_devices *md = sbi->md;
+	char buff[992];
+	int space = sizeof(buff);
+	char *b = buff;
+	int err;
+
+	err = _print_tier_info(md, &b, 0, md->t1_count, &space, "t1=");
+	if (unlikely(err))
+		goto no_space;
+
+	if (md->t2_count == 0)
+		goto print_options;
+
+	err = _print_tier_info(md, &b, md->t1_count, md->t2_count, &space,
+			       " t2=");
+	if (unlikely(err))
+		goto no_space;
+
+print_options:
+	if (mount_options) {
+		int printed = snprintf(b, space, " -o %s", mount_options);
+
+		if (unlikely(printed > space))
+			goto no_space;
+	}
+
+print:
+	zuf_info("mounted %s (0x%lx/0x%lx)\n", buff,
+		 md_t1_blocks(sbi->md), md_t2_blocks(sbi->md));
+	return;
+
+no_space:
+	snprintf(buff + sizeof(buff) - 4, 4, "...");
+	goto print;
+}
+
+static void _sb_mwtime_now(struct super_block *sb, struct md_dev_table *zdt)
+{
+	struct timespec64 now = current_time(sb->s_root->d_inode);
+
+	timespec_to_mt(&zdt->s_mtime, &now);
+	zdt->s_wtime = zdt->s_mtime;
+	/* TOZO _persist_md(sb, &zdt->s_mtime, 2*sizeof(zdt->s_mtime)); */
+}
+
+static void _clean_bdi(struct super_block *sb)
+{
+	if (sb->s_bdi != &noop_backing_dev_info) {
+		bdi_put(sb->s_bdi);
+		sb->s_bdi = &noop_backing_dev_info;
+	}
+}
+
+static int _setup_bdi(struct super_block *sb, const char *device_name)
+{
+	const char *n = sb->s_type->name;
+	int err;
+
+	if (sb->s_bdi)
+		_clean_bdi(sb);
+
+	err = super_setup_bdi_name(sb, "%s-%s", n, device_name);
+	if (unlikely(err)) {
+		zuf_err("Failed to super_setup_bdi\n");
+		return err;
+	}
+
+	sb->s_bdi->ra_pages = ZUFS_READAHEAD_PAGES;
+	sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK;
+	return 0;
+}
+
+static int _sb_add(struct zuf_root_info *zri, struct super_block *sb,
+		   __u64 *sb_id)
+{
+	uint i;
+	int err;
+
+	mutex_lock(&zri->sbl_lock);
+
+	if (zri->sbl.num == zri->sbl.max) {
+		struct super_block **new_array;
+
+		new_array = krealloc(zri->sbl.array,
+				  (zri->sbl.max + SBL_INC) * sizeof(*new_array),
+				  GFP_KERNEL | __GFP_ZERO);
+		if (unlikely(!new_array)) {
+			err = -ENOMEM;
+			goto out;
+		}
+		zri->sbl.max += SBL_INC;
+		zri->sbl.array = new_array;
+	}
+
+	for (i = 0; i < zri->sbl.max; ++i)
+		if (!zri->sbl.array[i])
+			break;
+
+	if (unlikely(i == zri->sbl.max)) {
+		zuf_err("!!!!! can't be! i=%d g_sbl.num=%d g_sbl.max=%d\n",
+			i, zri->sbl.num, zri->sbl.max);
+		err = -EFAULT;
+		goto out;
+	}
+
+	++zri->sbl.num;
+	zri->sbl.array[i] = sb;
+	*sb_id = i + 1;
+	err = 0;
+
+	zuf_dbg_vfs("sb_id=%lld\n", *sb_id);
+out:
+	mutex_unlock(&zri->sbl_lock);
+	return err;
+}
+
+static void _sb_remove(struct zuf_root_info *zri, struct super_block *sb)
+{
+	uint i;
+
+	mutex_lock(&zri->sbl_lock);
+
+	for (i = 0; i < zri->sbl.max; ++i)
+		if (zri->sbl.array[i] == sb)
+			break;
+	if (unlikely(i == zri->sbl.max)) {
+		zuf_err("!!!!! can't be! i=%d g_sbl.num=%d g_sbl.max=%d\n",
+			i, zri->sbl.num, zri->sbl.max);
+		goto out;
+	}
+
+	zri->sbl.array[i] = NULL;
+	--zri->sbl.num;
+out:
+	mutex_unlock(&zri->sbl_lock);
+}
+
 struct super_block *zuf_sb_from_id(struct zuf_root_info *zri, __u64 sb_id,
 				   struct zus_sb_info *zus_sbi)
 {
-	return NULL;
+	struct super_block *sb;
+
+	--sb_id;
+
+	if (zri->sbl.max <= sb_id) {
+		zuf_err("Invalid SB_ID 0x%llx\n", sb_id);
+		return NULL;
+	}
+
+	sb = zri->sbl.array[sb_id];
+	if (!sb) {
+		zuf_err("Stale SB_ID 0x%llx\n", sb_id);
+		return NULL;
+	}
+
+	return sb;
+}
+
+static void zuf_put_super(struct super_block *sb)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+
+	/* FIXME: This is because of a Kernel BUG (in v4.20) which
+	 * sometimes complains in _setup_bdi() on a recycle_mount that sysfs
+	 * bdi already exists. Cleaning here solves it.
+	 * Calling synchronize_rcu in zuf_kill_sb() after the call to
+	 * kill_block_super() does NOT solve it.
+	 */
+	_clean_bdi(sb);
+
+	if (sbi->zus_sbi) {
+		struct zufs_ioc_mount zim = {
+			.zmi.zus_sbi = sbi->zus_sbi,
+		};
+
+		zufc_dispatch_mount(ZUF_ROOT(sbi), NULL, ZUFS_M_UMOUNT, &zim);
+		sbi->zus_sbi = NULL;
+	}
+
+	/* NOTE!!! this is a HACK! we should not touch the s_umount
+	 * lock but to make lockdep happy we do that since our devices
+	 * are held exclusivly. Need to revisit every kernel version
+	 * change.
+	 */
+	if (sbi->md) {
+		up_write(&sb->s_umount);
+		md_fini(sbi->md, false);
+		down_write(&sb->s_umount);
+	}
+
+	_sb_remove(ZUF_ROOT(sbi), sb);
+	sb->s_fs_info = NULL;
+	if (!test_opt(sbi, FAILED))
+		zuf_info("unmounted /dev/%s\n", _bdev_name(sb->s_bdev));
+	kfree(sbi);
+}
+
+struct __fill_super_params {
+	struct multi_devices *md;
+	char *mount_options;
+};
+
+int zuf_private_mount(struct zuf_root_info *zri, struct register_fs_info *rfi,
+		      struct zufs_mount_info *zmi, struct super_block **sb_out)
+{
+	bool silent = zmi->po.mount_flags & ZUFS_M_SILENT;
+	char path[PATH_UUID];
+	const char *dev_path = NULL;
+	struct zuf_sb_info *sbi;
+	struct super_block *sb;
+	char *mount_options;
+	struct mdt_check mc = {
+		.alloc_mask	= ZUFS_ALLOC_MASK,
+		.major_ver	= rfi->FS_ver_major,
+		.minor_ver	= rfi->FS_ver_minor,
+		.magic		= rfi->FS_magic,
+
+		.silent = silent,
+		.private_mnt = true,
+	};
+	int err;
+
+	sb = kzalloc(sizeof(struct super_block), GFP_KERNEL);
+	if (unlikely(!sb)) {
+		zuf_err_cnd(silent, "Not enough memory to allocate sb\n");
+		return -ENOMEM;
+	}
+
+	sbi = kzalloc(sizeof(struct zuf_sb_info), GFP_KERNEL);
+	if (unlikely(!sbi)) {
+		zuf_err_cnd(silent, "Not enough memory to allocate sbi\n");
+		kfree(sb);
+		return -ENOMEM;
+	}
+
+	sb->s_fs_info = sbi;
+	sbi->sb = sb;
+
+	zmi->po.mount_flags |= ZUFS_M_PRIVATE;
+	set_opt(sbi, PRIVATE);
+
+	mount_options = kstrndup(zmi->po.mount_options,
+				 zmi->po.mount_options_len, GFP_KERNEL);
+	if (unlikely(!mount_options)) {
+		zuf_err_cnd(silent, "Not enough memory\n");
+		err = -ENOMEM;
+		goto fail;
+	}
+
+	memset(zmi->po.mount_options, 0, zmi->po.mount_options_len);
+
+	err = _parse_options(sbi, mount_options, 0, &zmi->po);
+	if (unlikely(err)) {
+		zuf_err_cnd(silent, "option parsing failed => %d\n", err);
+		goto fail;
+	}
+
+	if (unlikely(!sbi->pmount_dev)) {
+		zuf_err_cnd(silent, "private mount missing mountdev option\n");
+		err = -EINVAL;
+		goto fail;
+	}
+
+	zmi->po.mount_options_len = strlen(zmi->po.mount_options);
+
+	mc.holder = sbi;
+	err = md_init(&sbi->md, sbi->pmount_dev, &mc, path, &dev_path);
+	if (unlikely(err)) {
+		zuf_err_cnd(silent, "md_init failed! => %d\n", err);
+		goto fail;
+	}
+
+	zuf_dbg_verbose("private mount of %s\n", dev_path);
+
+	err = _sb_add(zri, sb, &zmi->sb_id);
+	if (unlikely(err)) {
+		zuf_err_cnd(silent, "_sb_add failed => %d\n", err);
+		goto fail;
+	}
+
+	*sb_out = sb;
+	return 0;
+
+fail:
+	if (sbi->md)
+		md_fini(sbi->md, true);
+	kfree(mount_options);
+	kfree(sbi->pmount_dev);
+	kfree(sbi);
+	kfree(sb);
+
+	return err;
+}
+
+int zuf_private_umount(struct zuf_root_info *zri, struct super_block *sb)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+
+	_sb_remove(zri, sb);
+	md_fini(sbi->md, true);
+	kfree(sbi->pmount_dev);
+	kfree(sbi);
+	kfree(sb);
+
+	return 0;
+}
+
+static int zuf_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct zuf_sb_info *sbi = NULL;
+	struct __fill_super_params *fsp = data;
+	struct zufs_ioc_mount zim = {};
+	struct zufs_ioc_mount *ioc_mount;
+	enum big_alloc_type bat;
+	struct register_fs_info *rfi;
+	struct inode *root_i;
+	size_t zim_size, mount_options_len;
+	bool exist;
+	int err;
+
+	BUILD_BUG_ON(sizeof(struct md_dev_table) > MDT_SIZE);
+	BUILD_BUG_ON(sizeof(struct zus_inode) != ZUFS_INODE_SIZE);
+
+	mount_options_len = (fsp->mount_options ?
+					strlen(fsp->mount_options) : 0) + 1;
+	zim_size = sizeof(zim) + mount_options_len;
+	ioc_mount = big_alloc(zim_size, sizeof(zim), &zim,
+			      GFP_KERNEL | __GFP_ZERO, &bat);
+	if (unlikely(!ioc_mount)) {
+		zuf_err_cnd(silent, "big_alloc(%ld) => -ENOMEM\n", zim_size);
+		return -ENOMEM;
+	}
+
+	ioc_mount->zmi.po.mount_options_len = mount_options_len;
+
+	err = _sb_add(zuf_fst(sb)->zri, sb, &ioc_mount->zmi.sb_id);
+	if (unlikely(err)) {
+		zuf_err_cnd(silent, "_sb_add failed => %d\n", err);
+		goto error;
+	}
+
+	sbi = kzalloc(sizeof(struct zuf_sb_info), GFP_KERNEL);
+	if (!sbi) {
+		zuf_err_cnd(silent, "Not enough memory to allocate sbi\n");
+		err = -ENOMEM;
+		goto error;
+	}
+	sb->s_fs_info = sbi;
+	sbi->sb = sb;
+
+	/* Initialize embedded objects */
+	spin_lock_init(&sbi->s_mmap_dirty_lock);
+	INIT_LIST_HEAD(&sbi->s_mmap_dirty);
+	if (silent) {
+		ioc_mount->zmi.po.mount_flags |= ZUFS_M_SILENT;
+		set_opt(sbi, SILENT);
+	}
+
+	sbi->md = fsp->md;
+	err = md_set_sb(sbi->md, sb->s_bdev, sb, silent);
+	if (unlikely(err))
+		goto error;
+
+	err = _parse_options(sbi, fsp->mount_options, 0, &ioc_mount->zmi.po);
+	if (err)
+		goto error;
+
+	err = _setup_bdi(sb, _bdev_name(sb->s_bdev));
+	if (err) {
+		zuf_err_cnd(silent, "Failed to setup bdi => %d\n", err);
+		goto error;
+	}
+
+	/* Tell ZUS to mount an FS for us */
+	err = zufc_dispatch_mount(ZUF_ROOT(sbi), zuf_fst(sb)->zus_zfi,
+				  ZUFS_M_MOUNT, ioc_mount);
+	if (unlikely(err)) {
+		zuf_err_cnd(silent, "zufc_dispatch_mount failed => %d\n", err);
+		goto error;
+	}
+	sbi->zus_sbi = ioc_mount->zmi.zus_sbi;
+
+	/* Init with default values */
+	sb->s_blocksize_bits = ioc_mount->zmi.s_blocksize_bits;
+	sb->s_blocksize = 1 << ioc_mount->zmi.s_blocksize_bits;
+
+	rfi = &zuf_fst(sb)->rfi;
+
+	sb->s_magic = rfi->FS_magic;
+	sb->s_time_gran = rfi->s_time_gran;
+	sb->s_maxbytes = rfi->s_maxbytes;
+	sb->s_flags |= SB_NOSEC;
+
+	sbi->fs_caps = ioc_mount->zmi.fs_caps;
+	if (sbi->fs_caps & ZUFS_FSC_ACL_ON)
+		sb->s_flags |= SB_POSIXACL;
+
+	sb->s_op = &zuf_sops;
+
+	root_i = zuf_iget(sb, ioc_mount->zmi.zus_ii, ioc_mount->zmi._zi,
+			  &exist);
+	if (IS_ERR(root_i)) {
+		err = PTR_ERR(root_i);
+		zuf_err_cnd(silent, "zuf_iget failed => %d\n", err);
+		goto error;
+	}
+	WARN_ON(exist);
+
+	sb->s_root = d_make_root(root_i);
+	if (!sb->s_root) {
+		zuf_err_cnd(silent, "d_make_root root inode failed\n");
+		iput(root_i); /* undo zuf_iget */
+		err = -ENOMEM;
+		goto error;
+	}
+
+	if (!zuf_rdonly(sb))
+		_sb_mwtime_now(sb, md_zdt(sbi->md));
+
+	mt_to_timespec(&root_i->i_ctime, &zus_zi(root_i)->i_ctime);
+	mt_to_timespec(&root_i->i_mtime, &zus_zi(root_i)->i_mtime);
+
+	_print_mount_info(sbi, fsp->mount_options);
+	clear_opt(sbi, SILENT);
+	big_free(ioc_mount, bat);
+	return 0;
+
+error:
+	zuf_warn("NOT mounting => %d\n", err);
+	if (sbi) {
+		set_opt(sbi, FAILED);
+		zuf_put_super(sb);
+	}
+	big_free(ioc_mount, bat);
+	return err;
+}
+
+static void _zst_to_kst(const struct statfs64 *zst, struct kstatfs *kst)
+{
+	kst->f_type	= zst->f_type;
+	kst->f_bsize	= zst->f_bsize;
+	kst->f_blocks	= zst->f_blocks;
+	kst->f_bfree	= zst->f_bfree;
+	kst->f_bavail	= zst->f_bavail;
+	kst->f_files	= zst->f_files;
+	kst->f_ffree	= zst->f_ffree;
+	kst->f_fsid	= zst->f_fsid;
+	kst->f_namelen	= zst->f_namelen;
+	kst->f_frsize	= zst->f_frsize;
+	kst->f_flags	= zst->f_flags;
+}
+
+static int zuf_statfs(struct dentry *d, struct kstatfs *buf)
+{
+	struct zuf_sb_info *sbi = SBI(d->d_sb);
+	struct zufs_ioc_statfs ioc_statfs = {
+		.hdr.in_len = offsetof(struct zufs_ioc_statfs, statfs_out),
+		.hdr.out_len = sizeof(ioc_statfs),
+		.hdr.operation = ZUFS_OP_STATFS,
+		.zus_sbi = sbi->zus_sbi,
+	};
+	int err;
+
+	err = zufc_dispatch(ZUF_ROOT(sbi), &ioc_statfs.hdr, NULL, 0);
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err_dispatch(d->d_sb,
+			"zufc_dispatch failed op=ZUFS_OP_STATFS => %d\n",
+			err);
+		return err;
+	}
+
+	_zst_to_kst(&ioc_statfs.statfs_out, buf);
+	return 0;
+}
+
+struct __mount_options {
+	struct zufs_ioc_mount_options imo;
+	char buf[ZUFS_MO_MAX];
+};
+
+static int zuf_show_options(struct seq_file *seq, struct dentry *root)
+{
+	struct zuf_sb_info *sbi = SBI(root->d_sb);
+	struct __mount_options mo = {
+		.imo.hdr.in_len = sizeof(mo.imo),
+		.imo.hdr.out_start = offsetof(typeof(mo.imo), buf),
+		.imo.hdr.out_len = 0,
+		.imo.hdr.out_max = sizeof(mo.buf),
+		.imo.hdr.operation = ZUFS_OP_SHOW_OPTIONS,
+		.imo.zus_sbi = sbi->zus_sbi,
+	};
+	int err;
+
+	if (test_opt(sbi, EPHEMERAL))
+		seq_puts(seq, ",ephemeral");
+	if (test_opt(sbi, DAX))
+		seq_puts(seq, ",dax");
+
+	err = zufc_dispatch(ZUF_ROOT(sbi), &mo.imo.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_err_dispatch(root->d_sb,
+			"zufs_dispatch failed op=ZUS_OP_SHOW_OPTIONS => %d\n",
+			err);
+		/* NOTE: if zusd crashed and we try to run 'umount', it will
+		 * SEGFAULT because zufc_dispatch will return -EFAULT.
+		 * Just return 0 as if the FS has no specific mount options.
+		 */
+		return 0;
+	}
+	seq_puts(seq, mo.buf);
+
+	return 0;
+}
+
+static int zuf_show_devname(struct seq_file *seq, struct dentry *root)
+{
+	seq_printf(seq, "/dev/%s", _bdev_name(root->d_sb->s_bdev));
+
+	return 0;
+}
+
+static int zuf_remount(struct super_block *sb, int *mntflags, char *data)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct zufs_ioc_mount zim = {};
+	struct zufs_ioc_mount *ioc_mount;
+	size_t remount_options_len, zim_size;
+	enum big_alloc_type bat;
+	ulong old_mount_opt = sbi->s_mount_opt;
+	int err;
+
+	zuf_info("remount... -o %s\n", data);
+
+	remount_options_len = data ? (strlen(data) + 1) : 0;
+	zim_size = sizeof(zim) + remount_options_len;
+	ioc_mount = big_alloc(zim_size, sizeof(zim), &zim,
+			      GFP_KERNEL | __GFP_ZERO, &bat);
+	if (unlikely(!ioc_mount))
+		return -ENOMEM;
+
+	ioc_mount->zmi.zus_sbi = sbi->zus_sbi,
+	ioc_mount->zmi.remount_flags = zuf_rdonly(sb) ? ZUFS_REM_WAS_RO : 0;
+	ioc_mount->zmi.po.mount_options_len = remount_options_len;
+
+	err = _parse_options(sbi, data, 1, &ioc_mount->zmi.po);
+	if (unlikely(err))
+		goto fail;
+
+	if (*mntflags & SB_RDONLY) {
+		ioc_mount->zmi.remount_flags |= ZUFS_REM_WILL_RO;
+
+		if (!zuf_rdonly(sb))
+			_sb_mwtime_now(sb, md_zdt(sbi->md));
+	} else if (zuf_rdonly(sb)) {
+		_sb_mwtime_now(sb, md_zdt(sbi->md));
+	}
+
+	err = zufc_dispatch_mount(ZUF_ROOT(sbi), zuf_fst(sb)->zus_zfi,
+				  ZUFS_M_REMOUNT, ioc_mount);
+	if (unlikely(err))
+		goto fail;
+
+	big_free(ioc_mount, bat);
+	return 0;
+
+fail:
+	sbi->s_mount_opt = old_mount_opt;
+	big_free(ioc_mount, bat);
+	zuf_dbg_err("remount failed restore option\n");
+	return err;
+}
+
+static int zuf_update_s_wtime(struct super_block *sb)
+{
+	if (!(zuf_rdonly(sb))) {
+		struct timespec64 now = current_time(sb->s_root->d_inode);
+
+		timespec_to_mt(&md_zdt(SBI(sb)->md)->s_wtime, &now);
+	}
+	return 0;
+}
+
+static struct inode *zuf_alloc_inode(struct super_block *sb)
+{
+	struct zuf_inode_info *zii;
+
+	zii = kmem_cache_alloc(zuf_inode_cachep, GFP_NOFS);
+	if (!zii)
+		return NULL;
+
+	zii->vfs_inode.i_version.counter = 1;
+	return &zii->vfs_inode;
+}
+
+static void zuf_destroy_inode(struct inode *inode)
+{
+	kmem_cache_free(zuf_inode_cachep, ZUII(inode));
 }
 
 static void _init_once(void *foo)
@@ -31,6 +759,7 @@ static void _init_once(void *foo)
 	struct zuf_inode_info *zii = foo;
 
 	inode_init_once(&zii->vfs_inode);
+	zii->zi = NULL;
 }
 
 int __init zuf_init_inodecache(void)
@@ -52,8 +781,81 @@ void zuf_destroy_inodecache(void)
 	kmem_cache_destroy(zuf_inode_cachep);
 }
 
+static struct super_operations zuf_sops = {
+	.alloc_inode	= zuf_alloc_inode,
+	.destroy_inode	= zuf_destroy_inode,
+	.put_super	= zuf_put_super,
+	.freeze_fs	= zuf_update_s_wtime,
+	.unfreeze_fs	= zuf_update_s_wtime,
+	.statfs		= zuf_statfs,
+	.remount_fs	= zuf_remount,
+	.show_options	= zuf_show_options,
+	.show_devname	= zuf_show_devname,
+};
+
 struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
 			 const char *dev_name, void *data)
 {
-	return ERR_PTR(-ENOTSUPP);
+	int silent = flags & SB_SILENT ? 1 : 0;
+	struct __fill_super_params fsp = {
+		.mount_options = data,
+	};
+	struct zuf_fs_type *fst = ZUF_FST(fs_type);
+	struct register_fs_info *rfi = &fst->rfi;
+	struct mdt_check mc = {
+		.alloc_mask	= ZUFS_ALLOC_MASK,
+		.major_ver	= rfi->FS_ver_major,
+		.minor_ver	= rfi->FS_ver_minor,
+		.magic		= rfi->FS_magic,
+
+		.holder = fs_type,
+		.silent = silent,
+	};
+	struct dentry *ret = NULL;
+	char path[PATH_UUID];
+	const char *dev_path = NULL;
+	int err;
+
+	zuf_dbg_vfs("dev_name=%s, data=%s\n", dev_name, (const char *)data);
+
+	err = md_init(&fsp.md, dev_name, &mc, path, &dev_path);
+	if (unlikely(err)) {
+		zuf_err_cnd(silent, "md_init failed! => %d\n", err);
+		goto out;
+	}
+
+	zuf_dbg_vfs("mounting with dev_path=%s\n", dev_path);
+	ret = mount_bdev(fs_type, flags, dev_path, &fsp, zuf_fill_super);
+
+out:
+	if (unlikely(err) && fsp.md)
+		md_fini(fsp.md, true);
+
+	return err ? ERR_PTR(err) : ret;
+}
+
+// ==== 8k fast_alloc ====
+static struct kmem_cache *zuf_8k_cachep;
+
+void *zuf_8k_alloc(gfp_t gfp)
+{
+	return kmem_cache_alloc(zuf_8k_cachep, gfp);
+}
+
+void zuf_8k_free(void *ptr)
+{
+	kmem_cache_free(zuf_8k_cachep, ptr);
+}
+
+int __init zuf_8k_cache_init(void)
+{
+	zuf_8k_cachep = kmem_cache_create("zuf_8k_cache", S_8K, 0, 0, NULL);
+	if (unlikely(!zuf_8k_cachep))
+		return -ENOMEM;
+	return 0;
+}
+
+void zuf_8k_cache_fini(void)
+{
+	kmem_cache_destroy(zuf_8k_cachep);
 }
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index cc49cfa95244..a417f9463682 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -63,6 +63,8 @@ const char *zuf_op_name(enum e_zufs_operation op)
 	switch  (op) {
 		CASE_ENUM_NAME(ZUFS_OP_NULL);
 		CASE_ENUM_NAME(ZUFS_OP_BREAK);
+		CASE_ENUM_NAME(ZUFS_OP_STATFS);
+		CASE_ENUM_NAME(ZUFS_OP_SHOW_OPTIONS);
 	case ZUFS_OP_MAX_OPT:
 	default:
 		return "UNKNOWN";
@@ -290,6 +292,95 @@ static void zufc_mounter_release(struct file *file)
 	}
 }
 
+static int _zu_private_mounter_release(struct file *file)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	struct zuf_special_file *zsf = file->private_data;
+	struct zuf_private_mount_info *zpmi;
+	int err;
+
+	zpmi = container_of(zsf, struct zuf_private_mount_info, zsf);
+
+	err = zuf_private_umount(zri, zpmi->sb);
+
+	kfree(zpmi);
+
+	return err;
+}
+
+static int _zu_private_mounter(struct file *file, void *parg)
+{
+	struct super_block *sb = file->f_inode->i_sb;
+	struct zufs_ioc_mount_private *zip = NULL;
+	struct zuf_private_mount_info *zpmi;
+	struct zuf_root_info *zri = ZRI(sb);
+	struct zufs_ioc_hdr hdr;
+	__u32 is_umount;
+	ulong cp_ret;
+	int err = 0;
+
+	get_user(is_umount,
+		 &((struct zufs_ioc_mount_private *)parg)->is_umount);
+	if (is_umount)
+		return _zu_private_mounter_release(file);
+
+	if (unlikely(file->private_data)) {
+		zuf_err("One mount per runner please..\n");
+		return -EINVAL;
+	}
+
+	zpmi = kzalloc(sizeof(*zpmi), GFP_KERNEL);
+	if (unlikely(!zpmi)) {
+		zuf_err("alloc failed\n");
+		return -ENOMEM;
+	}
+
+	zpmi->zsf.type = zlfs_e_private_mount;
+	zpmi->zsf.file = file;
+
+	cp_ret = copy_from_user(&hdr, parg, sizeof(hdr));
+	if (unlikely(cp_ret)) {
+		zuf_err("copy_from_user(hdr) => %ld\n", cp_ret);
+		err = -EFAULT;
+		goto fail;
+	}
+
+	zip = kmalloc(hdr.in_len, GFP_KERNEL);
+	if (unlikely(!zip)) {
+		zuf_err("alloc failed\n");
+		err = -ENOMEM;
+		goto fail;
+	}
+
+	cp_ret = copy_from_user(zip, parg, hdr.in_len);
+	if (unlikely(cp_ret)) {
+		zuf_err("copy_from_user => %ld\n", cp_ret);
+		err = -EFAULT;
+		goto fail;
+	}
+
+	err = zuf_private_mount(zri, &zip->rfi, &zip->zmi, &zpmi->sb);
+	if (unlikely(err))
+		goto fail;
+
+	cp_ret = copy_to_user(parg, zip, hdr.in_len);
+	if (unlikely(cp_ret)) {
+		zuf_err("copy_to_user =>%ld\n", cp_ret);
+		err = -EFAULT;
+		goto fail;
+	}
+
+	file->private_data = &zpmi->zsf;
+
+out:
+	kfree(zip);
+	return err;
+
+fail:
+	kfree(zpmi);
+	goto out;
+}
+
 /* ~~~~ ZU_IOC_NUMA_MAP ~~~~ */
 static int _zu_numa_map(struct file *file, void *parg)
 {
@@ -966,6 +1057,8 @@ long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
 		return _zu_wait(file, parg);
 	case ZU_IOC_ALLOC_BUFFER:
 		return _zu_ebuff_alloc(file, parg);
+	case ZU_IOC_PRIVATE_MOUNT:
+		return _zu_private_mounter(file, parg);
 	case ZU_IOC_BREAK_ALL:
 		return _zu_break(file, parg);
 	default:
@@ -988,6 +1081,9 @@ int zufc_release(struct inode *inode, struct file *file)
 	case zlfs_e_mout_thread:
 		zufc_mounter_release(file);
 		return 0;
+	case zlfs_e_private_mount:
+		_zu_private_mounter_release(file);
+		return 0;
 	case zlfs_e_pmem:
 		/* NOTHING to clean for pmem file yet */
 		/* zuf_pmem_release(file);*/
diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c
index ea7eb810ea9d..ecf240bd3e3f 100644
--- a/fs/zuf/zuf-root.c
+++ b/fs/zuf/zuf-root.c
@@ -405,6 +405,10 @@ int __init zuf_root_init(void)
 {
 	int err = zuf_init_inodecache();
 
+	if (unlikely(err))
+		return err;
+
+	err = zuf_8k_cache_init();
 	if (unlikely(err))
 		return err;
 
@@ -431,6 +435,7 @@ static void __exit zuf_root_exit(void)
 {
 	unregister_filesystem(&zufr_type);
 	kset_unregister(zufr_kset);
+	zuf_8k_cache_fini();
 	zuf_destroy_inodecache();
 }
 
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index d0cb762f50ec..18cbc376cfa6 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -106,12 +106,33 @@ struct zuf_pmem_file {
 	struct multi_devices *md;
 };
 
+/*
+ * Private Super-block flags
+ */
+enum {
+	ZUF_MOUNT_PEDANTIC	= 0x000001,	/* Check for memory leaks */
+	ZUF_MOUNT_PEDANTIC_SHADOW = 0x00002,	/* */
+	ZUF_MOUNT_SILENT	= 0x000004,	/* verbosity is silent */
+	ZUF_MOUNT_EPHEMERAL	= 0x000008,	/* Don't persist the data */
+	ZUF_MOUNT_FAILED	= 0x000010,	/* mark a failed-mount */
+	ZUF_MOUNT_DAX		= 0x000020,	/* mounted with dax option */
+	ZUF_MOUNT_POSIXACL	= 0x000040,	/* mounted with posix acls */
+	ZUF_MOUNT_PRIVATE	= 0x000080,	/* private mount from runner */
+};
+
+#define clear_opt(sbi, opt)       (sbi->s_mount_opt &= ~ZUF_MOUNT_ ## opt)
+#define set_opt(sbi, opt)         (sbi->s_mount_opt |= ZUF_MOUNT_ ## opt)
+#define test_opt(sbi, opt)      (sbi->s_mount_opt & ZUF_MOUNT_ ## opt)
 
 /*
  * ZUF per-inode data in memory
  */
 struct zuf_inode_info {
 	struct inode		vfs_inode;
+
+	/* cookies from Server */
+	struct zus_inode	*zi;
+	struct zus_inode_info	*zus_ii;
 };
 
 static inline struct zuf_inode_info *ZUII(struct inode *inode)
@@ -164,6 +185,119 @@ static inline bool zuf_rdonly(struct super_block *sb)
 	return sb_rdonly(sb);
 }
 
+static inline bool zuf_is_nio_reads(struct inode *inode)
+{
+	return SBI(inode->i_sb)->fs_caps & ZUFS_FSC_NIO_READS;
+}
+
+static inline bool zuf_is_nio_writes(struct inode *inode)
+{
+	return SBI(inode->i_sb)->fs_caps & ZUFS_FSC_NIO_WRITES;
+}
+
+static inline struct zus_inode *zus_zi(struct inode *inode)
+{
+	return ZUII(inode)->zi;
+}
+
+/* An accessor because of the frequent use in prints */
+static inline ulong _zi_ino(struct zus_inode *zi)
+{
+	return le64_to_cpu(zi->i_ino);
+}
+
+static inline bool _zi_active(struct zus_inode *zi)
+{
+	return (zi->i_nlink || zi->i_mode);
+}
+
+static inline void mt_to_timespec(struct timespec64 *t, __le64 *mt)
+{
+	u32 nsec;
+
+	t->tv_sec = div_s64_rem(le64_to_cpu(*mt), NSEC_PER_SEC, &nsec);
+	t->tv_nsec = nsec;
+}
+
+static inline void timespec_to_mt(__le64 *mt, struct timespec64 *t)
+{
+	*mt = cpu_to_le64(t->tv_sec * NSEC_PER_SEC + t->tv_nsec);
+}
+
+static inline
+void zus_inode_cmtime_now(struct inode *inode, struct zus_inode *zi)
+{
+	inode->i_mtime = inode->i_ctime = current_time(inode);
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+	zi->i_mtime = zi->i_ctime;
+}
+
+static inline
+void zus_inode_ctime_now(struct inode *inode, struct zus_inode *zi)
+{
+	inode->i_ctime = current_time(inode);
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+}
+
+static inline void *zuf_dpp_t_addr(struct super_block *sb, zu_dpp_t v)
+{
+	/* TODO: Implement zufs_ioc_create_mempool already */
+	if (WARN_ON(zu_dpp_t_pool(v)))
+		return NULL;
+
+	return md_addr_verify(SBI(sb)->md, zu_dpp_t_val(v));
+}
+
+enum big_alloc_type { ba_stack, ba_8k, ba_vmalloc };
+#define S_8K (1024UL * 8)
+
+void *zuf_8k_alloc(gfp_t gfp);
+void  zuf_8k_free(void *ptr);
+
+static inline
+void *big_alloc(uint bytes, uint local_size, void *local, gfp_t gfp,
+		enum big_alloc_type *bat)
+{
+	void *ptr;
+
+	if (bytes <= local_size) {
+		*bat = ba_stack;
+		ptr = local;
+	} else if (bytes <= S_8K) {
+		*bat = ba_8k;
+		ptr = zuf_8k_alloc(gfp);
+	} else {
+		*bat = ba_vmalloc;
+		ptr = vmalloc(bytes);
+	}
+
+	return ptr;
+}
+
+static inline void big_free(void *ptr, enum big_alloc_type bat)
+{
+	if (unlikely(!ptr))
+		return;
+
+	switch (bat) {
+	case ba_stack:
+		break;
+	case ba_8k:
+		zuf_8k_free(ptr);
+		break;
+	case ba_vmalloc:
+		vfree(ptr);
+	}
+}
+
+#if (CONFIG_FRAME_WARN == 0)
+#	define ZUF_MAX_STACK(minus) (THREAD_SIZE / 2 - minus)
+#elif (CONFIG_FRAME_WARN < (S_8K + 8))
+#	define ZUF_MAX_STACK(minus) (CONFIG_FRAME_WARN - minus)
+#else
+#	define ZUF_MAX_STACK(minus) ((S_8K + 8) - minus)
+#endif
+
 struct zuf_dispatch_op;
 typedef int (*overflow_handler)(struct zuf_dispatch_op *zdo, void *parg,
 				ulong zt_max_bytes);
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 4292a4fa5f1a..1af3bd016453 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -330,6 +330,17 @@ struct  zufs_ioc_mount {
 };
 #define ZU_IOC_MOUNT		_IOWR('Z', 11, struct zufs_ioc_mount)
 
+/* Mount locally with a zus-runner process */
+#define ZUFS_PMDEV_OPT "zpmdev"
+struct zufs_ioc_mount_private {
+	struct zufs_ioc_hdr	hdr;
+	__u32			mount_fd; /* kernel cookie */
+	__u32			is_umount; /* true or false */
+	struct register_fs_info	rfi;
+	struct zufs_mount_info	zmi; /* must be last */
+};
+#define ZU_IOC_PRIVATE_MOUNT	_IOWR('Z', 12, struct zufs_ioc_mount_private)
+
 /* pmem  */
 struct zufs_cpu_set {
 	ulong bits[16];
@@ -432,7 +443,31 @@ enum e_zufs_operation {
 	ZUFS_OP_NULL		= 0,
 	ZUFS_OP_BREAK		= 1,	/* Kernel telling Server to exit */
 
+	ZUFS_OP_STATFS		= 2,
+	ZUFS_OP_SHOW_OPTIONS	= 3,
+
 	ZUFS_OP_MAX_OPT,
 };
 
+#define ZUFS_MO_MAX	512
+
+struct zufs_ioc_mount_options {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_sb_info *zus_sbi;
+
+	/* OUT */
+	char	buf[0];
+};
+
+/* ZUFS_OP_STATFS */
+struct zufs_ioc_statfs {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_sb_info *zus_sbi;
+
+	/* OUT */
+	struct statfs64 statfs_out;
+};
+
 #endif /* _LINUX_ZUFS_API_H */

From patchwork Thu Sep 26 02:07:17 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161815
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C384E14E5
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:12:41 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 77F9F222BF
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:12:41 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="DNLuSSHm"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1730679AbfIZCMl (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:12:41 -0400
Received: from mail-wr1-f67.google.com ([209.85.221.67]:40170 "EHLO
        mail-wr1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727403AbfIZCMk (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:12:40 -0400
Received: by mail-wr1-f67.google.com with SMTP id l3so777083wru.7
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:12:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=BNqv/U5g8rZs1THbRnoNmThzJTM/NtyHCktjCy2/PuI=;
        b=DNLuSSHmz/n1hFZd7pEl8gUxhjCShpFJtTRgRC6mURmVk5XLSPQgvg5U5ldc1Qm0Yy
         XUYmCqbGqpp5iRDrMiwcMAgD722tYYmi28fuoppSPR5258WrrMOiRZXqiaCxrC3d+i4X
         Ka2Gr0DY71/ehEu/1cvn0xKkM9R6I0YNN9qtiMcYz5JOPGwEceL0GzwdydglBty9vJcM
         n8E9McI9YG7gPrvq+asUPxZmrc6i6vFWzA3yznk4Dktfxbo6l2wRGcXnFTWLKJG1tj77
         6Jj+PZfwpyQEQqf+sdBZfhtkJjsgW5ncrpunUy/xwP+fin8ouboBmDR/SgqIQPWvdUWS
         C6PA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=BNqv/U5g8rZs1THbRnoNmThzJTM/NtyHCktjCy2/PuI=;
        b=LOFZv/qK9hokXCuuqJRd0vZxy7fQ2NMqitJEFvJjYTQr1jTO82ZEDvUbSD+okvwUTa
         XuizgG0I4n7B2s3qsHwij//ZhidREaJfsO/L8OBt2X4aURSAV65mt7Z7bk+KshKyY440
         6lyCziOvkJNLR0WJ9TXd3rgkQs9I5LaBs9ZBDwZc5UpNg2PQq0zZGxqc3c42tOq5vi9G
         jwI6YkSqjN5OG+RXt1aJQa03JARSHZGp0Y/Tv6t1easEWfTmat7EblLMjfVkzSEZhuvT
         b2BvOVMHuzEOf7MK1v+AFFkIlUbxjr0FA50Hgy7VD0dP4A5RcRpYbLlyiNNJxBGbFrDC
         KdPA==
X-Gm-Message-State: APjAAAXw/I/EcusMO7d5UZCC1VR0MoUW2xnZOOcqH8afTTOfzuAyGiMq
        oNY2LSmanSlcJ8H0CNOPKEvyGRyraqk=
X-Google-Smtp-Source: 
 APXvYqzUTm08C92oCS3AA2g63SNWz7btTnGqcgp19b2JzWKZC7HOyKCcq1znqHQo8CB7XLEODPs/8g==
X-Received: by 2002:a5d:6648:: with SMTP id f8mr835597wrw.167.1569463953642;
        Wed, 25 Sep 2019 19:12:33 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.12.32
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:12:33 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 08/16] zuf: Namei and directory operations
Date: Thu, 26 Sep 2019 05:07:17 +0300
Message-Id: <20190926020725.19601-9-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Introducing Creation/deletion of files
Directory add/remove
Other namei operations

This is all a very STD Kernel way of doing things.
Each VFS operation is packed and dispatched to Server.
After dispatch return, pushing results into Kernel
structures

NOTE: The use of a zufs_inode communication structure
that is returned as a zufs_dpp_t (Dual port pointer)
Both Kernel and Server can read/write to this object.
If Kernel modifies this object it is always before
the dispatch so server can persist the changes.
It is also used by Server to return new info to be updated
into the vfs_inode.
In a pmem system this object can be directly pointing
to storage.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile    |   2 +-
 fs/zuf/_extern.h   |  41 ++++
 fs/zuf/directory.c | 100 ++++++++
 fs/zuf/file.c      |  31 +++
 fs/zuf/inode.c     | 561 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/zuf/namei.c     | 402 ++++++++++++++++++++++++++++++++
 fs/zuf/super.c     |   2 +
 fs/zuf/zuf-core.c  |  10 +
 fs/zuf/zuf.h       |  63 +++++
 fs/zuf/zus_api.h   |  94 ++++++++
 10 files changed, 1304 insertions(+), 2 deletions(-)
 create mode 100644 fs/zuf/directory.c
 create mode 100644 fs/zuf/file.c
 create mode 100644 fs/zuf/namei.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index a5800cad73fd..2bfed45723e3 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,5 +17,5 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += super.o inode.o
+zuf-y += super.o inode.o directory.o namei.o file.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index b1514e5821a2..50887792bf42 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -62,10 +62,51 @@ int zuf_private_umount(struct zuf_root_info *zri, struct super_block *sb);
 struct super_block *zuf_sb_from_id(struct zuf_root_info *zri, __u64 sb_id,
 				   struct zus_sb_info *zus_sbi);
 
+/* file.c */
+long __zuf_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len);
+
+/* namei.c */
+void zuf_zii_sync(struct inode *inode, bool sync_nlink);
+
 /* inode.c */
+int zuf_evict_dispatch(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       int operation, uint flags);
 struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
 		       zu_dpp_t _zi, bool *exist);
+void zuf_evict_inode(struct inode *inode);
+struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
+			    const struct qstr *qstr, const char *symname,
+			    ulong rdev_or_isize, bool tmpfile);
+int zuf_write_inode(struct inode *inode, struct writeback_control *wbc);
+int zuf_update_time(struct inode *inode, struct timespec64 *time, int flags);
+int zuf_setattr(struct dentry *dentry, struct iattr *attr);
+int zuf_getattr(const struct path *path, struct kstat *stat,
+		 u32 request_mask, unsigned int flags);
+void zuf_set_inode_flags(struct inode *inode, struct zus_inode *zi);
+
+/* directory.c */
+int zuf_add_dentry(struct inode *dir, struct qstr *str, struct inode *inode);
+int zuf_remove_dentry(struct inode *dir, struct qstr *str, struct inode *inode);
+
 /* t1.c */
 int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
 
+/*
+ * Inode and files operations
+ */
+
+/* file.c */
+extern const struct inode_operations zuf_file_inode_operations;
+extern const struct file_operations zuf_file_operations;
+
+/* inode.c */
+extern const struct address_space_operations zuf_aops;
+
+/* namei.c */
+extern const struct inode_operations zuf_dir_inode_operations;
+extern const struct inode_operations zuf_special_inode_operations;
+
+/* dir.c */
+extern const struct file_operations zuf_dir_operations;
+
 #endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/directory.c b/fs/zuf/directory.c
new file mode 100644
index 000000000000..5624e05f96e5
--- /dev/null
+++ b/fs/zuf/directory.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * File operations for directories.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include <linux/fs.h>
+#include <linux/vmalloc.h>
+#include "zuf.h"
+
+static int zuf_readdir(struct file *file, struct dir_context *ctx)
+{
+	return -ENOTSUPP;
+}
+
+/*
+ *FIXME comment to full git diff
+ */
+
+static int _dentry_dispatch(struct inode *dir, struct inode *inode,
+			    struct qstr *str, int operation)
+{
+	struct zufs_ioc_dentry ioc_dentry = {
+		.hdr.operation = operation,
+		.hdr.in_len = sizeof(ioc_dentry),
+		.hdr.out_len = sizeof(ioc_dentry),
+		.zus_ii = inode ? ZUII(inode)->zus_ii : NULL,
+		.zus_dir_ii = ZUII(dir)->zus_ii,
+		.str.len = str->len,
+	};
+	int err;
+
+	memcpy(&ioc_dentry.str.name, str->name, str->len);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(dir->i_sb)), &ioc_dentry.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_dbg_err("[%ld] op=%d zufc_dispatch failed => %d\n",
+			    dir->i_ino, operation, err);
+		return err;
+	}
+
+	return 0;
+}
+
+/* return pointer to added de on success, err-code on failure */
+int zuf_add_dentry(struct inode *dir, struct qstr *str, struct inode *inode)
+{
+	struct zuf_inode_info *zii = ZUII(dir);
+	int err;
+
+	if (!str->len || !zii->zi)
+		return -EINVAL;
+
+	zus_inode_cmtime_now(dir, zii->zi);
+	err = _dentry_dispatch(dir, inode, str, ZUFS_OP_ADD_DENTRY);
+	if (unlikely(err)) {
+		zuf_dbg_err("[%ld] _dentry_dispatch failed => %d\n",
+			    dir->i_ino, err);
+		return err;
+	}
+	zuf_zii_sync(dir, false);
+
+	return 0;
+}
+
+int zuf_remove_dentry(struct inode *dir, struct qstr *str, struct inode *inode)
+{
+	struct zuf_inode_info *zii = ZUII(dir);
+	int err;
+
+	if (!str->len)
+		return -EINVAL;
+
+	zus_inode_cmtime_now(dir, zii->zi);
+	err = _dentry_dispatch(dir, inode, str, ZUFS_OP_REMOVE_DENTRY);
+	if (unlikely(err)) {
+		zuf_dbg_err("[%ld] _dentry_dispatch failed => %d\n",
+			    dir->i_ino, err);
+		return err;
+	}
+	zuf_zii_sync(dir, false);
+
+	return 0;
+}
+
+const struct file_operations zuf_dir_operations = {
+	.llseek		= generic_file_llseek,
+	.read		= generic_read_dir,
+	.iterate_shared	= zuf_readdir,
+	.fsync		= noop_fsync,
+};
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
new file mode 100644
index 000000000000..619dada43666
--- /dev/null
+++ b/fs/zuf/file.c
@@ -0,0 +1,31 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * File operations for files.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include "zuf.h"
+
+long __zuf_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
+{
+	return -ENOTSUPP;
+}
+
+const struct file_operations zuf_file_operations = {
+	.open			= generic_file_open,
+};
+
+const struct inode_operations zuf_file_inode_operations = {
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+	.update_time	= zuf_update_time,
+};
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index a6115289dcda..88cb1937c223 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -13,11 +13,570 @@
  *	Sagi Manole <sagim@netapp.com>"
  */
 
+#include <linux/fs.h>
+#include <linux/aio.h>
+#include <linux/highuid.h>
+#include <linux/module.h>
+#include <linux/mpage.h>
+#include <linux/backing-dev.h>
+#include <linux/types.h>
+#include <linux/ratelimit.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/security.h>
+#include <linux/delay.h>
+
 #include "zuf.h"
 
+/* Flags that should be inherited by new inodes from their parent. */
+#define ZUFS_FL_INHERITED (S_SYNC | S_NOATIME | S_DIRSYNC)
+
+/* Flags that are appropriate for regular files (all but dir-specific ones). */
+#define ZUFS_FL_REG_MASK (~S_DIRSYNC)
+
+/* Flags that are appropriate for non-dir/non-regular files. */
+#define ZUFS_FL_OTHER_MASK (S_NOATIME)
+
+static bool _zi_valid(struct zus_inode *zi)
+{
+	if (!_zi_active(zi))
+		return false;
+
+	switch (le16_to_cpu(zi->i_mode) & S_IFMT) {
+	case S_IFREG:
+	case S_IFDIR:
+	case S_IFLNK:
+	case S_IFBLK:
+	case S_IFCHR:
+	case S_IFIFO:
+	case S_IFSOCK:
+		return true;
+	default:
+		zuf_err("unknown file type ino=%lld mode=%d\n", zi->i_ino,
+			  zi->i_mode);
+		return false;
+	}
+}
+
+static void _set_inode_from_zi(struct inode *inode, struct zus_inode *zi)
+{
+	inode->i_mode = le16_to_cpu(zi->i_mode);
+	inode->i_uid = KUIDT_INIT(le32_to_cpu(zi->i_uid));
+	inode->i_gid = KGIDT_INIT(le32_to_cpu(zi->i_gid));
+	set_nlink(inode, le16_to_cpu(zi->i_nlink));
+	inode->i_size = le64_to_cpu(zi->i_size);
+	inode->i_blocks = le64_to_cpu(zi->i_blocks);
+	mt_to_timespec(&inode->i_atime, &zi->i_atime);
+	mt_to_timespec(&inode->i_ctime, &zi->i_ctime);
+	mt_to_timespec(&inode->i_mtime, &zi->i_mtime);
+	inode->i_generation = le64_to_cpu(zi->i_generation);
+	zuf_set_inode_flags(inode, zi);
+
+	inode->i_blocks = le64_to_cpu(zi->i_blocks);
+	inode->i_mapping->a_ops = &zuf_aops;
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+		inode->i_op = &zuf_file_inode_operations;
+		inode->i_fop = &zuf_file_operations;
+		break;
+	case S_IFDIR:
+		inode->i_op = &zuf_dir_inode_operations;
+		inode->i_fop = &zuf_dir_operations;
+		break;
+	case S_IFBLK:
+	case S_IFCHR:
+	case S_IFIFO:
+	case S_IFSOCK:
+		inode->i_size = 0;
+		inode->i_op = &zuf_special_inode_operations;
+		init_special_inode(inode, inode->i_mode,
+				   le32_to_cpu(zi->i_rdev));
+		break;
+	default:
+		zuf_err("unknown file type ino=%lld mode=%d\n", zi->i_ino,
+			  zi->i_mode);
+		break;
+	}
+
+	inode->i_ino = le64_to_cpu(zi->i_ino);
+}
+
+/* Mask out flags that are inappropriate for the given type of inode. */
+static uint _calc_flags(umode_t mode, uint dir_flags, uint flags)
+{
+	uint zufs_flags = dir_flags & ZUFS_FL_INHERITED;
+
+	if (S_ISREG(mode))
+		zufs_flags &= ZUFS_FL_REG_MASK;
+	else if (!S_ISDIR(mode))
+		zufs_flags &= ZUFS_FL_OTHER_MASK;
+
+	return zufs_flags;
+}
+
+static int _set_zi_from_inode(struct inode *dir, struct zus_inode *zi,
+			      struct inode *inode)
+{
+	struct zus_inode *zidir = zus_zi(dir);
+
+	if (unlikely(!zidir))
+		return -EACCES;
+
+	zi->i_mode = cpu_to_le16(inode->i_mode);
+	zi->i_uid = cpu_to_le32(__kuid_val(inode->i_uid));
+	zi->i_gid = cpu_to_le32(__kgid_val(inode->i_gid));
+	/* NOTE: zus is boss of i_nlink (but let it know what we think) */
+	zi->i_nlink = cpu_to_le16(inode->i_nlink);
+	zi->i_size = cpu_to_le64(inode->i_size);
+	zi->i_blocks = cpu_to_le64(inode->i_blocks);
+	timespec_to_mt(&zi->i_atime, &inode->i_atime);
+	timespec_to_mt(&zi->i_mtime, &inode->i_mtime);
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+	zi->i_generation = cpu_to_le32(inode->i_generation);
+
+	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode))
+		zi->i_rdev = cpu_to_le32(inode->i_rdev);
+
+	zi->i_flags = cpu_to_le16(_calc_flags(inode->i_mode,
+					      le16_to_cpu(zidir->i_flags),
+					      inode->i_flags));
+	return 0;
+}
+
+static bool _times_equal(struct timespec64 *t, __le64 *mt)
+{
+	__le64 time;
+
+	timespec_to_mt(&time, t);
+	return time == *mt;
+}
+
+/* This function checks if VFS's inode and zus_inode are in sync */
+static void _warn_inode_dirty(struct inode *inode, struct zus_inode *zi)
+{
+#define __MISMACH_INT(inode, X, Y)	\
+	if (X != Y)			\
+		zuf_warn("[%ld] " #X"=0x%lx " #Y"=0x%lx""\n",	\
+			  inode->i_ino, (ulong)(X), (ulong)(Y))
+#define __MISMACH_TIME(inode, X, Y)	\
+	if (!_times_equal(X, Y)) {	\
+		struct timespec64 t;	\
+		mt_to_timespec(&t, (Y));\
+		zuf_warn("[%ld] " #X"=%lld:%ld " #Y"=%lld:%ld""\n",	\
+			  inode->i_ino, (X)->tv_sec, (X)->tv_nsec,	\
+			  t.tv_sec, t.tv_nsec);		\
+	}
+
+	if (!_times_equal(&inode->i_ctime, &zi->i_ctime) ||
+	    !_times_equal(&inode->i_mtime, &zi->i_mtime) ||
+	    !_times_equal(&inode->i_atime, &zi->i_atime) ||
+	    inode->i_size != le64_to_cpu(zi->i_size) ||
+	    inode->i_mode != le16_to_cpu(zi->i_mode) ||
+	    __kuid_val(inode->i_uid) != le32_to_cpu(zi->i_uid) ||
+	    __kgid_val(inode->i_gid) != le32_to_cpu(zi->i_gid) ||
+	    inode->i_nlink != le16_to_cpu(zi->i_nlink) ||
+	    inode->i_ino != _zi_ino(zi) ||
+	    inode->i_blocks != le64_to_cpu(zi->i_blocks)) {
+		__MISMACH_TIME(inode, &inode->i_ctime, &zi->i_ctime);
+		__MISMACH_TIME(inode, &inode->i_mtime, &zi->i_mtime);
+		__MISMACH_TIME(inode, &inode->i_atime, &zi->i_atime);
+		__MISMACH_INT(inode, inode->i_size, le64_to_cpu(zi->i_size));
+		__MISMACH_INT(inode, inode->i_mode, le16_to_cpu(zi->i_mode));
+		__MISMACH_INT(inode, __kuid_val(inode->i_uid),
+			      le32_to_cpu(zi->i_uid));
+		__MISMACH_INT(inode, __kgid_val(inode->i_gid),
+			      le32_to_cpu(zi->i_gid));
+		__MISMACH_INT(inode, inode->i_nlink, le16_to_cpu(zi->i_nlink));
+		__MISMACH_INT(inode, inode->i_ino, _zi_ino(zi));
+		__MISMACH_INT(inode, inode->i_blocks,
+			      le64_to_cpu(zi->i_blocks));
+	}
+}
+
+static void _zii_connect(struct inode *inode, struct zus_inode *zi,
+			 struct zus_inode_info *zus_ii)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zii->zi = zi;
+	zii->zus_ii = zus_ii;
+}
+
 struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii,
 		       zu_dpp_t _zi, bool *exist)
 {
-	return ERR_PTR(-ENOTSUPP);
+	struct zus_inode *zi = zuf_dpp_t_addr(sb, _zi);
+	struct inode *inode;
+
+	*exist = false;
+	if (unlikely(!zi)) {
+		/* Don't trust ZUS pointers */
+		zuf_err("Bad zus_inode 0x%llx\n", _zi);
+		return ERR_PTR(-EIO);
+	}
+	if (unlikely(!zus_ii)) {
+		zuf_err("zus_ii NULL\n");
+		return ERR_PTR(-EIO);
+	}
+
+	if (!_zi_valid(zi)) {
+		zuf_err("inactive node ino=%lld links=%d mode=%d\n", zi->i_ino,
+			  zi->i_nlink, zi->i_mode);
+		return ERR_PTR(-ESTALE);
+	}
+
+	zuf_dbg_zus("[%lld] size=0x%llx, blocks=0x%llx ct=0x%llx mt=0x%llx link=0x%x mode=0x%x xattr=0x%llx\n",
+		    zi->i_ino, zi->i_size, zi->i_blocks, zi->i_ctime,
+		    zi->i_mtime, zi->i_nlink, zi->i_mode, zi->i_xattr);
+
+	inode = iget_locked(sb, _zi_ino(zi));
+	if (unlikely(!inode))
+		return ERR_PTR(-ENOMEM);
+
+	if (!(inode->i_state & I_NEW)) {
+		*exist = true;
+		return inode;
+	}
+
+	_set_inode_from_zi(inode, zi);
+	_zii_connect(inode, zi, zus_ii);
+
+	unlock_new_inode(inode);
+	return inode;
+}
+
+int zuf_evict_dispatch(struct super_block *sb, struct zus_inode_info *zus_ii,
+		       int operation, uint flags)
+{
+	struct zufs_ioc_evict_inode ioc_evict_inode = {
+		.hdr.in_len = sizeof(ioc_evict_inode),
+		.hdr.out_len = sizeof(ioc_evict_inode),
+		.hdr.operation = operation,
+		.zus_ii = zus_ii,
+		.flags = flags,
+	};
+	int err;
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_evict_inode.hdr, NULL, 0);
+	if (unlikely(err && err != -EINTR))
+		zuf_err_dispatch(sb, "zufc_dispatch failed op=%s => %d\n",
+				 zuf_op_name(operation), err);
+	return err;
+}
+
+void zuf_evict_inode(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (!inode->i_nlink) {
+		if (unlikely(!zii->zi)) {
+			zuf_dbg_err("[%ld] inode without zi mode=0x%x size=0x%llx\n",
+				    inode->i_ino, inode->i_mode, inode->i_size);
+			goto out;
+		}
+
+		if (unlikely(is_bad_inode(inode)))
+			zuf_dbg_err("[%ld] inode is bad mode=0x%x zi=%p\n",
+				    inode->i_ino, inode->i_mode, zii->zi);
+		else
+			_warn_inode_dirty(inode, zii->zi);
+
+		zuf_w_lock(zii);
+
+		zuf_evict_dispatch(sb, zii->zus_ii, ZUFS_OP_FREE_INODE, 0);
+
+		inode->i_mtime = inode->i_ctime = current_time(inode);
+		inode->i_size = 0;
+
+		zuf_w_unlock(zii);
+	} else {
+		zuf_dbg_vfs("[%ld] inode is going down?\n", inode->i_ino);
+
+		zuf_smw_lock(zii);
+
+		zuf_evict_dispatch(sb, zii->zus_ii, ZUFS_OP_EVICT_INODE, 0);
+
+		zuf_smw_unlock(zii);
+	}
+
+out:
+	zii->zus_ii = NULL;
+	zii->zi = NULL;
+
+	clear_inode(inode);
+}
+
+/* @rdev_or_isize is i_size in the case of a symlink
+ * and rdev in the case of special-files
+ */
+struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
+			    const struct qstr *qstr, const char *symname,
+			    ulong rdev_or_isize, bool tmpfile)
+{
+	struct super_block *sb = dir->i_sb;
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct zufs_ioc_new_inode ioc_new_inode = {
+		.hdr.in_len = sizeof(ioc_new_inode),
+		.hdr.out_len = sizeof(ioc_new_inode),
+		.hdr.operation = ZUFS_OP_NEW_INODE,
+		.dir_ii = ZUII(dir)->zus_ii,
+		.flags = tmpfile ? ZI_TMPFILE : 0,
+		.str.len = qstr->len,
+	};
+	struct inode *inode;
+	struct zus_inode *zi = NULL;
+	struct page *pages[2];
+	uint nump = 0;
+	int err;
+
+	memcpy(&ioc_new_inode.str.name, qstr->name, qstr->len);
+
+	inode = new_inode(sb);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	inode_init_owner(inode, dir, mode);
+	inode->i_blocks = inode->i_size = 0;
+	inode->i_ctime = inode->i_mtime = current_time(dir);
+	inode->i_atime = inode->i_ctime;
+
+	zuf_dbg_verbose("inode=%p name=%s\n", inode, qstr->name);
+
+	zuf_set_inode_flags(inode, &ioc_new_inode.zi);
+
+	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) ||
+	    S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) {
+		init_special_inode(inode, mode, rdev_or_isize);
+	}
+
+	err = _set_zi_from_inode(dir, &ioc_new_inode.zi, inode);
+	if (unlikely(err))
+		goto fail;
+
+	zus_inode_cmtime_now(dir, zus_zi(dir));
+
+	err = zufc_dispatch(ZUF_ROOT(sbi), &ioc_new_inode.hdr, pages, nump);
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+		goto fail;
+	}
+	zi = zuf_dpp_t_addr(sb, ioc_new_inode._zi);
+
+	_zii_connect(inode, zi, ioc_new_inode.zus_ii);
+
+	/* update inode fields from filesystem inode */
+	inode->i_ino = le64_to_cpu(zi->i_ino);
+	inode->i_size = le64_to_cpu(zi->i_size);
+	inode->i_generation = le64_to_cpu(zi->i_generation);
+	inode->i_blocks = le64_to_cpu(zi->i_blocks);
+	set_nlink(inode, le16_to_cpu(zi->i_nlink));
+	zuf_zii_sync(dir, false);
+
+	zuf_dbg_zus("[%lld] size=0x%llx, blocks=0x%llx ct=0x%llx mt=0x%llx link=0x%x mode=0x%x xattr=0x%llx\n",
+		    zi->i_ino, zi->i_size, zi->i_blocks, zi->i_ctime,
+		    zi->i_mtime, zi->i_nlink, zi->i_mode, zi->i_xattr);
+
+	zuf_dbg_verbose("allocating inode %ld (zi=%p)\n", _zi_ino(zi), zi);
+
+	err = insert_inode_locked(inode);
+	if (unlikely(err)) {
+		zuf_dbg_err("[%ld:%s] generation=%lld insert_inode_locked => %d\n",
+			    inode->i_ino, qstr->name, zi->i_generation, err);
+		goto fail;
+	}
+
+	return inode;
+
+fail:
+	clear_nlink(inode);
+	if (zi)
+		zi->i_nlink = 0;
+	make_bad_inode(inode);
+	iput(inode);
+	return ERR_PTR(err);
+}
+
+int zuf_write_inode(struct inode *inode, struct writeback_control *wbc)
+{
+	/* write_inode should never be called because we always keep our inodes
+	 * clean. So let us know if write_inode ever gets called.
+	 */
+
+	/* d_tmpfile() does a mark_inode_dirty so only complain on regular files
+	 * TODO: How? Every thing off for now
+	 * WARN_ON(inode->i_nlink);
+	 */
+
+	return 0;
+}
+
+/*
+ * Mostly supporting file_accessed() for now. Which is the only one we use.
+ *
+ * But also file_update_time is used by fifo code.
+ */
+int zuf_update_time(struct inode *inode, struct timespec64 *time, int flags)
+{
+	struct zus_inode *zi = zus_zi(inode);
+
+	if (flags & S_ATIME) {
+		inode->i_atime = *time;
+		timespec_to_mt(&zi->i_atime, &inode->i_atime);
+		/* FIXME: Set a flag that zi needs flushing
+		 * for now every read needs zi-flushing.
+		 */
+	}
+
+	/* File_update_time() is not used by zuf.
+	 * FIXME: One exception is O_TMPFILE the vfs calls file_update_time
+	 * internally bypassing FS. So just do and silent.
+	 * The zus O_TMPFILE create protocol knows it needs flushing
+	 */
+	if ((flags & S_CTIME) || (flags & S_MTIME)) {
+		if (flags & S_CTIME) {
+			inode->i_ctime = *time;
+			timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+		}
+		if (flags & S_MTIME) {
+			inode->i_mtime = *time;
+			timespec_to_mt(&zi->i_mtime, &inode->i_mtime);
+		}
+		zuf_dbg_vfs("called for S_CTIME | S_MTIME 0x%x\n", flags);
+	}
+
+	if (flags & ~(S_CTIME | S_MTIME | S_ATIME))
+		zuf_err("called for 0x%x\n", flags);
+
+	return 0;
+}
+
+int zuf_getattr(const struct path *path, struct kstat *stat, u32 request_mask,
+		unsigned int flags)
+{
+	struct dentry *dentry = path->dentry;
+	struct inode *inode = d_inode(dentry);
+
+	if (inode->i_flags & S_APPEND)
+		stat->attributes |= STATX_ATTR_APPEND;
+	if (inode->i_flags & S_IMMUTABLE)
+		stat->attributes |= STATX_ATTR_IMMUTABLE;
+
+	stat->attributes_mask |= (STATX_ATTR_APPEND |
+				  STATX_ATTR_IMMUTABLE);
+	generic_fillattr(inode, stat);
+	/* stat->blocks should be the number of 512B blocks */
+	stat->blocks = inode->i_blocks << (inode->i_sb->s_blocksize_bits - 9);
+
+	return 0;
+}
+
+int zuf_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = dentry->d_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zus_inode *zi = zii->zi;
+	struct zufs_ioc_attr ioc_attr = {
+		.hdr.in_len = sizeof(ioc_attr),
+		.hdr.out_len = sizeof(ioc_attr),
+		.hdr.operation = ZUFS_OP_SETATTR,
+		.zus_ii = zii->zus_ii,
+	};
+	int err;
+
+	if (!zi)
+		return -EACCES;
+
+	/* Truncate is implemented via  fallocate(punch_hole) which means we
+	 * are not atomic with the other ATTRs. I think someone said that
+	 * some Kernel FSs don't even support truncate to come together with
+	 * other ATTRs
+	 */
+	if ((attr->ia_valid & ATTR_SIZE)) {
+		ZUF_CHECK_I_W_LOCK(inode);
+		zuf_smw_lock(zii);
+		err = __zuf_fallocate(inode, ZUFS_FL_TRUNCATE, attr->ia_size,
+				      ~0ULL);
+		zuf_smw_unlock(zii);
+		if (unlikely(err))
+			return err;
+		attr->ia_valid &= ~ATTR_SIZE;
+	}
+
+	err = setattr_prepare(dentry, attr);
+	if (unlikely(err))
+		return err;
+
+	if (attr->ia_valid & ATTR_MODE) {
+		zuf_dbg_vfs("[%ld] ATTR_MODE=0x%x\n",
+			     inode->i_ino, attr->ia_mode);
+		ioc_attr.zuf_attr |= STATX_MODE;
+		inode->i_mode = attr->ia_mode;
+		zi->i_mode = cpu_to_le16(inode->i_mode);
+		if (test_opt(SBI(inode->i_sb), POSIXACL)) {
+			err = posix_acl_chmod(inode, inode->i_mode);
+			if (unlikely(err))
+				return err;
+		}
+	}
+
+	if (attr->ia_valid & ATTR_UID) {
+		zuf_dbg_vfs("[%ld] ATTR_UID=0x%x\n",
+			     inode->i_ino, __kuid_val(attr->ia_uid));
+		ioc_attr.zuf_attr |= STATX_UID;
+		inode->i_uid = attr->ia_uid;
+		zi->i_uid = cpu_to_le32(__kuid_val(inode->i_uid));
+	}
+	if (attr->ia_valid & ATTR_GID) {
+		zuf_dbg_vfs("[%ld] ATTR_GID=0x%x\n",
+			     inode->i_ino, __kgid_val(attr->ia_gid));
+		ioc_attr.zuf_attr |= STATX_GID;
+		inode->i_gid = attr->ia_gid;
+		zi->i_gid = cpu_to_le32(__kgid_val(inode->i_gid));
+	}
+
+	if (attr->ia_valid & ATTR_ATIME) {
+		ioc_attr.zuf_attr |= STATX_ATIME;
+		inode->i_atime = attr->ia_atime;
+		timespec_to_mt(&zi->i_atime, &inode->i_atime);
+		zuf_dbg_vfs("[%ld] ATTR_ATIME=0x%llx\n",
+			     inode->i_ino, zi->i_atime);
+	}
+	if (attr->ia_valid & ATTR_CTIME) {
+		ioc_attr.zuf_attr |= STATX_CTIME;
+		inode->i_ctime = attr->ia_ctime;
+		timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+		zuf_dbg_vfs("[%ld] ATTR_CTIME=0x%llx\n",
+			     inode->i_ino, zi->i_ctime);
+	}
+	if (attr->ia_valid & ATTR_MTIME) {
+		ioc_attr.zuf_attr |= STATX_MTIME;
+		inode->i_mtime = attr->ia_mtime;
+		timespec_to_mt(&zi->i_mtime, &inode->i_mtime);
+		zuf_dbg_vfs("[%ld] ATTR_MTIME=0x%llx\n",
+			     inode->i_ino, zi->i_mtime);
+	}
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_attr.hdr, NULL, 0);
+	if (unlikely(err))
+		zuf_dbg_err("[%ld] set_attr=0x%x failed => %d\n",
+			    inode->i_ino, ioc_attr.zuf_attr, err);
+
+	return err;
+}
+
+void zuf_set_inode_flags(struct inode *inode, struct zus_inode *zi)
+{
+	unsigned int flags = le16_to_cpu(zi->i_flags) & ~ZUFS_S_IMMUTABLE;
+
+	inode->i_flags &=
+		~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC);
+	inode->i_flags |= flags;
+	if (zi->i_flags & ZUFS_S_IMMUTABLE)
+		inode->i_flags |= S_IMMUTABLE | S_NOATIME;
+	if (!zi->i_xattr)
+		inode_has_no_xattr(inode);
 }
 
+const struct address_space_operations zuf_aops = {
+};
diff --git a/fs/zuf/namei.c b/fs/zuf/namei.c
new file mode 100644
index 000000000000..299134ca7c07
--- /dev/null
+++ b/fs/zuf/namei.c
@@ -0,0 +1,402 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Inode operations for directories.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+#include <linux/fs.h>
+#include "zuf.h"
+
+
+static struct inode *d_parent(struct dentry *dentry)
+{
+	return dentry->d_parent->d_inode;
+}
+
+static void _set_nlink(struct inode *inode, struct zus_inode *zi)
+{
+	set_nlink(inode, le32_to_cpu(zi->i_nlink));
+}
+
+void zuf_zii_sync(struct inode *inode, bool sync_nlink)
+{
+	struct zus_inode *zi = zus_zi(inode);
+
+	if (inode->i_size != le64_to_cpu(zi->i_size) ||
+	    inode->i_blocks != le64_to_cpu(zi->i_blocks)) {
+		i_size_write(inode, le64_to_cpu(zi->i_size));
+		inode->i_blocks = le64_to_cpu(zi->i_blocks);
+	}
+
+	if (sync_nlink)
+		_set_nlink(inode, zi);
+}
+
+static void _instantiate_unlock(struct dentry *dentry, struct inode *inode)
+{
+	d_instantiate(dentry, inode);
+	unlock_new_inode(inode);
+}
+
+static struct dentry *zuf_lookup(struct inode *dir, struct dentry *dentry,
+				 uint flags)
+{
+	struct super_block *sb = dir->i_sb;
+	struct qstr *str = &dentry->d_name;
+	uint in_len = offsetof(struct zufs_ioc_lookup, _zi);
+	struct zufs_ioc_lookup ioc_lu = {
+		.hdr.in_len = in_len,
+		.hdr.out_start = in_len,
+		.hdr.out_len = sizeof(ioc_lu) - in_len,
+		.hdr.operation = ZUFS_OP_LOOKUP,
+		.dir_ii = ZUII(dir)->zus_ii,
+		.str.len = str->len,
+	};
+	struct inode *inode = NULL;
+	bool exist;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-name=%s\n", dir->i_ino, dentry->d_name.name);
+
+	if (dentry->d_name.len > ZUFS_NAME_LEN)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	memcpy(&ioc_lu.str.name, str->name, str->len);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_lu.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+		goto out;
+	}
+
+	inode = zuf_iget(dir->i_sb, ioc_lu.zus_ii, ioc_lu._zi, &exist);
+	if (exist) {
+		zuf_dbg_err("race in lookup\n");
+		zuf_evict_dispatch(sb, ioc_lu.zus_ii, ZUFS_OP_EVICT_INODE,
+				   ZI_LOOKUP_RACE);
+	}
+
+out:
+	return d_splice_alias(inode, dentry);
+}
+
+/*
+ * By the time this is called, we already have created
+ * the directory cache entry for the new file, but it
+ * is so far negative - it has no inode.
+ *
+ * If the create succeeds, we fill in the inode information
+ * with d_instantiate().
+ */
+static int zuf_create(struct inode *dir, struct dentry *dentry, umode_t mode,
+		      bool excl)
+{
+	struct inode *inode;
+
+	zuf_dbg_vfs("[%ld] dentry-name=%s mode=0x%x\n",
+		     dir->i_ino, dentry->d_name.name, mode);
+
+	inode = zuf_new_inode(dir, mode, &dentry->d_name, NULL, 0, false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_file_inode_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+	inode->i_fop = &zuf_file_operations;
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
+static int zuf_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
+		     dev_t rdev)
+{
+	struct inode *inode;
+
+	zuf_dbg_vfs("[%ld] mode=0x%x rdev=0x%x\n", dir->i_ino, mode, rdev);
+
+	inode = zuf_new_inode(dir, mode, &dentry->d_name, NULL, rdev, false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_special_inode_operations;
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
+static int zuf_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct inode *inode;
+
+	inode = zuf_new_inode(dir, mode, &dentry->d_name, NULL, 0, true);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	/* TODO: See about more ephemeral operations on this file, around
+	 * mmap and such.
+	 * Must see about that tmpfile mode that is later link_at
+	 * (probably the !O_EXCL flag)
+	 */
+	inode->i_op = &zuf_file_inode_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+	inode->i_fop = &zuf_file_operations;
+
+	set_nlink(inode, 1); /* user_mode knows nothing */
+	d_tmpfile(dentry, inode);
+	/* tmpfile operate on nlink=0. Since this is a tmp file we do not care
+	 * about cl_flushing. If later this file will be linked to a dir. the
+	 * add_dentry will flush the zi.
+	 */
+	zus_zi(inode)->i_nlink = inode->i_nlink;
+
+	unlock_new_inode(inode);
+	return 0;
+}
+
+static int zuf_link(struct dentry *dest_dentry, struct inode *dir,
+		    struct dentry *dentry)
+{
+	struct inode *inode = dest_dentry->d_inode;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-ino=%ld dentry-name=%s dentry-parent=%ld dest_d-ino=%ld dest_d-name=%s\n",
+		     dir->i_ino, inode->i_ino, dentry->d_name.name,
+		     d_parent(dentry)->i_ino,
+		     dest_dentry->d_inode->i_ino, dest_dentry->d_name.name);
+
+	if (inode->i_nlink >= ZUFS_LINK_MAX)
+		return -EMLINK;
+
+	ihold(inode);
+
+	zus_inode_cmtime_now(dir, zus_zi(dir));
+	zus_inode_ctime_now(inode, zus_zi(inode));
+
+	err = zuf_add_dentry(dir, &dentry->d_name, inode);
+	if (unlikely(err)) {
+		iput(inode);
+		return err;
+	}
+
+	_set_nlink(inode, zus_zi(inode));
+
+	d_instantiate(dentry, inode);
+
+	return 0;
+}
+
+static int zuf_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-ino=%ld dentry-name=%s dentry-parent=%ld\n",
+		     dir->i_ino, inode->i_ino, dentry->d_name.name,
+		     d_parent(dentry)->i_ino);
+
+	inode->i_ctime = dir->i_ctime;
+	timespec_to_mt(&zus_zi(inode)->i_ctime, &inode->i_ctime);
+
+	err = zuf_remove_dentry(dir, &dentry->d_name, inode);
+	if (unlikely(err))
+		return err;
+
+	zuf_zii_sync(inode, true);
+	zuf_zii_sync(dir, true);
+
+	return 0;
+}
+
+static int zuf_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct inode *inode;
+
+	zuf_dbg_vfs("[%ld] dentry-name=%s dentry-parent=%ld mode=0x%x\n",
+		     dir->i_ino, dentry->d_name.name, d_parent(dentry)->i_ino,
+		     mode);
+
+	if (dir->i_nlink >= ZUFS_LINK_MAX)
+		return -EMLINK;
+
+	inode = zuf_new_inode(dir, S_IFDIR | mode, &dentry->d_name, NULL, 0,
+			      false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_dir_inode_operations;
+	inode->i_fop = &zuf_dir_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+
+	zuf_zii_sync(dir, true);
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
+static bool _empty_dir(struct inode *dir)
+{
+	if (dir->i_nlink != 2) {
+		zuf_dbg_verbose("[%ld] directory has nlink(%d) != 2\n",
+				dir->i_ino, dir->i_nlink);
+		return false;
+	}
+	/* NOTE: Above is not the only -ENOTEMPTY the zus-fs will need to check
+	 * for the "only-files" no subdirs case. And return -ENOTEMPTY below
+	 */
+	return true;
+}
+
+static int zuf_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = dentry->d_inode;
+	int err;
+
+	zuf_dbg_vfs("[%ld] dentry-ino=%ld dentry-name=%s dentry-parent=%ld\n",
+		     dir->i_ino, inode->i_ino, dentry->d_name.name,
+		     d_parent(dentry)->i_ino);
+
+	if (!inode)
+		return -ENOENT;
+
+	if (!_empty_dir(inode))
+		return -ENOTEMPTY;
+
+	zus_inode_cmtime_now(dir, zus_zi(dir));
+	inode->i_ctime = dir->i_ctime;
+	timespec_to_mt(&zus_zi(inode)->i_ctime, &inode->i_ctime);
+
+	err = zuf_remove_dentry(dir, &dentry->d_name, inode);
+	if (unlikely(err))
+		return err;
+
+	zuf_zii_sync(inode, true);
+	zuf_zii_sync(dir, true);
+
+	return 0;
+}
+
+/* Structure of a directory element; */
+struct zuf_dir_element {
+	__le64  ino;
+	char name[254];
+};
+
+static int zuf_rename(struct inode *old_dir, struct dentry *old_dentry,
+		      struct inode *new_dir, struct dentry *new_dentry,
+		      uint flags)
+{
+	struct inode *old_inode = d_inode(old_dentry);
+	struct inode *new_inode = d_inode(new_dentry);
+	struct zuf_sb_info *sbi = SBI(old_inode->i_sb);
+	struct zufs_ioc_rename ioc_rename = {
+		.hdr.in_len = sizeof(ioc_rename),
+		.hdr.out_len = sizeof(ioc_rename),
+		.hdr.operation = ZUFS_OP_RENAME,
+		.old_dir_ii = ZUII(old_dir)->zus_ii,
+		.new_dir_ii = ZUII(new_dir)->zus_ii,
+		.old_zus_ii = ZUII(old_inode)->zus_ii,
+		.new_zus_ii = new_inode ? ZUII(new_inode)->zus_ii : NULL,
+		.old_d_str.len = old_dentry->d_name.len,
+		.new_d_str.len = new_dentry->d_name.len,
+		.flags = flags,
+	};
+	struct timespec64 time = current_time(old_dir);
+	int err;
+
+	zuf_dbg_vfs(
+		"old_inode=%ld new_inode=%ld old_name=%s new_name=%s f=0x%x\n",
+		old_inode->i_ino, new_inode ? new_inode->i_ino : 0,
+		old_dentry->d_name.name, new_dentry->d_name.name, flags);
+
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE /*| RENAME_WHITEOUT*/))
+		return -EINVAL;
+
+	if (flags & RENAME_EXCHANGE) {
+		/* A subdir holds a ref on parent, see if we need to
+		 * exchange refs
+		 */
+		if (unlikely(!new_inode))
+			return -EINVAL;
+
+		if ((S_ISDIR(old_inode->i_mode) != S_ISDIR(new_inode->i_mode))
+		    && (old_dir != new_dir)) {
+			if (S_ISDIR(old_inode->i_mode)) {
+				if (ZUFS_LINK_MAX <= new_dir->i_nlink)
+					return -EMLINK;
+			} else {
+				if (ZUFS_LINK_MAX <= old_dir->i_nlink)
+					return -EMLINK;
+			}
+		}
+	} else if (S_ISDIR(old_inode->i_mode)) {
+		if (new_inode) {
+			if (!_empty_dir(new_inode))
+				return -ENOTEMPTY;
+		} else if (ZUFS_LINK_MAX <= new_dir->i_nlink) {
+			return -EMLINK;
+		}
+	}
+
+	memcpy(&ioc_rename.old_d_str.name, old_dentry->d_name.name,
+		old_dentry->d_name.len);
+	memcpy(&ioc_rename.new_d_str.name, new_dentry->d_name.name,
+		new_dentry->d_name.len);
+	timespec_to_mt(&ioc_rename.time, &time);
+
+	zus_inode_cmtime_now(old_dir, zus_zi(old_dir));
+	if (old_dir != new_dir)
+		zus_inode_cmtime_now(new_dir, zus_zi(new_dir));
+
+	if (new_inode)
+		zus_inode_ctime_now(new_inode, zus_zi(new_inode));
+	else
+		zus_inode_ctime_now(old_inode, zus_zi(old_inode));
+
+	err = zufc_dispatch(ZUF_ROOT(sbi), &ioc_rename.hdr, NULL, 0);
+
+	zuf_zii_sync(old_dir, true);
+	zuf_zii_sync(new_dir, true);
+
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+		return err;
+	}
+
+	if (new_inode)
+		_set_nlink(new_inode, zus_zi(new_inode));
+
+	return 0;
+}
+
+const struct inode_operations zuf_dir_inode_operations = {
+	.create		= zuf_create,
+	.lookup		= zuf_lookup,
+	.link		= zuf_link,
+	.unlink		= zuf_unlink,
+	.mkdir		= zuf_mkdir,
+	.rmdir		= zuf_rmdir,
+	.mknod		= zuf_mknod,
+	.tmpfile	= zuf_tmpfile,
+	.rename		= zuf_rename,
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+	.update_time	= zuf_update_time,
+};
+
+const struct inode_operations zuf_special_inode_operations = {
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+	.update_time	= zuf_update_time,
+};
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index 01927deb5013..abd7e6cb2a4a 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -784,6 +784,8 @@ void zuf_destroy_inodecache(void)
 static struct super_operations zuf_sops = {
 	.alloc_inode	= zuf_alloc_inode,
 	.destroy_inode	= zuf_destroy_inode,
+	.write_inode	= zuf_write_inode,
+	.evict_inode	= zuf_evict_inode,
 	.put_super	= zuf_put_super,
 	.freeze_fs	= zuf_update_s_wtime,
 	.unfreeze_fs	= zuf_update_s_wtime,
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index a417f9463682..48dd7b665064 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -65,6 +65,16 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_BREAK);
 		CASE_ENUM_NAME(ZUFS_OP_STATFS);
 		CASE_ENUM_NAME(ZUFS_OP_SHOW_OPTIONS);
+
+		CASE_ENUM_NAME(ZUFS_OP_NEW_INODE);
+		CASE_ENUM_NAME(ZUFS_OP_FREE_INODE);
+		CASE_ENUM_NAME(ZUFS_OP_EVICT_INODE);
+
+		CASE_ENUM_NAME(ZUFS_OP_LOOKUP);
+		CASE_ENUM_NAME(ZUFS_OP_ADD_DENTRY);
+		CASE_ENUM_NAME(ZUFS_OP_REMOVE_DENTRY);
+		CASE_ENUM_NAME(ZUFS_OP_RENAME);
+		CASE_ENUM_NAME(ZUFS_OP_SETATTR);
 	case ZUFS_OP_MAX_OPT:
 	default:
 		return "UNKNOWN";
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 18cbc376cfa6..2d5327e1d2b1 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -130,6 +130,9 @@ enum {
 struct zuf_inode_info {
 	struct inode		vfs_inode;
 
+	/* Stuff for mmap write */
+	struct rw_semaphore	in_sync;
+
 	/* cookies from Server */
 	struct zus_inode	*zi;
 	struct zus_inode_info	*zus_ii;
@@ -248,6 +251,66 @@ static inline void *zuf_dpp_t_addr(struct super_block *sb, zu_dpp_t v)
 	return md_addr_verify(SBI(sb)->md, zu_dpp_t_val(v));
 }
 
+/* ~~~~ inode locking ~~~~ */
+static inline void zuf_r_lock(struct zuf_inode_info *zii)
+{
+	inode_lock_shared(&zii->vfs_inode);
+}
+static inline void zuf_r_unlock(struct zuf_inode_info *zii)
+{
+	inode_unlock_shared(&zii->vfs_inode);
+}
+
+static inline void zuf_smr_lock(struct zuf_inode_info *zii)
+{
+	down_read_nested(&zii->in_sync, 1);
+}
+static inline void zuf_smr_lock_pagefault(struct zuf_inode_info *zii)
+{
+	down_read_nested(&zii->in_sync, 2);
+}
+static inline void zuf_smr_unlock(struct zuf_inode_info *zii)
+{
+	up_read(&zii->in_sync);
+}
+
+static inline void zuf_smw_lock(struct zuf_inode_info *zii)
+{
+	down_write(&zii->in_sync);
+}
+static inline void zuf_smw_lock_nested(struct zuf_inode_info *zii)
+{
+	down_write_nested(&zii->in_sync, 1);
+}
+static inline void zuf_smw_unlock(struct zuf_inode_info *zii)
+{
+	up_write(&zii->in_sync);
+}
+
+static inline void zuf_w_lock(struct zuf_inode_info *zii)
+{
+	inode_lock(&zii->vfs_inode);
+	zuf_smw_lock(zii);
+}
+static inline void zuf_w_lock_nested(struct zuf_inode_info *zii)
+{
+	inode_lock_nested(&zii->vfs_inode, 2);
+	zuf_smw_lock_nested(zii);
+}
+static inline void zuf_w_unlock(struct zuf_inode_info *zii)
+{
+	zuf_smw_unlock(zii);
+	inode_unlock(&zii->vfs_inode);
+}
+
+static inline void ZUF_CHECK_I_W_LOCK(struct inode *inode)
+{
+#ifdef CONFIG_ZUF_DEBUG
+	if (WARN_ON(down_write_trylock(&inode->i_rwsem)))
+		up_write(&inode->i_rwsem);
+#endif
+}
+
 enum big_alloc_type { ba_stack, ba_8k, ba_vmalloc };
 #define S_8K (1024UL * 8)
 
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 1af3bd016453..9b9e97fe844e 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -446,6 +446,17 @@ enum e_zufs_operation {
 	ZUFS_OP_STATFS		= 2,
 	ZUFS_OP_SHOW_OPTIONS	= 3,
 
+	ZUFS_OP_NEW_INODE	= 4,
+	ZUFS_OP_FREE_INODE	= 5,
+	ZUFS_OP_EVICT_INODE	= 6,
+
+	ZUFS_OP_LOOKUP		= 7,
+	ZUFS_OP_ADD_DENTRY	= 8,
+	ZUFS_OP_REMOVE_DENTRY	= 9,
+	ZUFS_OP_RENAME		= 10,
+
+	ZUFS_OP_SETATTR		= 19,
+
 	ZUFS_OP_MAX_OPT,
 };
 
@@ -470,4 +481,87 @@ struct zufs_ioc_statfs {
 	struct statfs64 statfs_out;
 };
 
+/* zufs_ioc_new_inode flags: */
+enum zi_flags {
+	ZI_TMPFILE = 1,		/* for new_inode */
+	ZI_LOOKUP_RACE = 1,	/* for evict */
+};
+
+struct zufs_str {
+	__u8 len;
+	char name[ZUFS_NAME_LEN];
+};
+
+/* ZUFS_OP_NEW_INODE */
+struct zufs_ioc_new_inode {
+	struct zufs_ioc_hdr hdr;
+	 /* IN */
+	struct zus_inode zi;
+	struct zus_inode_info *dir_ii; /* If mktmp this is the root */
+	struct zufs_str str;
+	__u64 flags;
+
+	 /* OUT */
+	zu_dpp_t _zi;
+	struct zus_inode_info *zus_ii;
+};
+
+/* ZUFS_OP_FREE_INODE, ZUFS_OP_EVICT_INODE */
+struct zufs_ioc_evict_inode {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 flags;
+};
+
+/* ZUFS_OP_LOOKUP */
+struct zufs_ioc_lookup {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *dir_ii;
+	struct zufs_str str;
+
+	 /* OUT */
+	zu_dpp_t _zi;
+	struct zus_inode_info *zus_ii;
+};
+
+/* ZUFS_OP_ADD_DENTRY, ZUFS_OP_REMOVE_DENTRY */
+struct zufs_ioc_dentry {
+	struct zufs_ioc_hdr hdr;
+	struct zus_inode_info *zus_ii; /* IN */
+	struct zus_inode_info *zus_dir_ii; /* IN */
+	struct zufs_str str; /* IN */
+	__u64 ino; /* OUT - only for lookup */
+};
+
+/* ZUFS_OP_RENAME */
+struct zufs_ioc_rename {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *old_dir_ii;
+	struct zus_inode_info *new_dir_ii;
+	struct zus_inode_info *old_zus_ii;
+	struct zus_inode_info *new_zus_ii;
+	struct zufs_str old_d_str;
+	struct zufs_str new_d_str;
+	__u64 time;
+	__u64 flags;
+};
+
+/* ZUFS_OP_SETATTR */
+struct zufs_ioc_attr {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u32 zuf_attr;
+	__u32 pad;
+};
+
+/* Special flag for ZUFS_OP_FALLOCATE to specify a setattr(SIZE)
+ * IE. same as punch hole but set_i_size to be @filepos. In this
+ * case @last_pos == ~0ULL
+ */
+#define ZUFS_FL_TRUNCATE 0x80000000
+
 #endif /* _LINUX_ZUFS_API_H */

From patchwork Thu Sep 26 02:07:18 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161817
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6ACD5924
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:13:01 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 3DEFB222BF
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:13:01 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="SaCiK8CJ"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1730783AbfIZCNA (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:13:00 -0400
Received: from mail-wm1-f68.google.com ([209.85.128.68]:55775 "EHLO
        mail-wm1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727403AbfIZCNA (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:13:00 -0400
Received: by mail-wm1-f68.google.com with SMTP id a6so727690wma.5
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:12:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=gL69fGNCem1/GDwiWl3Bh15i2ZHAFAibXRT8WHa+3DA=;
        b=SaCiK8CJKwJfym6+RjbqmVt8/fOAlEPicc8c6NV4rPtZQDWA9MpE0qjamlMWqK6LxM
         SibfFtFc+nHc6En9A4YeTmYZaw7v84JYBSa7WcRNiOXqZ/oLklt/AQ1xtg1sQNZaJ1iK
         w7o9Rn1UOODlTJOEA5sxWOpO7OjeYENwl9zqIlRnDZQAMRl4F5gAyQm1r/lq1WrdGixV
         08JNDKFSY2UwyV3cK/qG4UHYFRcluDjQdg8Pktzl4nTVh22uJyzrH9NxyHAq+7FeP8jE
         O6O6MAgT2zazr6Tjprue+NWvqf9xBb8BPiaoYZ2HXS0KvLo6XVVcHPV3g2XILOYtvXfZ
         fQxg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=gL69fGNCem1/GDwiWl3Bh15i2ZHAFAibXRT8WHa+3DA=;
        b=Jr2UjLP0WQtKUfGDP7SBA5STO8Tvt0EuhPPaT1UqimwGwi6gibvQHLzom/50yMYg+t
         cQNd8x5JHVI8SQp5F8Ig0TkJkQc3Rxe764cL75VqsJow3CwFvoENSS3EC0NKhiLwtOe3
         fxlytt/Lv9+32cYawTpQtfG4Ngh2JUS98o0YE3k7j09YRtBslj8828sUlouVo3dmMV17
         wjOMW9mWUpkRkKwTcA67KEvRgGJ3wIoLnd59PGCDh6W4qEnPf7T+7CS0Qfh52Qjdw7u3
         XfVUakRfaqFnYPYgLCIJqw9V5oUIZhnN41hoifQPwrIuTw1ckQa3S2spL3X12tY/BM1M
         9sSw==
X-Gm-Message-State: APjAAAV4xwO/CRs69zs9AqjGxtSWj5R5Us/wXwp3KhZBkamEcoCXr6h6
        DawviEXC2otEJa+4eQQiQpgT+L0/DdE=
X-Google-Smtp-Source: 
 APXvYqyazCuEy3nZx32jTIrwkoaa6vpUdM/YqPAsUh5uuCYgRqNZd7Tq29TbvgyOqVEilhy0y9XGqQ==
X-Received: by 2002:a05:600c:24ce:: with SMTP id
 14mr789418wmu.71.1569463977540;
        Wed, 25 Sep 2019 19:12:57 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.12.55
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:12:57 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 09/16] zuf: readdir operation
Date: Thu, 26 Sep 2019 05:07:18 +0300
Message-Id: <20190926020725.19601-10-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Implements the file_operations->iterate_shared via info
returned from Server.

Establish protocol with Server for readdir.

The Server fills a zuf allocated buffer (up to 4M at a time)
which will contain a zufs encoded dir entries. It will then
call the proper emit vector to fill the caller buffer.
The buffer is passed to Server not as part of the zufs_ioc_readdir
struct but maps this buffer directly into Server space via the
zt_map_pages facility.

[v2]
  Fix the gcc warning:
    directory.c:86:1: warning: the frame size of 8576 bytes is
		  larger than 8192 bytes
  Fix it by allocating the pages array, which was on stack
  as part of the allocation we already do for the readdir buffer
  Reported-by: kbuild test robot <lkp@intel.com>

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/directory.c | 69 +++++++++++++++++++++++++++++++++++-
 fs/zuf/zuf-core.c  |  2 ++
 fs/zuf/zus_api.h   | 88 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/fs/zuf/directory.c b/fs/zuf/directory.c
index 5624e05f96e5..7417aeb77773 100644
--- a/fs/zuf/directory.c
+++ b/fs/zuf/directory.c
@@ -19,7 +19,74 @@
 
 static int zuf_readdir(struct file *file, struct dir_context *ctx)
 {
-	return -ENOTSUPP;
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	loff_t i_size = i_size_read(inode);
+	struct zufs_ioc_readdir ioc_readdir = {
+		.hdr.in_len = sizeof(ioc_readdir),
+		.hdr.out_len = sizeof(ioc_readdir),
+		.hdr.operation = ZUFS_OP_READDIR,
+		.dir_ii = ZUII(inode)->zus_ii,
+	};
+	struct zufs_readdir_iter rdi;
+	struct page **pages;
+	struct zufs_dir_entry *zde;
+	void *addr, *__a;
+	uint nump, i;
+	int err;
+
+	if (ctx->pos && i_size <= ctx->pos)
+		return 0;
+	if (!i_size)
+		i_size = PAGE_SIZE; /* Just for the . && .. */
+	if (i_size - ctx->pos < PAGE_SIZE)
+		ioc_readdir.hdr.len = PAGE_SIZE;
+	else
+		ioc_readdir.hdr.len = min_t(loff_t, i_size - ctx->pos,
+					    ZUS_API_MAP_MAX_SIZE);
+	nump = md_o2p_up(ioc_readdir.hdr.len);
+	/* Allocating both readdir buffer and the pages-array.
+	 * Pages array is at end
+	 */
+	addr = vzalloc(md_p2o(nump) + nump * sizeof(*pages));
+	if (unlikely(!addr))
+		return -ENOMEM;
+
+	WARN_ON((ulong)addr & (PAGE_SIZE - 1));
+
+	pages = addr + md_p2o(nump);
+	__a = addr;
+	for (i = 0; i < nump; ++i) {
+		pages[i] = vmalloc_to_page(__a);
+		__a += PAGE_SIZE;
+	}
+
+more:
+	ioc_readdir.pos = ctx->pos;
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_readdir.hdr, pages, nump);
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err_dispatch(sb, "zufc_dispatch failed => %d\n", err);
+		goto out;
+	}
+
+	zufs_readdir_iter_init(&rdi, &ioc_readdir, addr);
+	while ((zde = zufs_next_zde(&rdi)) != NULL) {
+		zuf_dbg_verbose("%s pos=0x%lx\n",
+				zde->zstr.name, (ulong)zde->pos);
+		ctx->pos = zde->pos;
+		if (!dir_emit(ctx, zde->zstr.name, zde->zstr.len, zde->ino,
+			      zde->type))
+			goto out;
+	}
+	ctx->pos = ioc_readdir.pos;
+	if (ioc_readdir.more) {
+		zuf_dbg_err("more\n");
+		goto more;
+	}
+out:
+	vfree(addr);
+	return err;
 }
 
 /*
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 48dd7b665064..c0049c1d5ba3 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -74,6 +74,8 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_ADD_DENTRY);
 		CASE_ENUM_NAME(ZUFS_OP_REMOVE_DENTRY);
 		CASE_ENUM_NAME(ZUFS_OP_RENAME);
+		CASE_ENUM_NAME(ZUFS_OP_READDIR);
+
 		CASE_ENUM_NAME(ZUFS_OP_SETATTR);
 	case ZUFS_OP_MAX_OPT:
 	default:
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 9b9e97fe844e..2bdf047282e8 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -454,6 +454,7 @@ enum e_zufs_operation {
 	ZUFS_OP_ADD_DENTRY	= 8,
 	ZUFS_OP_REMOVE_DENTRY	= 9,
 	ZUFS_OP_RENAME		= 10,
+	ZUFS_OP_READDIR		= 11,
 
 	ZUFS_OP_SETATTR		= 19,
 
@@ -549,6 +550,93 @@ struct zufs_ioc_rename {
 	__u64 flags;
 };
 
+/* ZUFS_OP_READDIR */
+struct zufs_ioc_readdir {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *dir_ii;
+	__u64 pos;
+
+	/* OUT */
+	__u8	more;
+};
+
+struct zufs_dir_entry {
+	__le64 ino;
+	struct {
+		unsigned	type	: 8;
+		ulong		pos	: 56;
+	};
+	struct zufs_str zstr;
+};
+
+struct zufs_readdir_iter {
+	void *__zde, *last;
+	struct zufs_ioc_readdir *ioc_readdir;
+};
+
+enum {E_ZDE_HDR_SIZE =
+	offsetof(struct zufs_dir_entry, zstr) + offsetof(struct zufs_str, name),
+};
+
+#ifndef __cplusplus
+static inline void zufs_readdir_iter_init(struct zufs_readdir_iter *rdi,
+					  struct zufs_ioc_readdir *ioc_readdir,
+					  void *app_ptr)
+{
+	rdi->__zde = app_ptr;
+	rdi->last = app_ptr + ioc_readdir->hdr.len;
+	rdi->ioc_readdir = ioc_readdir;
+	ioc_readdir->more = false;
+}
+
+static inline uint zufs_dir_entry_len(__u8 name_len)
+{
+	return ALIGN(E_ZDE_HDR_SIZE + name_len, sizeof(__u64));
+}
+
+static inline
+struct zufs_dir_entry *zufs_next_zde(struct zufs_readdir_iter *rdi)
+{
+	struct zufs_dir_entry *zde = rdi->__zde;
+	uint len;
+
+	if (rdi->last <= rdi->__zde + E_ZDE_HDR_SIZE)
+		return NULL;
+	if (zde->zstr.len == 0)
+		return NULL;
+	len = zufs_dir_entry_len(zde->zstr.len);
+	if (rdi->last <= rdi->__zde + len)
+		return NULL;
+
+	rdi->__zde += len;
+	return zde;
+}
+
+static inline bool zufs_zde_emit(struct zufs_readdir_iter *rdi, __u64 ino,
+				 __u8 type, __u64 pos, const char *name,
+				 __u8 len)
+{
+	struct zufs_dir_entry *zde = rdi->__zde;
+
+	if (rdi->last <= rdi->__zde + zufs_dir_entry_len(len)) {
+		rdi->ioc_readdir->more = true;
+		return false;
+	}
+
+	rdi->ioc_readdir->more = 0;
+	zde->ino = ino;
+	zde->type = type;
+	/*ASSERT(0 == (pos && (1 << 56 - 1)));*/
+	zde->pos = pos;
+	strncpy(zde->zstr.name, name, len);
+	zde->zstr.len = len;
+	zufs_next_zde(rdi);
+
+	return true;
+}
+#endif /* ndef __cplusplus */
+
 /* ZUFS_OP_SETATTR */
 struct zufs_ioc_attr {
 	struct zufs_ioc_hdr hdr;

From patchwork Thu Sep 26 02:07:19 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161819
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A7246924
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:13:15 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 7B9F3222C1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:13:15 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="wtHeHlGY"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1731942AbfIZCNO (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:13:14 -0400
Received: from mail-wm1-f66.google.com ([209.85.128.66]:35334 "EHLO
        mail-wm1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727403AbfIZCNO (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:13:14 -0400
Received: by mail-wm1-f66.google.com with SMTP id y21so751225wmi.0
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:13:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=zh5jmyO5F3BUfYYd+HMTYNsKgvdRPm7apGOLmwRTjl4=;
        b=wtHeHlGYON58k0yQ7ju5K+mtZHXj30PbcpZIb6Uzix/WdCt0VohiK7nOLWeCLl7vmr
         UQE+DAKjzws4v1bhn0WRBcGA3dkDfuEtk7YpRve+LfWfwvrdH3ZgnThfw0QmMy+YwUcb
         IErZBuwMCfjPJv/zqiyMHNAHWtJrsdkvC2guA1QWb7D2u0dOWqIAA4OCh1ttWBZ+wOHk
         wNYMpiDibwGIzXhInMspOORC/DKfP6cSzUxGG+QdpHg1qYfH0L/WsyBra28T068YybgZ
         0B1XgTCUOx7qC3fbc+wKEQctLec3yYeLCLGncBEzQRGKR02/7ueRTg5mKJLoRuedLb9U
         /TzQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=zh5jmyO5F3BUfYYd+HMTYNsKgvdRPm7apGOLmwRTjl4=;
        b=E5SxPSjOcoZQjoQOCiLYVrSFJAZPIkRFQu1+u8FVpP3FUdwYVD0g4zkaceE/9SDNPM
         LgEo7t86quyXXxZ1a6jh/XEOmMZANtCj/78Kv3O+OEQ244DyyXXfp2ZQ3N8oYo+9RFY+
         gh3N3vw913g03QjWsYaM4/mi/8jXFIoge51+2CU35HfRDRxw5HdkcAS7kpex0XEOb0dk
         hd9BUUsZr+2GNjnCSSY5LpsjlnqHg81liVbi1BJSAMdfNlw5RVbNO51utqEZkPPj/9Le
         SYl/SE4zMIZHuuzepLgBJO2JS4thvXupSNyWIrky9ydace4qu7O0SGDlCdoZGMwfyrrB
         bTEA==
X-Gm-Message-State: APjAAAXbF8TX5JDrVMLmyMgYe7uL349hT9IV7sB1AHU/ivTEFgmT5VAV
        55/wJBEyUHJ0Q88jk4XpDFdv0QBrN7g=
X-Google-Smtp-Source: 
 APXvYqw2tKcpaTG4RbA5+4VSzIb3zCc/yvQJp5kJ5ofm5jLgOtRUXM7ql3rrieiHTW1FrsmGBN5D3w==
X-Received: by 2002:a1c:9d15:: with SMTP id g21mr865378wme.96.1569463991342;
        Wed, 25 Sep 2019 19:13:11 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.13.09
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:13:10 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 10/16] zuf: symlink
Date: Thu, 26 Sep 2019 05:07:19 +0300
Message-Id: <20190926020725.19601-11-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

The symlink support is all hidden within the creation/open
of the inode.

As part of ZUFS_OP_NEW_INODE we also send the requested
content of the symlink for storage.

On an open of an existing symlink the link information
is returned within the zufs_inode structure via a zufs_dpp_t
pointer. (See Documentation about zufs_dpp_t pointers)

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile  |  2 +-
 fs/zuf/_extern.h |  7 +++++
 fs/zuf/inode.c   |  7 +++++
 fs/zuf/namei.c   | 27 ++++++++++++++++++
 fs/zuf/symlink.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 115 insertions(+), 1 deletion(-)
 create mode 100644 fs/zuf/symlink.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 2bfed45723e3..04c31b7bb9ff 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,5 +17,5 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += super.o inode.o directory.o namei.o file.o
+zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 50887792bf42..95413f65c47f 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -88,6 +88,10 @@ void zuf_set_inode_flags(struct inode *inode, struct zus_inode *zi);
 int zuf_add_dentry(struct inode *dir, struct qstr *str, struct inode *inode);
 int zuf_remove_dentry(struct inode *dir, struct qstr *str, struct inode *inode);
 
+/* symlink.c */
+uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
+			const char *symname, ulong len, struct page *pages[2]);
+
 /* t1.c */
 int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
 
@@ -109,4 +113,7 @@ extern const struct inode_operations zuf_special_inode_operations;
 /* dir.c */
 extern const struct file_operations zuf_dir_operations;
 
+/* symlink.c */
+extern const struct inode_operations zuf_symlink_inode_operations;
+
 #endif	/*ndef __ZUF_EXTERN_H__*/
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index 88cb1937c223..bf3f8b27f918 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -83,6 +83,9 @@ static void _set_inode_from_zi(struct inode *inode, struct zus_inode *zi)
 		inode->i_op = &zuf_dir_inode_operations;
 		inode->i_fop = &zuf_dir_operations;
 		break;
+	case S_IFLNK:
+		inode->i_op = &zuf_symlink_inode_operations;
+		break;
 	case S_IFBLK:
 	case S_IFCHR:
 	case S_IFIFO:
@@ -348,6 +351,10 @@ struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
 	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) ||
 	    S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) {
 		init_special_inode(inode, mode, rdev_or_isize);
+	} else if (symname) {
+		inode->i_size = rdev_or_isize;
+		nump = zuf_prepare_symname(&ioc_new_inode, symname,
+					   rdev_or_isize, pages);
 	}
 
 	err = _set_zi_from_inode(dir, &ioc_new_inode.zi, inode);
diff --git a/fs/zuf/namei.c b/fs/zuf/namei.c
index 299134ca7c07..e78aa04f10d5 100644
--- a/fs/zuf/namei.c
+++ b/fs/zuf/namei.c
@@ -164,6 +164,32 @@ static int zuf_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 	return 0;
 }
 
+static int zuf_symlink(struct inode *dir, struct dentry *dentry,
+		       const char *symname)
+{
+	struct inode *inode;
+	ulong len;
+
+	zuf_dbg_vfs("[%ld] de->name=%s symname=%s\n",
+			dir->i_ino, dentry->d_name.name, symname);
+
+	len = strlen(symname);
+	if (len + 1 > ZUFS_MAX_SYMLINK)
+		return -ENAMETOOLONG;
+
+	inode = zuf_new_inode(dir, S_IFLNK|S_IRWXUGO, &dentry->d_name,
+			       symname, len, false);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	inode->i_op = &zuf_symlink_inode_operations;
+	inode->i_mapping->a_ops = &zuf_aops;
+
+	_instantiate_unlock(dentry, inode);
+
+	return 0;
+}
+
 static int zuf_link(struct dentry *dest_dentry, struct inode *dir,
 		    struct dentry *dentry)
 {
@@ -385,6 +411,7 @@ const struct inode_operations zuf_dir_inode_operations = {
 	.lookup		= zuf_lookup,
 	.link		= zuf_link,
 	.unlink		= zuf_unlink,
+	.symlink	= zuf_symlink,
 	.mkdir		= zuf_mkdir,
 	.rmdir		= zuf_rmdir,
 	.mknod		= zuf_mknod,
diff --git a/fs/zuf/symlink.c b/fs/zuf/symlink.c
new file mode 100644
index 000000000000..1446bdf60cb9
--- /dev/null
+++ b/fs/zuf/symlink.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Symlink operations
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include "zuf.h"
+
+/* Can never fail all checks already made before.
+ * Returns: The number of pages stored @pages
+ */
+uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
+			 const char *symname, ulong len,
+			 struct page *pages[2])
+{
+	uint nump;
+
+	ioc_new_inode->zi.i_size = cpu_to_le64(len);
+	if (len < sizeof(ioc_new_inode->zi.i_symlink)) {
+		memcpy(&ioc_new_inode->zi.i_symlink, symname, len);
+		return 0;
+	}
+
+	pages[0] = virt_to_page(symname);
+	nump = 1;
+
+	ioc_new_inode->hdr.len = len;
+	ioc_new_inode->hdr.offset = (ulong)symname & (PAGE_SIZE - 1);
+
+	if (PAGE_SIZE < ioc_new_inode->hdr.offset + len) {
+		pages[1] = virt_to_page(symname + PAGE_SIZE);
+		++nump;
+	}
+
+	return nump;
+}
+
+/*
+ * In case of short symlink, we serve it directly from zi; otherwise, read
+ * symlink value directly from pmem using dpp mapping.
+ */
+static const char *zuf_get_link(struct dentry *dentry, struct inode *inode,
+				struct delayed_call *notused)
+{
+	const char *link;
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (inode->i_size < sizeof(zii->zi->i_symlink))
+		return zii->zi->i_symlink;
+
+	link = zuf_dpp_t_addr(inode->i_sb, le64_to_cpu(zii->zi->i_sym_dpp));
+	if (!link) {
+		zuf_err("bad symlink: i_sym_dpp=0x%llx\n", zii->zi->i_sym_dpp);
+		return ERR_PTR(-EIO);
+	}
+	return link;
+}
+
+const struct inode_operations zuf_symlink_inode_operations = {
+	.get_link	= zuf_get_link,
+	.update_time	= zuf_update_time,
+	.setattr	= zuf_setattr,
+	.getattr	= zuf_getattr,
+};

From patchwork Thu Sep 26 02:07:20 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161825
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6AF51924
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:13:38 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 287A4222C1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:13:38 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="j4g8CuNT"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1732328AbfIZCNh (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:13:37 -0400
Received: from mail-wr1-f68.google.com ([209.85.221.68]:45715 "EHLO
        mail-wr1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727403AbfIZCNh (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:13:37 -0400
Received: by mail-wr1-f68.google.com with SMTP id r5so478158wrm.12
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:13:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=YrpOaPQLD5Jn7KZmM5HKkpm2ZuROLPYaVDN60R153Dg=;
        b=j4g8CuNTXWw9n3m8mffCwJuNA+PA+Hb6QMdufOp6oW+zr2d788h5EAi13pP01dJzd5
         bVs+2ejbaK9sG4OOGBg0xMo7ZJskBDopWQ05u8UKIW7n1mtn6ceXkOozRagN4+CCjG2h
         uKKT7FXWvMKgw4kiAfZ0MTtsSK+wcxOdSVIbhOXjzsABFJ2JVti4hEfomR/hZdp+v3oJ
         O7K9kaocTb1EkoZkDhoUonzIwQOyY+3gFBT8E6/PMC7BbJEIKBw7hU82ngIOXcM71Wwc
         /LjonYbJw2FxEPy8+OGGwpuzrFOqteKiV7JUABD7eb63VhI8tV+PGWoFsUwGR2gbsXmB
         9KVw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=YrpOaPQLD5Jn7KZmM5HKkpm2ZuROLPYaVDN60R153Dg=;
        b=Nd/NL5yFggLPQnZvgwgzvg1s7KJHlFYpb3FyH/yfv4NSoX6DLNkOm5X3GsB14C7TRu
         H9Gk4jZTCWEejlaNzKku0oBK3ZnkennIBv3F62HOA8BUMdC3U6W2WUmxYSHQzAfTn1Y8
         GTj/BOkVW8nUKAAeeehpt/buk2GOIP38Az1vhltiEfTX3IwRLZ4r0NHJbasJbgjvrbB6
         l5NYnVnO1pxx7wHPPiT0qKA6z0aVAs9T+QF96l7rlGZaVStv7C3FRHW5KCQfvQLc3Yxv
         gp6WZ50AABhh0ucRMOs231/4wa3DmQIA8FWNOACR/PdJTdEM+EndG7GwtMyBtfTTpsI4
         u10w==
X-Gm-Message-State: APjAAAX+CZjBDxI7QrQ+VDSWfmVJ8pXz/FENbMm7+DgxftouelMqBJV1
        09A8tjnvtu0gfSp953csM4FhySZM750=
X-Google-Smtp-Source: 
 APXvYqylDWaqgEqmxxFXmsAbL+2r+weaOoWOV8cQKKFETCvPOqz+p7IzwBGzY9VsUA74PYRf0dnkZw==
X-Received: by 2002:a5d:40c4:: with SMTP id b4mr839063wrq.214.1569464007263;
        Wed, 25 Sep 2019 19:13:27 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.13.25
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:13:26 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 11/16] zuf: Write/Read implementation
Date: Thu, 26 Sep 2019 05:07:20 +0300
Message-Id: <20190926020725.19601-12-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

zufs Has two ways to do IO.

1. The elegant way:
   By mapping application buffers into Server VM. This is much simpler
   to implement by zusFS. But is slow and does not scale well.

2. The fast way: (called NIO)
   Server returns physical block information. And the pmem_memcpy
   is done in Kernel.

   This way is more complicated. Each block needs to ZUFS_GET_MULTI
   But also ZUFS_PUT_MULTI to indicate that Kernel has finished the
   copy, and pmem block may be recycled.
   But if we will go to server and back twice for each IOP this will
   kill our performance. So what we do is the pigi_put mechanisim
   (See zuf-core.c). pigi_put is a way to delay the put operation for
   later so when a new operation is going to Server it will take on the
   way all accumulated put operations. So in one go I might fetch
   new block info as well as PUT the previous IO. Don't worry all
   this is done zuf-core style without any locks or atomics.
   There are times that Server may request an immediate PUT and/or
   keep the ZT-channel locked for guaranty forward progress.

It is up to the zusFS to decide which mode it wants to operate in
[1] or [2] above. And more flags govern aspects of the IO requested.

The dispatch to the server can operate on buffers up to
ZUS_API_MAP_MAX_SIZE (4M). Any bigger operations are split
up and dispatched at this size.

Also if a multy-segments aio is used each segment is dispatched
on its own.

rw.c here also includes some operations for mmap. Will be used
in next patch.

The fallocate operation with its various mode flags is also dispatched
through the rw.c IO API because it might need to do some t1/t2 IO as
part of the operation. If it is for COW of cloned inodes or read/write
of the unaligned edges. zufs also implements truncate via a private
fallocate flag.

There is also code for comparing two buffers for the implementation
of the dedup operation.

Also in this patch the facility to SWAP on a zufs system.

There is also an IOCTL fasility to execute IO (ZU_IOC_IOMAP_EXEC)
from a Server background threads. We use this in Netapp for
tiering down cold blocks to slower storage.
Both ZU_IOC_IOMAP_EXEC and the IO despatch operate on facility
we call zufs_iomap which is a varlen buffer that may request and
encode many types of operations and block/memory targets for IO.
It is kind of an IO executor of sorts. zusFS encodes such iomap
to tell Kernel what needs to be done.

[v2]
  zuf: Range of _IO_gm_inner must fit API (PXS-5151)
   Zuf must never request pages which may fall out-of-range of
  ZUS_API_MAP_MAX_PAGES. When IO request is not page-aligned, limit
  size based on start offset.

[v3]
  zufc: bad bugs in zufc_goose_all_zts

 * The BAD Bug was that we called the internal smp_call_function
   instead of the proper on_each_cpu.
   This was bad because smp_call_function calls all other CPUs
   but us. Anyway the proper public API for this is on_each_cpu.

 * Another BUG is that zufc_goose_all_zts needs to be always called
   with an inode. This is because we are assuming that we are holding
   the inode_w_lock and no more puts can come in parallel to the goose_all.

 * In clone the goose target is the destination file which is going
   to be truncated. (See above we must have a locked inode at hand)

 * Call zufc_goose_all_zts under the inode_w_lock in evict.

 * One more change is to *not* relay on Server to turn off the
   ZUFS_H_HAS_PIGY_PUT flag. We will use this later to fix another
   theoretical Race window with pigi_put
   (In fact there is a zus patch to stop resetting that bit)

[v4]
  Remove the swap activate code. It will come in later Kernels.
  This is because to do it properly we should send a small patch
  to Kernel so to not force the FS to use page_cache. The code
  had an Hack to bypass this bug. But I rather remove the code
  instead.

[v5]
  Fix the warning of type:
    warning: the frame size of 8712 bytes is larger than 8192 bytes

  We allocate the maximum stack space allowed by the Kernel
  configuration, without warning. If the needed space fits in the
  stack it is used. If not we allocate from a new dedicated kmem_cache
  an 8K buffer to store our block-numbers. 8k is the maximum allowed
  in the zufs API which is 1024 data blocks,
  The above logic is hidden under the big_alloc facility that was already
  used in other places.

Signed-off-by: Sagi Manole <sagim@netapp.com>
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   1 +
 fs/zuf/_extern.h  |  22 ++
 fs/zuf/file.c     |  73 ++++
 fs/zuf/inode.c    |  13 +
 fs/zuf/rw.c       | 959 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf-core.c | 400 ++++++++++++++++++-
 fs/zuf/zuf.h      |   7 +
 fs/zuf/zus_api.h  | 251 ++++++++++++
 8 files changed, 1724 insertions(+), 2 deletions(-)
 create mode 100644 fs/zuf/rw.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 04c31b7bb9ff..23bc3791a001 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,5 +17,6 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
+zuf-y += rw.o
 zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 95413f65c47f..745d0cc9e719 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -43,6 +43,9 @@ int zufc_dispatch(struct zuf_root_info *zri, struct zufs_ioc_hdr *hdr,
 	zuf_dispatch_init(&zdo, hdr, pages, nump);
 	return __zufc_dispatch(zri, &zdo);
 }
+int zufc_pigy_put(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo,
+		  struct zufs_ioc_IO *io, uint iom_n, ulong *bns, bool do_now);
+void zufc_goose_all_zts(struct zuf_root_info *zri, struct inode *inode);
 
 /* zuf-root.c */
 int zufr_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs);
@@ -92,6 +95,25 @@ int zuf_remove_dentry(struct inode *dir, struct qstr *str, struct inode *inode);
 uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
 			const char *symname, ulong len, struct page *pages[2]);
 
+/* rw.c */
+int zuf_rw_read_page(struct zuf_sb_info *sbi, struct inode *inode,
+		     struct page *page, u64 filepos);
+ssize_t zuf_rw_read_iter(struct super_block *sb, struct inode *inode,
+			 struct kiocb *kiocb, struct iov_iter *ii);
+ssize_t zuf_rw_write_iter(struct super_block *sb, struct inode *inode,
+			  struct kiocb *kiocb, struct iov_iter *ii);
+int _zufs_IO_get_multy(struct zuf_sb_info *sbi, struct inode *inode,
+		       loff_t pos, ulong len, struct _io_gb_multy *io_gb);
+void _zufs_IO_put_multy(struct zuf_sb_info *sbi, struct inode *inode,
+			struct _io_gb_multy *io_gb);
+int zuf_rw_fallocate(struct inode *inode, uint mode, loff_t offset, loff_t len);
+int zuf_iom_execute_sync(struct super_block *sb, struct inode *inode,
+			 __u64 *iom_e, uint iom_n);
+int zuf_iom_execute_async(struct super_block *sb, struct zus_iomap_build *iomb,
+			 __u64 *iom_e_user, uint iom_n);
+int zuf_rw_file_range_compare(struct inode *i_in, loff_t pos_in,
+			      struct inode *i_out, loff_t pos_out, loff_t len);
+
 /* t1.c */
 int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
 
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index 619dada43666..8711b44371e0 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -13,6 +13,9 @@
  *	Sagi Manole <sagim@netapp.com>"
  */
 
+#include <linux/fs.h>
+#include <linux/uio.h>
+
 #include "zuf.h"
 
 long __zuf_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
@@ -20,8 +23,78 @@ long __zuf_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
 	return -ENOTSUPP;
 }
 
+static ssize_t zuf_read_iter(struct kiocb *kiocb, struct iov_iter *ii)
+{
+	struct inode *inode = file_inode(kiocb->ki_filp);
+	struct zuf_inode_info *zii = ZUII(inode);
+	ssize_t ret;
+
+	zuf_dbg_rw("[%ld] ppos=0x%llx len=0x%zx\n",
+		     inode->i_ino, kiocb->ki_pos, iov_iter_count(ii));
+
+	file_accessed(kiocb->ki_filp);
+
+	zuf_r_lock(zii);
+
+	ret = zuf_rw_read_iter(inode->i_sb, inode, kiocb, ii);
+
+	zuf_r_unlock(zii);
+
+	zuf_dbg_rw("[%ld] => 0x%lx\n", inode->i_ino, ret);
+	return ret;
+}
+
+static ssize_t zuf_write_iter(struct kiocb *kiocb, struct iov_iter *ii)
+{
+	struct inode *inode = file_inode(kiocb->ki_filp);
+	struct zuf_inode_info *zii = ZUII(inode);
+	ssize_t ret;
+	loff_t end_offset;
+
+	ret = generic_write_checks(kiocb, ii);
+	if (unlikely(ret < 0)) {
+		zuf_dbg_vfs("[%ld] generic_write_checks => 0x%lx\n",
+			    inode->i_ino, ret);
+		return ret;
+	}
+
+	zuf_r_lock(zii);
+
+	ret = file_remove_privs(kiocb->ki_filp);
+	if (unlikely(ret < 0))
+		goto out;
+
+	end_offset = kiocb->ki_pos + iov_iter_count(ii);
+	if (inode->i_size < end_offset) {
+		spin_lock(&inode->i_lock);
+		if (inode->i_size < end_offset) {
+			zii->zi->i_size = cpu_to_le64(end_offset);
+			i_size_write(inode, end_offset);
+		}
+		spin_unlock(&inode->i_lock);
+	}
+
+	zus_inode_cmtime_now(inode, zii->zi);
+
+	ret = zuf_rw_write_iter(inode->i_sb, inode, kiocb, ii);
+	if (unlikely(ret < 0)) {
+		/* TODO(sagi): do we want to truncate i_size? */
+		goto out;
+	}
+
+	inode->i_blocks = le64_to_cpu(zii->zi->i_blocks);
+
+out:
+	zuf_r_unlock(zii);
+
+	zuf_dbg_rw("[%ld] => 0x%lx\n", inode->i_ino, ret);
+	return ret;
+}
+
 const struct file_operations zuf_file_operations = {
 	.open			= generic_file_open,
+	.read_iter		= zuf_read_iter,
+	.write_iter		= zuf_write_iter,
 };
 
 const struct inode_operations zuf_file_inode_operations = {
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index bf3f8b27f918..27660979ed6f 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -287,6 +287,8 @@ void zuf_evict_inode(struct inode *inode)
 
 		zuf_w_lock(zii);
 
+		zufc_goose_all_zts(ZUF_ROOT(SBI(sb)), inode);
+
 		zuf_evict_dispatch(sb, zii->zus_ii, ZUFS_OP_FREE_INODE, 0);
 
 		inode->i_mtime = inode->i_ctime = current_time(inode);
@@ -298,6 +300,8 @@ void zuf_evict_inode(struct inode *inode)
 
 		zuf_smw_lock(zii);
 
+		zufc_goose_all_zts(ZUF_ROOT(SBI(sb)), inode);
+
 		zuf_evict_dispatch(sb, zii->zus_ii, ZUFS_OP_EVICT_INODE, 0);
 
 		zuf_smw_unlock(zii);
@@ -585,5 +589,14 @@ void zuf_set_inode_flags(struct inode *inode, struct zus_inode *zi)
 		inode_has_no_xattr(inode);
 }
 
+/* direct_IO is not called. We set an empty one so open(O_DIRECT) will be happy
+ */
+static ssize_t zuf_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+{
+	WARN_ON(1);
+	return 0;
+}
+
 const struct address_space_operations zuf_aops = {
+	.direct_IO		= zuf_direct_IO,
 };
diff --git a/fs/zuf/rw.c b/fs/zuf/rw.c
new file mode 100644
index 000000000000..48f584e71a03
--- /dev/null
+++ b/fs/zuf/rw.c
@@ -0,0 +1,959 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Read/Write operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+#include <linux/fadvise.h>
+#include <linux/uio.h>
+#include <linux/delay.h>
+#include <asm/cacheflush.h>
+
+#include "zuf.h"
+#include "t2.h"
+
+#define	rand_tag(kiocb)	\
+	((kiocb->ki_filp->f_mode & FMODE_RANDOM) ? ZUFS_RW_RAND : 0)
+#define	kiocb_ra(kiocb)	(&kiocb->ki_filp->f_ra)
+
+static const char *_pr_rw(uint rw)
+{
+	return (rw & WRITE) ? "WRITE" : "READ";
+}
+
+static int _ioc_bounds_check(struct zufs_iomap *ziom,
+			     struct zufs_iomap *user_ziom, void *ziom_end)
+{
+	size_t iom_max_bytes = ziom_end - (void *)&user_ziom->iom_e;
+
+	if (unlikely((iom_max_bytes / sizeof(__u64) < ziom->iom_max))) {
+		zuf_err("kernel-buff-size(0x%zx) < ziom->iom_max(0x%x)\n",
+			(iom_max_bytes / sizeof(__u64)), ziom->iom_max);
+		return -EINVAL;
+	}
+
+	if (unlikely(ziom->iom_max < ziom->iom_n)) {
+		zuf_err("ziom->iom_max(0x%x) < ziom->iom_n(0x%x)\n",
+			ziom->iom_max, ziom->iom_n);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void _extract_gb_multy_bns(struct _io_gb_multy *io_gb,
+				  struct zufs_ioc_IO *io_user)
+{
+	uint i;
+
+	/* Return of some T1 pages from GET_MULTY */
+	io_gb->iom_n = 0;
+	for (i = 0; i < io_gb->IO.ziom.iom_n; ++i) {
+		ulong bn = _zufs_iom_t1_bn(io_user->iom_e[i]);
+
+		if (unlikely(bn == -1)) {
+			zuf_err("!!!!");
+			break;
+		}
+		io_gb->bns[io_gb->iom_n++] = bn;
+	}
+}
+
+static int rw_overflow_handler(struct zuf_dispatch_op *zdo, void *arg,
+			       ulong max_bytes)
+{
+	struct zufs_ioc_IO *io = container_of(zdo->hdr, typeof(*io), hdr);
+	struct zufs_ioc_IO *io_user = arg;
+	int err;
+
+	*io = *io_user;
+
+	err = _ioc_bounds_check(&io->ziom, &io_user->ziom, arg + max_bytes);
+	if (unlikely(err))
+		return err;
+
+	if ((io->hdr.err == -EZUFS_RETRY) &&
+	    io->ziom.iom_n && _zufs_iom_pop(io->iom_e)) {
+
+		zuf_dbg_rw(
+			"[%s]zuf_iom_execute_sync(%d) max=0x%lx iom_e[%d] => %d\n",
+			zuf_op_name(io->hdr.operation), io->ziom.iom_n,
+			max_bytes, _zufs_iom_opt_type(io_user->iom_e),
+			io->hdr.err);
+
+		io->hdr.err = zuf_iom_execute_sync(zdo->sb, zdo->inode,
+						   io_user->iom_e,
+						   io->ziom.iom_n);
+		return EZUF_RETRY_DONE;
+	}
+
+	/* No tier ups needed */
+
+	if (io->hdr.err == -EZUFS_RETRY) {
+		zuf_warn("ZUSfs violating API EZUFS_RETRY with no payload\n");
+		/* continue any way because we want to PUT all these GETs
+		 * we did. But the Server is buggy
+		 */
+		io->hdr.err = 0;
+	}
+
+	if (io->hdr.operation != ZUFS_OP_GET_MULTY)
+		return 0; /* We are finished */
+
+	/* ZUFS_OP_GET_MULTY Decoding at ZT context  */
+
+	if (io->ziom.iom_n) {
+		struct _io_gb_multy *io_gb =
+					container_of(io, typeof(*io_gb), IO);
+
+		zuf_dbg_rw("[%s] _extract_bns(%d) iom_e[0x%llx]\n",
+			   zuf_op_name(io->hdr.operation), io->ziom.iom_n,
+			   io_user->iom_e[0]);
+
+		if (unlikely(ZUS_API_MAP_MAX_PAGES < io->ziom.iom_n)) {
+			zuf_err("[%s] leaking T1 (%d) iom_e[0x%llx]\n",
+				zuf_op_name(io->hdr.operation), io->ziom.iom_n,
+				io_user->iom_e[0]);
+
+			io->ziom.iom_n = ZUS_API_MAP_MAX_PAGES;
+		}
+
+		_extract_gb_multy_bns(io_gb, io_user);
+	}
+
+	return 0;
+}
+
+static int _IO_dispatch(struct zuf_sb_info *sbi, struct zufs_ioc_IO *IO,
+			struct zuf_inode_info *zii, int operation,
+			uint pgoffset, struct page **pages, uint nump,
+			u64 filepos, uint len)
+{
+	struct zuf_dispatch_op zdo;
+	int err;
+
+	IO->hdr.operation = operation;
+	IO->hdr.in_len = sizeof(*IO);
+	IO->hdr.out_len = sizeof(*IO);
+	IO->hdr.offset = pgoffset;
+	IO->hdr.len = len;
+	IO->zus_ii = zii->zus_ii;
+	IO->filepos = filepos;
+
+	zuf_dispatch_init(&zdo, &IO->hdr, pages, nump);
+	zdo.oh = rw_overflow_handler;
+	zdo.sb = sbi->sb;
+	zdo.inode = &zii->vfs_inode;
+
+	zuf_dbg_verbose("[%ld][%s] fp=0x%llx nump=0x%x len=0x%x\n",
+			zdo.inode ? zdo.inode->i_ino : -1,
+			zuf_op_name(operation), filepos, nump, len);
+
+	err = __zufc_dispatch(ZUF_ROOT(sbi), &zdo);
+	if (unlikely(err == -EZUFS_RETRY)) {
+		zuf_err("Unexpected ZUS return => %d\n", err);
+		err = -EIO;
+	}
+	return err;
+}
+
+int zuf_rw_read_page(struct zuf_sb_info *sbi, struct inode *inode,
+		     struct page *page, u64 filepos)
+{
+	struct zufs_ioc_IO io = {};
+	struct page *pages[1];
+	uint nump;
+	int err;
+
+	pages[0] = page;
+	nump = 1;
+
+	err = _IO_dispatch(sbi, &io, ZUII(inode), ZUFS_OP_READ, 0, pages, nump,
+			   filepos, PAGE_SIZE);
+	return err;
+}
+
+
+/* return < 0 - is err. 0 compairs */
+int zuf_rw_file_range_compare(struct inode *i_in, loff_t pos_in,
+			      struct inode *i_out, loff_t pos_out, loff_t len)
+{
+	struct super_block *sb = i_in->i_sb;
+	ulong bs = sb->s_blocksize;
+	struct page *p_in, *p_out;
+	void *a_in, *a_out;
+	int err = 0;
+
+	if (unlikely((pos_in & (bs - 1)) || (pos_out & (bs - 1)) ||
+		     (bs != PAGE_SIZE))) {
+		zuf_err("[%ld]@0x%llx & [%ld]@0x%llx len=0x%llx bs=0x%lx\n",
+			   i_in->i_ino, pos_in, i_out->i_ino, pos_out, len, bs);
+		return -EINVAL;
+	}
+
+	zuf_dbg_rw("[%ld]@0x%llx & [%ld]@0x%llx len=0x%llx\n",
+		   i_in->i_ino, pos_in, i_out->i_ino, pos_out, len);
+
+	p_in = alloc_page(GFP_KERNEL);
+	p_out = alloc_page(GFP_KERNEL);
+	if (unlikely(!p_in || !p_out)) {
+		err = -ENOMEM;
+		goto out;
+	}
+	a_in = page_address(p_in);
+	a_out = page_address(p_out);
+
+	while (len) {
+		ulong l;
+
+		err = zuf_rw_read_page(SBI(sb), i_in, p_in, pos_in);
+		if (unlikely(err))
+			goto out;
+
+		err = zuf_rw_read_page(SBI(sb), i_out, p_out, pos_out);
+		if (unlikely(err))
+			goto out;
+
+		l = min_t(ulong, PAGE_SIZE, len);
+		if (memcmp(a_in, a_out, l)) {
+			err = -EBADE;
+			goto out;
+		}
+
+		pos_in += l;
+		pos_out += l;
+		len -= l;
+	}
+
+out:
+	__free_page(p_in);
+	__free_page(p_out);
+
+	return err;
+}
+
+/* ZERO a part of a single block. len does not cross a block boundary */
+int zuf_rw_fallocate(struct inode *inode, uint mode, loff_t pos, loff_t len)
+{
+	struct zufs_ioc_IO io = {};
+	int err;
+
+	io.last_pos = (len == ~0ULL) ? ~0ULL : pos + len;
+	io.rw = mode;
+
+	err = _IO_dispatch(SBI(inode->i_sb), &io, ZUII(inode),
+			   ZUFS_OP_FALLOCATE, 0, NULL, 0, pos, 0);
+	return err;
+
+}
+
+static struct page *_addr_to_page(unsigned long addr)
+{
+	const void *p = (const void *)addr;
+
+	return is_vmalloc_addr(p) ? vmalloc_to_page(p) : virt_to_page(p);
+}
+
+static ssize_t _iov_iter_get_pages_kvec(struct iov_iter *ii,
+		   struct page **pages, size_t maxsize, uint maxpages,
+		   size_t *start)
+{
+	ssize_t bytes;
+	size_t i, nump;
+	unsigned long addr = (unsigned long)ii->kvec->iov_base;
+
+	*start = addr & (PAGE_SIZE - 1);
+	bytes = min_t(ssize_t, iov_iter_single_seg_count(ii), maxsize);
+	nump = min_t(size_t, DIV_ROUND_UP(bytes + *start, PAGE_SIZE), maxpages);
+
+	/* TODO: FUSE assumes single page for ITER_KVEC. Boaz: Remove? */
+	WARN_ON(nump > 1);
+
+	for (i = 0; i < nump; ++i) {
+		pages[i] = _addr_to_page(addr + (i * PAGE_SIZE));
+
+		get_page(pages[i]);
+	}
+	return bytes;
+}
+
+static ssize_t _iov_iter_get_pages_any(struct iov_iter *ii,
+		   struct page **pages, size_t maxsize, uint maxpages,
+		   size_t *start)
+{
+	ssize_t bytes;
+
+	bytes = unlikely(ii->type & ITER_KVEC) ?
+		_iov_iter_get_pages_kvec(ii, pages, maxsize, maxpages, start) :
+		iov_iter_get_pages(ii, pages, maxsize, maxpages, start);
+
+	if (unlikely(bytes < 0))
+		zuf_dbg_err("[%d] bytes=%ld type=%d count=%lu",
+			smp_processor_id(), bytes, ii->type, ii->count);
+
+	return bytes;
+}
+
+static ssize_t _zufs_IO(struct zuf_sb_info *sbi, struct inode *inode,
+			void *on_stack, uint max_on_stack,
+			struct iov_iter *ii, struct kiocb *kiocb,
+			struct file_ra_state *ra, int operation, uint rw)
+{
+	int err = 0;
+	loff_t start_pos = kiocb->ki_pos;
+	loff_t pos = start_pos;
+	enum big_alloc_type bat;
+	struct page **pages;
+	uint max_pages = min_t(uint,
+			md_o2p_up(iov_iter_count(ii) + (pos & ~PAGE_MASK)),
+			ZUS_API_MAP_MAX_PAGES);
+
+	pages = big_alloc(max_pages * sizeof(*pages), max_on_stack, on_stack,
+			  GFP_NOFS, &bat);
+	if (unlikely(!pages)) {
+		zuf_err("Sigh on stack is best max_pages=%d\n", max_pages);
+		return -ENOMEM;
+	};
+
+	while (iov_iter_count(ii)) {
+		struct zufs_ioc_IO io = {};
+		uint nump;
+		ssize_t bytes;
+		size_t pgoffset;
+		uint i;
+
+		if (ra) {
+			io.ra.start	= ra->start;
+			io.ra.ra_pages	= ra->ra_pages;
+			io.ra.prev_pos	= ra->prev_pos;
+		}
+		io.rw = rw;
+
+		bytes = _iov_iter_get_pages_any(ii, pages,
+					ZUS_API_MAP_MAX_SIZE,
+					ZUS_API_MAP_MAX_PAGES, &pgoffset);
+		if (unlikely(bytes < 0)) {
+			err = bytes;
+			break;
+		}
+
+		nump = DIV_ROUND_UP(bytes + pgoffset, PAGE_SIZE);
+
+		io.last_pos = pos;
+		err = _IO_dispatch(sbi, &io, ZUII(inode), operation,
+				   pgoffset, pages, nump, pos, bytes);
+
+		bytes = io.last_pos - pos;
+
+		zuf_dbg_rw("[%ld]	%s [0x%llx-0x%zx]\n",
+			    inode->i_ino, _pr_rw(rw), pos, bytes);
+
+		iov_iter_advance(ii, bytes);
+		pos += bytes;
+
+		if (ra) {
+			ra->start	= io.ra.start;
+			ra->ra_pages	= io.ra.ra_pages;
+			ra->prev_pos	= io.ra.prev_pos;
+		}
+		if (io.wr_unmap.len)
+			unmap_mapping_range(inode->i_mapping,
+					    io.wr_unmap.offset,
+					    io.wr_unmap.len, 0);
+
+		for (i = 0; i < nump; ++i)
+			put_page(pages[i]);
+
+		if (unlikely(err))
+			break;
+	}
+
+	big_free(pages, bat);
+
+	if (unlikely(pos == start_pos))
+		return err;
+
+	kiocb->ki_pos = pos;
+	return pos - start_pos;
+}
+
+int _zufs_IO_get_multy(struct zuf_sb_info *sbi, struct inode *inode,
+		       loff_t pos, ulong len, struct _io_gb_multy *io_gb)
+{
+	struct zufs_ioc_IO *IO = &io_gb->IO;
+	int err;
+
+	IO->hdr.operation = ZUFS_OP_GET_MULTY;
+	IO->hdr.in_len = sizeof(*IO);
+	IO->hdr.out_len = sizeof(*IO);
+	IO->hdr.len = len;
+	IO->zus_ii = ZUII(inode)->zus_ii;
+	IO->filepos = pos;
+	IO->last_pos = pos;
+
+	zuf_dispatch_init(&io_gb->zdo, &IO->hdr, NULL, 0);
+	io_gb->zdo.oh = rw_overflow_handler;
+	io_gb->zdo.sb = sbi->sb;
+	io_gb->zdo.inode = inode;
+	io_gb->zdo.bns = io_gb->bns;
+
+
+	err = __zufc_dispatch(ZUF_ROOT(sbi), &io_gb->zdo);
+	if (unlikely(err == -EZUFS_RETRY)) {
+		zuf_err("Unexpected ZUS return => %d\n", err);
+		err = -EIO;
+	}
+
+	if (unlikely(err)) {
+		/* err from Server means no contract and NO bns locked
+		 * so no puts
+		 */
+		if ((err != -ENOSPC) && (err != -EIO) && (err != -EINTR))
+			zuf_warn("At this early stage show me %d\n", err);
+		if (io_gb->IO.ziom.iom_n)
+			zuf_err("Server Smoking iom_n=%u err=%d\n",
+				io_gb->IO.ziom.iom_n, err);
+		zuf_dbg_err("_IO_dispatch => %d\n", err);
+		return err;
+	}
+	if (unlikely(!io_gb->iom_n)) {
+		if (!io_gb->IO.ziom.iom_n) {
+			zuf_err("WANT tO SEE => %d\n", err);
+			return err;
+		}
+
+		_extract_gb_multy_bns(io_gb, &io_gb->IO);
+		if (unlikely(!io_gb->iom_n)) {
+			zuf_err("WHAT ????\n");
+			return err;
+		}
+	}
+	/* Even if _IO_dispatch returned a theoretical error but also some
+	 * pages, we do the few pages and do an OP_PUT_MULTY (error ignored)
+	 */
+	return 0;
+}
+
+void _zufs_IO_put_multy(struct zuf_sb_info *sbi, struct inode *inode,
+			struct _io_gb_multy *io_gb)
+{
+	bool put_now;
+	int err;
+
+	put_now = io_gb->IO.ret_flags &
+		  (ZUFS_RET_PUT_NOW | ZUFS_RET_NEW | ZUFS_RET_LOCKED_PUT);
+
+	err  = zufc_pigy_put(ZUF_ROOT(sbi), &io_gb->zdo, &io_gb->IO,
+			     io_gb->iom_n, io_gb->bns, put_now);
+	if (unlikely(err))
+		zuf_warn("zufc_pigy_put => %d\n", err);
+}
+
+static inline int _read_one(struct zuf_sb_info *sbi, struct iov_iter *ii,
+			     ulong bn, uint offset, uint len, int i)
+{
+	uint retl;
+
+	if (!bn) {
+		retl = iov_iter_zero(len, ii);
+	} else {
+		void *addr = md_addr_verify(sbi->md, md_p2o(bn));
+
+		if (unlikely(!addr)) {
+			zuf_err("Server bad bn[%d]=0x%lx bytes_more=0x%lx\n",
+				i, bn, iov_iter_count(ii));
+			return -EIO;
+		}
+		retl = copy_to_iter(addr + offset, len, ii);
+	}
+	if (unlikely(retl != len)) {
+		/* This can happen if we get a read_only Prt from App */
+		zuf_dbg_err("copy_to_iter bn=0x%lx off=0x%x len=0x%x retl=0x%x\n",
+			bn, offset, len, retl);
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+static inline int _write_one(struct zuf_sb_info *sbi, struct iov_iter *ii,
+			     ulong bn, uint offset, uint len, int i)
+{
+	void *addr = md_addr_verify(sbi->md, md_p2o(bn));
+	uint retl;
+
+	if (unlikely(!addr)) {
+		zuf_err("Server bad page[%d] bn=0x%lx bytes_more=0x%lx\n",
+			i, bn, iov_iter_count(ii));
+		return -EIO;
+	}
+
+	retl = _copy_from_iter_flushcache(addr + offset, len, ii);
+	if (unlikely(retl != len)) {
+		/* FIXME: This can happen if we get a read_only Prt from App */
+		zuf_err("copy_to_iter bn=0x%lx off=0x%x len=0x%x retl=0x%x\n",
+			bn, offset, len, retl);
+		return -EFAULT;
+	}
+	return 0;
+}
+
+static ssize_t _IO_gm_inner(struct zuf_sb_info *sbi, struct inode *inode,
+			    ulong *bns, uint max_bns,
+			    struct iov_iter *ii, struct file_ra_state *ra,
+			    loff_t start, uint rw)
+{
+	loff_t pos = start;
+	uint offset = pos & (PAGE_SIZE - 1);
+	struct _io_gb_multy io_gb = { .bns = bns, };
+	ssize_t size;
+	int err;
+	uint i;
+
+	if (ra) {
+		io_gb.IO.ra.start	= ra->start;
+		io_gb.IO.ra.ra_pages	= ra->ra_pages;
+		io_gb.IO.ra.prev_pos	= ra->prev_pos;
+	}
+	io_gb.IO.rw = rw;
+
+	size = min_t(ssize_t, ZUS_API_MAP_MAX_SIZE - offset,
+		     iov_iter_count(ii));
+	err = _zufs_IO_get_multy(sbi, inode, pos, size, &io_gb);
+	if (unlikely(err))
+		return err;
+
+	if (ra) {
+		ra->start	= io_gb.IO.ra.start;
+		ra->ra_pages	= io_gb.IO.ra.ra_pages;
+		ra->prev_pos	= io_gb.IO.ra.prev_pos;
+	}
+
+	if (unlikely(io_gb.IO.last_pos != (pos + size))) {
+		if (unlikely(io_gb.IO.last_pos < pos)) {
+			zuf_err("Server bad last_pos(0x%llx) <= pos(0x%llx) len=0x%lx\n",
+				 io_gb.IO.last_pos, pos, iov_iter_count(ii));
+			err = -EIO;
+			goto out;
+		}
+
+		zuf_dbg_err("Short %s start(0x%llx) len=0x%lx last_pos(0x%llx)\n",
+			    _pr_rw(rw), pos, iov_iter_count(ii),
+			    io_gb.IO.last_pos);
+		size = io_gb.IO.last_pos - pos;
+	}
+
+	i = 0;
+	while (size) {
+		uint len;
+		ulong bn;
+
+		len = min_t(uint, PAGE_SIZE - offset, size);
+
+		bn = io_gb.bns[i];
+		if (rw & WRITE)
+			err = _write_one(sbi, ii, bn, offset, len, i);
+		else
+			err = _read_one(sbi, ii, bn, offset, len, i);
+		if (unlikely(err))
+			break;
+
+		zuf_dbg_rw("[%ld]	%s [0x%llx-0x%x] bn=0x%lx [%d]\n",
+			    inode->i_ino, _pr_rw(rw), pos, len, bn, i);
+
+		pos += len;
+		size -= len;
+		offset = 0;
+		if (io_gb.iom_n <= ++i)
+			break;
+	}
+out:
+	_zufs_IO_put_multy(sbi, inode, &io_gb);
+	if (io_gb.IO.wr_unmap.len)
+		unmap_mapping_range(inode->i_mapping, io_gb.IO.wr_unmap.offset,
+				    io_gb.IO.wr_unmap.len, 0);
+
+	return unlikely(pos == start) ? err : pos - start;
+}
+
+static ssize_t _IO_gm(struct zuf_sb_info *sbi, struct inode *inode,
+		      ulong *on_stack, uint max_on_stack,
+		      struct iov_iter *ii, struct kiocb *kiocb,
+		      struct file_ra_state *ra, uint rw)
+{
+	ssize_t size = 0;
+	ssize_t ret = 0;
+	enum big_alloc_type bat;
+	ulong *bns;
+	uint max_bns = min_t(uint,
+		md_o2p_up(iov_iter_count(ii) + (kiocb->ki_pos & ~PAGE_MASK)),
+		ZUS_API_MAP_MAX_PAGES);
+
+	bns = big_alloc(max_bns * sizeof(ulong), max_on_stack, on_stack,
+			GFP_NOFS, &bat);
+	if (unlikely(!bns)) {
+		zuf_err("life was more simple on the stack max_bns=%d\n",
+			max_bns);
+		return -ENOMEM;
+	}
+
+	while (iov_iter_count(ii)) {
+		ret = _IO_gm_inner(sbi, inode, bns, max_bns, ii, ra,
+				   kiocb->ki_pos, rw);
+		if (unlikely(ret < 0))
+			break;
+
+		kiocb->ki_pos += ret;
+		size += ret;
+	}
+
+	big_free(bns, bat);
+
+	return size ?: ret;
+}
+
+ssize_t zuf_rw_read_iter(struct super_block *sb, struct inode *inode,
+			 struct kiocb *kiocb, struct iov_iter *ii)
+{
+	long on_stack[ZUF_MAX_STACK(8) / sizeof(long)];
+	ulong rw = READ | rand_tag(kiocb);
+
+	/* EOF protection */
+	if (unlikely(kiocb->ki_pos > i_size_read(inode)))
+		return 0;
+
+	iov_iter_truncate(ii, i_size_read(inode) - kiocb->ki_pos);
+	if (unlikely(!iov_iter_count(ii))) {
+		/* Don't let zero len reads have any effect */
+		zuf_dbg_rw("called with NULL len\n");
+		return 0;
+	}
+
+	if (zuf_is_nio_reads(inode))
+		return _IO_gm(SBI(sb), inode, on_stack, sizeof(on_stack),
+			      ii, kiocb, kiocb_ra(kiocb), rw);
+
+	return _zufs_IO(SBI(sb), inode, on_stack, sizeof(on_stack), ii,
+			kiocb, kiocb_ra(kiocb), ZUFS_OP_READ, rw);
+}
+
+ssize_t zuf_rw_write_iter(struct super_block *sb, struct inode *inode,
+			  struct kiocb *kiocb, struct iov_iter *ii)
+{
+	long on_stack[ZUF_MAX_STACK(8) / sizeof(long)];
+	ulong rw = WRITE;
+
+	if (kiocb->ki_filp->f_flags & O_DSYNC ||
+	    IS_SYNC(kiocb->ki_filp->f_mapping->host))
+		rw |= ZUFS_RW_DSYNC;
+	if (kiocb->ki_filp->f_flags & O_DIRECT)
+		rw |= ZUFS_RW_DIRECT;
+
+	if (zuf_is_nio_writes(inode))
+		return _IO_gm(SBI(sb), inode, on_stack, sizeof(on_stack),
+			      ii, kiocb, kiocb_ra(kiocb), rw);
+
+	return _zufs_IO(SBI(sb), inode, on_stack, sizeof(on_stack),
+			ii, kiocb, kiocb_ra(kiocb), ZUFS_OP_WRITE, rw);
+}
+
+/* ~~~~ iom_dec.c ~~~ */
+/* for now here (at rw.c) looks logical */
+
+static int __iom_add_t2_io_len(struct super_block *sb, struct t2_io_state *tis,
+			       zu_dpp_t t1, ulong t2_bn, __u64 num_pages)
+{
+	void *ptr;
+	struct page *page;
+	int i, err;
+
+	ptr = zuf_dpp_t_addr(sb, t1);
+	if (unlikely(!ptr)) {
+		zuf_err("Bad t1 zu_dpp_t t1=0x%llx t2=0x%lx num_pages=0x%llx\n",
+			t1, t2_bn, num_pages);
+		return -EFAULT; /* zuf_dpp_t_addr already yeld */
+	}
+
+	page = virt_to_page(ptr);
+	if (unlikely(!page)) {
+		zuf_err("bad t1(0x%llx)\n", t1);
+		return -EFAULT;
+	}
+
+	for (i = 0; i < num_pages; ++i) {
+		err = t2_io_add(tis, t2_bn++, page++);
+		if (unlikely(err))
+			return err;
+	}
+	return 0;
+}
+
+static int iom_add_t2_io_len(struct super_block *sb, struct t2_io_state *tis,
+			     __u64 **cur_e)
+{
+	struct zufs_iom_t2_io_len *t2iol = (void *)*cur_e;
+	int err = __iom_add_t2_io_len(sb, tis, t2iol->iom.t1_val,
+				      _zufs_iom_first_val(&t2iol->iom.t2_val),
+				      t2iol->num_pages);
+
+	*cur_e = (void *)(t2iol + 1);
+	return err;
+}
+
+static int iom_add_t2_io(struct super_block *sb, struct t2_io_state *tis,
+			 __u64 **cur_e)
+{
+	struct zufs_iom_t2_io *t2io = (void *)*cur_e;
+
+	int err = __iom_add_t2_io_len(sb, tis, t2io->t1_val,
+				      _zufs_iom_first_val(&t2io->t2_val), 1);
+
+	*cur_e = (void *)(t2io + 1);
+	return err;
+}
+
+static int iom_t2_zusmem_io(struct super_block *sb, struct t2_io_state *tis,
+			    __u64 **cur_e)
+{
+	struct zufs_iom_t2_zusmem_io *mem_io = (void *)*cur_e;
+	ulong t2_bn = _zufs_iom_first_val(&mem_io->t2_val);
+	ulong user_ptr = (ulong)mem_io->zus_mem_ptr;
+	int rw = _zufs_iom_opt_type(*cur_e) == IOM_T2_ZUSMEM_WRITE ?
+						WRITE : READ;
+	int num_p = md_o2p_up(mem_io->len);
+	int num_p_r;
+	struct page *pages[16];
+	int i, err = 0;
+
+	if (16 < num_p) {
+		zuf_err("num_p(%d) > 16\n", num_p);
+		return -EINVAL;
+	}
+
+	num_p_r = get_user_pages_fast(user_ptr, num_p, rw,
+				      pages);
+	if (num_p_r != num_p) {
+		zuf_err("!!!! get_user_pages_fast num_p_r(%d) != num_p(%d)\n",
+			num_p_r, num_p);
+		err = -EFAULT;
+		goto out;
+	}
+
+	for (i = 0; i < num_p_r && !err; ++i)
+		err = t2_io_add(tis, t2_bn++, pages[i]);
+
+out:
+	for (i = 0; i < num_p_r; ++i)
+		put_page(pages[i]);
+
+	*cur_e = (void *)(mem_io + 1);
+	return err;
+}
+
+static int iom_unmap(struct super_block *sb, struct inode *inode, __u64 **cur_e)
+{
+	struct zufs_iom_unmap *iom_unmap = (void *)*cur_e;
+	struct inode *inode_look = NULL;
+	ulong	unmap_index = _zufs_iom_first_val(&iom_unmap->unmap_index);
+	ulong	unmap_n = iom_unmap->unmap_n;
+	ulong	ino = iom_unmap->ino;
+
+	if (!inode || ino) {
+		if (WARN_ON(!ino)) {
+			zuf_err("[%ld] 0x%lx-0x%lx\n",
+				inode ? inode->i_ino : -1, unmap_index,
+				unmap_n);
+			goto out;
+		}
+		inode_look = ilookup(sb, ino);
+		if (!inode_look) {
+			/* From the time we requested an unmap to now
+			 * inode was evicted from cache so surely it no longer
+			 * have any mappings. Cool job was already done for us.
+			 * Even if a racing thread reloads the inode it will
+			 * not have this mapping we wanted to clear, but only
+			 * new ones.
+			 * TODO: For now warn when this happen, because in
+			 *    current usage it cannot happen. But before
+			 *    upstream we should convert to zuf_dbg_err
+			 */
+			zuf_warn("[%ld] 0x%lx-0x%lx\n",
+				 ino, unmap_index, unmap_n);
+			goto out;
+		}
+
+		inode = inode_look;
+	}
+
+	zuf_dbg_rw("[%ld] 0x%lx-0x%lx\n", inode->i_ino, unmap_index, unmap_n);
+
+	unmap_mapping_range(inode->i_mapping, md_p2o(unmap_index),
+			    md_p2o(unmap_n), 0);
+
+	if (inode_look)
+		iput(inode_look);
+
+out:
+	*cur_e = (void *)(iom_unmap + 1);
+	return 0;
+}
+
+static int iom_wbinv(__u64 **cur_e)
+{
+	wbinvd();
+
+	++*cur_e;
+
+	return 0;
+}
+
+struct _iom_exec_info {
+	struct super_block *sb;
+	struct inode *inode;
+	struct t2_io_state *rd_tis;
+	struct t2_io_state *wr_tis;
+	__u64 *iom_e;
+	uint iom_n;
+	bool print;
+};
+
+static int _iom_execute_inline(struct _iom_exec_info *iei)
+{
+	__u64 *cur_e, *end_e;
+	int err = 0;
+#ifdef CONFIG_ZUF_DEBUG
+	uint wrs = 0;
+	uint rds = 0;
+	uint uns = 0;
+	uint wrmem = 0;
+	uint rdmem = 0;
+	uint wbinv = 0;
+#	define	WRS()	(++wrs)
+#	define	RDS()	(++rds)
+#	define	UNS()	(++uns)
+#	define	WRMEM()	(++wrmem)
+#	define	RDMEM()	(++rdmem)
+#	define	WBINV()	(++wbinv)
+#else
+#	define	WRS()
+#	define	RDS()
+#	define	UNS()
+#	define	WRMEM()
+#	define	RDMEM()
+#	define	WBINV()
+#endif /* !def CONFIG_ZUF_DEBUG */
+
+	cur_e =  iei->iom_e;
+	end_e = cur_e + iei->iom_n;
+	while (cur_e && (cur_e < end_e)) {
+		uint op;
+
+		op = _zufs_iom_opt_type(cur_e);
+
+		switch (op) {
+		case IOM_NONE:
+			return 0;
+
+		case IOM_T2_WRITE:
+			err = iom_add_t2_io(iei->sb, iei->wr_tis, &cur_e);
+			WRS();
+			break;
+		case IOM_T2_READ:
+			err = iom_add_t2_io(iei->sb, iei->rd_tis, &cur_e);
+			RDS();
+			break;
+
+		case IOM_T2_WRITE_LEN:
+			err = iom_add_t2_io_len(iei->sb, iei->wr_tis, &cur_e);
+			WRS();
+			break;
+		case IOM_T2_READ_LEN:
+			err = iom_add_t2_io_len(iei->sb, iei->rd_tis, &cur_e);
+			RDS();
+			break;
+
+		case IOM_T2_ZUSMEM_WRITE:
+			err = iom_t2_zusmem_io(iei->sb, iei->wr_tis, &cur_e);
+			WRMEM();
+			break;
+		case IOM_T2_ZUSMEM_READ:
+			err = iom_t2_zusmem_io(iei->sb, iei->rd_tis, &cur_e);
+			RDMEM();
+			break;
+
+		case IOM_UNMAP:
+			err = iom_unmap(iei->sb, iei->inode, &cur_e);
+			UNS();
+			break;
+
+		case IOM_WBINV:
+			err = iom_wbinv(&cur_e);
+			WBINV();
+			break;
+
+		default:
+			zuf_err("!!!!! Bad opt %d\n",
+				_zufs_iom_opt_type(cur_e));
+			err = -EIO;
+			break;
+		}
+
+		if (unlikely(err))
+			break;
+	}
+
+#ifdef CONFIG_ZUF_DEBUG
+	zuf_dbg_rw("exec wrs=%d rds=%d uns=%d rdmem=%d wrmem=%d => %d\n",
+		   wrs, rds, uns, rdmem, wrmem, err);
+#endif
+
+	return err;
+}
+
+/* inode here is the default inode if ioc_unmap->ino is zero
+ * this is an optimization for the unmap done at write_iter hot path.
+ */
+int zuf_iom_execute_sync(struct super_block *sb, struct inode *inode,
+			 __u64 *iom_e_user, uint iom_n)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct t2_io_state rd_tis = {};
+	struct t2_io_state wr_tis = {};
+	struct _iom_exec_info iei = {};
+	int err, err_r, err_w;
+
+	t2_io_begin(sbi->md, READ, NULL, 0, -1, &rd_tis);
+	t2_io_begin(sbi->md, WRITE, NULL, 0, -1, &wr_tis);
+
+	iei.sb = sb;
+	iei.inode = inode;
+	iei.rd_tis = &rd_tis;
+	iei.wr_tis = &wr_tis;
+	iei.iom_e = iom_e_user;
+	iei.iom_n = iom_n;
+	iei.print = 0;
+
+	err = _iom_execute_inline(&iei);
+
+	err_r = t2_io_end(&rd_tis, true);
+	err_w = t2_io_end(&wr_tis, true);
+
+	/* TODO: not sure if OK when _iom_execute return with -ENOMEM
+	 * In such a case, we might be better of skiping t2_io_ends.
+	 */
+	return err ?: (err_r ?: err_w);
+}
+
+int zuf_iom_execute_async(struct super_block *sb, struct zus_iomap_build *iomb,
+			 __u64 *iom_e_user, uint iom_n)
+{
+	zuf_err("Async IOM NOT supported Yet!!!\n");
+	return -EFAULT;
+}
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index c0049c1d5ba3..11300fd79929 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -25,6 +25,20 @@
 #include "relay.h"
 
 enum { INITIAL_ZT_CHANNELS = 3 };
+#define _ZT_MAX_PIGY_PUT \
+	((ZUS_API_MAP_MAX_PAGES * sizeof(__u64) + \
+	  sizeof(struct zufs_ioc_IO)) * INITIAL_ZT_CHANNELS)
+
+enum { PG0 = 0, PG1 = 1, PG2 = 2, PG3 = 3, PG4 = 4, PG5 = 5 };
+struct __pigi_put_it {
+	void *buff;
+	void *waiter;
+	uint s; /* total encoded bytes */
+	uint last; /* So we can update last zufs_ioc_hdr->flags */
+	bool needs_goosing;
+	ulong inodes[PG5 + 1];
+	uint ic;
+};
 
 struct zufc_thread {
 	struct zuf_special_file hdr;
@@ -40,6 +54,12 @@ struct zufc_thread {
 
 	/* Next operation*/
 	struct zuf_dispatch_op *zdo;
+
+	/* Secondary chans point to the 0-channel's
+	 * pigi_put_chan0
+	 */
+	struct __pigi_put_it pigi_put_chan0;
+	struct __pigi_put_it *pigi_put;
 };
 
 struct zuf_threads_pool {
@@ -76,7 +96,14 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_RENAME);
 		CASE_ENUM_NAME(ZUFS_OP_READDIR);
 
+		CASE_ENUM_NAME(ZUFS_OP_READ);
+		CASE_ENUM_NAME(ZUFS_OP_PRE_READ);
+		CASE_ENUM_NAME(ZUFS_OP_WRITE);
 		CASE_ENUM_NAME(ZUFS_OP_SETATTR);
+
+		CASE_ENUM_NAME(ZUFS_OP_GET_MULTY);
+		CASE_ENUM_NAME(ZUFS_OP_PUT_MULTY);
+		CASE_ENUM_NAME(ZUFS_OP_NOOP);
 	case ZUFS_OP_MAX_OPT:
 	default:
 		return "UNKNOWN";
@@ -543,6 +570,238 @@ static void _prep_header_size_op(struct zufs_ioc_hdr *hdr,
 	hdr->err = err;
 }
 
+/* ~~~~~ pigi_put logic ~~~~~ */
+struct _goose_waiter {
+	struct kref kref;
+	struct zuf_root_info *zri;
+	ulong inode; /* We use the inode address as a unique tag */
+};
+
+static void _last_goose(struct kref *kref)
+{
+	struct _goose_waiter *gw = container_of(kref, typeof(*gw), kref);
+
+	wake_up_var(&gw->kref);
+}
+
+static void _goose_put(struct _goose_waiter *gw)
+{
+	kref_put(&gw->kref, _last_goose);
+}
+
+static void _goose_get(struct _goose_waiter *gw)
+{
+	kref_get(&gw->kref);
+}
+
+static void _goose_wait(struct _goose_waiter *gw)
+{
+	wait_var_event(&gw->kref, !kref_read(&gw->kref));
+}
+
+static void _pigy_put_encode(struct zufs_ioc_IO *io,
+			     struct zufs_ioc_IO *io_user, ulong *bns)
+{
+	uint i;
+
+	*io_user = *io;
+	for (i = 0; i < io->ziom.iom_n; ++i)
+		_zufs_iom_enc_bn(&io_user->ziom.iom_e[i], bns[i], 0);
+
+	io_user->hdr.in_len = _ioc_IO_size(io->ziom.iom_n);
+}
+
+static void pigy_put_dh(struct zuf_dispatch_op *zdo, void *pzt, void *parg)
+{
+	struct zufs_ioc_IO *io = container_of(zdo->hdr, typeof(*io), hdr);
+	struct zufs_ioc_IO *io_user = parg;
+
+	_pigy_put_encode(io, io_user, zdo->bns);
+}
+
+static int _pigy_put_now(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo)
+{
+	int err;
+
+	zdo->dh = pigy_put_dh;
+
+	err = __zufc_dispatch(zri, zdo);
+	if (unlikely(err == -EZUFS_RETRY)) {
+		zuf_err("Unexpected ZUS return => %d\n", err);
+		err = -EIO;
+	}
+	return err;
+}
+
+int zufc_pigy_put(struct zuf_root_info *zri, struct zuf_dispatch_op *zdo,
+		  struct zufs_ioc_IO *io, uint iom_n, ulong *bns, bool do_now)
+{
+	struct zufc_thread *zt;
+	struct zufs_ioc_IO *io_user;
+	uint pigi_put_s;
+	int cpu;
+
+	io->hdr.operation = ZUFS_OP_PUT_MULTY;
+	io->hdr.out_len = 0;		/* No returns from put */
+	io->ret_flags = 0;
+	io->ziom.iom_n = iom_n;
+	zdo->bns = bns;
+
+	pigi_put_s = _ioc_IO_size(iom_n);
+
+	/* FIXME: Pedantic check remove please */
+	if (WARN_ON(zdo->__locked_zt && !do_now))
+		do_now = true;
+
+	cpu = get_cpu();
+
+	zt = _zt_from_cpu(zri, cpu, 0);
+	if (do_now || (zt->pigi_put->s + pigi_put_s > _ZT_MAX_PIGY_PUT) ||
+	    (zt->pigi_put->ic > PG5)) {
+		put_cpu();
+
+		/* NOTE: pigy_put buffer is full, We dispatch a put NOW
+		 * which will also take with it the full pigy_put buffer.
+		 * At the server the pigy_put will be done first then this
+		 * one, so order of puts is preserved, not that it matters
+		 */
+		if (!do_now)
+			zuf_dbg_perf(
+				"[%ld] iom_n=0x%x zt->pigi_put->s=0x%x + 0x%x > 0x%lx ic=%d\n",
+				zdo->inode->i_ino, iom_n, zt->pigi_put->s,
+				pigi_put_s, _ZT_MAX_PIGY_PUT,
+				zt->pigi_put->ic++);
+
+		return _pigy_put_now(zri, zdo);
+	}
+
+	/* Mark last one as has more */
+	if (zt->pigi_put->s) {
+		io_user = zt->pigi_put->buff + zt->pigi_put->last;
+		io_user->hdr.flags |= ZUFS_H_HAS_PIGY_PUT;
+	}
+
+	io_user = zt->pigi_put->buff + zt->pigi_put->s;
+	_pigy_put_encode(io, io_user, bns);
+	zt->pigi_put->last = zt->pigi_put->s;
+	zt->pigi_put->s += pigi_put_s;
+	zt->pigi_put->inodes[zt->pigi_put->ic++] = (ulong)zdo->inode;
+
+	put_cpu();
+	return 0;
+}
+
+/* Add the pigy_put accumulated buff to current command
+ * Always runs in the context of a ZT
+ */
+static void _pigy_put_add_to_ioc(struct zuf_root_info *zri,
+				 struct zufc_thread *zt)
+{
+	struct zufs_ioc_hdr *hdr = zt->opt_buff;
+	struct __pigi_put_it *pigi = zt->pigi_put;
+
+	if (unlikely(!pigi->s))
+		return;
+
+	if (unlikely(pigi->s + hdr->in_len > zt->max_zt_command)) {
+		zuf_err("!!! Should not pigi_put->s(%d) + in_len(%d) > max_zt_command(%ld)\n",
+			pigi->s, hdr->in_len, zt->max_zt_command);
+		/*TODO we must check at init time that max_zt_command not too
+		 * small
+		 */
+		return;
+	}
+
+	memcpy((void *)hdr + hdr->in_len, pigi->buff, pigi->s);
+	hdr->flags |= ZUFS_H_HAS_PIGY_PUT;
+	pigi->s = pigi->last = 0;
+	pigi->ic = 0;
+	/* for every 3 channels */
+	pigi->inodes[PG0] = pigi->inodes[PG1] = pigi->inodes[PG2] = 0;
+	pigi->inodes[PG3] = pigi->inodes[PG4] = pigi->inodes[PG5] = 0;
+}
+
+static void _goose_prep(struct zuf_root_info *zri,
+			struct zufc_thread *zt)
+{
+	_prep_header_size_op(zt->opt_buff, ZUFS_OP_NOOP, 0);
+	_pigy_put_add_to_ioc(zri, zt);
+
+	zt->pigi_put->needs_goosing = false;
+}
+
+static inline bool _zt_pigi_has_inode(struct __pigi_put_it *pigi,
+				      ulong inode)
+{
+	return	pigi->ic &&
+		((pigi->inodes[PG0] == inode) ||
+		 (pigi->inodes[PG1] == inode) ||
+		 (pigi->inodes[PG2] == inode) ||
+		 (pigi->inodes[PG3] == inode) ||
+		 (pigi->inodes[PG4] == inode) ||
+		 (pigi->inodes[PG5] == inode));
+}
+
+static void _goose_one(void *info)
+{
+	struct _goose_waiter *gw = info;
+	struct zuf_root_info *zri = gw->zri;
+	struct zufc_thread *zt;
+	int cpu = smp_processor_id();
+	uint c;
+
+	/* Look for least busy channel. All busy we are left with zt0 */
+	for (c = INITIAL_ZT_CHANNELS; c; --c) {
+		zt = _zt_from_cpu(zri, cpu, c - 1);
+		if (unlikely(!(zt && zt->hdr.file)))
+			return; /* We are crashing */
+
+		if (!zt->pigi_put->s || zt->pigi_put->needs_goosing)
+			return; /* this cpu is goose empty */
+
+		if (!_zt_pigi_has_inode(zt->pigi_put, gw->inode))
+			return;
+		if (!zt->zdo)
+			break;
+	}
+
+	/* Tell them to ... */
+	zt->pigi_put->needs_goosing = true;
+	_goose_get(gw);
+	zt->pigi_put->waiter = gw;
+	if (!zt->zdo)
+		relay_fss_wakeup(&zt->relay);
+}
+
+/* NOTE: @inode must not be NULL */
+void zufc_goose_all_zts(struct zuf_root_info *zri, struct inode *inode)
+{
+	struct _goose_waiter gw;
+
+	if (!S_ISREG(inode->i_mode) || !(inode->i_size || inode->i_blocks))
+		return;
+
+	/* No point in two goosers fighting we are goosing for everyone
+	 * This protects that only one zt->pigi_put->waiter at a time
+	 */
+	mutex_lock(&zri->sbl_lock);
+
+	gw.zri = zri;
+	kref_init(&gw.kref);
+	gw.inode = (ulong)inode;
+
+	on_each_cpu(_goose_one, &gw, true);
+
+	if (kref_read(&gw.kref) == 1)
+		goto out;
+
+	_goose_put(&gw); /* put kref_init's 1 */
+	_goose_wait(&gw);
+
+out:
+	mutex_unlock(&zri->sbl_lock);
+}
+
 /* ~~~~~ ZT thread operations ~~~~~ */
 
 static int _zu_init(struct file *file, void *parg)
@@ -591,6 +850,24 @@ static int _zu_init(struct file *file, void *parg)
 		goto out;
 	}
 
+	if (zt->chan == 0) {
+		zt->pigi_put = &zt->pigi_put_chan0;
+
+		zt->pigi_put->buff = vmalloc(_ZT_MAX_PIGY_PUT);
+		if (unlikely(!zt->pigi_put->buff)) {
+			vfree(zt->opt_buff);
+			zi_init.hdr.err = -ENOMEM;
+			goto out;
+		}
+		zt->pigi_put->needs_goosing = false;
+		zt->pigi_put->last = zt->pigi_put->s = 0;
+	} else {
+		struct zufc_thread *zt0;
+
+		zt0 = _zt_from_cpu(ZRI(file->f_inode->i_sb), cpu, 0);
+		zt->pigi_put = &zt0->pigi_put_chan0;
+	}
+
 	file->private_data = &zt->hdr;
 out:
 	err = copy_to_user(parg, &zi_init, sizeof(zi_init));
@@ -625,6 +902,9 @@ static void zufc_zt_release(struct file *file)
 		msleep(1000); /* crap */
 	}
 
+	if (zt->chan == 0)
+		vfree(zt->pigi_put->buff);
+
 	vfree(zt->opt_buff);
 	memset(zt, 0, sizeof(*zt));
 }
@@ -706,9 +986,25 @@ static int _copy_outputs(struct zufc_thread *zt, void *arg)
 	}
 }
 
+static bool _need_channel_lock(struct zufc_thread *zt)
+{
+	struct zufs_ioc_IO *ret_io = zt->opt_buff;
+
+	/* Only ZUF_GET_MULTY is allowed channel locking
+	 * because it absolutely must and I truest the code.
+	 * If You need a new channel locking command come talk
+	 * to me first.
+	 */
+	return	(ret_io->hdr.err == 0) &&
+		(ret_io->hdr.operation == ZUFS_OP_GET_MULTY) &&
+		(ret_io->ret_flags & ZUFS_RET_LOCKED_PUT) &&
+		(ret_io->ziom.iom_n != 0);
+}
+
 static int _zu_wait(struct file *file, void *parg)
 {
 	struct zufc_thread *zt;
+	struct zufs_ioc_hdr *user_hdr;
 	bool __chan_is_locked = false;
 	int err;
 
@@ -730,6 +1026,10 @@ static int _zu_wait(struct file *file, void *parg)
 		goto err;
 	}
 
+	user_hdr = zt->opt_buff;
+	if (user_hdr->flags & ZUFS_H_HAS_PIGY_PUT)
+		user_hdr->flags &= ~ZUFS_H_HAS_PIGY_PUT;
+
 	if (relay_is_app_waiting(&zt->relay)) {
 		if (unlikely(!zt->zdo)) {
 			zuf_err("User has gone...\n");
@@ -751,13 +1051,29 @@ static int _zu_wait(struct file *file, void *parg)
 
 		_unmap_pages(zt, zt->zdo->pages, zt->zdo->nump);
 
-		zt->zdo = NULL;
+		if (unlikely(!err && _need_channel_lock(zt))) {
+			zt->zdo->__locked_zt = zt;
+			__chan_is_locked = true;
+		} else {
+			zt->zdo = NULL;
+		}
 		if (unlikely(err)) /* _copy_outputs returned an err */
 			goto err;
 
 		relay_app_wakeup(&zt->relay);
 	}
 
+	if (zt->pigi_put->needs_goosing && !__chan_is_locked) {
+		/* go do a cycle and come back */
+		_goose_prep(ZRI(file->f_inode->i_sb), zt);
+		return 0;
+	}
+
+	if (zt->pigi_put->waiter) {
+		_goose_put(zt->pigi_put->waiter);
+		zt->pigi_put->waiter = NULL;
+	}
+
 	err = __relay_fss_wait(&zt->relay, __chan_is_locked);
 	if (err)
 		zuf_dbg_err("[%d] relay error: %d\n", zt->no, err);
@@ -770,8 +1086,16 @@ static int _zu_wait(struct file *file, void *parg)
 		 * we should have a bit set in zt->zdo->hdr set per operation.
 		 * TODO: Why this does not work?
 		 */
-		_map_pages(zt, zt->zdo->pages, zt->zdo->nump, 0);
+		_map_pages(zt, zt->zdo->pages, zt->zdo->nump,
+			   zt->zdo->hdr->operation == ZUFS_OP_WRITE);
+		if (zt->pigi_put->s)
+			_pigy_put_add_to_ioc(ZRI(file->f_inode->i_sb), zt);
 	} else {
+		if (zt->pigi_put->needs_goosing) {
+			_goose_prep(ZRI(file->f_inode->i_sb), zt);
+			return 0;
+		}
+
 		/* This Means we were released by _zu_break */
 		zuf_dbg_zus("_zu_break? => %d\n", err);
 		_prep_header_size_op(zt->opt_buff, ZUFS_OP_BREAK, err);
@@ -953,6 +1277,30 @@ static inline struct zu_exec_buff *_ebuff_from_file(struct file *file)
 	return ebuff;
 }
 
+static int _ebuff_bounds_check(struct zu_exec_buff *ebuff, ulong buff,
+			       struct zufs_iomap *ziom,
+			       struct zufs_iomap *user_ziom, void *ziom_end)
+{
+	size_t iom_max_bytes = ziom_end - (void *)&user_ziom->iom_e;
+
+	if (buff != ebuff->vma->vm_start ||
+	    ebuff->vma->vm_end < buff + iom_max_bytes) {
+		WARN_ON_ONCE(1);
+		zuf_err("Executing out off bound vm_start=0x%lx vm_end=0x%lx buff=0x%lx buff_end=0x%lx\n",
+			ebuff->vma->vm_start, ebuff->vma->vm_end, buff,
+			buff + iom_max_bytes);
+		return -EINVAL;
+	}
+
+	if (unlikely((iom_max_bytes / sizeof(__u64) < ziom->iom_max)))
+		return -EINVAL;
+
+	if (unlikely(ziom->iom_max < ziom->iom_n))
+		return -EINVAL;
+
+	return 0;
+}
+
 static int _zu_ebuff_alloc(struct file *file, void *arg)
 {
 	struct zufs_ioc_alloc_buffer ioc_alloc;
@@ -1004,6 +1352,52 @@ static void zufc_ebuff_release(struct file *file)
 	kfree(ebuff);
 }
 
+static int _zu_iomap_exec(struct file *file, void *arg)
+{
+	struct zuf_root_info *zri = ZRI(file->f_inode->i_sb);
+	struct zu_exec_buff *ebuff = _ebuff_from_file(file);
+	struct zufs_ioc_iomap_exec ioc_iomap;
+	struct zufs_ioc_iomap_exec *user_iomap;
+
+	struct super_block *sb;
+	int err;
+
+	if (unlikely(!ebuff))
+		return -EINVAL;
+
+	user_iomap = ebuff->opt_buff;
+	/* do all checks on a kernel copy so malicious Server cannot
+	 * crash the Kernel
+	 */
+	ioc_iomap = *user_iomap;
+
+	err = _ebuff_bounds_check(ebuff, (ulong)arg, &ioc_iomap.ziom,
+				  &user_iomap->ziom,
+				  ebuff->opt_buff + ebuff->alloc_size);
+	if (unlikely(err)) {
+		zuf_err("illegal iomap: iom_max=%u iom_n=%u\n",
+			ioc_iomap.ziom.iom_max, ioc_iomap.ziom.iom_n);
+		return err;
+	}
+
+	/* The ID of the super block received in mount */
+	sb = zuf_sb_from_id(zri, ioc_iomap.sb_id, ioc_iomap.zus_sbi);
+	if (unlikely(!sb))
+		return -EINVAL;
+
+	if (ioc_iomap.wait_for_done)
+		err = zuf_iom_execute_sync(sb, NULL, user_iomap->ziom.iom_e,
+					   ioc_iomap.ziom.iom_n);
+	else
+		err =  zuf_iom_execute_async(sb, ioc_iomap.ziom.iomb,
+					     user_iomap->ziom.iom_e,
+					     ioc_iomap.ziom.iom_n);
+
+	user_iomap->hdr.err = err;
+	zuf_dbg_core("OUT => %d\n", err);
+	return 0; /* report err at hdr, but the command was executed */
+};
+
 /* ~~~~ ioctl & release handlers ~~~~ */
 static int _zu_register_fs(struct file *file, void *parg)
 {
@@ -1069,6 +1463,8 @@ long zufc_ioctl(struct file *file, unsigned int cmd, ulong arg)
 		return _zu_wait(file, parg);
 	case ZU_IOC_ALLOC_BUFFER:
 		return _zu_ebuff_alloc(file, parg);
+	case ZU_IOC_IOMAP_EXEC:
+		return _zu_iomap_exec(file, parg);
 	case ZU_IOC_PRIVATE_MOUNT:
 		return _zu_private_mounter(file, parg);
 	case ZU_IOC_BREAK_ALL:
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 2d5327e1d2b1..2c57c51a2099 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -402,6 +402,13 @@ static inline int zuf_flt_to_err(vm_fault_t flt)
 	return -EACCES;
 }
 
+struct _io_gb_multy {
+	struct zuf_dispatch_op zdo;
+	struct zufs_ioc_IO IO;
+	ulong iom_n;
+	ulong *bns;
+};
+
 /* Keep this include last thing in file */
 #include "_extern.h"
 
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 2bdf047282e8..e3a783748ce6 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -456,7 +456,15 @@ enum e_zufs_operation {
 	ZUFS_OP_RENAME		= 10,
 	ZUFS_OP_READDIR		= 11,
 
+	ZUFS_OP_READ		= 14,
+	ZUFS_OP_PRE_READ	= 15,
+	ZUFS_OP_WRITE		= 16,
 	ZUFS_OP_SETATTR		= 19,
+	ZUFS_OP_FALLOCATE	= 21,
+
+	ZUFS_OP_GET_MULTY	= 29,
+	ZUFS_OP_PUT_MULTY	= 30,
+	ZUFS_OP_NOOP		= 31,
 
 	ZUFS_OP_MAX_OPT,
 };
@@ -646,10 +654,253 @@ struct zufs_ioc_attr {
 	__u32 pad;
 };
 
+/* ~~~~ io_map structures && IOCTL(s) ~~~~ */
+/*
+ * These set of structures and helpers are used in return of zufs_ioc_IO and
+ * also at ZU_IOC_IOMAP_EXEC, NULL terminating list (array)
+ *
+ * Each iom_elemet stars with an __u64 of which the 8 hight bits carry an
+ * operation_type, And the 56 bits value denotes a page offset, (md_o2p()) or a
+ * length. operation_type is one of ZUFS_IOM_TYPE enum.
+ * The interpreter then jumps to the next operation depending on the size
+ * of the defined operation.
+ */
+
+enum ZUFS_IOM_TYPE {
+	IOM_NONE	= 0,
+	IOM_T1_WRITE	= 1,
+	IOM_T1_READ	= 2,
+
+	IOM_T2_WRITE	= 3,
+	IOM_T2_READ	= 4,
+	IOM_T2_WRITE_LEN = 5,
+	IOM_T2_READ_LEN	= 6,
+
+	IOM_T2_ZUSMEM_WRITE = 7,
+	IOM_T2_ZUSMEM_READ = 8,
+
+	IOM_UNMAP	= 9,
+	IOM_WBINV	= 10,
+	IOM_REPEAT	= 11,
+
+	IOM_NUM_LEGAL_OPT,
+};
+
+#define ZUFS_IOM_VAL_BITS	56
+#define ZUFS_IOM_FIRST_VAL_MASK ((1UL << ZUFS_IOM_VAL_BITS) - 1)
+
+static inline enum ZUFS_IOM_TYPE _zufs_iom_opt_type(__u64 *iom_e)
+{
+	uint ret = (*iom_e) >> ZUFS_IOM_VAL_BITS;
+
+	if (ret >= IOM_NUM_LEGAL_OPT)
+		return IOM_NONE;
+	return (enum ZUFS_IOM_TYPE)ret;
+}
+
+static inline bool _zufs_iom_pop(__u64 *iom_e)
+{
+	return _zufs_iom_opt_type(iom_e) != IOM_NONE;
+}
+
+static inline ulong _zufs_iom_first_val(__u64 *iom_elemets)
+{
+	return *iom_elemets & ZUFS_IOM_FIRST_VAL_MASK;
+}
+
+static inline void _zufs_iom_enc_type_val(__u64 *ptr, enum ZUFS_IOM_TYPE type,
+					 ulong val)
+{
+	*ptr = (__u64)val | ((__u64)type << ZUFS_IOM_VAL_BITS);
+}
+
+static inline ulong _zufs_iom_t1_bn(__u64 val)
+{
+	if (unlikely(_zufs_iom_opt_type(&val) != IOM_T1_READ))
+		return -1;
+
+	return zu_dpp_t_bn(_zufs_iom_first_val(&val));
+}
+
+static inline void _zufs_iom_enc_bn(__u64 *ptr, ulong bn, uint pool)
+{
+	_zufs_iom_enc_type_val(ptr, IOM_T1_READ, zu_enc_dpp_t_bn(bn, pool));
+}
+
+/* IOM_T1_WRITE / IOM_T1_READ
+ * May be followed by an IOM_REPEAT
+ */
+struct zufs_iom_t1_io {
+	/* Special dpp_t that denote a page ie: bn << 3 | zu_dpp_t_pool  */
+	__u64	t1_val;
+};
+
+/* IOM_T2_WRITE / IOM_T2_READ */
+struct zufs_iom_t2_io {
+	__u64	t2_val;
+	zu_dpp_t t1_val;
+};
+
+/* IOM_T2_WRITE_LEN / IOM_T2_READ_LEN */
+struct zufs_iom_t2_io_len {
+	struct zufs_iom_t2_io iom;
+	__u64 num_pages;
+};
+
+/* IOM_T2_ZUSMEM_WRITE / IOM_T2_ZUSMEM_READ */
+struct zufs_iom_t2_zusmem_io {
+	__u64	t2_val;
+	__u64	zus_mem_ptr; /* needs an get_user_pages() */
+	__u64	len;
+};
+
+/* IOM_UNMAP:
+ *	Executes unmap_mapping_range & remove of zuf's block-caching
+ *
+ * For now iom_unmap means even_cows=0, because Kernel takes care of all
+ * the cases of the even_cows=1. In future if needed it will be on the high
+ * bit of unmap_n.
+ */
+struct zufs_iom_unmap {
+	__u64	unmap_index;	/* Offset in pages of inode */
+	__u64	unmap_n;	/* Num pages to unmap (0 means: to eof) */
+	__u64	ino;		/* Pages of this inode */
+};
+
+#define ZUFS_WRITE_OP_SPACE						\
+	((sizeof(struct zufs_iom_unmap) +				\
+	  sizeof(struct zufs_iom_t2_io)) / sizeof(__u64) + sizeof(__u64))
+
+struct zus_iomap_build;
+/* For ZUFS_OP_IOM_DONE */
+struct zufs_ioc_iomap_done {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_sb_info *zus_sbi;
+
+	/* The cookie received from zufs_ioc_iomap_exec */
+	struct	zus_iomap_build *iomb;
+};
+
+struct zufs_iomap {
+	/* A cookie from zus to return when execution is done */
+	struct	zus_iomap_build *iomb;
+
+	__u32	iom_max;	/* num of __u64 allocated	 */
+	__u32	iom_n;		/* num of valid __u64 in iom_e	 */
+	__u64	iom_e[0];	/* encoded operations to execute */
+
+	/* This struct must be last */
+};
+
+/*
+ * Execute an iomap in behalf of the Server
+ *
+ * NOTE: this IOCTL must come on an above ZU_IOC_ALLOC_BUFFER type file
+ * and the passed arg-buffer must be the pointer returned from an mmap
+ * call preformed in the file, before the call to this IOC.
+ * If this is not done the IOCTL will return EINVAL.
+ */
+struct zufs_ioc_iomap_exec {
+	struct zufs_ioc_hdr hdr;
+	/* The ID of the super block received in mount */
+	__u64	sb_id;
+	/* We verify the sb_id validity against zus_sbi */
+	struct zus_sb_info *zus_sbi;
+	/* If application buffers they are from this IO*/
+	__u64	zt_iocontext;
+	/* Only return from IOCTL when finished. iomap_done NOT called */
+	__u32	wait_for_done;
+	__u32	__pad;
+
+	struct zufs_iomap ziom; /* must be last */
+};
+#define ZU_IOC_IOMAP_EXEC	_IOWR('Z', 19, struct zufs_ioc_iomap_exec)
+
+/*
+ * ZUFS_OP_READ / ZUFS_OP_WRITE / ZUFS_OP_FALLOCATE
+ *       also
+ * ZUFS_OP_GET_MULTY / ZUFS_OP_PUT_MULTY
+ */
+/* flags for zufs_ioc_IO->ret_flags */
+enum {
+	ZUFS_RET_RESERVED	= 0x0001, /* Not used */
+	ZUFS_RET_NEW		= 0x0002, /* In WRITE, allocated a new block */
+	ZUFS_RET_IOM_ALL_PMEM	= 0x0004, /* iom_e[] is encoded with pmem-bn */
+	ZUFS_RET_PUT_NOW	= 0x0008, /* GET_MULTY demands no pigi-puts  */
+	ZUFS_RET_LOCKED_PUT	= 0x0010, /* Same as PUT_NOW but must lock a zt
+					   * channel, Because GET took a lock
+					   */
+};
+
+/* flags for zufs_ioc_IO->rw */
+#define ZUFS_RW_WRITE	BIT(0)	/* SAME as WRITE in Kernel */
+#define ZUFS_RW_MMAP	BIT(1)
+
+#define ZUFS_RW_RAND	BIT(4)	/* fadvise(random) */
+
+/* Same meaning as IOCB_XXXX different bits */
+#define ZUFS_RW_KERN	8
+#define ZUFS_RW_EVENTFD	BIT(ZUFS_RW_KERN + 0)
+#define ZUFS_RW_APPEND	BIT(ZUFS_RW_KERN + 1)
+#define ZUFS_RW_DIRECT	BIT(ZUFS_RW_KERN + 2)
+#define ZUFS_RW_HIPRI	BIT(ZUFS_RW_KERN + 3)
+#define ZUFS_RW_DSYNC	BIT(ZUFS_RW_KERN + 4)
+#define ZUFS_RW_SYNC	BIT(ZUFS_RW_KERN + 5)
+#define ZUFS_RW_NOWAIT	BIT(ZUFS_RW_KERN + 7)
+#define ZUFS_RW_LAST_USED_BIT (ZUFS_RW_KERN + 7)
+/* ^^ PLEASE update (keep last) ^^ */
+
+/* 8 bits left for user */
+#define ZUFS_RW_USER_BITS 0xFF000000
+#define ZUFS_RW_USER	BIT(24)
+
 /* Special flag for ZUFS_OP_FALLOCATE to specify a setattr(SIZE)
  * IE. same as punch hole but set_i_size to be @filepos. In this
  * case @last_pos == ~0ULL
  */
 #define ZUFS_FL_TRUNCATE 0x80000000
 
+struct zufs_ioc_IO {
+	struct zufs_ioc_hdr hdr;
+
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 filepos;
+	__u64 rw;		/* One or more of ZUFS_RW_XXX		*/
+	__u32 ret_flags;	/* OUT - ZUFS_RET_XXX OUT		*/
+	__u32 pool;		/* All dpp_t(s) belong to this pool	*/
+	__u64 cookie;		/* For FS private use			*/
+
+	/* in / OUT */
+	/* For read-ahead (or alloc ahead) */
+	struct __zufs_ra {
+		union {
+			ulong start;
+			__u64 __start;
+		};
+		__u64 prev_pos;
+		__u32 ra_pages;
+		__u32 ra_pad; /* we need this */
+	} ra;
+
+	/* For writes TODO: encode at iom_e? */
+	struct __zufs_write_unmap {
+		__u32  offset;
+		__u32  len;
+	} wr_unmap;
+
+	/* The last offset in this IO. If 0, than error code at .hdr.err */
+	/* for ZUFS_OP_FALLOCATE this is the requested end offset */
+	__u64 last_pos;
+
+	struct zufs_iomap ziom;
+	__u64 iom_e[ZUFS_WRITE_OP_SPACE]; /* One tier_up for WRITE or GB */
+};
+
+static inline uint _ioc_IO_size(uint iom_n)
+{
+	return offsetof(struct zufs_ioc_IO, iom_e) + iom_n * sizeof(__u64);
+}
+
 #endif /* _LINUX_ZUFS_API_H */

From patchwork Thu Sep 26 02:07:21 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161827
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 16251924
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:13:55 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id D4596222BF
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:13:54 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="A5ztylp+"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1732347AbfIZCNy (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:13:54 -0400
Received: from mail-wr1-f67.google.com ([209.85.221.67]:37636 "EHLO
        mail-wr1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727403AbfIZCNy (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:13:54 -0400
Received: by mail-wr1-f67.google.com with SMTP id i1so797091wro.4
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:13:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=wObfJysS/RlUT4xr16i7dOiNQaPqoRhMjGposlvVU7Y=;
        b=A5ztylp+VNDhItVHtNtGgGC34aYwhPcHb1BEIFQYcLH+TjrE1wqOQYCSPaTSnO6DWr
         AldtdY5C1+XW9koFqHzZ9qqV89cz2/6yfRlIdEvFCbvCTwl8EASXp8ArCJRgs+3tsBg4
         yI5wbHtXVh7IwUk1nfEDPi1PTzdPlV0uefZiVBS5Av2uNU4SA4uwE6hqXJWf11Zac8vS
         IFFh0QlIiSCURLvqRH/kH12FJNyVMSJ+54FeNWKdoga0WxBuQ6tMi0qIlm22lkkBXLKP
         /0y3mH1M9iVEJ5dVToEdbGyiiCsaKT//Z4zCWnN0KAPYy4pXsrX1P4N3Xh9RQpFbli9B
         8BFQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=wObfJysS/RlUT4xr16i7dOiNQaPqoRhMjGposlvVU7Y=;
        b=bxF+iGcmV/0RNXAlP5WRAsjpoMo8OdcC/9/+hrFk0lhoQtPd/v40/6afMO5ZaluLs5
         feJ+d50xxmX9AjuVaDBKsttNAxHEUquDQWXPFfL3zqt6j6ngu4d1BeeksRuSE1tVM3r2
         yNfkQXkIv7t7/3i6XIdCwBQLluz1AfmbBfeIjKQ3nYOYjdyLFl9Tq5Z2coP+4o0fKN6A
         lK7ggxODeI3BjB/NeK5HqKjSie/fVyeDmubIVpSSl6ylLC0QT9+hg/+C/2GAWSosxPzs
         k2NufQXKlBxtkuR5mh+zH+ROM84DP09QtSIUb37bXt4XrFsH+7irejNDwfizTOv1IB8T
         8TxQ==
X-Gm-Message-State: APjAAAWq8UPCE+AHdOt6Ov9pGzN4ceovxvUmA/5OlUOMeguPlZW3bbnw
        +N45fK2QvKoRe4hYedqxtJ3zQmwd1ec=
X-Google-Smtp-Source: 
 APXvYqxUxQk+Mx+itWsU+f+N265U66lyK/ZZwV1YzjpRGK8ZD/6du2CRIlIXfuNny5QGQbUw9GVYxw==
X-Received: by 2002:adf:e4c9:: with SMTP id v9mr839674wrm.396.1569464028718;
        Wed, 25 Sep 2019 19:13:48 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.13.47
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:13:48 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 12/16] zuf: mmap & sync
Date: Thu, 26 Sep 2019 05:07:21 +0300
Message-Id: <20190926020725.19601-13-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

On page-fault call the zusFS for the page information. We always
mmap pmem pages directly. (No page cache)

With write-mmap and pmem. We need to keep track of dirty inodes
and call the zusFS when one of the sync variants are called.
This is because the Server will need to do a cl_flush on all
dirty pages.

If we did not have any write-mmaped pages on the inode sync does
nothing.

[v2]
  zuf: pmem mmap must be 2M aligned

  We only support huge pages on pmem mmap (2M).
  Prevent mmap on pmem with VM addresses unaligned to 2M.

  [Under valgrind it would try to give us address not aligned
   and bypass the zufr_get_unmapped_area(). By returning
   an error valgrind backs off and everything works again
  ]

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   2 +-
 fs/zuf/_extern.h  |   6 +
 fs/zuf/file.c     |  66 ++++++++++
 fs/zuf/inode.c    |  10 ++
 fs/zuf/mmap.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/super.c    |  89 ++++++++++++++
 fs/zuf/t1.c       |   9 ++
 fs/zuf/zuf-core.c |   2 +
 fs/zuf/zuf.h      |   3 +
 fs/zuf/zus_api.h  |  26 ++++
 10 files changed, 512 insertions(+), 1 deletion(-)
 create mode 100644 fs/zuf/mmap.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 23bc3791a001..02df1374a946 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,6 +17,6 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += rw.o
+zuf-y += rw.o mmap.o
 zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 745d0cc9e719..cafda97c973c 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -64,8 +64,11 @@ int zuf_private_mount(struct zuf_root_info *zri, struct register_fs_info *rfi,
 int zuf_private_umount(struct zuf_root_info *zri, struct super_block *sb);
 struct super_block *zuf_sb_from_id(struct zuf_root_info *zri, __u64 sb_id,
 				   struct zus_sb_info *zus_sbi);
+void zuf_sync_inc(struct inode *inode);
+void zuf_sync_dec(struct inode *inode, ulong write_unmapped);
 
 /* file.c */
+int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync);
 long __zuf_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len);
 
 /* namei.c */
@@ -114,6 +117,9 @@ int zuf_iom_execute_async(struct super_block *sb, struct zus_iomap_build *iomb,
 int zuf_rw_file_range_compare(struct inode *i_in, loff_t pos_in,
 			      struct inode *i_out, loff_t pos_out, loff_t len);
 
+/* mmap.c */
+int zuf_file_mmap(struct file *file, struct vm_area_struct *vma);
+
 /* t1.c */
 int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
 
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index 8711b44371e0..7fcaf085bf8e 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -23,6 +23,70 @@ long __zuf_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
 	return -ENOTSUPP;
 }
 
+/* This function is called by both msync() and fsync(). */
+int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_sync ioc_range = {
+		.hdr.in_len = sizeof(ioc_range),
+		.hdr.operation = ZUFS_OP_SYNC,
+		.zus_ii = zii->zus_ii,
+		.offset = start,
+		.flags = datasync ? ZUFS_SF_DATASYNC : 0,
+	};
+	loff_t isize;
+	ulong uend = end + 1;
+	int err = 0;
+
+	zuf_dbg_vfs(
+		"[%ld] start=0x%llx end=0x%llx  datasync=%d write_mapped=%d\n",
+		inode->i_ino, start, end, datasync,
+		atomic_read(&zii->write_mapped));
+
+	/* We want to serialize the syncs so they don't fight with each other
+	 * and is though more efficient, but we do not want to lock out
+	 * read/writes and page-faults so we have a special sync semaphore
+	 */
+	zuf_smw_lock(zii);
+
+	isize = i_size_read(inode);
+	if (!isize) {
+		zuf_dbg_mmap("[%ld] file is empty\n", inode->i_ino);
+		goto out;
+	}
+	if (isize < uend)
+		uend = isize;
+	if (uend < start) {
+		zuf_dbg_mmap("[%ld] isize=0x%llx start=0x%llx end=0x%lx\n",
+				 inode->i_ino, isize, start, uend);
+		err = -ENODATA;
+		goto out;
+	}
+
+	if (!atomic_read(&zii->write_mapped))
+		goto out; /* Nothing to do on this inode */
+
+	ioc_range.length = uend - start;
+	unmap_mapping_range(inode->i_mapping, start, ioc_range.length, 0);
+	zufc_goose_all_zts(ZUF_ROOT(SBI(inode->i_sb)), inode);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_range.hdr,
+			    NULL, 0);
+	if (unlikely(err))
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+
+	zuf_sync_dec(inode, ioc_range.write_unmapped);
+
+out:
+	zuf_smw_unlock(zii);
+	return err;
+}
+
+static int zuf_fsync(struct file *file, loff_t start, loff_t end, int datasync)
+{
+	return zuf_isync(file_inode(file), start, end, datasync);
+}
+
 static ssize_t zuf_read_iter(struct kiocb *kiocb, struct iov_iter *ii)
 {
 	struct inode *inode = file_inode(kiocb->ki_filp);
@@ -95,6 +159,8 @@ const struct file_operations zuf_file_operations = {
 	.open			= generic_file_open,
 	.read_iter		= zuf_read_iter,
 	.write_iter		= zuf_write_iter,
+	.mmap			= zuf_file_mmap,
+	.fsync			= zuf_fsync,
 };
 
 const struct inode_operations zuf_file_inode_operations = {
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index 27660979ed6f..1e3dba654f34 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -271,6 +271,7 @@ void zuf_evict_inode(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	struct zuf_inode_info *zii = ZUII(inode);
+	int write_mapped;
 
 	if (!inode->i_nlink) {
 		if (unlikely(!zii->zi)) {
@@ -311,6 +312,15 @@ void zuf_evict_inode(struct inode *inode)
 	zii->zus_ii = NULL;
 	zii->zi = NULL;
 
+	/* ZUS on evict has synced all mmap dirty pages, YES? */
+	write_mapped = atomic_read(&zii->write_mapped);
+	if (unlikely(write_mapped || !list_empty(&zii->i_mmap_dirty))) {
+		zuf_dbg_mmap("[%ld] !!!! write_mapped=%d list_empty=%d\n",
+			      inode->i_ino, write_mapped,
+			      list_empty(&zii->i_mmap_dirty));
+		zuf_sync_dec(inode, write_mapped);
+	}
+
 	clear_inode(inode);
 }
 
diff --git a/fs/zuf/mmap.c b/fs/zuf/mmap.c
new file mode 100644
index 000000000000..318c701f7d7d
--- /dev/null
+++ b/fs/zuf/mmap.c
@@ -0,0 +1,300 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * mmap operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/pfn_t.h>
+#include "zuf.h"
+
+/* ~~~ Functions for mmap and page faults ~~~ */
+
+/* MAP_PRIVATE, copy data to user private page (cow_page) */
+static int _cow_private_page(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	int err;
+
+	/* Basically a READ into vmf->cow_page */
+	err = zuf_rw_read_page(sbi, inode, vmf->cow_page,
+			       md_p2o(vmf->pgoff));
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err("[%ld] read_page failed bn=0x%lx address=0x%lx => %d\n",
+			inode->i_ino, vmf->pgoff, vmf->address, err);
+		/* FIXME: Probably return VM_FAULT_SIGBUS */
+	}
+
+	/*HACK: This is an hack since Kernel v4.7 where a VM_FAULT_LOCKED with
+	 * vmf->page==NULL is no longer supported. Looks like for now this way
+	 * works well. We let mm mess around with unlocking and putting its own
+	 * cow_page.
+	 */
+	vmf->page = vmf->cow_page;
+	get_page(vmf->page);
+	lock_page(vmf->page);
+
+	return VM_FAULT_LOCKED;
+}
+
+static inline ulong _gb_bn(struct zufs_ioc_IO *get_block)
+{
+	if (unlikely(!get_block->ziom.iom_n))
+		return 0;
+
+	return _zufs_iom_t1_bn(get_block->iom_e[0]);
+}
+
+static vm_fault_t zuf_write_fault(struct vm_area_struct *vma,
+				  struct vm_fault *vmf)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zus_inode *zi = zii->zi;
+	ulong bn;
+	struct _io_gb_multy io_gb = {
+		.IO.rw = WRITE | ZUFS_RW_MMAP,
+		.bns = &bn,
+	};
+	vm_fault_t fault = VM_FAULT_SIGBUS;
+	ulong addr = vmf->address;
+	ulong pmem_bn;
+	pgoff_t size;
+	pfn_t pfnt;
+	ulong pfn;
+	int err;
+
+	zuf_dbg_mmap("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+		    "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p\n",
+		    _zi_ino(zi), vma->vm_start, vma->vm_end, addr, vmf->pgoff,
+		    vmf->flags, vmf->cow_page, vmf->page);
+
+	sb_start_pagefault(inode->i_sb);
+	zuf_smr_lock_pagefault(zii);
+
+	size = md_o2p_up(i_size_read(inode));
+	if (unlikely(vmf->pgoff >= size)) {
+		ulong pgoff = vma->vm_pgoff + md_o2p(addr - vma->vm_start);
+
+		zuf_dbg_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n",
+			    _zi_ino(zi), vmf->pgoff, pgoff, size);
+
+		fault = VM_FAULT_SIGBUS;
+		goto out;
+	}
+
+	if (vmf->cow_page) {
+		fault = _cow_private_page(vma, vmf);
+		goto out;
+	}
+
+	zus_inode_cmtime_now(inode, zi);
+	/* NOTE: zus needs to flush the zi */
+
+	err = _zufs_IO_get_multy(sbi, inode, md_p2o(vmf->pgoff), PAGE_SIZE,
+				 &io_gb);
+	if (unlikely(err)) {
+		zuf_dbg_err("_get_put_block failed => %d\n", err);
+		goto out;
+	}
+	pmem_bn = _gb_bn(&io_gb.IO);
+	if (unlikely(pmem_bn == 0)) {
+		zuf_err("[%ld] pmem_bn=0  rw=0x%llx ret_flags=0x%x but no error?\n",
+			_zi_ino(zi), io_gb.IO.rw, io_gb.IO.ret_flags);
+		fault = VM_FAULT_SIGBUS;
+		goto out;
+	}
+
+	if (io_gb.IO.ret_flags & ZUFS_RET_NEW) {
+		/* newly created block */
+		inode->i_blocks = le64_to_cpu(zii->zi->i_blocks);
+	}
+	unmap_mapping_range(inode->i_mapping, vmf->pgoff << PAGE_SHIFT,
+				    PAGE_SIZE, 0);
+
+	pfn = md_pfn(sbi->md, pmem_bn);
+	pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+	fault = vmf_insert_mixed_mkwrite(vma, addr, pfnt);
+	err = zuf_flt_to_err(fault);
+	if (unlikely(err)) {
+		zuf_err("[%ld] vm_insert_mixed_mkwrite failed => fault=0x%x err=%d\n",
+			_zi_ino(zi), (int)fault, err);
+		goto put;
+	}
+
+	zuf_dbg_mmap("[%ld] vm_insert_mixed 0x%lx prot=0x%lx => %d\n",
+		    _zi_ino(zi), pfn, vma->vm_page_prot.pgprot, err);
+
+	zuf_sync_inc(inode);
+put:
+	_zufs_IO_put_multy(sbi, inode, &io_gb);
+out:
+	zuf_smr_unlock(zii);
+	sb_end_pagefault(inode->i_sb);
+	return fault;
+}
+
+static vm_fault_t zuf_pfn_mkwrite(struct vm_fault *vmf)
+{
+	return zuf_write_fault(vmf->vma, vmf);
+}
+
+static vm_fault_t zuf_read_fault(struct vm_area_struct *vma,
+				 struct vm_fault *vmf)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zus_inode *zi = zii->zi;
+	ulong bn;
+	struct _io_gb_multy io_gb = {
+		.IO.rw = READ | ZUFS_RW_MMAP,
+		.bns = &bn,
+	};
+	vm_fault_t fault = VM_FAULT_SIGBUS;
+	ulong addr = vmf->address;
+	ulong pmem_bn;
+	pgoff_t size;
+	pfn_t pfnt;
+	int err;
+
+	zuf_dbg_mmap("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+		    "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p\n",
+		    _zi_ino(zi), vma->vm_start, vma->vm_end, addr, vmf->pgoff,
+		    vmf->flags, vmf->cow_page, vmf->page);
+
+	zuf_smr_lock_pagefault(zii);
+
+	size = md_o2p_up(i_size_read(inode));
+	if (unlikely(vmf->pgoff >= size)) {
+		ulong pgoff = vma->vm_pgoff + md_o2p(addr - vma->vm_start);
+
+		zuf_dbg_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n",
+			    _zi_ino(zi), vmf->pgoff, pgoff, size);
+		goto out;
+	}
+
+	if (vmf->cow_page) {
+		zuf_warn("cow is read\n");
+		fault = _cow_private_page(vma, vmf);
+		goto out;
+	}
+
+	file_accessed(vma->vm_file);
+	/* NOTE: zus needs to flush the zi */
+
+	err = _zufs_IO_get_multy(sbi, inode, md_p2o(vmf->pgoff), PAGE_SIZE,
+				 &io_gb);
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err("_get_put_block failed => %d\n", err);
+		goto out;
+	}
+
+	pmem_bn = _gb_bn(&io_gb.IO);
+	if (pmem_bn == 0) {
+		/* Hole in file */
+		pfnt = pfn_to_pfn_t(my_zero_pfn(vmf->address));
+	} else {
+		/* We have a real page */
+		pfnt = phys_to_pfn_t(PFN_PHYS(md_pfn(sbi->md, pmem_bn)),
+				     PFN_MAP | PFN_DEV);
+	}
+	fault = vmf_insert_mixed(vma, addr, pfnt);
+	err = zuf_flt_to_err(fault);
+	if (unlikely(err)) {
+		zuf_err("[%ld] vm_insert_mixed => fault=0x%x err=%d\n",
+			_zi_ino(zi), (int)fault, err);
+		goto put;
+	}
+
+	zuf_dbg_mmap("[%ld] vm_insert_mixed pmem_bn=0x%lx fault=%d\n",
+		     _zi_ino(zi), pmem_bn, fault);
+
+put:
+	if (pmem_bn)
+		_zufs_IO_put_multy(sbi, inode, &io_gb);
+out:
+	zuf_smr_unlock(zii);
+	return fault;
+}
+
+static vm_fault_t zuf_fault(struct vm_fault *vmf)
+{
+	bool write_fault = (0 != (vmf->flags & FAULT_FLAG_WRITE));
+
+	if (write_fault)
+		return zuf_write_fault(vmf->vma, vmf);
+	else
+		return zuf_read_fault(vmf->vma, vmf);
+}
+
+static void zuf_mmap_open(struct vm_area_struct *vma)
+{
+	struct zuf_inode_info *zii = ZUII(file_inode(vma->vm_file));
+
+	atomic_inc(&zii->vma_count);
+}
+
+static void zuf_mmap_close(struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	int vma_count = atomic_dec_return(&ZUII(inode)->vma_count);
+
+	if (unlikely(vma_count < 0))
+		zuf_err("[%ld] WHAT??? vma_count=%d\n",
+			 inode->i_ino, vma_count);
+	else if (unlikely(vma_count == 0)) {
+		struct zuf_inode_info *zii = ZUII(inode);
+		struct zufs_ioc_mmap_close mmap_close = {};
+		int err;
+
+		mmap_close.hdr.operation = ZUFS_OP_MMAP_CLOSE;
+		mmap_close.hdr.in_len = sizeof(mmap_close);
+
+		mmap_close.zus_ii = zii->zus_ii;
+		mmap_close.rw = 0; /* TODO: Do we need this */
+
+		zuf_smr_lock(zii);
+
+		err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &mmap_close.hdr,
+				    NULL, 0);
+		if (unlikely(err))
+			zuf_dbg_err("[%ld] err=%d\n", inode->i_ino, err);
+
+		zuf_smr_unlock(zii);
+	}
+}
+
+static const struct vm_operations_struct zuf_vm_ops = {
+	.fault		= zuf_fault,
+	.pfn_mkwrite	= zuf_pfn_mkwrite,
+	.open           = zuf_mmap_open,
+	.close		= zuf_mmap_close,
+};
+
+int zuf_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(file);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	file_accessed(file);
+
+	vma->vm_ops = &zuf_vm_ops;
+
+	atomic_inc(&zii->vma_count);
+
+	zuf_dbg_vfs("[%ld] start=0x%lx end=0x%lx flags=0x%lx page_prot=0x%lx\n",
+		     file->f_mapping->host->i_ino, vma->vm_start, vma->vm_end,
+		     vma->vm_flags, pgprot_val(vma->vm_page_prot));
+
+	return 0;
+}
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index abd7e6cb2a4a..2a0db11b51d6 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -737,6 +737,90 @@ static int zuf_update_s_wtime(struct super_block *sb)
 	return 0;
 }
 
+static void _sync_add_inode(struct inode *inode)
+{
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zuf_dbg_mmap("[%ld] write_mapped=%d\n",
+		      inode->i_ino, atomic_read(&zii->write_mapped));
+
+	spin_lock(&sbi->s_mmap_dirty_lock);
+
+	/* Because we are lazy removing the inodes, only in case of an fsync
+	 * or an evict_inode. It is fine if we are call multiple times.
+	 */
+	if (list_empty(&zii->i_mmap_dirty))
+		list_add(&zii->i_mmap_dirty, &sbi->s_mmap_dirty);
+
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+}
+
+static void _sync_remove_inode(struct inode *inode)
+{
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zuf_dbg_mmap("[%ld] write_mapped=%d\n",
+		      inode->i_ino, atomic_read(&zii->write_mapped));
+
+	spin_lock(&sbi->s_mmap_dirty_lock);
+	list_del_init(&zii->i_mmap_dirty);
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+}
+
+void zuf_sync_inc(struct inode *inode)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (1 == atomic_inc_return(&zii->write_mapped))
+		_sync_add_inode(inode);
+}
+
+/* zuf_sync_dec will unmapped in batches */
+void zuf_sync_dec(struct inode *inode, ulong write_unmapped)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (0 == atomic_sub_return(write_unmapped, &zii->write_mapped))
+		_sync_remove_inode(inode);
+}
+
+/*
+ * We must fsync any mmap-active inodes
+ */
+static int zuf_sync_fs(struct super_block *sb, int wait)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct zuf_inode_info *zii, *t;
+	enum {to_clean_size = 120};
+	struct zuf_inode_info *zii_to_clean[to_clean_size];
+	uint i, to_clean;
+
+	zuf_dbg_vfs("Syncing wait=%d\n", wait);
+more_inodes:
+	spin_lock(&sbi->s_mmap_dirty_lock);
+	to_clean = 0;
+	list_for_each_entry_safe(zii, t, &sbi->s_mmap_dirty, i_mmap_dirty) {
+		list_del_init(&zii->i_mmap_dirty);
+		zii_to_clean[to_clean++] = zii;
+		if (to_clean >= to_clean_size)
+			break;
+	}
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+
+	if (!to_clean)
+		return 0;
+
+	for (i = 0; i < to_clean; ++i)
+		zuf_isync(&zii_to_clean[i]->vfs_inode, 0, ~0 - 1, 1);
+
+	if (to_clean == to_clean_size)
+		goto more_inodes;
+
+	return 0;
+}
+
 static struct inode *zuf_alloc_inode(struct super_block *sb)
 {
 	struct zuf_inode_info *zii;
@@ -759,7 +843,11 @@ static void _init_once(void *foo)
 	struct zuf_inode_info *zii = foo;
 
 	inode_init_once(&zii->vfs_inode);
+	INIT_LIST_HEAD(&zii->i_mmap_dirty);
 	zii->zi = NULL;
+	init_rwsem(&zii->in_sync);
+	atomic_set(&zii->vma_count, 0);
+	atomic_set(&zii->write_mapped, 0);
 }
 
 int __init zuf_init_inodecache(void)
@@ -789,6 +877,7 @@ static struct super_operations zuf_sops = {
 	.put_super	= zuf_put_super,
 	.freeze_fs	= zuf_update_s_wtime,
 	.unfreeze_fs	= zuf_update_s_wtime,
+	.sync_fs	= zuf_sync_fs,
 	.statfs		= zuf_statfs,
 	.remount_fs	= zuf_remount,
 	.show_options	= zuf_show_options,
diff --git a/fs/zuf/t1.c b/fs/zuf/t1.c
index 46ea7f6181fc..1f2db5a674d5 100644
--- a/fs/zuf/t1.c
+++ b/fs/zuf/t1.c
@@ -124,6 +124,15 @@ int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!zsf || zsf->type != zlfs_e_pmem)
 		return -EPERM;
 
+	/* Valgrined may interfere with our 2M mmap aligned vma start
+	 * (See zufr_get_unmapped_area). Tell the guys to back off
+	 */
+	if (unlikely(vma->vm_start & ~PMD_MASK)) {
+		zuf_err("mmap is not 2M aligned vm_start=0x%lx\n",
+				vma->vm_start);
+		return -EINVAL;
+	}
+
 	vma->vm_flags |= VM_HUGEPAGE;
 	vma->vm_ops = &t1_vm_ops;
 
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 11300fd79929..cb4a4def646f 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -99,7 +99,9 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_READ);
 		CASE_ENUM_NAME(ZUFS_OP_PRE_READ);
 		CASE_ENUM_NAME(ZUFS_OP_WRITE);
+		CASE_ENUM_NAME(ZUFS_OP_MMAP_CLOSE);
 		CASE_ENUM_NAME(ZUFS_OP_SETATTR);
+		CASE_ENUM_NAME(ZUFS_OP_SYNC);
 
 		CASE_ENUM_NAME(ZUFS_OP_GET_MULTY);
 		CASE_ENUM_NAME(ZUFS_OP_PUT_MULTY);
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 2c57c51a2099..fe479cb70f97 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -132,6 +132,9 @@ struct zuf_inode_info {
 
 	/* Stuff for mmap write */
 	struct rw_semaphore	in_sync;
+	struct list_head	i_mmap_dirty;
+	atomic_t		write_mapped;
+	atomic_t		vma_count;
 
 	/* cookies from Server */
 	struct zus_inode	*zi;
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index e3a783748ce6..e70bd8b7ff69 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -459,7 +459,9 @@ enum e_zufs_operation {
 	ZUFS_OP_READ		= 14,
 	ZUFS_OP_PRE_READ	= 15,
 	ZUFS_OP_WRITE		= 16,
+	ZUFS_OP_MMAP_CLOSE	= 17,
 	ZUFS_OP_SETATTR		= 19,
+	ZUFS_OP_SYNC		= 20,
 	ZUFS_OP_FALLOCATE	= 21,
 
 	ZUFS_OP_GET_MULTY	= 29,
@@ -645,6 +647,13 @@ static inline bool zufs_zde_emit(struct zufs_readdir_iter *rdi, __u64 ino,
 }
 #endif /* ndef __cplusplus */
 
+struct zufs_ioc_mmap_close {
+	struct zufs_ioc_hdr hdr;
+	 /* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 rw; /* Some flags + READ or WRITE */
+};
+
 /* ZUFS_OP_SETATTR */
 struct zufs_ioc_attr {
 	struct zufs_ioc_hdr hdr;
@@ -654,6 +663,23 @@ struct zufs_ioc_attr {
 	__u32 pad;
 };
 
+/* ZUFS_OP_SYNC */
+enum ZUFS_SYNC_FLAGS {
+	ZUFS_SF_DATASYNC		= 0x00000001,
+	ZUFS_SF_DONTNEED		= 0x00000100,
+};
+
+struct zufs_ioc_sync {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 offset, length;
+	__u64 flags;
+
+	/* OUT */
+	__u64 write_unmapped;
+};
+
 /* ~~~~ io_map structures && IOCTL(s) ~~~~ */
 /*
  * These set of structures and helpers are used in return of zufs_ioc_IO and

From patchwork Thu Sep 26 02:07:22 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161829
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A2FE4924
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:14:22 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 6BAED222BF
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:14:22 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="uyM5RQpI"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1732926AbfIZCOV (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:14:21 -0400
Received: from mail-wr1-f68.google.com ([209.85.221.68]:40331 "EHLO
        mail-wr1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727403AbfIZCOV (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:14:21 -0400
Received: by mail-wr1-f68.google.com with SMTP id l3so779473wru.7
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:14:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=+hqcdSN3xgTXB6x9H16TTTMzbxMwFL5/B9R8HZl9eHk=;
        b=uyM5RQpIv3+C4GXta56FdRjN7PYEIyAVbM8uGC0cI0WqxVv4zrrgncL2IcFpuApT6b
         jU3pPTX2JBh2NncG76DBMou5fizbdN70TuJ7UPgHZnEoi+w/PP7dfv3I43h1K4BZnKNG
         phcLko0Uxe8Q6jM8yx7Ktyhle9GMhAqmgip6JCB0q8ILQPabag3teMcqYiBoWaO3Yod0
         paW8BDamBX3yX1/krjZJ2tneo0U/gnbsE4r+ZlwtBH1CQ2fPQJyUARZb25kXf7URUxSA
         h74ZNu2SZEMiAfp/y6S8Pe2MpYiSXleJBYvIk35nXvdmng+BysPsYTNigWE0OIGmYdbc
         +P4g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=+hqcdSN3xgTXB6x9H16TTTMzbxMwFL5/B9R8HZl9eHk=;
        b=jtm/MNuqCkfFnev2Jx871+819gtJLYjiRG7paXx7yZpJvQBV0utXaD1vACdktFDOOc
         fHvfYvpENKIUemomtN7++QaD1zRirOGlA8Xkk37zMAgQ3rvpm60PRu/mX1siBPNe2gqM
         smFiBo1X/8zJqRUDTs5HQ1MGleoWIl5OYAX5X/dS0bT1AtIzEpJ2t4lBEPSNquJZqyBD
         Cqgq++Yl84gXc4o5ZcTMslMAuA3nqm95i+0to6u66F1h+p38aKsJdsmDjKCxzCwSZi3v
         fj8Xu9F6sQ4MFUusIeV6AAuo1vk1g31yXCcI2enPJovlrLjgH6KuMPWOV2ruCUpVOnNi
         Vebg==
X-Gm-Message-State: APjAAAUR9aOXcnGhqU40njZUl1yJ9hedVZPiRRedn3seR5ROUUy+2kFg
        gLII5Dzew92uVFZ4jf2C8oMP6QGH86E=
X-Google-Smtp-Source: 
 APXvYqxVh9+PAWWOX0NGy9p1D2bDWP6sNtHVttc/ScMtJEd/tY8gAV2R1JBCqh8kQcgphz3heL+ZrA==
X-Received: by 2002:a5d:6a06:: with SMTP id m6mr855323wru.190.1569464054936;
        Wed, 25 Sep 2019 19:14:14 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.14.13
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:14:14 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 13/16] zuf: More file operation
Date: Thu, 26 Sep 2019 05:07:22 +0300
Message-Id: <20190926020725.19601-14-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Add more file/inode operation:

vector			function		operation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.llseek			zuf_llseek		ZUFS_OP_LLSEEK

.fallocate		zuf_fallocate		ZUFS_OP_FALLOCATE
.copy_file_range	zuf_copy_file_range	ZUFS_OP_COPY
.remap_file_range	zuf_clone_file_range	ZUFS_OP_CLONE
.fadvise		zuf_fadvise		(multiple see rw.c)
.fiemap			zuf_fiemap		ZUFS_OP_FIEMAP

See more comments in source code.

[v2]
  SQUASHME zuf: fadvise fix up missing operations

  Mainly there was a bug found by Vlad, that POSIX_FADV_RANDOM was
  missing and therefor was returning and error and some tests were
  failing.
  But while at it actually implement all the missing advise. Just
  punch into file->ra the proper flags.
  FIXME:  There is a pending patch by Jan to export generic_fadvise
	  for now duplicate what we need inline.

[v3]
  zuf: lock two zii fix

[v4]
  zuf: Reduce stack usage (fiemap)
  Same as for IO use the big_alloc to prevent compilation warning

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/_extern.h  |   3 +
 fs/zuf/file.c     | 650 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/zuf/rw.c       |  92 +++++++
 fs/zuf/zuf-core.c |   5 +
 fs/zuf/zus_api.h  |  83 ++++++
 5 files changed, 832 insertions(+), 1 deletion(-)

diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index cafda97c973c..2c7456724ef6 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -110,6 +110,9 @@ int _zufs_IO_get_multy(struct zuf_sb_info *sbi, struct inode *inode,
 void _zufs_IO_put_multy(struct zuf_sb_info *sbi, struct inode *inode,
 			struct _io_gb_multy *io_gb);
 int zuf_rw_fallocate(struct inode *inode, uint mode, loff_t offset, loff_t len);
+int zuf_rw_fadvise(struct super_block *sb, struct file *file,
+		   loff_t offset, loff_t len, int advise, bool rand);
+
 int zuf_iom_execute_sync(struct super_block *sb, struct inode *inode,
 			 __u64 *iom_e, uint iom_n);
 int zuf_iom_execute_async(struct super_block *sb, struct zus_iomap_build *iomb,
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index 7fcaf085bf8e..1c51529694e7 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -15,12 +15,158 @@
 
 #include <linux/fs.h>
 #include <linux/uio.h>
+#include <linux/falloc.h>
+#include <linux/fadvise.h>
+#include <linux/sched/signal.h>
 
 #include "zuf.h"
 
 long __zuf_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
 {
-	return -ENOTSUPP;
+	struct zuf_inode_info *zii = ZUII(inode);
+	bool need_len_check, need_unmap;
+	loff_t unmap_len = 0; /* 0 means all file */
+	loff_t new_size = len + offset;
+	loff_t i_size = i_size_read(inode);
+	int err = 0;
+
+	zuf_dbg_vfs("[%ld] mode=0x%x offset=0x%llx len=0x%llx\n",
+		     inode->i_ino, mode, offset, len);
+
+	if (!S_ISREG(inode->i_mode))
+		return -EINVAL;
+	if (IS_SWAPFILE(inode))
+		return -ETXTBSY;
+
+	/* These are all the FL flags we know how to handle on the  kernel side
+	 * a zusFS that does not support one of these can just return
+	 * EOPNOTSUPP.
+	 */
+	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
+		     FALLOC_FL_NO_HIDE_STALE | FALLOC_FL_COLLAPSE_RANGE |
+		     FALLOC_FL_ZERO_RANGE | FALLOC_FL_INSERT_RANGE |
+		     FALLOC_FL_UNSHARE_RANGE | ZUFS_FL_TRUNCATE)){
+		zuf_dbg_err("Unsupported mode(0x%x)\n", mode);
+		return -EOPNOTSUPP;
+	}
+
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		need_len_check = false;
+		need_unmap = true;
+		unmap_len = len;
+	} else if (mode & ZUFS_FL_TRUNCATE) {
+		need_len_check = true;
+		new_size = offset;
+		need_unmap = true;
+	} else if (mode & FALLOC_FL_COLLAPSE_RANGE) {
+		need_len_check = false;
+		need_unmap = true;
+	} else if (mode & FALLOC_FL_INSERT_RANGE) {
+		need_len_check = true;
+		new_size = i_size + len;
+		need_unmap = true;
+	} else if (mode & FALLOC_FL_ZERO_RANGE) {
+		need_len_check = !(mode & FALLOC_FL_KEEP_SIZE);
+		need_unmap = true;
+	} else {
+		/* FALLOC_FL_UNSHARE_RANGE same as regular */
+		need_len_check = !(mode & FALLOC_FL_KEEP_SIZE);
+		need_unmap = false;
+	}
+
+	if (need_len_check && (new_size > i_size)) {
+		err = inode_newsize_ok(inode, new_size);
+		if (unlikely(err)) {
+			zuf_dbg_err("inode_newsize_ok(0x%llx) => %d\n",
+				    new_size, err);
+			goto out;
+		}
+	}
+
+	if (need_unmap) {
+		zufc_goose_all_zts(ZUF_ROOT(SBI(inode->i_sb)), inode);
+		unmap_mapping_range(inode->i_mapping, offset, unmap_len, 1);
+	}
+
+	zus_inode_cmtime_now(inode, zii->zi);
+
+	err = zuf_rw_fallocate(inode, mode, offset, len);
+
+	/* Even if we had an error these might have changed */
+	i_size_write(inode, le64_to_cpu(zii->zi->i_size));
+	inode->i_blocks = le64_to_cpu(zii->zi->i_blocks);
+
+out:
+	return err;
+}
+
+static long zuf_fallocate(struct file *file, int mode, loff_t offset,
+			  loff_t len)
+{
+	struct inode *inode = file->f_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	int err;
+
+	zuf_w_lock(zii);
+
+	err = __zuf_fallocate(inode, mode, offset, len);
+
+	zuf_w_unlock(zii);
+	return err;
+}
+
+static loff_t zuf_llseek(struct file *file, loff_t offset, int whence)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_seek ioc_seek = {
+		.hdr.in_len = sizeof(ioc_seek),
+		.hdr.out_len = sizeof(ioc_seek),
+		.hdr.operation = ZUFS_OP_LLSEEK,
+		.zus_ii = zii->zus_ii,
+		.offset_in = offset,
+		.whence = whence,
+	};
+	int err = 0;
+
+	zuf_dbg_vfs("[%ld] offset=0x%llx whence=%d\n",
+		     inode->i_ino, offset, whence);
+
+	if (whence != SEEK_DATA && whence != SEEK_HOLE)
+		return generic_file_llseek(file, offset, whence);
+
+	zuf_r_lock(zii);
+
+	if ((offset < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET)) ||
+	    offset > inode->i_sb->s_maxbytes) {
+		err = -EINVAL;
+		goto out;
+	} else if (inode->i_size <= offset) {
+		err = -ENXIO;
+		goto out;
+	} else if (!inode->i_blocks) {
+		if (whence == SEEK_HOLE)
+			ioc_seek.offset_out = i_size_read(inode);
+		else
+			err = -ENXIO;
+		goto out;
+	}
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_seek.hdr, NULL, 0);
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+		goto out;
+	}
+
+	if (ioc_seek.offset_out != file->f_pos) {
+		file->f_pos = ioc_seek.offset_out;
+		file->f_version = 0;
+	}
+
+out:
+	zuf_r_unlock(zii);
+
+	return err ?: ioc_seek.offset_out;
 }
 
 /* This function is called by both msync() and fsync(). */
@@ -87,6 +233,481 @@ static int zuf_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	return zuf_isync(file_inode(file), start, end, datasync);
 }
 
+/* This callback is called when a file is closed */
+static int zuf_flush(struct file *file, fl_owner_t id)
+{
+	zuf_dbg_vfs("[%ld]\n", file->f_inode->i_ino);
+	return 0;
+}
+
+static int zuf_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+		      u64 offset, u64 length)
+{
+	struct super_block *sb = inode->i_sb;
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_fiemap ioc_fiemap = {
+		.hdr.operation = ZUFS_OP_FIEMAP,
+		.hdr.in_len = sizeof(ioc_fiemap),
+		.hdr.out_len = sizeof(ioc_fiemap),
+		.zus_ii = zii->zus_ii,
+		.start = offset,
+		.length = length,
+		.flags = fieinfo->fi_flags,
+	};
+	long on_stack[ZUF_MAX_STACK(160) / sizeof(long)];
+	struct page **pages = NULL;
+	enum big_alloc_type bat = 0;
+	uint nump = 0, extents_max = 0;
+	int i, err;
+
+	zuf_dbg_vfs("[%ld] offset=0x%llx len=0x%llx extents_max=%u flags=0x%x\n",
+		    inode->i_ino, offset, length, fieinfo->fi_extents_max,
+		    fieinfo->fi_flags);
+
+	/* TODO: Have support for FIEMAP_FLAG_XATTR */
+	err = fiemap_check_flags(fieinfo, FIEMAP_FLAG_SYNC);
+	if (unlikely(err))
+		return err;
+
+	if (likely(fieinfo->fi_extents_max)) {
+		ulong start = (ulong)fieinfo->fi_extents_start;
+		ulong len = fieinfo->fi_extents_max *
+						sizeof(struct fiemap_extent);
+		ulong offset = start & (PAGE_SIZE - 1);
+		ulong end_offset = (offset + len) & (PAGE_SIZE - 1);
+		ulong __len;
+		uint nump_r;
+
+		nump = md_o2p_up(offset + len);
+		if (ZUS_API_MAP_MAX_PAGES < nump)
+			nump = ZUS_API_MAP_MAX_PAGES;
+
+		__len = nump * PAGE_SIZE - offset;
+		if (end_offset)
+			__len -= (PAGE_SIZE - end_offset);
+
+		extents_max = __len / sizeof(struct fiemap_extent);
+
+		ioc_fiemap.hdr.len = extents_max * sizeof(struct fiemap_extent);
+		ioc_fiemap.hdr.offset = offset;
+
+		pages = big_alloc(nump * sizeof(*pages), sizeof(on_stack),
+				  on_stack, GFP_KERNEL, &bat);
+		if (unlikely(!pages))
+			return -ENOMEM;
+
+		nump_r = get_user_pages_fast(start, nump, WRITE, pages);
+		if (unlikely(nump != nump_r)) {
+			err = -EFAULT;
+			goto free;
+		}
+	}
+	ioc_fiemap.extents_max = extents_max;
+
+	zuf_r_lock(zii);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_fiemap.hdr, pages, nump);
+	if (unlikely(err)) {
+		zuf_dbg_err("zufs_dispatch failed => %d\n", err);
+		goto out;
+	}
+
+	fieinfo->fi_extents_mapped = ioc_fiemap.extents_mapped;
+	if (unlikely(extents_max &&
+		     (extents_max < ioc_fiemap.extents_mapped))) {
+		zuf_err("extents_max=%d extents_mapped=%d\n", extents_max,
+			ioc_fiemap.extents_mapped);
+		err = -EINVAL;
+	}
+
+out:
+	zuf_r_unlock(zii);
+
+	for (i = 0; i < nump; ++i)
+		put_page(pages[i]);
+free:
+	big_free(pages, bat);
+
+	return err;
+}
+
+/* ~~~~~ clone/copy range ~~~~~ */
+
+/*
+ * Copy/paste from Kernel mm/filemap.c::generic_remap_checks
+ * FIXME: make it EXPORT_GPL
+ */
+static int _access_check_limits(struct file *file, loff_t pos,
+				       loff_t *count)
+{
+	struct inode *inode = file->f_mapping->host;
+	loff_t max_size = inode->i_sb->s_maxbytes;
+
+	if (!(file->f_flags & O_LARGEFILE))
+		max_size = MAX_NON_LFS;
+
+	if (unlikely(pos >= max_size))
+		return -EFBIG;
+	*count = min(*count, max_size - pos);
+	return 0;
+}
+
+static int _write_check_limits(struct file *file, loff_t pos,
+				      loff_t *count)
+{
+
+	loff_t limit = rlimit(RLIMIT_FSIZE);
+
+	if (limit != RLIM_INFINITY) {
+		if (pos >= limit) {
+			send_sig(SIGXFSZ, current, 0);
+			return -EFBIG;
+		}
+		*count = min(*count, limit - pos);
+	}
+
+	return _access_check_limits(file, pos, count);
+}
+
+static int _remap_checks(struct file *file_in, loff_t pos_in,
+			 struct file *file_out, loff_t pos_out,
+			 loff_t *req_count, unsigned int remap_flags)
+{
+	struct inode *inode_in = file_in->f_mapping->host;
+	struct inode *inode_out = file_out->f_mapping->host;
+	uint64_t count = *req_count;
+	uint64_t bcount;
+	loff_t size_in, size_out;
+	loff_t bs = inode_out->i_sb->s_blocksize;
+	int ret;
+
+	/* The start of both ranges must be aligned to an fs block. */
+	if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs))
+		return -EINVAL;
+
+	/* Ensure offsets don't wrap. */
+	if (pos_in + count < pos_in || pos_out + count < pos_out)
+		return -EINVAL;
+
+	size_in = i_size_read(inode_in);
+	size_out = i_size_read(inode_out);
+
+	/* Dedupe requires both ranges to be within EOF. */
+	if ((remap_flags & REMAP_FILE_DEDUP) &&
+	    (pos_in >= size_in || pos_in + count > size_in ||
+	     pos_out >= size_out || pos_out + count > size_out))
+		return -EINVAL;
+
+	/* Ensure the infile range is within the infile. */
+	if (pos_in >= size_in)
+		return -EINVAL;
+	count = min(count, size_in - (uint64_t)pos_in);
+
+	ret = _access_check_limits(file_in, pos_in, &count);
+	if (ret)
+		return ret;
+
+	ret = _write_check_limits(file_out, pos_out, &count);
+	if (ret)
+		return ret;
+
+	/*
+	 * If the user wanted us to link to the infile's EOF, round up to the
+	 * next block boundary for this check.
+	 *
+	 * Otherwise, make sure the count is also block-aligned, having
+	 * already confirmed the starting offsets' block alignment.
+	 */
+	if (pos_in + count == size_in) {
+		bcount = ALIGN(size_in, bs) - pos_in;
+	} else {
+		if (!IS_ALIGNED(count, bs))
+			count = ALIGN_DOWN(count, bs);
+		bcount = count;
+	}
+
+	/* Don't allow overlapped cloning within the same file. */
+	if (inode_in == inode_out &&
+	    pos_out + bcount > pos_in &&
+	    pos_out < pos_in + bcount)
+		return -EINVAL;
+
+	/*
+	 * We shortened the request but the caller can't deal with that, so
+	 * bounce the request back to userspace.
+	 */
+	if (*req_count != count && !(remap_flags & REMAP_FILE_CAN_SHORTEN))
+		return -EINVAL;
+
+	*req_count = count;
+	return 0;
+}
+
+/*
+ * Copy/paste from generic_remap_file_range_prep(). We cannot call
+ * generic_remap_file_range_prep because it calles fsync twice and we do not
+ * want to go to the Server so many times.
+ * So below is just the checks.
+ * FIXME: Send a patch upstream to split the generic_remap_file_range_prep
+ * or receive a flag if to do the syncs
+ *
+ * Check that the two inodes are eligible for cloning, the ranges make
+ * sense.
+ *
+ * If there's an error, then the usual negative error code is returned.
+ * Otherwise returns 0 with *len set to the request length.
+ */
+static int _remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				  struct file *file_out, loff_t pos_out,
+				  loff_t *len, unsigned int remap_flags)
+{
+	struct inode *inode_in = file_inode(file_in);
+	struct inode *inode_out = file_inode(file_out);
+	int ret;
+
+	/* Don't touch certain kinds of inodes */
+	if (IS_IMMUTABLE(inode_out))
+		return -EPERM;
+
+	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
+		return -ETXTBSY;
+
+	/* Don't reflink dirs, pipes, sockets... */
+	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+		return -EISDIR;
+	if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+		return -EINVAL;
+
+	/* Zero length dedupe exits immediately; reflink goes to EOF. */
+	if (*len == 0) {
+		loff_t isize = i_size_read(inode_in);
+
+		if ((remap_flags & REMAP_FILE_DEDUP) || pos_in == isize)
+			return 0;
+		if (pos_in > isize)
+			return -EINVAL;
+		*len = isize - pos_in;
+		if (*len == 0)
+			return 0;
+	}
+
+	/* Check that we don't violate system file offset limits. */
+	ret = _remap_checks(file_in, pos_in, file_out, pos_out, len,
+			    remap_flags);
+	if (ret)
+		return ret;
+
+	/*
+	 * REMAP_FILE_DEDUP see if extents are the same.
+	 */
+	if (remap_flags & REMAP_FILE_DEDUP)
+		ret = zuf_rw_file_range_compare(inode_in, pos_in,
+						inode_out, pos_out, *len);
+
+	return ret;
+}
+
+static void _lock_two_ziis(struct zuf_inode_info *zii1,
+			   struct zuf_inode_info *zii2)
+{
+	if (zii1 > zii2)
+		swap(zii1, zii2);
+
+	zuf_w_lock(zii1);
+	if (zii1 != zii2)
+		zuf_w_lock_nested(zii2);
+}
+
+static void _unlock_two_ziis(struct zuf_inode_info *zii1,
+		      struct zuf_inode_info *zii2)
+{
+	if (zii1 > zii2)
+		swap(zii1, zii2);
+
+	if (zii1 != zii2)
+		zuf_w_unlock(zii2);
+	zuf_w_unlock(zii1);
+}
+
+static int _clone_file_range(struct inode *src_inode, loff_t pos_in,
+			     struct file *file_out,
+			     struct inode *dst_inode, loff_t pos_out,
+			     u64 len, u64 len_up, int operation)
+{
+	struct zuf_inode_info *src_zii = ZUII(src_inode);
+	struct zuf_inode_info *dst_zii = ZUII(dst_inode);
+	struct zus_inode *dst_zi = dst_zii->zi;
+	struct super_block *sb = src_inode->i_sb;
+	struct zufs_ioc_clone ioc_clone = {
+		.hdr.in_len = sizeof(ioc_clone),
+		.hdr.out_len = sizeof(ioc_clone),
+		.hdr.operation = operation,
+		.src_zus_ii = src_zii->zus_ii,
+		.dst_zus_ii = dst_zii->zus_ii,
+		.pos_in = pos_in,
+		.pos_out = pos_out,
+		.len = len,
+		.len_up = len_up,
+	};
+	int err;
+
+	/* NOTE: len==0 means to-end-of-file which is what we want */
+	unmap_mapping_range(src_inode->i_mapping, pos_in,  len, 0);
+	unmap_mapping_range(dst_inode->i_mapping, pos_out, len, 0);
+
+	zufc_goose_all_zts(ZUF_ROOT(SBI(dst_inode->i_sb)), dst_inode);
+
+	if ((len_up == 0) && (pos_in || pos_out)) {
+		zuf_err("Boaz Smoking 0x%llx 0x%llx 0x%llx\n",
+			pos_in, pos_out, len);
+		/* Bad caller */
+		return -EINVAL;
+	}
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_clone.hdr, NULL, 0);
+	if (unlikely(err && err != -EINTR)) {
+		zuf_dbg_err("failed to clone %ld -> %ld ; err=%d\n",
+			 src_inode->i_ino, dst_inode->i_ino, err);
+		return err;
+	}
+
+	dst_inode->i_blocks = le64_to_cpu(dst_zi->i_blocks);
+	i_size_write(dst_inode, dst_zi->i_size);
+
+	return err;
+}
+
+/* FIXME: Old checks are not needed. I keep them to make sure they
+ * are not complaining. Will remove _zuf_old_checks SOON
+ */
+static int _zuf_old_checks(struct super_block *sb,
+			   struct inode *src_inode, loff_t pos_in,
+			   struct inode *dst_inode, loff_t pos_out, loff_t len)
+{
+	if (src_inode == dst_inode) {
+		if (pos_in == pos_out) {
+			zuf_warn("[%ld] Clone nothing!!\n",
+				    src_inode->i_ino);
+			return 0;
+		}
+		if (pos_in < pos_out) {
+			if (pos_in + len > pos_out) {
+				zuf_warn("[%ld] overlapping pos_in < pos_out?? => EINVAL\n",
+					 src_inode->i_ino);
+				return -EINVAL;
+			}
+		} else {
+			if (pos_out + len > pos_in) {
+				zuf_warn("[%ld] overlapping pos_out < pos_in?? => EINVAL\n",
+					 src_inode->i_ino);
+				return -EINVAL;
+			}
+		}
+	}
+
+	if ((pos_in & (sb->s_blocksize - 1)) ||
+	    (pos_out & (sb->s_blocksize - 1))) {
+		zuf_err("[%ld] Not aligned len=0x%llx pos_in=0x%llx "
+			"pos_out=0x%llx src-size=0x%llx dst-size=0x%llx\n",
+			 src_inode->i_ino, len, pos_in, pos_out,
+			 i_size_read(src_inode), i_size_read(dst_inode));
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static loff_t zuf_clone_file_range(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				loff_t len, uint remap_flags)
+{
+	struct inode *src_inode = file_inode(file_in);
+	struct inode *dst_inode = file_inode(file_out);
+	struct zuf_inode_info *src_zii = ZUII(src_inode);
+	struct zuf_inode_info *dst_zii = ZUII(dst_inode);
+	ulong src_size = i_size_read(src_inode);
+	ulong dst_size = i_size_read(dst_inode);
+	struct super_block *sb = src_inode->i_sb;
+	ulong len_up;
+	int err;
+
+	zuf_dbg_vfs("IN: [%ld]{0x%llx} => [%ld]{0x%llx} length=0x%llx flags=0x%x\n",
+		    src_inode->i_ino, pos_in, dst_inode->i_ino, pos_out, len,
+		    remap_flags);
+
+	if (remap_flags & ~(REMAP_FILE_CAN_SHORTEN | REMAP_FILE_DEDUP)) {
+		/* New flags we do not know */
+		zuf_dbg_err("[%ld] Unknown remap_flags(0x%x)\n",
+			    src_inode->i_ino, remap_flags);
+		return -EINVAL;
+	}
+
+	if ((pos_in + len > sb->s_maxbytes) || (pos_out + len > sb->s_maxbytes))
+		return -EINVAL;
+
+	_lock_two_ziis(src_zii, dst_zii);
+
+	err = _remap_file_range_prep(file_in, pos_in, file_out, pos_out, &len,
+				     remap_flags);
+	if (err < 0 || len == 0)
+		goto out;
+	err = _zuf_old_checks(sb, src_inode, pos_in, dst_inode, pos_out, len);
+	if (unlikely(err))
+		goto out;
+
+	err = file_remove_privs(file_out);
+	if (unlikely(err))
+		goto out;
+
+	if (!(remap_flags & REMAP_FILE_DEDUP))
+		zus_inode_cmtime_now(dst_inode, dst_zii->zi);
+
+	/* See about all-file-clone optimization */
+	len_up = len;
+	if (!pos_in && !pos_out && (src_size <= pos_in + len) &&
+	    (dst_size <= src_size)) {
+		len_up = 0;
+	} else if (len & (sb->s_blocksize - 1)) {
+		/* un-aligned len, see if it is beyond EOF */
+		if ((src_size > pos_in  + len) ||
+		    (dst_size > pos_out + len)) {
+			zuf_err("[%ld][%ld] Not aligned len=0x%llx pos_in=0x%llx "
+				"pos_out=0x%llx src-size=0x%lx dst-size=0x%lx\n",
+				src_inode->i_ino, dst_inode->i_ino, len,
+				pos_in, pos_out, src_size, dst_size);
+			err = -EINVAL;
+			goto out;
+		}
+		len_up = md_p2o(md_o2p_up(len));
+	}
+
+	err = _clone_file_range(src_inode, pos_in, file_out, dst_inode, pos_out,
+				len, len_up, ZUFS_OP_CLONE);
+	if (unlikely(err))
+		zuf_dbg_err("_clone_file_range failed => %d\n", err);
+
+out:
+	_unlock_two_ziis(src_zii, dst_zii);
+	return err ? err : len;
+}
+
+static ssize_t zuf_copy_file_range(struct file *file_in, loff_t pos_in,
+				   struct file *file_out, loff_t pos_out,
+				   size_t len, uint flags)
+{
+	struct inode *src_inode = file_inode(file_in);
+	struct inode *dst_inode = file_inode(file_out);
+	ssize_t ret;
+
+	zuf_dbg_vfs("ino-in=%ld ino-out=%ld pos_in=0x%llx pos_out=0x%llx length=0x%lx\n",
+		    src_inode->i_ino, dst_inode->i_ino, pos_in, pos_out, len);
+
+	ret = zuf_clone_file_range(file_in, pos_in, file_out, pos_out, len,
+				   REMAP_FILE_ADVISORY);
+
+	return ret ?: len;
+}
+
 static ssize_t zuf_read_iter(struct kiocb *kiocb, struct iov_iter *ii)
 {
 	struct inode *inode = file_inode(kiocb->ki_filp);
@@ -155,16 +776,43 @@ static ssize_t zuf_write_iter(struct kiocb *kiocb, struct iov_iter *ii)
 	return ret;
 }
 
+static int zuf_fadvise(struct file *file, loff_t offset, loff_t len,
+		       int advise)
+{
+	struct inode *inode = file_inode(file);
+	struct zuf_inode_info *zii = ZUII(inode);
+	int err;
+
+	if (!S_ISREG(inode->i_mode))
+		return -EINVAL;
+
+	zuf_r_lock(zii);
+
+	err = zuf_rw_fadvise(inode->i_sb, file, offset, len, advise,
+			     file->f_mode & FMODE_RANDOM);
+
+	zuf_r_unlock(zii);
+
+	return err;
+}
+
 const struct file_operations zuf_file_operations = {
 	.open			= generic_file_open,
 	.read_iter		= zuf_read_iter,
 	.write_iter		= zuf_write_iter,
 	.mmap			= zuf_file_mmap,
 	.fsync			= zuf_fsync,
+	.llseek			= zuf_llseek,
+	.flush			= zuf_flush,
+	.fallocate		= zuf_fallocate,
+	.copy_file_range	= zuf_copy_file_range,
+	.remap_file_range	= zuf_clone_file_range,
+	.fadvise		= zuf_fadvise,
 };
 
 const struct inode_operations zuf_file_inode_operations = {
 	.setattr	= zuf_setattr,
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
+	.fiemap		= zuf_fiemap,
 };
diff --git a/fs/zuf/rw.c b/fs/zuf/rw.c
index 48f584e71a03..60b7a3e07e17 100644
--- a/fs/zuf/rw.c
+++ b/fs/zuf/rw.c
@@ -664,6 +664,98 @@ ssize_t zuf_rw_write_iter(struct super_block *sb, struct inode *inode,
 			ii, kiocb, kiocb_ra(kiocb), ZUFS_OP_WRITE, rw);
 }
 
+static int _fadv_willneed(struct super_block *sb, struct inode *inode,
+			  loff_t offset, loff_t len, bool rand)
+{
+	struct zufs_ioc_IO io = {};
+	struct __zufs_ra ra = {
+		.start = md_o2p(offset),
+		.ra_pages = md_o2p_up(len),
+		.prev_pos = offset - 1,
+	};
+	int err;
+
+	io.ra.start = ra.start;
+	io.ra.ra_pages = ra.ra_pages;
+	io.ra.prev_pos = ra.prev_pos;
+	io.rw = rand ? ZUFS_RW_RAND : 0;
+
+	err = _IO_dispatch(SBI(sb), &io, ZUII(inode), ZUFS_OP_PRE_READ, 0,
+			   NULL, 0, offset, 0);
+	return err;
+}
+
+static int _fadv_dontneed(struct super_block *sb, struct inode *inode,
+			  loff_t offset, loff_t len)
+{
+	struct zufs_ioc_sync ioc_range = {
+		.hdr.in_len = sizeof(ioc_range),
+		.hdr.operation = ZUFS_OP_SYNC,
+		.zus_ii = ZUII(inode)->zus_ii,
+		.offset = offset,
+		.length = len,
+		.flags = ZUFS_SF_DONTNEED,
+	};
+
+	return zufc_dispatch(ZUF_ROOT(SBI(sb)), &ioc_range.hdr, NULL, 0);
+}
+
+/* FIXME: There is a pending patch from Jan Karta to export generic_fadvise.
+ * until then duplicate here what we need
+ */
+#include <linux/backing-dev.h>
+
+static int _generic_fadvise(struct file *file, loff_t offset, loff_t len,
+			    int advise)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(file_inode(file));
+
+	switch (advise) {
+	case POSIX_FADV_NORMAL:
+		file->f_ra.ra_pages = bdi->ra_pages;
+		spin_lock(&file->f_lock);
+		file->f_mode &= ~FMODE_RANDOM;
+		spin_unlock(&file->f_lock);
+		break;
+	case POSIX_FADV_RANDOM:
+		spin_lock(&file->f_lock);
+		file->f_mode |= FMODE_RANDOM;
+		spin_unlock(&file->f_lock);
+		break;
+	case POSIX_FADV_SEQUENTIAL:
+		file->f_ra.ra_pages = bdi->ra_pages * 2;
+		spin_lock(&file->f_lock);
+		file->f_mode &= ~FMODE_RANDOM;
+		spin_unlock(&file->f_lock);
+		break;
+	case POSIX_FADV_NOREUSE:
+		break;
+	}
+
+	return 0;
+}
+
+int zuf_rw_fadvise(struct super_block *sb, struct file *file,
+		   loff_t offset, loff_t len, int advise, bool rand)
+{
+	switch (advise) {
+	case POSIX_FADV_WILLNEED:
+		return _fadv_willneed(sb, file_inode(file), offset, len, rand);
+	case POSIX_FADV_DONTNEED:
+		return _fadv_dontneed(sb, file_inode(file), offset, len);
+
+	case POSIX_FADV_SEQUENTIAL:
+	case POSIX_FADV_NORMAL:
+	case POSIX_FADV_RANDOM:
+	case POSIX_FADV_NOREUSE:
+		return _generic_fadvise(file, offset, len, advise);
+	default:
+		zuf_warn("Unknown advise %d\n", advise);
+		return -EINVAL;
+	}
+	return -EINVAL;
+}
+
 /* ~~~~ iom_dec.c ~~~ */
 /* for now here (at rw.c) looks logical */
 
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index cb4a4def646f..4284d2298906 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -95,6 +95,8 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_REMOVE_DENTRY);
 		CASE_ENUM_NAME(ZUFS_OP_RENAME);
 		CASE_ENUM_NAME(ZUFS_OP_READDIR);
+		CASE_ENUM_NAME(ZUFS_OP_CLONE);
+		CASE_ENUM_NAME(ZUFS_OP_COPY);
 
 		CASE_ENUM_NAME(ZUFS_OP_READ);
 		CASE_ENUM_NAME(ZUFS_OP_PRE_READ);
@@ -102,6 +104,9 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_MMAP_CLOSE);
 		CASE_ENUM_NAME(ZUFS_OP_SETATTR);
 		CASE_ENUM_NAME(ZUFS_OP_SYNC);
+		CASE_ENUM_NAME(ZUFS_OP_FALLOCATE);
+		CASE_ENUM_NAME(ZUFS_OP_LLSEEK);
+		CASE_ENUM_NAME(ZUFS_OP_FIEMAP);
 
 		CASE_ENUM_NAME(ZUFS_OP_GET_MULTY);
 		CASE_ENUM_NAME(ZUFS_OP_PUT_MULTY);
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index e70bd8b7ff69..c8bcb6006fab 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -455,6 +455,8 @@ enum e_zufs_operation {
 	ZUFS_OP_REMOVE_DENTRY	= 9,
 	ZUFS_OP_RENAME		= 10,
 	ZUFS_OP_READDIR		= 11,
+	ZUFS_OP_CLONE		= 12,
+	ZUFS_OP_COPY		= 13,
 
 	ZUFS_OP_READ		= 14,
 	ZUFS_OP_PRE_READ	= 15,
@@ -463,6 +465,8 @@ enum e_zufs_operation {
 	ZUFS_OP_SETATTR		= 19,
 	ZUFS_OP_SYNC		= 20,
 	ZUFS_OP_FALLOCATE	= 21,
+	ZUFS_OP_LLSEEK		= 22,
+	ZUFS_OP_FIEMAP		= 28,
 
 	ZUFS_OP_GET_MULTY	= 29,
 	ZUFS_OP_PUT_MULTY	= 30,
@@ -680,6 +684,85 @@ struct zufs_ioc_sync {
 	__u64 write_unmapped;
 };
 
+/* ZUFS_OP_CLONE */
+struct zufs_ioc_clone {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *src_zus_ii;
+	struct zus_inode_info *dst_zus_ii;
+	__u64 pos_in, pos_out;
+	__u64 len;
+	__u64 len_up;
+};
+
+/* ZUFS_OP_LLSEEK */
+struct zufs_ioc_seek {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 offset_in;
+	__u32 whence;
+	__u32 pad;
+
+	/* OUT */
+	__u64 offset_out;
+};
+
+/* ZUFS_OP_FIEMAP */
+struct zufs_ioc_fiemap {
+	struct zufs_ioc_hdr hdr;
+
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64	start;
+	__u64	length;
+	__u32	flags;
+	__u32	extents_max;
+
+	/* OUT */
+	__u32	extents_mapped;
+	__u32	pad;
+
+} __packed;
+
+struct zufs_fiemap_extent_info {
+	struct fiemap_extent *fi_extents_start;
+	__u32 fi_flags;
+	__u32 fi_extents_mapped;
+	__u32 fi_extents_max;
+	__u32 __pad;
+};
+
+static inline
+int zufs_fiemap_fill_next_extent(struct zufs_fiemap_extent_info *fieinfo,
+				 __u64 logical, __u64 phys,
+				 __u64 len, __u32 flags)
+{
+	struct fiemap_extent *dest = fieinfo->fi_extents_start;
+
+	if (fieinfo->fi_extents_max == 0) {
+		fieinfo->fi_extents_mapped++;
+		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
+	}
+
+	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
+		return 1;
+
+	dest += fieinfo->fi_extents_mapped;
+	dest->fe_logical = logical;
+	dest->fe_physical = phys;
+	dest->fe_length = len;
+	dest->fe_flags = flags;
+
+	fieinfo->fi_extents_mapped++;
+	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
+		return 1;
+
+	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
+}
+
+
+
 /* ~~~~ io_map structures && IOCTL(s) ~~~~ */
 /*
  * These set of structures and helpers are used in return of zufs_ioc_IO and

From patchwork Thu Sep 26 02:07:23 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161831
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id ABF0914E5
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:14:39 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 76286222BF
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:14:39 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="qI+zIDMl"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1732991AbfIZCOj (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:14:39 -0400
Received: from mail-wm1-f68.google.com ([209.85.128.68]:39299 "EHLO
        mail-wm1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727403AbfIZCOi (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:14:38 -0400
Received: by mail-wm1-f68.google.com with SMTP id v17so726792wml.4
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:14:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=FObWE5VRR+Dl4IRubx6Y55penA+25nkCNNLebYYfrMM=;
        b=qI+zIDMl1B7VejU+i8Uv0c+EfpbvNL5PuiFXUem6PvI5pfetdbCqNr2cKiH4BbZEvG
         ErvyMVPUIoXLprjVFzMdXMmyViRSkNH0Mw4SCMd0D6dxzBwZESCG8KDmoP4L23QZmXG5
         TwlRLo5ms4kPumfLdQpSiOPSv4vi3acTQNztObBFmqMqvEPDvU7K/8FuaIKN3+LDiUQN
         Fg/94SQDpBwuPZCzmfW6inviSx03lOpLCudr8x6PgTg0MoW/M0eOsOMDMGbJR7Y9jBLs
         4521AeQgOuHYgjFstPCczQaxKFZhGhoFEKQGhtxhSAqpBlhdqBBJCt9gho5E7RV3T40U
         SSQQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=FObWE5VRR+Dl4IRubx6Y55penA+25nkCNNLebYYfrMM=;
        b=jYkqtgtzLOOjf4iU5GTt6j7bsOmIs78ieksDdMwFJsq70oEP2KJpkOBbRnU1uC48YK
         PDYz81GUFxyy1lXjRT6tuxq/q9x1KFLEB7ap2011tI/iwSzmT3M1z5v573GcTQ12UmWh
         RwptFOemIbrrSBAZigIPB8Tc8eieXrrLUqwcRn6p9AJADqp4vb7EuywflHyjat2BaXyK
         AUq4tIlG+aGaossY4haalKzfn/uhSLdqkOiHtbCYEoZO1XTeZUpF6+kSUjSlKNYdL6St
         j9KvGmJoIPFHb0QLt1L7lfhwaSZm+KHU/IF10NeVVSZJP1lMgRNXZQUprL4cteyGVD5G
         tI5A==
X-Gm-Message-State: APjAAAVx3NH9zkjZgl/iL3Q4SUMTZj96pWOuVfF9/TKJUw3vx4++anvw
        +olwI0qoXbBPDst+9ngLvY3SvqLrnNo=
X-Google-Smtp-Source: 
 APXvYqw41ptEGAeJYblHUGCWvf86EHIJjOqjTrQ7XtakC0M4XIn9VZBcpSTcWZUztcwac+pQLSCN0g==
X-Received: by 2002:a7b:c188:: with SMTP id y8mr876544wmi.51.1569464074728;
        Wed, 25 Sep 2019 19:14:34 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.14.33
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:14:34 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 14/16] zuf: ioctl implementation
Date: Thu, 26 Sep 2019 05:07:23 +0300
Message-Id: <20190926020725.19601-15-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

* support for some generic IOCTLs:
  FS_IOC_GETFLAGS, FS_IOC_SETFLAGS, FS_IOC_GETVERSION, FS_IOC_SETVERSION

* Simple support for zusFS defined IOCTLs
  We only support flat structures
  (no emmbedded pointers within the IOCTL structures)
  We try to deduce the size of the IOCTL from the _IOC_SIZE(cmd)
  If zusFS needs a bigger copy it will send a retry with the
  new size. So bad defined IOCTLs always do 2 trips to userland

* zusFS may also retry if it wants an fs_freeze to implement
  its IOCTL (TODO keep a map)

[v2]
  zuf: Reduce stack usage (ioctl)
  Same as for IO use big_alloc for buffers too big for on the stack

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile    |   1 +
 fs/zuf/_extern.h   |   6 +
 fs/zuf/directory.c |   4 +
 fs/zuf/file.c      |   4 +
 fs/zuf/ioctl.c     | 309 +++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf-core.c  |   1 +
 fs/zuf/zus_api.h   |  37 ++++++
 7 files changed, 362 insertions(+)
 create mode 100644 fs/zuf/ioctl.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 02df1374a946..d3257bfc69ba 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,6 +17,7 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
+zuf-y += ioctl.o
 zuf-y += rw.o mmap.o
 zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 2c7456724ef6..04e0515469e7 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -126,6 +126,12 @@ int zuf_file_mmap(struct file *file, struct vm_area_struct *vma);
 /* t1.c */
 int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma);
 
+/* ioctl.c */
+long zuf_ioctl(struct file *filp, uint cmd, ulong arg);
+#ifdef CONFIG_COMPAT
+long zuf_compat_ioctl(struct file *file, uint cmd, ulong arg);
+#endif
+
 /*
  * Inode and files operations
  */
diff --git a/fs/zuf/directory.c b/fs/zuf/directory.c
index 7417aeb77773..612b6e410615 100644
--- a/fs/zuf/directory.c
+++ b/fs/zuf/directory.c
@@ -164,4 +164,8 @@ const struct file_operations zuf_dir_operations = {
 	.read		= generic_read_dir,
 	.iterate_shared	= zuf_readdir,
 	.fsync		= noop_fsync,
+	.unlocked_ioctl = zuf_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= zuf_compat_ioctl,
+#endif
 };
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index 1c51529694e7..e0bd60e095e7 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -808,6 +808,10 @@ const struct file_operations zuf_file_operations = {
 	.copy_file_range	= zuf_copy_file_range,
 	.remap_file_range	= zuf_clone_file_range,
 	.fadvise		= zuf_fadvise,
+	.unlocked_ioctl		= zuf_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl		= zuf_compat_ioctl,
+#endif
 };
 
 const struct inode_operations zuf_file_inode_operations = {
diff --git a/fs/zuf/ioctl.c b/fs/zuf/ioctl.c
new file mode 100644
index 000000000000..77b8d7627a74
--- /dev/null
+++ b/fs/zuf/ioctl.c
@@ -0,0 +1,309 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Ioctl operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ *	Sagi Manole <sagim@netapp.com>"
+ */
+
+#include <linux/capability.h>
+#include <linux/compat.h>
+
+#include "zuf.h"
+
+#define ZUFS_SUPPORTED_FS_FLAGS (FS_SYNC_FL | FS_APPEND_FL | FS_IMMUTABLE_FL | \
+				 FS_NOATIME_FL | FS_DIRTY_FL)
+
+noinline
+static int _ioctl_dispatch(struct inode *inode, uint cmd, ulong arg,
+			   void *on_stack, uint max_stack)
+{
+	enum big_alloc_type bat;
+	struct zufs_ioc_ioctl *ioc_ioctl;
+	size_t ioc_size = _IOC_SIZE(cmd);
+	void __user *parg = (void __user *)arg;
+	struct timespec64 time = current_time(inode);
+	size_t size;
+	bool retry = false;
+	int err;
+	bool freeze = false;
+
+realloc:
+	size = sizeof(*ioc_ioctl) + ioc_size;
+
+	zuf_dbg_vfs("[%ld] cmd=0x%x arg=0x%lx size=0x%zx cap_admin=%u IOC(%d, %d, %zd)\n",
+		    inode->i_ino, cmd, arg, size, capable(CAP_SYS_ADMIN),
+		    _IOC_TYPE(cmd), _IOC_NR(cmd), ioc_size);
+
+	ioc_ioctl = big_alloc(size, max_stack, on_stack, GFP_KERNEL, &bat);
+	if (unlikely(!ioc_ioctl))
+		return -ENOMEM;
+
+	memset(ioc_ioctl, 0, sizeof(*ioc_ioctl));
+	ioc_ioctl->hdr.in_len = size;
+	ioc_ioctl->hdr.out_start = offsetof(struct zufs_ioc_ioctl, out_start);
+	ioc_ioctl->hdr.out_max = size;
+	ioc_ioctl->hdr.out_len = 0;
+	ioc_ioctl->hdr.operation = ZUFS_OP_IOCTL;
+	ioc_ioctl->zus_ii = ZUII(inode)->zus_ii;
+	ioc_ioctl->cmd = cmd;
+	ioc_ioctl->kflags = capable(CAP_SYS_ADMIN) ? ZUFS_IOC_CAP_ADMIN : 0;
+	timespec_to_mt(&ioc_ioctl->time, &time);
+
+dispatch:
+	if (arg && ioc_size) {
+		if (copy_from_user(ioc_ioctl->arg, parg, ioc_size)) {
+			err = -EFAULT;
+			goto out;
+		}
+	}
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_ioctl->hdr,
+			    NULL, 0);
+
+	if (unlikely(err == -EZUFS_RETRY)) {
+		if (unlikely(retry)) {
+			zuf_err("Server => EZUFS_RETRY again uflags=%d\n",
+				ioc_ioctl->uflags);
+			err = -EBUSY;
+			goto out;
+		}
+		retry = true;
+		switch (ioc_ioctl->uflags) {
+		case ZUFS_IOC_REALLOC:
+			ioc_size = ioc_ioctl->new_size - sizeof(*ioc_ioctl);
+			big_free(ioc_ioctl, bat);
+			goto realloc;
+		case ZUFS_IOC_FREEZE_REQ:
+			err = freeze_super(inode->i_sb);
+			if (unlikely(err)) {
+				zuf_warn("unable to freeze fs err=%d\n", err);
+				goto out;
+			}
+			freeze = true;
+			ioc_ioctl->kflags |= ZUFS_IOC_FSFROZEN;
+			goto dispatch;
+		default:
+			zuf_err("unkonwn ZUFS retry type uflags=%d\n",
+				ioc_ioctl->uflags);
+			err = -EINVAL;
+			goto out;
+		}
+	}
+
+	if (unlikely(err)) {
+		zuf_dbg_err("zufc_dispatch failed => %d IOC(%d, %d, %zd)\n",
+			    err, _IOC_TYPE(cmd), _IOC_NR(cmd), ioc_size);
+		goto out;
+	}
+
+	if (ioc_ioctl->hdr.out_len) {
+		if (copy_to_user(parg, ioc_ioctl->arg,
+		    ioc_ioctl->hdr.out_len)) {
+			err = -EFAULT;
+			goto out;
+		}
+	}
+
+out:
+	if (freeze) {
+		int thaw_err = thaw_super(inode->i_sb);
+
+		if (unlikely(thaw_err))
+			zuf_err("post ioctl thaw file system failure err = %d\n",
+				 thaw_err);
+	}
+
+	big_free(ioc_ioctl, bat);
+
+	return err;
+}
+
+static uint _translate_to_ioc_flags(struct zus_inode *zi)
+{
+	uint zi_flags = le16_to_cpu(zi->i_flags);
+	uint ioc_flags = 0;
+
+	if (zi_flags & S_SYNC)
+		ioc_flags |= FS_SYNC_FL;
+	if (zi_flags & S_APPEND)
+		ioc_flags |= FS_APPEND_FL;
+	if (zi_flags & S_IMMUTABLE)
+		ioc_flags |= FS_IMMUTABLE_FL;
+	if (zi_flags & S_NOATIME)
+		ioc_flags |= FS_NOATIME_FL;
+	if (zi_flags & S_DIRSYNC)
+		ioc_flags |= FS_DIRSYNC_FL;
+
+	return ioc_flags;
+}
+
+static int _ioc_getflags(struct inode *inode, uint __user *parg)
+{
+	struct zus_inode *zi = zus_zi(inode);
+	uint flags = _translate_to_ioc_flags(zi);
+
+	return put_user(flags, parg);
+}
+
+static void _translate_to_zi_flags(struct zus_inode *zi, unsigned int flags)
+{
+	uint zi_flags = le16_to_cpu(zi->i_flags);
+
+	zi_flags &=
+		~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC);
+
+	if (flags & FS_SYNC_FL)
+		zi_flags |= S_SYNC;
+	if (flags & FS_APPEND_FL)
+		zi_flags |= S_APPEND;
+	if (flags & FS_IMMUTABLE_FL)
+		zi_flags |= S_IMMUTABLE;
+	if (flags & FS_NOATIME_FL)
+		zi_flags |= S_NOATIME;
+	if (flags & FS_DIRSYNC_FL)
+		zi_flags |= S_DIRSYNC;
+
+	zi->i_flags = cpu_to_le16(zi_flags);
+}
+
+/* use statx ioc to flush zi changes to fs */
+static int __ioc_dispatch_zi_update(struct inode *inode, uint flags)
+{
+	struct zufs_ioc_attr ioc_attr = {
+		.hdr.in_len = sizeof(ioc_attr),
+		.hdr.out_len = sizeof(ioc_attr),
+		.hdr.operation = ZUFS_OP_SETATTR,
+		.zus_ii = ZUII(inode)->zus_ii,
+		.zuf_attr = flags,
+	};
+	int err;
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_attr.hdr, NULL, 0);
+	if (unlikely(err && err != -EINTR))
+		zuf_err("zufc_dispatch failed => %d\n", err);
+
+	return err;
+}
+
+static int _ioc_setflags(struct inode *inode, uint __user *parg)
+{
+	struct zus_inode *zi = zus_zi(inode);
+	uint flags, oldflags;
+	int err;
+
+	if (!inode_owner_or_capable(inode))
+		return -EPERM;
+
+	if (get_user(flags, parg))
+		return -EFAULT;
+
+	if (flags & ~ZUFS_SUPPORTED_FS_FLAGS)
+		return -EOPNOTSUPP;
+
+	if (zi->i_flags & ZUFS_S_IMMUTABLE)
+		return -EPERM;
+
+	inode_lock(inode);
+
+	oldflags = le32_to_cpu(zi->i_flags);
+
+	if ((flags ^ oldflags) &
+		(FS_APPEND_FL | FS_IMMUTABLE_FL)) {
+		if (!capable(CAP_LINUX_IMMUTABLE)) {
+			inode_unlock(inode);
+			return -EPERM;
+		}
+	}
+
+	if (!S_ISDIR(inode->i_mode))
+		flags &= ~FS_DIRSYNC_FL;
+
+	flags = flags & FS_FL_USER_MODIFIABLE;
+	flags |= oldflags & ~FS_FL_USER_MODIFIABLE;
+	inode->i_ctime = current_time(inode);
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+	_translate_to_zi_flags(zi, flags);
+	zuf_set_inode_flags(inode, zi);
+
+	err = __ioc_dispatch_zi_update(inode, ZUFS_STATX_FLAGS | STATX_CTIME);
+
+	inode_unlock(inode);
+	return err;
+}
+
+static int _ioc_setversion(struct inode *inode, uint __user *parg)
+{
+	struct zus_inode *zi = zus_zi(inode);
+	__u32 generation;
+	int err;
+
+	if (!inode_owner_or_capable(inode))
+		return -EPERM;
+
+	if (get_user(generation, parg))
+		return -EFAULT;
+
+	inode_lock(inode);
+
+	inode->i_ctime = current_time(inode);
+	inode->i_generation = generation;
+	timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+	zi->i_generation = cpu_to_le32(inode->i_generation);
+
+	err = __ioc_dispatch_zi_update(inode, ZUFS_STATX_VERSION | STATX_CTIME);
+
+	inode_unlock(inode);
+	return err;
+}
+
+long zuf_ioctl(struct file *filp, unsigned int cmd, ulong arg)
+{
+	void __user *parg = (void __user *)arg;
+	char on_stack[ZUF_MAX_STACK(8)];
+
+	switch (cmd) {
+	case FS_IOC_GETFLAGS:
+		return _ioc_getflags(filp->f_inode, parg);
+	case FS_IOC_SETFLAGS:
+		return _ioc_setflags(filp->f_inode, parg);
+	case FS_IOC_GETVERSION:
+		return put_user(filp->f_inode->i_generation, (int __user *)arg);
+	case FS_IOC_SETVERSION:
+		return _ioc_setversion(filp->f_inode, parg);
+	default:
+		return _ioctl_dispatch(filp->f_inode, cmd, arg, on_stack,
+				       sizeof(on_stack));
+	}
+}
+
+#ifdef CONFIG_COMPAT
+long zuf_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case FS_IOC32_GETFLAGS:
+		cmd = FS_IOC_GETFLAGS;
+		break;
+	case FS_IOC32_SETFLAGS:
+		cmd = FS_IOC_SETFLAGS;
+		break;
+	case FS_IOC32_GETVERSION:
+		cmd = FS_IOC_GETVERSION;
+		break;
+	case FS_IOC32_SETVERSION:
+		cmd = FS_IOC_SETVERSION;
+		break;
+	default:
+		return -ENOIOCTLCMD;
+	}
+	return zuf_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
+
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 4284d2298906..9b8fe3bff0cd 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -106,6 +106,7 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_SYNC);
 		CASE_ENUM_NAME(ZUFS_OP_FALLOCATE);
 		CASE_ENUM_NAME(ZUFS_OP_LLSEEK);
+		CASE_ENUM_NAME(ZUFS_OP_IOCTL);
 		CASE_ENUM_NAME(ZUFS_OP_FIEMAP);
 
 		CASE_ENUM_NAME(ZUFS_OP_GET_MULTY);
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index c8bcb6006fab..4ebb067c0719 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -466,6 +466,7 @@ enum e_zufs_operation {
 	ZUFS_OP_SYNC		= 20,
 	ZUFS_OP_FALLOCATE	= 21,
 	ZUFS_OP_LLSEEK		= 22,
+	ZUFS_OP_IOCTL		= 23,
 	ZUFS_OP_FIEMAP		= 28,
 
 	ZUFS_OP_GET_MULTY	= 29,
@@ -708,6 +709,42 @@ struct zufs_ioc_seek {
 	__u64 offset_out;
 };
 
+/* ZUFS_OP_IOCTL */
+/* Flags for zufs_ioc_ioctl->kflags */
+enum e_ZUFS_IOCTL_KFLAGS {
+	ZUFS_IOC_FSFROZEN	= 0x1,	/* Tell Server we froze the FS	  */
+	ZUFS_IOC_CAP_ADMIN	= 0x2,	/* The ioctl caller had CAP_ADMIN */
+};
+
+/* received for zus on zufs_ioc_ioctl->uflags */
+enum e_ZUFS_IOCTL_UFLAGS {
+	ZUFS_IOC_REALLOC	= 0x1,	/*_IOC_SIZE(cmd) was not it and Server
+					 * needs a deeper copy
+					 */
+	ZUFS_IOC_FREEZE_REQ	= 0x2,	/* Server needs a freeze and a recall */
+};
+
+struct zufs_ioc_ioctl {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 time;
+	__u32 cmd;
+	__u32 kflags; /* zuf/kernel state and flags*/
+
+	/* OUT */
+	/* This is just a zero-size marker for the start of output */
+	char out_start[0];
+	union {
+		struct { /* If return was -EZUFS_RETRY */
+			__u32 uflags; /* flags returned from zus */
+			__u32 new_size;
+		};
+
+		char arg[0];
+	};
+};
+
 /* ZUFS_OP_FIEMAP */
 struct zufs_ioc_fiemap {
 	struct zufs_ioc_hdr hdr;

From patchwork Thu Sep 26 02:07:24 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161833
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 50738924
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:15:01 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 0D441222C0
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:15:01 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="CUZonNod"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1730061AbfIZCPA (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:15:00 -0400
Received: from mail-wm1-f66.google.com ([209.85.128.66]:35484 "EHLO
        mail-wm1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1725768AbfIZCPA (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:15:00 -0400
Received: by mail-wm1-f66.google.com with SMTP id y21so753643wmi.0
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:14:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=tc76zzp+cLsNIF3taJlc89n9fKXApFgLIyKW9N9Ao+k=;
        b=CUZonNodkXHqOBBuGyaOaKoo5EJx9gAGvPzvjHYF72OC5JE98XUjOEVDleIS+ZwLh3
         Qxffd/xu2fyFQM2DFWZviL4YWI1YwchPSE5/PP+oK/KrTSrPfqacTx9lhriWA16ii15v
         IPZgPAhCjIMnPbGiTVEam30f70nItrVJWpFCeWCxT9BdC+BxYMK17Z1NwES0UlarC5Wf
         Ecr0gegUMKCGiXi++BuxcNFUHBTnd0HBsTS01bRy68QKYOrdtxjJQi5oiWUe/yUcST4q
         6JPvjSpAvWSam7iMa6ncDm7T5esIa7uOaWd1eR4iAavLhdnvfIaD+0FCUCFRJcsiGV4B
         XSFw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=tc76zzp+cLsNIF3taJlc89n9fKXApFgLIyKW9N9Ao+k=;
        b=Gf+jwjIpG5VcUJ0/ExjM8rvsqHprnerU5tcvHBuC5YdIFM2Fy1mHaNDuZfFmu3hD9D
         /DlQWNTj36m7pojPG1pblW61pnCdOI5yscg7cytekjmQUrAUIInBlW4k0zlsT3K+ba5n
         3stKBzV4ffsBLoVi0sqS2jBuvqMYTuUFCBIHu374z+tsPMVSmQGhV9tU0WUDRCbC5k6p
         CpEhwEQFK8v2PzY3cU8egFf1BUKkL9xHMILeDaUG46jDHHTofFJx+VBV1jIjGaIPaZqi
         rvhsZ5sJBVFx0dxrctQ013FP0HrHka3DCH3QS9RFxeRpCztnkG4NcVRABhmS5oU2Wn0p
         XW/w==
X-Gm-Message-State: APjAAAUlMngAy5LeSfGvgYqQBa/ILOaO3LGGemYiIzVRyGhDAaEiZpom
        aXaZDSq3XToR5I6H2reFnQ9mkpBppno=
X-Google-Smtp-Source: 
 APXvYqyDV8za4JpAf/Lru17wJg5qPye/vwxH/KJSrqqwqBFTEAHHGocOhbJpvETsHXKvOiAA1Lhqsw==
X-Received: by 2002:a1c:2d44:: with SMTP id t65mr839294wmt.12.1569464093319;
        Wed, 25 Sep 2019 19:14:53 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.14.51
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:14:52 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 15/16] zuf: xattr && acl implementation
Date: Thu, 26 Sep 2019 05:07:24 +0300
Message-Id: <20190926020725.19601-16-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

We establish the usual dispatch API to user-mode,
for get/set/list_xattr.
Since the buffers are variable length we utilize the
zdo->overflow_handler for the extra copy from Server.
(see also zuf-core.c)

The ACL support is all in Kernel. There is no new API
with zusFS.
We define the internal structure of the ACL inside
an opec xattr and store via the xattr zus_api.

TODO:
  Future FSs that have their own ACL on-disk-format, and/or
  Network zusFS that have their own verifiers for the ACL
  will need to establish an alternative API for the acl.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   2 +-
 fs/zuf/_extern.h  |  20 +++
 fs/zuf/acl.c      | 270 +++++++++++++++++++++++++++++++++++++++
 fs/zuf/file.c     |   3 +
 fs/zuf/inode.c    |  18 +++
 fs/zuf/namei.c    |   6 +
 fs/zuf/super.c    |   2 +
 fs/zuf/symlink.c  |   1 +
 fs/zuf/xattr.c    | 314 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf-core.c |   3 +
 fs/zuf/zuf.h      |  34 +++++
 fs/zuf/zus_api.h  |  25 +++-
 12 files changed, 696 insertions(+), 2 deletions(-)
 create mode 100644 fs/zuf/acl.c
 create mode 100644 fs/zuf/xattr.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index d3257bfc69ba..abc7dcda0029 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,7 +17,7 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += ioctl.o
+zuf-y += ioctl.o acl.o xattr.o
 zuf-y += rw.o mmap.o
 zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 04e0515469e7..d0d83eae75c1 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -132,6 +132,26 @@ long zuf_ioctl(struct file *filp, uint cmd, ulong arg);
 long zuf_compat_ioctl(struct file *file, uint cmd, ulong arg);
 #endif
 
+/* xattr.c */
+int zuf_initxattrs(struct inode *inode, const struct xattr *xattr_array,
+		   void *fs_info);
+ssize_t __zuf_getxattr(struct inode *inode, int type, const char *name,
+		       void *buffer, size_t size);
+int __zuf_setxattr(struct inode *inode, int type, const char *name,
+		   const void *value, size_t size, int flags);
+ssize_t zuf_listxattr(struct dentry *dentry, char *buffer, size_t size);
+extern const struct xattr_handler *zuf_xattr_handlers[];
+
+/* acl.c */
+int zuf_set_acl(struct inode *inode, struct posix_acl *acl, int type);
+struct posix_acl *zuf_get_acl(struct inode *inode, int type);
+int zuf_acls_create_pre(struct inode *dir, umode_t *mode,
+			struct posix_acl **def_acl, struct posix_acl **acl);
+int zuf_acls_create_post(struct inode *dir, struct inode *inode,
+			 struct posix_acl *def_acl, struct posix_acl *acl);
+extern const struct xattr_handler zuf_acl_access_xattr_handler;
+extern const struct xattr_handler zuf_acl_default_xattr_handler;
+
 /*
  * Inode and files operations
  */
diff --git a/fs/zuf/acl.c b/fs/zuf/acl.c
new file mode 100644
index 000000000000..fe2bcd2096bf
--- /dev/null
+++ b/fs/zuf/acl.c
@@ -0,0 +1,270 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Access Control List
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/fs.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/xattr.h>
+#include "zuf.h"
+
+static void _acl_to_value(const struct posix_acl *acl, void *value)
+{
+	int n;
+	struct zuf_acl *macl = value;
+
+	zuf_dbg_acl("acl->count=%d\n", acl->a_count);
+
+	for (n = 0; n < acl->a_count; n++) {
+		const struct posix_acl_entry *entry = &acl->a_entries[n];
+
+		zuf_dbg_acl("aclno=%d tag=0x%x perm=0x%x\n",
+			     n, entry->e_tag, entry->e_perm);
+
+		macl->tag = cpu_to_le16(entry->e_tag);
+		macl->perm = cpu_to_le16(entry->e_perm);
+
+		switch (entry->e_tag) {
+		case ACL_USER:
+			macl->id = cpu_to_le32(
+				from_kuid(&init_user_ns, entry->e_uid));
+			break;
+		case ACL_GROUP:
+			macl->id = cpu_to_le32(
+				from_kgid(&init_user_ns, entry->e_gid));
+			break;
+		case ACL_USER_OBJ:
+		case ACL_GROUP_OBJ:
+		case ACL_MASK:
+		case ACL_OTHER:
+			break;
+		default:
+			zuf_dbg_err("e_tag=0x%x\n", entry->e_tag);
+			return;
+		}
+		macl++;
+	}
+}
+
+static int __set_acl(struct inode *inode, struct posix_acl *acl, int type,
+		     bool set_mode)
+{
+	char *name = NULL;
+	void *buf;
+	int err;
+	size_t size;
+	umode_t old_mode = inode->i_mode;
+
+	zuf_dbg_acl("[%ld] acl=%p type=0x%x\n", inode->i_ino, acl, type);
+
+	switch (type) {
+	case ACL_TYPE_ACCESS: {
+		struct zus_inode *zi = ZUII(inode)->zi;
+
+		name = XATTR_POSIX_ACL_ACCESS;
+		if (acl && set_mode) {
+			err = posix_acl_update_mode(inode, &inode->i_mode,
+						    &acl);
+			if (err)
+				return err;
+
+			zuf_dbg_acl("old=0x%x new=0x%x acl_count=%d\n",
+				    old_mode, inode->i_mode,
+				    acl ? acl->a_count : -1);
+			inode->i_ctime = current_time(inode);
+			timespec_to_mt(&zi->i_ctime, &inode->i_ctime);
+			zi->i_mode = cpu_to_le16(inode->i_mode);
+		}
+		break;
+	}
+	case ACL_TYPE_DEFAULT:
+		name = XATTR_POSIX_ACL_DEFAULT;
+		if (!S_ISDIR(inode->i_mode))
+			return acl ? -EACCES : 0;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	size = acl ? acl->a_count * sizeof(struct zuf_acl) : 0;
+	buf = kmalloc(size, GFP_KERNEL);
+	if (unlikely(!buf))
+		return -ENOMEM;
+
+	if (acl)
+		_acl_to_value(acl, buf);
+
+	/* NOTE: Server's zus_setxattr implementers should cl_flush the zi.
+	 *  In the case it returned an error it should not cl_flush.
+	 *  We will restore to old i_mode.
+	 */
+	err = __zuf_setxattr(inode, ZUF_XF_SYSTEM, name, buf, size, 0);
+	if (likely(!err)) {
+		set_cached_acl(inode, type, acl);
+	} else {
+		/* Error need to restore changes (xfstest/generic/449) */
+		struct zus_inode *zi = ZUII(inode)->zi;
+
+		inode->i_mode = old_mode;
+		zi->i_mode = cpu_to_le16(inode->i_mode);
+	}
+
+	kfree(buf);
+	return err;
+}
+
+int zuf_set_acl(struct inode *inode, struct posix_acl *acl, int type)
+{
+	return __set_acl(inode, acl, type, true);
+}
+
+static struct posix_acl *_value_to_acl(void *value, size_t size)
+{
+	int n, count;
+	struct posix_acl *acl;
+	struct zuf_acl *macl = value;
+	void *end = value + size;
+
+	if (!value)
+		return NULL;
+
+	count = size / sizeof(struct zuf_acl);
+	if (count < 0)
+		return ERR_PTR(-EINVAL);
+	if (count == 0)
+		return NULL;
+
+	acl = posix_acl_alloc(count, GFP_NOFS);
+	if (unlikely(!acl))
+		return ERR_PTR(-ENOMEM);
+
+	for (n = 0; n < count; n++) {
+		if (end < (void *)macl + sizeof(struct zuf_acl))
+			goto fail;
+
+		zuf_dbg_acl("aclno=%d tag=0x%x perm=0x%x id=0x%x\n",
+			     n, le16_to_cpu(macl->tag), le16_to_cpu(macl->perm),
+			     le32_to_cpu(macl->id));
+
+		acl->a_entries[n].e_tag  = le16_to_cpu(macl->tag);
+		acl->a_entries[n].e_perm = le16_to_cpu(macl->perm);
+
+		switch (acl->a_entries[n].e_tag) {
+		case ACL_USER_OBJ:
+		case ACL_GROUP_OBJ:
+		case ACL_MASK:
+		case ACL_OTHER:
+			macl++;
+			break;
+		case ACL_USER:
+			acl->a_entries[n].e_uid = make_kuid(&init_user_ns,
+							le32_to_cpu(macl->id));
+			macl++;
+			if (end < (void *)macl)
+				goto fail;
+			break;
+		case ACL_GROUP:
+			acl->a_entries[n].e_gid = make_kgid(&init_user_ns,
+							le32_to_cpu(macl->id));
+			macl++;
+			if (end < (void *)macl)
+				goto fail;
+			break;
+
+		default:
+			goto fail;
+		}
+	}
+	if (macl != end)
+		goto fail;
+	return acl;
+
+fail:
+	posix_acl_release(acl);
+	return ERR_PTR(-EINVAL);
+}
+
+struct posix_acl *zuf_get_acl(struct inode *inode, int type)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	char *name = NULL;
+	void *buf;
+	struct posix_acl *acl = NULL;
+	int ret;
+
+	zuf_dbg_acl("[%ld] type=0x%x\n", inode->i_ino, type);
+
+	buf = (void *)__get_free_page(GFP_KERNEL);
+	if (unlikely(!buf))
+		return ERR_PTR(-ENOMEM);
+
+	switch (type) {
+	case ACL_TYPE_ACCESS:
+		name = XATTR_POSIX_ACL_ACCESS;
+		break;
+	case ACL_TYPE_DEFAULT:
+		name = XATTR_POSIX_ACL_DEFAULT;
+		break;
+	default:
+		WARN_ON(1);
+		return ERR_PTR(-EINVAL);
+	}
+
+	zuf_smr_lock(zii);
+
+	ret = __zuf_getxattr(inode, ZUF_XF_SYSTEM, name, buf, PAGE_SIZE);
+	if (likely(ret > 0)) {
+		acl = _value_to_acl(buf, ret);
+	} else if (ret != -ENODATA) {
+		if (ret != 0)
+			zuf_dbg_err("failed to getattr ret=%d\n", ret);
+		acl = ERR_PTR(ret);
+	}
+
+	if (!IS_ERR(acl))
+		set_cached_acl(inode, type, acl);
+
+	zuf_smr_unlock(zii);
+
+	free_page((ulong)buf);
+
+	return acl;
+}
+
+/* Used by creation of new inodes */
+int zuf_acls_create_pre(struct inode *dir, umode_t *mode,
+			struct posix_acl **def_acl, struct posix_acl **acl)
+{
+	int err = posix_acl_create(dir, mode, def_acl, acl);
+
+	return err;
+}
+
+int zuf_acls_create_post(struct inode *dir, struct inode *inode,
+			 struct posix_acl *def_acl, struct posix_acl *acl)
+{
+	int err = 0, err2 = 0;
+
+	zuf_dbg_acl("def_acl_count=%d acl_count=%d\n",
+			def_acl ? def_acl->a_count : -1,
+			acl ? acl->a_count : -1);
+
+	if (def_acl)
+		err = __set_acl(inode, def_acl, ACL_TYPE_DEFAULT, false);
+	else
+		inode->i_default_acl = NULL;
+
+	if (acl)
+		err2 = __set_acl(inode, acl, ACL_TYPE_ACCESS, false);
+	else
+		inode->i_acl = NULL;
+
+	return err ?: err2;
+}
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index e0bd60e095e7..a4a788dcdc87 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -819,4 +819,7 @@ const struct inode_operations zuf_file_inode_operations = {
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
 	.fiemap		= zuf_fiemap,
+	.get_acl	= zuf_get_acl,
+	.set_acl	= zuf_set_acl,
+	.listxattr	= zuf_listxattr,
 };
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index 1e3dba654f34..ed324701a20b 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -287,6 +287,7 @@ void zuf_evict_inode(struct inode *inode)
 			_warn_inode_dirty(inode, zii->zi);
 
 		zuf_w_lock(zii);
+		zuf_xaw_lock(zii); /* Needed? probably not but palying safe */
 
 		zufc_goose_all_zts(ZUF_ROOT(SBI(sb)), inode);
 
@@ -295,6 +296,7 @@ void zuf_evict_inode(struct inode *inode)
 		inode->i_mtime = inode->i_ctime = current_time(inode);
 		inode->i_size = 0;
 
+		zuf_xaw_unlock(zii);
 		zuf_w_unlock(zii);
 	} else {
 		zuf_dbg_vfs("[%ld] inode is going down?\n", inode->i_ino);
@@ -341,6 +343,7 @@ struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
 		.flags = tmpfile ? ZI_TMPFILE : 0,
 		.str.len = qstr->len,
 	};
+	struct posix_acl *acl = NULL, *def_acl = NULL;
 	struct inode *inode;
 	struct zus_inode *zi = NULL;
 	struct page *pages[2];
@@ -360,6 +363,15 @@ struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
 
 	zuf_dbg_verbose("inode=%p name=%s\n", inode, qstr->name);
 
+	err = security_inode_init_security(inode, dir, qstr, zuf_initxattrs,
+					   NULL);
+	if (err && err != -EOPNOTSUPP)
+		goto fail;
+
+	err = zuf_acls_create_pre(dir, &inode->i_mode, &def_acl, &acl);
+	if (unlikely(err))
+		goto fail;
+
 	zuf_set_inode_flags(inode, &ioc_new_inode.zi);
 
 	if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) ||
@@ -400,6 +412,12 @@ struct inode *zuf_new_inode(struct inode *dir, umode_t mode,
 
 	zuf_dbg_verbose("allocating inode %ld (zi=%p)\n", _zi_ino(zi), zi);
 
+	if ((def_acl || acl) && !symname) {
+		err = zuf_acls_create_post(dir, inode, def_acl, acl);
+		if (unlikely(err))
+			goto fail;
+	}
+
 	err = insert_inode_locked(inode);
 	if (unlikely(err)) {
 		zuf_dbg_err("[%ld:%s] generation=%lld insert_inode_locked => %d\n",
diff --git a/fs/zuf/namei.c b/fs/zuf/namei.c
index e78aa04f10d5..a33745c328b9 100644
--- a/fs/zuf/namei.c
+++ b/fs/zuf/namei.c
@@ -420,10 +420,16 @@ const struct inode_operations zuf_dir_inode_operations = {
 	.setattr	= zuf_setattr,
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
+	.get_acl	= zuf_get_acl,
+	.set_acl	= zuf_set_acl,
+	.listxattr	= zuf_listxattr,
 };
 
 const struct inode_operations zuf_special_inode_operations = {
 	.setattr	= zuf_setattr,
 	.getattr	= zuf_getattr,
 	.update_time	= zuf_update_time,
+	.get_acl	= zuf_get_acl,
+	.set_acl	= zuf_set_acl,
+	.listxattr	= zuf_listxattr,
 };
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index 2a0db11b51d6..8f760e8b3fdc 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -553,6 +553,7 @@ static int zuf_fill_super(struct super_block *sb, void *data, int silent)
 		sb->s_flags |= SB_POSIXACL;
 
 	sb->s_op = &zuf_sops;
+	sb->s_xattr = zuf_xattr_handlers;
 
 	root_i = zuf_iget(sb, ioc_mount->zmi.zus_ii, ioc_mount->zmi._zi,
 			  &exist);
@@ -845,6 +846,7 @@ static void _init_once(void *foo)
 	inode_init_once(&zii->vfs_inode);
 	INIT_LIST_HEAD(&zii->i_mmap_dirty);
 	zii->zi = NULL;
+	init_rwsem(&zii->xa_rwsem);
 	init_rwsem(&zii->in_sync);
 	atomic_set(&zii->vma_count, 0);
 	atomic_set(&zii->write_mapped, 0);
diff --git a/fs/zuf/symlink.c b/fs/zuf/symlink.c
index 1446bdf60cb9..5e9115ba4cbd 100644
--- a/fs/zuf/symlink.c
+++ b/fs/zuf/symlink.c
@@ -70,4 +70,5 @@ const struct inode_operations zuf_symlink_inode_operations = {
 	.update_time	= zuf_update_time,
 	.setattr	= zuf_setattr,
 	.getattr	= zuf_getattr,
+	.listxattr	= zuf_listxattr,
 };
diff --git a/fs/zuf/xattr.c b/fs/zuf/xattr.c
new file mode 100644
index 000000000000..3c239bb7ec7e
--- /dev/null
+++ b/fs/zuf/xattr.c
@@ -0,0 +1,314 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Extended Attributes
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/fs.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/xattr.h>
+
+#include "zuf.h"
+
+/* ~~~~~~~~~~~~~~~ xattr get ~~~~~~~~~~~~~~~ */
+
+struct _xxxattr {
+	void *user_buffer;
+	union {
+		struct zufs_ioc_xattr ioc_xattr;
+		char buf[512];
+	} d;
+};
+
+static inline uint _XXXATTR_SIZE(uint ioc_size)
+{
+	struct _xxxattr *_xxxattr;
+
+	return ioc_size + (sizeof(*_xxxattr) - sizeof(_xxxattr->d));
+}
+
+static int _xattr_oh(struct zuf_dispatch_op *zdo, void *parg, ulong max_bytes)
+{
+	struct zufs_ioc_hdr *hdr = zdo->hdr;
+	struct zufs_ioc_xattr *ioc_xattr =
+			container_of(hdr, typeof(*ioc_xattr), hdr);
+	struct _xxxattr *_xxattr =
+			container_of(ioc_xattr, typeof(*_xxattr), d.ioc_xattr);
+	struct zufs_ioc_xattr *user_ioc_xattr = parg;
+
+	if (hdr->err)
+		return 0;
+
+	ioc_xattr->user_buf_size = user_ioc_xattr->user_buf_size;
+
+	hdr->out_len -= sizeof(ioc_xattr->user_buf_size);
+	memcpy(_xxattr->user_buffer, user_ioc_xattr->buf, hdr->out_len);
+	return 0;
+}
+
+ssize_t __zuf_getxattr(struct inode *inode, int type, const char *name,
+		       void *buffer, size_t size)
+{
+	size_t name_len = strlen(name) + 1; /* plus \NUL */
+	struct _xxxattr *p_xattr;
+	struct _xxxattr s_xattr;
+	enum big_alloc_type bat;
+	struct zufs_ioc_xattr *ioc_xattr;
+	size_t ioc_size = sizeof(*ioc_xattr) + name_len;
+	struct zuf_dispatch_op zdo;
+	int err;
+	ssize_t ret;
+
+	zuf_dbg_vfs("[%ld] type=%d name=%s size=%lu ioc_size=%lu\n",
+			inode->i_ino, type, name, size, ioc_size);
+
+	p_xattr = big_alloc(_XXXATTR_SIZE(ioc_size), sizeof(s_xattr), &s_xattr,
+			    GFP_KERNEL, &bat);
+	if (unlikely(!p_xattr))
+		return -ENOMEM;
+
+	ioc_xattr = &p_xattr->d.ioc_xattr;
+	memset(ioc_xattr, 0, sizeof(*ioc_xattr));
+	p_xattr->user_buffer = buffer;
+
+	ioc_xattr->hdr.in_len = ioc_size;
+	ioc_xattr->hdr.out_start =
+				offsetof(struct zufs_ioc_xattr, user_buf_size);
+	 /* out_len updated by zus */
+	ioc_xattr->hdr.out_len = sizeof(ioc_xattr->user_buf_size);
+	ioc_xattr->hdr.out_max = 0;
+	ioc_xattr->hdr.operation = ZUFS_OP_XATTR_GET;
+	ioc_xattr->zus_ii = ZUII(inode)->zus_ii;
+	ioc_xattr->type = type;
+	ioc_xattr->name_len = name_len;
+	ioc_xattr->user_buf_size = size;
+
+	strcpy(ioc_xattr->buf, name);
+
+	zuf_dispatch_init(&zdo, &ioc_xattr->hdr, NULL, 0);
+	zdo.oh = _xattr_oh;
+	err = __zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &zdo);
+	ret = ioc_xattr->user_buf_size;
+
+	big_free(p_xattr, bat);
+
+	if (unlikely(err))
+		return err;
+
+	return ret;
+}
+
+/* ~~~~~~~~~~~~~~~ xattr set ~~~~~~~~~~~~~~~ */
+
+int __zuf_setxattr(struct inode *inode, int type, const char *name,
+		   const void *value, size_t size, int flags)
+{
+	size_t name_len = strlen(name) + 1;
+	struct _xxxattr *p_xattr;
+	struct _xxxattr s_xattr;
+	enum big_alloc_type bat;
+	struct zufs_ioc_xattr *ioc_xattr;
+	size_t ioc_size = sizeof(*ioc_xattr) + name_len + size;
+	int err;
+
+	zuf_dbg_vfs("[%ld] type=%d name=%s size=%lu ioc_size=%lu\n",
+			inode->i_ino, type, name, size, ioc_size);
+
+	p_xattr = big_alloc(_XXXATTR_SIZE(ioc_size), sizeof(s_xattr), &s_xattr,
+			    GFP_KERNEL, &bat);
+	if (unlikely(!p_xattr))
+		return -ENOMEM;
+
+	ioc_xattr = &p_xattr->d.ioc_xattr;
+	memset(ioc_xattr, 0, sizeof(*ioc_xattr));
+
+	ioc_xattr->hdr.in_len = ioc_size;
+	ioc_xattr->hdr.out_len = 0;
+	ioc_xattr->hdr.operation = ZUFS_OP_XATTR_SET;
+	ioc_xattr->zus_ii = ZUII(inode)->zus_ii;
+	ioc_xattr->type = type;
+	ioc_xattr->name_len = name_len;
+	ioc_xattr->user_buf_size = size;
+	ioc_xattr->flags = flags;
+
+	if (value && !size)
+		ioc_xattr->ioc_flags = ZUFS_XATTR_SET_EMPTY;
+
+	strcpy(ioc_xattr->buf, name);
+	if (value)
+		memcpy(ioc_xattr->buf + name_len, value, size);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_xattr->hdr,
+			    NULL, 0);
+
+	big_free(p_xattr, bat);
+
+	return err;
+}
+
+/* ~~~~~~~~~~~~~~~ xattr list ~~~~~~~~~~~~~~~ */
+
+static ssize_t __zuf_listxattr(struct inode *inode, char *buffer, size_t size)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct _xxxattr s_xattr;
+	struct zufs_ioc_xattr *ioc_xattr;
+	struct zuf_dispatch_op zdo;
+
+	int err;
+
+	zuf_dbg_vfs("[%ld] size=%lu\n", inode->i_ino, size);
+
+	ioc_xattr = &s_xattr.d.ioc_xattr;
+	memset(ioc_xattr, 0, sizeof(*ioc_xattr));
+	s_xattr.user_buffer = buffer;
+
+	ioc_xattr->hdr.in_len = sizeof(*ioc_xattr);
+	ioc_xattr->hdr.out_start =
+				offsetof(struct zufs_ioc_xattr, user_buf_size);
+	 /* out_len updated by zus */
+	ioc_xattr->hdr.out_len = sizeof(ioc_xattr->user_buf_size);
+	ioc_xattr->hdr.out_max = 0;
+	ioc_xattr->hdr.operation = ZUFS_OP_XATTR_LIST;
+	ioc_xattr->zus_ii = zii->zus_ii;
+	ioc_xattr->name_len = 0;
+	ioc_xattr->user_buf_size = size;
+	ioc_xattr->ioc_flags = capable(CAP_SYS_ADMIN) ? ZUFS_XATTR_TRUSTED : 0;
+
+	zuf_dispatch_init(&zdo, &ioc_xattr->hdr, NULL, 0);
+	zdo.oh = _xattr_oh;
+	err = __zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &zdo);
+	if (unlikely(err))
+		return err;
+
+	return ioc_xattr->user_buf_size;
+}
+
+ssize_t zuf_listxattr(struct dentry *dentry, char *buffer, size_t size)
+{
+	struct inode *inode = dentry->d_inode;
+	struct zuf_inode_info *zii = ZUII(inode);
+	ssize_t ret;
+
+	zuf_xar_lock(zii);
+
+	ret = __zuf_listxattr(inode, buffer, size);
+
+	zuf_xar_unlock(zii);
+
+	return ret;
+}
+
+/* ~~~~~~~~~~~~~~~ xattr sb handlers ~~~~~~~~~~~~~~~ */
+static bool zuf_xattr_handler_list(struct dentry *dentry)
+{
+	return true;
+}
+
+static
+int zuf_xattr_handler_get(const struct xattr_handler *handler,
+			  struct dentry *dentry, struct inode *inode,
+			  const char *name, void *value, size_t size)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	int ret;
+
+	zuf_dbg_xattr("[%ld] name=%s\n", inode->i_ino, name);
+
+	zuf_xar_lock(zii);
+
+	ret = __zuf_getxattr(inode, handler->flags, name, value, size);
+
+	zuf_xar_unlock(zii);
+
+	return ret;
+}
+
+static
+int zuf_xattr_handler_set(const struct xattr_handler *handler,
+			  struct dentry *d_notused, struct inode *inode,
+			  const char *name, const void *value, size_t size,
+			  int flags)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	int err;
+
+	zuf_dbg_xattr("[%ld] name=%s size=0x%lx flags=0x%x\n",
+			inode->i_ino, name, size, flags);
+
+	zuf_xaw_lock(zii);
+
+	err = __zuf_setxattr(inode, handler->flags, name, value, size, flags);
+
+	zuf_xaw_unlock(zii);
+
+	return err;
+}
+
+const struct xattr_handler zuf_xattr_security_handler = {
+	.prefix	= XATTR_SECURITY_PREFIX,
+	.flags = ZUF_XF_SECURITY,
+	.list	= zuf_xattr_handler_list,
+	.get	= zuf_xattr_handler_get,
+	.set	= zuf_xattr_handler_set,
+};
+
+const struct xattr_handler zuf_xattr_trusted_handler = {
+	.prefix	= XATTR_TRUSTED_PREFIX,
+	.flags = ZUF_XF_TRUSTED,
+	.list	= zuf_xattr_handler_list,
+	.get	= zuf_xattr_handler_get,
+	.set	= zuf_xattr_handler_set,
+};
+
+const struct xattr_handler zuf_xattr_user_handler = {
+	.prefix	= XATTR_USER_PREFIX,
+	.flags = ZUF_XF_USER,
+	.list	= zuf_xattr_handler_list,
+	.get	= zuf_xattr_handler_get,
+	.set	= zuf_xattr_handler_set,
+};
+
+const struct xattr_handler *zuf_xattr_handlers[] = {
+	&zuf_xattr_user_handler,
+	&zuf_xattr_trusted_handler,
+	&zuf_xattr_security_handler,
+	&posix_acl_access_xattr_handler,
+	&posix_acl_default_xattr_handler,
+	NULL
+};
+
+/*
+ * Callback for security_inode_init_security() for acquiring xattrs.
+ */
+int zuf_initxattrs(struct inode *inode, const struct xattr *xattr_array,
+		   void *fs_info)
+{
+	const struct xattr *xattr;
+
+	for (xattr = xattr_array; xattr->name != NULL; xattr++) {
+		int err;
+
+		/* REMOVEME: We had a BUG here for a long time that never
+		 * crashed, I want to see this is called, please.
+		 */
+		zuf_warn("Yes it is name=%s value-size=%zd\n",
+			  xattr->name, xattr->value_len);
+
+		err = zuf_xattr_handler_set(&zuf_xattr_security_handler, NULL,
+					    inode, xattr->name, xattr->value,
+					    xattr->value_len, 0);
+		if (unlikely(err)) {
+			zuf_err("[%ld] failed to init xattrs err=%d\n",
+				 inode->i_ino, err);
+			return err;
+		}
+	}
+	return 0;
+}
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 9b8fe3bff0cd..d3252ca7d2d1 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -107,6 +107,9 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_FALLOCATE);
 		CASE_ENUM_NAME(ZUFS_OP_LLSEEK);
 		CASE_ENUM_NAME(ZUFS_OP_IOCTL);
+		CASE_ENUM_NAME(ZUFS_OP_XATTR_GET);
+		CASE_ENUM_NAME(ZUFS_OP_XATTR_SET);
+		CASE_ENUM_NAME(ZUFS_OP_XATTR_LIST);
 		CASE_ENUM_NAME(ZUFS_OP_FIEMAP);
 
 		CASE_ENUM_NAME(ZUFS_OP_GET_MULTY);
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index fe479cb70f97..4a1d474eb80b 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -130,6 +130,8 @@ enum {
 struct zuf_inode_info {
 	struct inode		vfs_inode;
 
+	/* Lock for xattr operations */
+	struct rw_semaphore	xa_rwsem;
 	/* Stuff for mmap write */
 	struct rw_semaphore	in_sync;
 	struct list_head	i_mmap_dirty;
@@ -313,6 +315,38 @@ static inline void ZUF_CHECK_I_W_LOCK(struct inode *inode)
 		up_write(&inode->i_rwsem);
 #endif
 }
+static inline void zuf_xar_lock(struct zuf_inode_info *zii)
+{
+	down_read(&zii->xa_rwsem);
+}
+
+static inline void zuf_xar_unlock(struct zuf_inode_info *zii)
+{
+	up_read(&zii->xa_rwsem);
+}
+
+static inline void zuf_xaw_lock(struct zuf_inode_info *zii)
+{
+	down_write(&zii->xa_rwsem);
+}
+
+static inline void zuf_xaw_unlock(struct zuf_inode_info *zii)
+{
+	up_write(&zii->xa_rwsem);
+}
+
+/* xattr types */
+enum {	ZUF_XF_SECURITY    = 1,
+	ZUF_XF_SYSTEM      = 2,
+	ZUF_XF_TRUSTED     = 3,
+	ZUF_XF_USER        = 4,
+};
+
+struct zuf_acl {
+	__le16	tag;
+	__le16	perm;
+	__le32	id;
+};
 
 enum big_alloc_type { ba_stack, ba_8k, ba_vmalloc };
 #define S_8K (1024UL * 8)
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 4ebb067c0719..1359f0384f82 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -467,6 +467,9 @@ enum e_zufs_operation {
 	ZUFS_OP_FALLOCATE	= 21,
 	ZUFS_OP_LLSEEK		= 22,
 	ZUFS_OP_IOCTL		= 23,
+	ZUFS_OP_XATTR_GET	= 24,
+	ZUFS_OP_XATTR_SET	= 25,
+	ZUFS_OP_XATTR_LIST	= 27,
 	ZUFS_OP_FIEMAP		= 28,
 
 	ZUFS_OP_GET_MULTY	= 29,
@@ -745,6 +748,26 @@ struct zufs_ioc_ioctl {
 	};
 };
 
+/* ZUFS_OP_XATTR */
+/* xattr ioc_flags */
+#define ZUFS_XATTR_SET_EMPTY	(1 << 0)
+#define ZUFS_XATTR_TRUSTED	(1 << 1)
+
+struct zufs_ioc_xattr {
+	struct zufs_ioc_hdr hdr;
+	/* IN */
+	struct zus_inode_info *zus_ii;
+	__u32	flags;
+	__u32	type;
+	__u16	name_len;
+	__u16	ioc_flags;
+
+	/* OUT */
+	__u32	user_buf_size;
+	char	buf[0];
+};
+
+
 /* ZUFS_OP_FIEMAP */
 struct zufs_ioc_fiemap {
 	struct zufs_ioc_hdr hdr;
@@ -760,7 +783,7 @@ struct zufs_ioc_fiemap {
 	__u32	extents_mapped;
 	__u32	pad;
 
-} __packed;
+};
 
 struct zufs_fiemap_extent_info {
 	struct fiemap_extent *fi_extents_start;

From patchwork Thu Sep 26 02:07:25 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 11161835
Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 55F9F112B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:15:20 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 330FD222C0
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Thu, 26 Sep 2019 02:15:20 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com
 header.i=@plexistor-com.20150623.gappssmtp.com header.b="jBo2ob8+"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1733032AbfIZCPT (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Wed, 25 Sep 2019 22:15:19 -0400
Received: from mail-wr1-f65.google.com ([209.85.221.65]:38760 "EHLO
        mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1725768AbfIZCPT (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Wed, 25 Sep 2019 22:15:19 -0400
Received: by mail-wr1-f65.google.com with SMTP id w12so100529wro.5
        for <linux-fsdevel@vger.kernel.org>;
 Wed, 25 Sep 2019 19:15:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=oNcyUb+waA/wFzR42SB8pRb0N5bkuNbBy1Xitzh8Ovg=;
        b=jBo2ob8+5Ft+x9lzcYiLQu2C1wQZbwtKJQ/szD0s1B2IQcSHwY4UUxOWjDTnKDznWL
         THYCHLlBbkwwTbG5fg+iFOzHIzuWTWME/6o61dEMPT2u93mpMRr+KrTl8K9CDS/3R5FO
         IKiKFqK+D3FB0I9iTk8n1WH8uwMMPf+2foKWaOm+8Ua5im7mAXKkAK+Ohd/van+zAtDN
         X2GdKifchI63tQeTt21mIYdBnJUjmnrOkFZuUHnZWpFL359xRr967ISkFhQzGHTDGH05
         DhLov0YfCnL3fEtB3zajFEA2GGv3NaM6QHaz6VdMvDBo7pryBRepkOWwAJDs++JtvtHB
         YygA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=oNcyUb+waA/wFzR42SB8pRb0N5bkuNbBy1Xitzh8Ovg=;
        b=PQNl91661e1SbHhq+9l7MHlFgeKyBtaNrmJVxBWuJtfUgviRahdWMpuNGnxWMOzvS3
         CKRj6/0tEmIZ2ZfAhxpOWXEXVJiJZqx1SdFU+1nD5cQv0LUEzmu3yUKQn5XwKrikKZqy
         A7YW4bYfPz7B+GI240neNHyNxX7GFVPWjoGEPcEKsqWu8Yd/98aMlnD6EwTuGmFfnZYQ
         wpQxpdOYkKX7XRjZfmcjaPiKnZZkrB8/8nJXcLVTfuUFYR/p04ftGVlpgoO3TyDmWQrc
         sX20sZfbkwy+Vqg5rjvyxUz5yfbtrzwtSgttzQLlEAOWse/1ZkFkNDxLCwETenXUthaG
         u60Q==
X-Gm-Message-State: APjAAAWvjTuqfJmLIJWKngeGIJzoY2yJgDFgkts3fj0VDjQLjLqBk6o3
        IvGvn6b7z3aX9aNj3zGE/kGWeWmLZwg=
X-Google-Smtp-Source: 
 APXvYqxEaK4Mel+GChxNW/TWvLBM0vMHj8xPlM0nqONiEoz4+d6X35RCvg0PrlPX6zWWH5c2fU4dYA==
X-Received: by 2002:a5d:4803:: with SMTP id l3mr915146wrq.301.1569464116322;
        Wed, 25 Sep 2019 19:15:16 -0700 (PDT)
Received: from Bfire.plexistor.com ([217.70.210.43])
        by smtp.googlemail.com with ESMTPSA id
 o19sm968751wro.50.2019.09.25.19.15.14
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 25 Sep 2019 19:15:15 -0700 (PDT)
From: Boaz Harrosh <boaz@plexistor.com>
X-Google-Original-From: Boaz Harrosh <boazh@netapp.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>,
        Matt Benjamin <mbenjami@redhat.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Sagi Manole <sagim@netapp.com>,
        Matthew Wilcox <willy@infradead.org>,
        Dan Williams <dan.j.williams@intel.com>
Subject: [PATCH 16/16] zuf: Support for dynamic-debug of zusFSs
Date: Thu, 26 Sep 2019 05:07:25 +0300
Message-Id: <20190926020725.19601-17-boazh@netapp.com>
X-Mailer: git-send-email 2.21.0
In-Reply-To: <20190926020725.19601-1-boazh@netapp.com>
References: <20190926020725.19601-1-boazh@netapp.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

[THIS PATCH will be changed or dropped before final submission]

In zus we support dynamic-debug prints. ie user can
turn on and off the prints at run time by writing
to some special files.

The API is exactly the same as the Kernel's dynamic-prints
only the special file that we perform read/write on is:
	/sys/fs/zuf/ddbg

But otherwise it is identical to Kernel.

The Kernel code is a thin wrapper to dispatch to/from
the read/write of /sys/fs/zuf/ddbg file to the zus
server.
The heavy lifting is done by the zus project build system
and core code. See zus project how this is done

This facility is dispatched on the mount-thread and not
the regular ZTs. Because it is available globally before
any mounts.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/_extern.h  |  3 ++
 fs/zuf/zuf-root.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index d0d83eae75c1..40cc228e4c99 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -29,6 +29,9 @@ int zufc_release(struct inode *inode, struct file *file);
 int zufc_mmap(struct file *file, struct vm_area_struct *vma);
 const char *zuf_op_name(enum e_zufs_operation op);
 
+int __zufc_dispatch_mount(struct zuf_root_info *zri,
+			  enum e_mount_operation op,
+			  struct zufs_ioc_mount *zim);
 int zufc_dispatch_mount(struct zuf_root_info *zri, struct zus_fs_info *zus_zfi,
 			enum e_mount_operation operation,
 			struct zufs_ioc_mount *zim);
diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c
index ecf240bd3e3f..3c3126d676a6 100644
--- a/fs/zuf/zuf-root.c
+++ b/fs/zuf/zuf-root.c
@@ -70,6 +70,81 @@ static void _fs_type_free(struct zuf_fs_type *zft)
 }
 #endif /*CONFIG_LOCKDEP*/
 
+#define DDBG_MAX_BUF_SIZE	(8 * PAGE_SIZE)
+/* We use ppos as a cookie for the dynamic debug ID we want to read from */
+static ssize_t _zus_ddbg_read(struct file *file, char __user *buf, size_t len,
+			      loff_t *ppos)
+{
+	struct zufs_ioc_mount *zim;
+	size_t buf_size = (DDBG_MAX_BUF_SIZE <= len) ? DDBG_MAX_BUF_SIZE : len;
+	size_t zim_size =  sizeof(zim->hdr) + sizeof(zim->zdi);
+	ssize_t err;
+
+	zim = vzalloc(zim_size + buf_size);
+	if (unlikely(!zim))
+		return -ENOMEM;
+
+	/* null terminate the 1st character in the buffer, hence the '+ 1' */
+	zim->hdr.in_len = zim_size + 1;
+	zim->hdr.out_len = zim_size + buf_size;
+	zim->zdi.len = buf_size;
+	zim->zdi.id = *ppos;
+	*ppos = 0;
+
+	err = __zufc_dispatch_mount(ZRI(file->f_inode->i_sb), ZUFS_M_DDBG_RD,
+				    zim);
+	if (unlikely(err)) {
+		zuf_err("error dispatching contorl message => %ld\n", err);
+		goto out;
+	}
+
+	err = simple_read_from_buffer(buf, zim->zdi.len, ppos, zim->zdi.msg,
+				      buf_size);
+	if (unlikely(err <= 0))
+		goto out;
+
+	*ppos = zim->zdi.id;
+out:
+	vfree(zim);
+	return err;
+}
+
+static ssize_t _zus_ddbg_write(struct file *file, const char __user *buf,
+			       size_t len, loff_t *ofst)
+{
+	struct _ddbg_info {
+		struct zufs_ioc_mount zim;
+		char buf[512];
+	} ddi = {};
+	ssize_t err;
+
+	if (unlikely(512 < len)) {
+		zuf_err("ddbg control message to long\n");
+		return -EINVAL;
+	}
+
+	memset(&ddi, 0, sizeof(ddi));
+	if (copy_from_user(ddi.zim.zdi.msg, buf, len))
+		return -EFAULT;
+
+	ddi.zim.hdr.in_len = sizeof(ddi);
+	ddi.zim.hdr.out_len = sizeof(ddi.zim);
+	err = __zufc_dispatch_mount(ZRI(file->f_inode->i_sb), ZUFS_M_DDBG_WR,
+				    &ddi.zim);
+	if (unlikely(err)) {
+		zuf_err("error dispatching contorl message => %ld\n", err);
+		return err;
+	}
+
+	return len;
+}
+
+static const struct file_operations _zus_ddbg_ops = {
+	.open = nonseekable_open,
+	.read = _zus_ddbg_read,
+	.write = _zus_ddbg_write,
+	.llseek = no_llseek,
+};
 
 static ssize_t _state_read(struct file *file, char __user *buf, size_t len,
 			   loff_t *ppos)
@@ -338,6 +413,7 @@ static int zufr_fill_super(struct super_block *sb, void *data, int silent)
 	static struct tree_descr zufr_files[] = {
 		[2] = {"state", &_state_ops, S_IFREG | 0400},
 		[3] = {"registered_fs", &_registered_fs_ops, S_IFREG | 0400},
+		[4] = {"ddbg", &_zus_ddbg_ops, S_IFREG | 0600},
 		{""},
 	};
 	struct zuf_root_info *zri;