From patchwork Tue Feb 19 11:51:31 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Boaz Harrosh <boaz@plexistor.com>
X-Patchwork-Id: 10819745
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6B42F1399
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 19 Feb 2019 11:52:11 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4D54C2BA9D
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 19 Feb 2019 11:52:11 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 425FF2BC10; Tue, 19 Feb 2019 11:52:11 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AD30C2BBF2
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue, 19 Feb 2019 11:52:09 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727991AbfBSLwJ (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 19 Feb 2019 06:52:09 -0500
Received: from mail-wm1-f66.google.com ([209.85.128.66]:37900 "EHLO
        mail-wm1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726947AbfBSLwI (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 19 Feb 2019 06:52:08 -0500
Received: by mail-wm1-f66.google.com with SMTP id v26so2297975wmh.3
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 19 Feb 2019 03:52:05 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=plexistor-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=GNAh2oKcWvXgGbPaJaA5argwXRBIUDxcxfN7SuLDhvU=;
        b=wZIyydYil4l9O6cLGqfCnjxbal6Q7N8kuQmYXK8nsRSagUOyYx6LoP4pulMBlidn7U
         q1/E0IOhttSjnxuYFzHmroZfQy4UZUozkyn54O2QQYtkiWjzxrKhaM4cHT6gTbDKf8EC
         2s1cKJn87fu7XBXffZ2LAGWuvH2uz2SPwXISNlVwsttpFlJuNiQOWuQk5USMS7YfxeGd
         vl1a2VwqIMpUQoJvF/9VkQ4aE6K4cTcYLjIlG5fuOzY90vA7SoQdt20L+6OxoLQAT+3n
         r0yWYdn4nin71gEg5cHm6ANe+pPKKzjYgLXyztwF24gzPjl4X1WEfC1tSqpZCQDd5+DB
         Boqg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=GNAh2oKcWvXgGbPaJaA5argwXRBIUDxcxfN7SuLDhvU=;
        b=ZCmUddoCWSDiXaQpK7gcd7DqWcrv8VGX/e2jSzi/Jxn1iYPXm9r7iZwGX3/rRgeKq2
         viCcr1yaVVIl2q30BMurWjO0a0b21YzzGe43PgXFc23bv1lcKByxbol+Pf0quoNw1o3Y
         rGXGFJ9hU367txKupjCVAkvxK7rhsvrcrRcj0rfdMQFeZOblzcH/nvMupTOujDU+/xvc
         Z6gyUzGmUYOyNeZnVU1G1SnBt7rIP6UY+QKOXppPSBhinnjXJhf+1nRttUW5192xpKoB
         Fbc09kovdavSe4JROH98vHjj6W6H851+XmzNjNbkr/bsL2breTS9bd+gKhmt2I5fkh4z
         X0vg==
X-Gm-Message-State: AHQUAubCndKuYpsKtAX1Bd5g0EfN4WlqLG3g1DIIKzRNOx+30y8Qh0ui
        t36teDwB45GNDSdauoArjNtvbslbr7k=
X-Google-Smtp-Source: 
 AHgI3IasE3L60XiO51jeWWUbKvZ1A9Bk4OokXzq+R8y/0z+pDbj7XFUcGInR4MbKI63fLnM+bGV9VQ==
X-Received: by 2002:a1c:c489:: with SMTP id
 u131mr2401842wmf.127.1550577124315;
        Tue, 19 Feb 2019 03:52:04 -0800 (PST)
Received: from Bfire.plexistor.com ([207.232.55.62])
        by smtp.gmail.com with ESMTPSA id
 t18sm3605830wmt.8.2019.02.19.03.52.02
        (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256);
        Tue, 19 Feb 2019 03:52:03 -0800 (PST)
From: Boaz harrosh <boaz@plexistor.com>
To: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        Al Viro <viro@zeniv.linux.org.uk>
Cc: Ric Wheeler <rwheeler@redhat.com>,
        Miklos Szeredi <mszeredi@redhat.com>,
        Steven Whitehouse <swhiteho@redhat.com>,
        Jefff moyer <jmoyer@redhat.com>,
        Amir Goldstein <amir73il@gmail.com>,
        Amit Golander <Amit.Golander@netapp.com>,
        Sagi Manole <sagim@netapp.com>
Subject: [RFC PATCH 12/17] zuf: mmap & sync
Date: Tue, 19 Feb 2019 13:51:31 +0200
Message-Id: <20190219115136.29952-13-boaz@plexistor.com>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20190219115136.29952-1-boaz@plexistor.com>
References: <20190219115136.29952-1-boaz@plexistor.com>
MIME-Version: 1.0
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Boaz Harrosh <boazh@netapp.com>

On page-fault call the zusFS for the page information. We always
mmap pmem pages directly. (No page cache)

With write-mmap and pmem. We need to keep track of dirty inodes
and call the zusFS when one of the sync variants are called.

This is because the Server will need to do a cl_flush on all
dirty pages.

If we did not have write-mmaped called on the inode we do
nothing on sync.

Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   2 +-
 fs/zuf/_extern.h  |  10 ++
 fs/zuf/_pr.h      |   1 +
 fs/zuf/file.c     |  65 +++++++++
 fs/zuf/inode.c    |  10 ++
 fs/zuf/mmap.c     | 339 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/super.c    |  91 +++++++++++++
 fs/zuf/zuf-core.c |   2 +
 fs/zuf/zuf.h      |   3 +
 fs/zuf/zus_api.h  |   9 ++
 10 files changed, 531 insertions(+), 1 deletion(-)
 create mode 100644 fs/zuf/mmap.c

diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile
index 9b7123f2af3e..970062d6b13f 100644
--- a/fs/zuf/Makefile
+++ b/fs/zuf/Makefile
@@ -17,6 +17,6 @@ zuf-y += md.o t1.o t2.o
 zuf-y += zuf-core.o zuf-root.o
 
 # Main FS
-zuf-y += rw.o
+zuf-y += rw.o mmap.o
 zuf-y += super.o inode.o directory.o namei.o file.o symlink.o
 zuf-y += module.o
diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h
index 2905fe20cec7..5029f865655a 100644
--- a/fs/zuf/_extern.h
+++ b/fs/zuf/_extern.h
@@ -46,6 +46,10 @@ bool zuf_dir_emit(struct super_block *sb, struct dir_context *ctx,
 uint zuf_prepare_symname(struct zufs_ioc_new_inode *ioc_new_inode,
 			const char *symname, ulong len, struct page *pages[2]);
 
+
+/* mmap.c */
+int zuf_file_mmap(struct file *file, struct vm_area_struct *vma);
+
 /* rw.c */
 int _zuf_get_put_block(struct zuf_sb_info *sbi, struct zuf_inode_info *zii,
 			  enum e_zufs_operation op, int rw, ulong index,
@@ -61,11 +65,17 @@ int zuf_iom_execute_sync(struct super_block *sb, struct inode *inode,
 			 __u64 *iom_e, uint iom_n);
 int zuf_iom_execute_async(struct super_block *sb, struct zus_iomap_build *iomb,
 			 __u64 *iom_e_user, uint iom_n);
+/* file.c */
+int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync);
+
 
 /* super.c */
 int zuf_init_inodecache(void);
 void zuf_destroy_inodecache(void);
 
+void zuf_sync_inc(struct inode *inode);
+void zuf_sync_dec(struct inode *inode, ulong write_unmapped);
+
 struct dentry *zuf_mount(struct file_system_type *fs_type, int flags,
 			 const char *dev_name, void *data);
 
diff --git a/fs/zuf/_pr.h b/fs/zuf/_pr.h
index 151e127f513b..a1ceab2abce2 100644
--- a/fs/zuf/_pr.h
+++ b/fs/zuf/_pr.h
@@ -45,6 +45,7 @@
 #define zuf_dbg_t2(s, args ...)		zuf_chan_debug("t2dbg", s, ##args)
 #define zuf_dbg_t2_rw(s, args ...)	zuf_chan_debug("t2grw", s, ##args)
 #define zuf_dbg_core(s, args ...)	zuf_chan_debug("core ", s, ##args)
+#define zuf_dbg_mmap(s, args ...)	zuf_chan_debug("mmap ", s, ##args)
 #define zuf_dbg_zus(s, args ...)	zuf_chan_debug("zusdg", s, ##args)
 #define zuf_dbg_verbose(s, args ...)	zuf_chan_debug("d-oto", s, ##args)
 
diff --git a/fs/zuf/file.c b/fs/zuf/file.c
index 0e62145e923a..392b1a0d5881 100644
--- a/fs/zuf/file.c
+++ b/fs/zuf/file.c
@@ -174,6 +174,69 @@ static loff_t zuf_llseek(struct file *file, loff_t offset, int whence)
 	return err ?: ioc_seek.offset_out;
 }
 
+/* This function is called by both msync() and fsync(). */
+int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zufs_ioc_range ioc_range = {
+		.hdr.in_len = sizeof(ioc_range),
+		.hdr.operation = ZUFS_OP_SYNC,
+		.zus_ii = zii->zus_ii,
+		.offset = start,
+		.opflags = datasync,
+	};
+	loff_t isize;
+	ulong uend = end + 1;
+	int err = 0;
+
+	zuf_dbg_vfs(
+		"[%ld] start=0x%llx end=0x%llx  datasync=%d write_mapped=%d\n",
+		inode->i_ino, start, end, datasync,
+		atomic_read(&zii->write_mapped));
+
+	/* We want to serialize the syncs so they don't fight with each other
+	 * and is though more efficient, but we do not want to lock out
+	 * read/writes and page-faults so we have a special sync semaphore
+	 */
+	zuf_smw_lock(zii);
+
+	isize = i_size_read(inode);
+	if (!isize) {
+		zuf_dbg_mmap("[%ld] file is empty\n", inode->i_ino);
+		goto out;
+	}
+	if (isize < uend)
+		uend = isize;
+	if (uend < start) {
+		zuf_dbg_mmap("[%ld] isize=0x%llx start=0x%llx end=0x%lx\n",
+				 inode->i_ino, isize, start, uend);
+		err = -ENODATA;
+		goto out;
+	}
+
+	if (!atomic_read(&zii->write_mapped))
+		goto out; /* Nothing to do on this inode */
+
+	ioc_range.length = uend - start;
+	unmap_mapping_range(inode->i_mapping, start, ioc_range.length, 0);
+
+	err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &ioc_range.hdr,
+			    NULL, 0);
+	if (unlikely(err))
+		zuf_dbg_err("zufc_dispatch failed => %d\n", err);
+
+	zuf_sync_dec(inode, ioc_range.write_unmapped);
+
+out:
+	zuf_smw_unlock(zii);
+	return err;
+}
+
+static int zuf_fsync(struct file *file, loff_t start, loff_t end, int datasync)
+{
+	return zuf_isync(file_inode(file), start, end, datasync);
+}
+
 /* This callback is called when a file is closed */
 static int zuf_flush(struct file *file, fl_owner_t id)
 {
@@ -439,7 +502,9 @@ const struct file_operations zuf_file_operations = {
 	.llseek			= zuf_llseek,
 	.read_iter		= zuf_read_iter,
 	.write_iter		= zuf_write_iter,
+	.mmap			= zuf_file_mmap,
 	.open			= generic_file_open,
+	.fsync			= zuf_fsync,
 	.flush			= zuf_flush,
 	.release		= zuf_file_release,
 	.fallocate		= zuf_fallocate,
diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c
index 2b49a0c31a02..8f9b4f28c556 100644
--- a/fs/zuf/inode.c
+++ b/fs/zuf/inode.c
@@ -270,6 +270,7 @@ void zuf_evict_inode(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	struct zuf_inode_info *zii = ZUII(inode);
+	int write_mapped;
 
 	if (!inode->i_nlink) {
 		if (unlikely(!zii->zi)) {
@@ -312,6 +313,15 @@ void zuf_evict_inode(struct inode *inode)
 		zii->zero_page = NULL;
 	}
 
+	/* ZUS on evict has synced all mmap dirty pages, YES? */
+	write_mapped = atomic_read(&zii->write_mapped);
+	if (unlikely(write_mapped || !list_empty(&zii->i_mmap_dirty))) {
+		zuf_dbg_mmap("[%ld] !!!! write_mapped=%d list_empty=%d\n",
+			      inode->i_ino, write_mapped,
+			      list_empty(&zii->i_mmap_dirty));
+		zuf_sync_dec(inode, write_mapped);
+	}
+
 	clear_inode(inode);
 }
 
diff --git a/fs/zuf/mmap.c b/fs/zuf/mmap.c
new file mode 100644
index 000000000000..4a4eb117a6b0
--- /dev/null
+++ b/fs/zuf/mmap.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BRIEF DESCRIPTION
+ *
+ * Read/Write operations.
+ *
+ * Copyright (c) 2018 NetApp Inc. All rights reserved.
+ *
+ * ZUFS-License: GPL-2.0. See module.c for LICENSE details.
+ *
+ * Authors:
+ *	Boaz Harrosh <boazh@netapp.com>
+ */
+
+#include <linux/pfn_t.h>
+#include "zuf.h"
+
+/* ~~~ Functions for mmap and page faults ~~~ */
+
+/* MAP_PRIVATE, copy data to user private page (cow_page) */
+static int _cow_private_page(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	int err;
+
+	/* Basically a READ into vmf->cow_page */
+	err = zuf_rw_read_page(sbi, inode, vmf->cow_page,
+			       md_p2o(vmf->pgoff));
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err("[%ld] read_page failed bn=0x%lx address=0x%lx => %d\n",
+			inode->i_ino, vmf->pgoff, vmf->address, err);
+		/* FIXME: Probably return VM_FAULT_SIGBUS */
+	}
+
+	/*HACK: This is an hack since Kernel v4.7 where a VM_FAULT_LOCKED with
+	 * vmf->page==NULL is no longer supported. Looks like for now this way
+	 * works well. We let mm mess around with unlocking and putting its own
+	 * cow_page.
+	 */
+	vmf->page = vmf->cow_page;
+	get_page(vmf->page);
+	lock_page(vmf->page);
+
+	return VM_FAULT_LOCKED;
+}
+
+int _rw_init_zero_page(struct zuf_inode_info *zii)
+{
+	if (zii->zero_page)
+		return 0;
+
+	zii->zero_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	if (unlikely(!zii->zero_page))
+		return -ENOMEM;
+	zii->zero_page->mapping = zii->vfs_inode.i_mapping;
+	return 0;
+}
+
+static int zuf_write_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+			   bool pfn_mkwrite)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zus_inode *zi = zii->zi;
+	struct zufs_ioc_IO get_block = {};
+	int fault = VM_FAULT_SIGBUS;
+	ulong addr = vmf->address;
+	pgoff_t size;
+	pfn_t pfnt;
+	ulong pfn;
+	int err;
+
+	zuf_dbg_mmap("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+		    "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p\n",
+		    _zi_ino(zi), vma->vm_start, vma->vm_end, addr, vmf->pgoff,
+		    vmf->flags, vmf->cow_page, vmf->page);
+
+	if (unlikely(vmf->page && vmf->page != zii->zero_page)) {
+		zuf_err("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+			"pgoff=0x%lx vmf_flags=0x%x page=%p cow_page=%p\n",
+			_zi_ino(zi), vma->vm_start, vma->vm_end, addr,
+			vmf->pgoff, vmf->flags, vmf->page, vmf->cow_page);
+		return VM_FAULT_SIGBUS;
+	}
+
+	sb_start_pagefault(inode->i_sb);
+	zuf_smr_lock_pagefault(zii);
+
+	size = md_o2p_up(i_size_read(inode));
+	if (unlikely(vmf->pgoff >= size)) {
+		ulong pgoff = vma->vm_pgoff + md_o2p(addr - vma->vm_start);
+
+		zuf_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n",
+			 _zi_ino(zi), vmf->pgoff, pgoff, size);
+
+		fault = VM_FAULT_SIGBUS;
+		goto out;
+	}
+
+	if (vmf->cow_page) {
+		fault = _cow_private_page(vma, vmf);
+		goto out;
+	}
+
+	zus_inode_cmtime_now(inode, zi);
+	/* NOTE: zus needs to flush the zi */
+
+	err = _zuf_get_put_block(sbi, zii, ZUFS_OP_GET_BLOCK, WRITE, vmf->pgoff,
+			     &get_block);
+	if (unlikely(err)) {
+		zuf_dbg_err("_get_put_block failed => %d\n", err);
+		goto out;
+	}
+
+	if ((get_block.gp_block.ret_flags & ZUFS_GBF_NEW) || !pfn_mkwrite) {
+		inode->i_blocks = le64_to_cpu(zii->zi->i_blocks);
+		/* newly created block */
+		unmap_mapping_range(inode->i_mapping, vmf->pgoff << PAGE_SHIFT,
+				    PAGE_SIZE, 0);
+	} else if (pfn_mkwrite) {
+		/* If the block did not change just tell mm to flip
+		 * the write bit
+		 */
+		fault = VM_FAULT_WRITE;
+		goto skip_insert;
+	}
+
+	if (unlikely(get_block.gp_block.pmem_bn == 0)) {
+		zuf_err("[%ld] pmem_bn=0  rw=0x%x ret_flags=0x%x priv=0x%lx but no error?\n",
+			_zi_ino(zi), get_block.gp_block.rw,
+			get_block.gp_block.ret_flags,
+			(ulong)get_block.gp_block.priv);
+		fault = VM_FAULT_SIGBUS;
+		goto out;
+	}
+
+	pfn = md_pfn(sbi->md, get_block.gp_block.pmem_bn);
+	pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+	fault = vmf_insert_mixed_mkwrite(vma, addr, pfnt);
+	err = zuf_flt_to_err(fault);
+	if (unlikely(err)) {
+		zuf_err("vm_insert_mixed_mkwrite failed => %d\n", err);
+		goto put;
+	}
+
+	zuf_dbg_mmap("[%ld] vm_insert_mixed 0x%lx prot=0x%lx => %d\n",
+		    _zi_ino(zi), pfn, vma->vm_page_prot.pgprot, err);
+
+skip_insert:
+	zuf_sync_inc(inode);
+put:
+	_zuf_get_put_block(sbi, zii, ZUFS_OP_PUT_BLOCK, WRITE, vmf->pgoff,
+			     &get_block);
+out:
+	zuf_smr_unlock(zii);
+	sb_end_pagefault(inode->i_sb);
+	return fault;
+}
+
+static int zuf_pfn_mkwrite(struct vm_fault *vmf)
+{
+	return zuf_write_fault(vmf->vma, vmf, true);
+}
+
+static int zuf_read_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+	struct zus_inode *zi = zii->zi;
+	struct zufs_ioc_IO get_block = {};
+	int fault = VM_FAULT_SIGBUS;
+	ulong addr = vmf->address;
+	pgoff_t size;
+	pfn_t pfnt;
+	ulong pfn;
+	int err;
+
+	zuf_dbg_mmap("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx "
+		    "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p\n",
+		    _zi_ino(zi), vma->vm_start, vma->vm_end, addr, vmf->pgoff,
+		    vmf->flags, vmf->cow_page, vmf->page);
+
+	zuf_smr_lock_pagefault(zii);
+
+	size = md_o2p_up(i_size_read(inode));
+	if (unlikely(vmf->pgoff >= size)) {
+		ulong pgoff = vma->vm_pgoff + md_o2p(addr - vma->vm_start);
+
+		zuf_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n",
+			 _zi_ino(zi), vmf->pgoff, pgoff, size);
+		goto out;
+	}
+
+	if (vmf->cow_page) {
+		zuf_warn("cow is read\n");
+		fault = _cow_private_page(vma, vmf);
+		goto out;
+	}
+
+	file_accessed(vma->vm_file);
+	/* NOTE: zus needs to flush the zi */
+
+	err = _zuf_get_put_block(sbi, zii, ZUFS_OP_GET_BLOCK, READ, vmf->pgoff,
+			     &get_block);
+	if (unlikely(err && err != -EINTR)) {
+		zuf_err("_get_put_block failed => %d\n", err);
+		goto out;
+	}
+
+	if (get_block.gp_block.pmem_bn == 0) {
+		/* Hole in file */
+		err = _rw_init_zero_page(zii);
+		if (unlikely(err))
+			goto out;
+
+		err = vm_insert_page(vma, addr, zii->zero_page);
+		zuf_dbg_mmap("[%ld] inserted zero\n", _zi_ino(zi));
+
+		/* NOTE: we are fooling mm, we do not need this page
+		 * to be locked and get(ed)
+		 */
+		fault = VM_FAULT_NOPAGE;
+		goto out;
+	}
+
+	/* We have a real page */
+	pfn = md_pfn(sbi->md, get_block.gp_block.pmem_bn);
+	pfnt = phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV);
+	fault = vmf_insert_mixed(vma, addr, pfnt);
+	err = zuf_flt_to_err(fault);
+	if (unlikely(err)) {
+		zuf_err("[%ld] vm_insert_page/mixed => %d\n", _zi_ino(zi), err);
+		goto put;
+	}
+
+	zuf_dbg_mmap("[%ld] vm_insert_mixed 0x%lx prot=0x%lx => %d\n",
+		    _zi_ino(zi), pfn, vma->vm_page_prot.pgprot, err);
+
+put:
+	_zuf_get_put_block(sbi, zii, ZUFS_OP_PUT_BLOCK, READ, vmf->pgoff,
+		       &get_block);
+out:
+	zuf_smr_unlock(zii);
+	return fault;
+}
+
+static int zuf_fault(struct vm_fault *vmf)
+{
+	bool write_fault = (0 != (vmf->flags & FAULT_FLAG_WRITE));
+
+	if (write_fault)
+		return zuf_write_fault(vmf->vma, vmf, false);
+	else
+		return zuf_read_fault(vmf->vma, vmf);
+}
+
+static int zuf_page_mkwrite(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct inode *inode = vma->vm_file->f_mapping->host;
+	ulong addr = vmf->address;
+
+	/* our zero page doesn't really hold the correct offset to the file in
+	 * page->index so vmf->pgoff is incorrect, lets fix that
+	 */
+	vmf->pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
+
+	zuf_dbg_mmap("[%ld] pgoff=0x%lx\n", inode->i_ino, vmf->pgoff);
+
+	/* call fault handler to get a real page for writing */
+	return zuf_write_fault(vma, vmf, false);
+}
+
+static void zuf_mmap_open(struct vm_area_struct *vma)
+{
+	struct zuf_inode_info *zii = ZUII(file_inode(vma->vm_file));
+
+	atomic_inc(&zii->vma_count);
+}
+
+static void zuf_mmap_close(struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	int vma_count = atomic_dec_return(&ZUII(inode)->vma_count);
+
+	if (unlikely(vma_count < 0))
+		zuf_err("[%ld] WHAT??? vma_count=%d\n",
+			 inode->i_ino, vma_count);
+	else if (unlikely(vma_count == 0)) {
+		struct zuf_inode_info *zii = ZUII(inode);
+		struct zufs_ioc_mmap_close mmap_close = {};
+		int err;
+
+		mmap_close.hdr.operation = ZUFS_OP_MMAP_CLOSE;
+		mmap_close.hdr.in_len = sizeof(mmap_close);
+
+		mmap_close.zus_ii = zii->zus_ii;
+		mmap_close.rw = 0; /* TODO: Do we need this */
+
+		zuf_smr_lock(zii);
+
+		err = zufc_dispatch(ZUF_ROOT(SBI(inode->i_sb)), &mmap_close.hdr,
+				    NULL, 0);
+		if (unlikely(err))
+			zuf_dbg_err("[%ld] err=%d\n", inode->i_ino, err);
+
+		zuf_smr_unlock(zii);
+	}
+}
+
+static const struct vm_operations_struct zuf_vm_ops = {
+	.fault		= zuf_fault,
+	.page_mkwrite	= zuf_page_mkwrite,
+	.pfn_mkwrite	= zuf_pfn_mkwrite,
+	.open           = zuf_mmap_open,
+	.close		= zuf_mmap_close,
+};
+
+int zuf_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(file);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	file_accessed(file);
+
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_ops = &zuf_vm_ops;
+
+	atomic_inc(&zii->vma_count);
+
+	zuf_dbg_vfs("[%ld] start=0x%lx end=0x%lx flags=0x%lx page_prot=0x%lx\n",
+		     file->f_mapping->host->i_ino, vma->vm_start, vma->vm_end,
+		     vma->vm_flags, pgprot_val(vma->vm_page_prot));
+
+	return 0;
+}
diff --git a/fs/zuf/super.c b/fs/zuf/super.c
index 2afa7b405945..2f1dd44290a2 100644
--- a/fs/zuf/super.c
+++ b/fs/zuf/super.c
@@ -570,6 +570,90 @@ static int zuf_update_s_wtime(struct super_block *sb)
 	return 0;
 }
 
+static void _sync_add_inode(struct inode *inode)
+{
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zuf_dbg_mmap("[%ld] write_mapped=%d\n",
+		      inode->i_ino, atomic_read(&zii->write_mapped));
+
+	spin_lock(&sbi->s_mmap_dirty_lock);
+
+	/* Because we are lazy removing the inodes, only in case of an fsync
+	 * or an evict_inode. It is fine if we are call multiple times.
+	 */
+	if (list_empty(&zii->i_mmap_dirty))
+		list_add(&zii->i_mmap_dirty, &sbi->s_mmap_dirty);
+
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+}
+
+static void _sync_remove_inode(struct inode *inode)
+{
+	struct zuf_sb_info *sbi = SBI(inode->i_sb);
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	zuf_dbg_mmap("[%ld] write_mapped=%d\n",
+		      inode->i_ino, atomic_read(&zii->write_mapped));
+
+	spin_lock(&sbi->s_mmap_dirty_lock);
+	list_del_init(&zii->i_mmap_dirty);
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+}
+
+void zuf_sync_inc(struct inode *inode)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (1 == atomic_inc_return(&zii->write_mapped))
+		_sync_add_inode(inode);
+}
+
+/* zuf_sync_dec will unmapped in batches */
+void zuf_sync_dec(struct inode *inode, ulong write_unmapped)
+{
+	struct zuf_inode_info *zii = ZUII(inode);
+
+	if (0 == atomic_sub_return(write_unmapped, &zii->write_mapped))
+		_sync_remove_inode(inode);
+}
+
+/*
+ * We must fsync any mmap-active inodes
+ */
+static int zuf_sync_fs(struct super_block *sb, int wait)
+{
+	struct zuf_sb_info *sbi = SBI(sb);
+	struct zuf_inode_info *zii, *t;
+	enum {to_clean_size = 120};
+	struct zuf_inode_info *zii_to_clean[to_clean_size];
+	uint i, to_clean;
+
+	zuf_dbg_vfs("Syncing wait=%d\n", wait);
+more_inodes:
+	spin_lock(&sbi->s_mmap_dirty_lock);
+	to_clean = 0;
+	list_for_each_entry_safe(zii, t, &sbi->s_mmap_dirty, i_mmap_dirty) {
+		list_del_init(&zii->i_mmap_dirty);
+		zii_to_clean[to_clean++] = zii;
+		if (to_clean >= to_clean_size)
+			break;
+	}
+	spin_unlock(&sbi->s_mmap_dirty_lock);
+
+	if (!to_clean)
+		return 0;
+
+	for (i = 0; i < to_clean; ++i)
+		zuf_isync(&zii_to_clean[i]->vfs_inode, 0, ~0 - 1, 1);
+
+	if (to_clean == to_clean_size)
+		goto more_inodes;
+
+	return 0;
+}
+
 static struct inode *zuf_alloc_inode(struct super_block *sb)
 {
 	struct zuf_inode_info *zii;
@@ -592,6 +676,12 @@ static void _init_once(void *foo)
 	struct zuf_inode_info *zii = foo;
 
 	inode_init_once(&zii->vfs_inode);
+	INIT_LIST_HEAD(&zii->i_mmap_dirty);
+	zii->zi = NULL;
+	zii->zero_page = NULL;
+	init_rwsem(&zii->in_sync);
+	atomic_set(&zii->vma_count, 0);
+	atomic_set(&zii->write_mapped, 0);
 }
 
 int __init zuf_init_inodecache(void)
@@ -621,6 +711,7 @@ static struct super_operations zuf_sops = {
 	.put_super	= zuf_put_super,
 	.freeze_fs	= zuf_update_s_wtime,
 	.unfreeze_fs	= zuf_update_s_wtime,
+	.sync_fs	= zuf_sync_fs,
 	.statfs		= zuf_statfs,
 	.remount_fs	= zuf_remount,
 	.show_options	= zuf_show_options,
diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c
index 371c2e93dd81..86f624031d8d 100644
--- a/fs/zuf/zuf-core.c
+++ b/fs/zuf/zuf-core.c
@@ -781,8 +781,10 @@ const char *zuf_op_name(enum e_zufs_operation op)
 		CASE_ENUM_NAME(ZUFS_OP_WRITE		);
 		CASE_ENUM_NAME(ZUFS_OP_GET_BLOCK	);
 		CASE_ENUM_NAME(ZUFS_OP_PUT_BLOCK	);
+		CASE_ENUM_NAME(ZUFS_OP_MMAP_CLOSE	);
 		CASE_ENUM_NAME(ZUFS_OP_GET_SYMLINK	);
 		CASE_ENUM_NAME(ZUFS_OP_SETATTR		);
+		CASE_ENUM_NAME(ZUFS_OP_SYNC		);
 		CASE_ENUM_NAME(ZUFS_OP_FALLOCATE	);
 		CASE_ENUM_NAME(ZUFS_OP_LLSEEK		);
 		CASE_ENUM_NAME(ZUFS_OP_BREAK		);
diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h
index 7d79189bfe60..98f4ea088671 100644
--- a/fs/zuf/zuf.h
+++ b/fs/zuf/zuf.h
@@ -158,6 +158,9 @@ struct zuf_inode_info {
 
 	/* Stuff for mmap write */
 	struct rw_semaphore	in_sync;
+	struct list_head	i_mmap_dirty;
+	atomic_t		write_mapped;
+	atomic_t		vma_count;
 	struct page		*zero_page; /* TODO: Remove */
 
 	/* cookies from Server */
diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h
index 26b7b56f96c4..3d6481768308 100644
--- a/fs/zuf/zus_api.h
+++ b/fs/zuf/zus_api.h
@@ -344,8 +344,10 @@ enum e_zufs_operation {
 	ZUFS_OP_WRITE,
 	ZUFS_OP_GET_BLOCK,
 	ZUFS_OP_PUT_BLOCK,
+	ZUFS_OP_MMAP_CLOSE,
 	ZUFS_OP_GET_SYMLINK,
 	ZUFS_OP_SETATTR,
+	ZUFS_OP_SYNC,
 	ZUFS_OP_FALLOCATE,
 	ZUFS_OP_LLSEEK,
 
@@ -516,6 +518,13 @@ static inline bool zufs_zde_emit(struct zufs_readdir_iter *rdi, __u64 ino,
 	return true;
 }
 
+struct zufs_ioc_mmap_close {
+	struct zufs_ioc_hdr hdr;
+	 /* IN */
+	struct zus_inode_info *zus_ii;
+	__u64 rw; /* Some flags + READ or WRITE */
+};
+
 /* ZUFS_OP_GET_SYMLINK */
 struct zufs_ioc_get_link {
 	struct zufs_ioc_hdr hdr;