[11/16] zuf: Write/Read implementation

zufs Has two ways to do IO.

1. The elegant way:
   By mapping application buffers into Server VM. This is much simpler
   to implement by zusFS. But is slow and does not scale well.

2. The fast way: (called NIO)
   Server returns physical block information. And the pmem_memcpy
   is done in Kernel.

   This way is more complicated. Each block needs to ZUFS_GET_MULTI
   But also ZUFS_PUT_MULTI to indicate that Kernel has finished the
   copy, and pmem block may be recycled.
   But if we will go to server and back twice for each IOP this will
   kill our performance. So what we do is the pigi_put mechanisim
   (See zuf-core.c). pigi_put is a way to delay the put operation for
   later so when a new operation is going to Server it will take on the
   way all accumulated put operations. So in one go I might fetch
   new block info as well as PUT the previous IO. Don't worry all
   this is done zuf-core style without any locks or atomics.
   There are times that Server may request an immediate PUT and/or
   keep the ZT-channel locked for guaranty forward progress.

It is up to the zusFS to decide which mode it wants to operate in
[1] or [2] above. And more flags govern aspects of the IO requested.

The dispatch to the server can operate on buffers up to
ZUS_API_MAP_MAX_SIZE (4M). Any bigger operations are split
up and dispatched at this size.

Also if a multy-segments aio is used each segment is dispatched
on its own.

rw.c here also includes some operations for mmap. Will be used
in next patch.

The fallocate operation with its various mode flags is also dispatched
through the rw.c IO API because it might need to do some t1/t2 IO as
part of the operation. If it is for COW of cloned inodes or read/write
of the unaligned edges. zufs also implements truncate via a private
fallocate flag.

There is also code for comparing two buffers for the implementation
of the dedup operation.

Also in this patch the facility to SWAP on a zufs system.

There is also an IOCTL fasility to execute IO (ZU_IOC_IOMAP_EXEC)
from a Server background threads. We use this in Netapp for
tiering down cold blocks to slower storage.
Both ZU_IOC_IOMAP_EXEC and the IO despatch operate on facility
we call zufs_iomap which is a varlen buffer that may request and
encode many types of operations and block/memory targets for IO.
It is kind of an IO executor of sorts. zusFS encodes such iomap
to tell Kernel what needs to be done.

[v2]
  zuf: Range of _IO_gm_inner must fit API (PXS-5151)
   Zuf must never request pages which may fall out-of-range of
  ZUS_API_MAP_MAX_PAGES. When IO request is not page-aligned, limit
  size based on start offset.

[v3]
  zufc: bad bugs in zufc_goose_all_zts

 * The BAD Bug was that we called the internal smp_call_function
   instead of the proper on_each_cpu.
   This was bad because smp_call_function calls all other CPUs
   but us. Anyway the proper public API for this is on_each_cpu.

 * Another BUG is that zufc_goose_all_zts needs to be always called
   with an inode. This is because we are assuming that we are holding
   the inode_w_lock and no more puts can come in parallel to the goose_all.

 * In clone the goose target is the destination file which is going
   to be truncated. (See above we must have a locked inode at hand)

 * Call zufc_goose_all_zts under the inode_w_lock in evict.

 * One more change is to *not* relay on Server to turn off the
   ZUFS_H_HAS_PIGY_PUT flag. We will use this later to fix another
   theoretical Race window with pigi_put
   (In fact there is a zus patch to stop resetting that bit)

[v4]
  Remove the swap activate code. It will come in later Kernels.
  This is because to do it properly we should send a small patch
  to Kernel so to not force the FS to use page_cache. The code
  had an Hack to bypass this bug. But I rather remove the code
  instead.

[v5]
  Fix the warning of type:
    warning: the frame size of 8712 bytes is larger than 8192 bytes

  We allocate the maximum stack space allowed by the Kernel
  configuration, without warning. If the needed space fits in the
  stack it is used. If not we allocate from a new dedicated kmem_cache
  an 8K buffer to store our block-numbers. 8k is the maximum allowed
  in the zufs API which is 1024 data blocks,
  The above logic is hidden under the big_alloc facility that was already
  used in other places.

Signed-off-by: Sagi Manole <sagim@netapp.com>
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 fs/zuf/Makefile   |   1 +
 fs/zuf/_extern.h  |  22 ++
 fs/zuf/file.c     |  73 ++++
 fs/zuf/inode.c    |  13 +
 fs/zuf/rw.c       | 959 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf-core.c | 400 ++++++++++++++++++-
 fs/zuf/zuf.h      |   7 +
 fs/zuf/zus_api.h  | 251 ++++++++++++
 8 files changed, 1724 insertions(+), 2 deletions(-)
 create mode 100644 fs/zuf/rw.c

Message ID	20190926020725.19601-12-boazh@netapp.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=Yvvd=XV=vger.kernel.org=linux-fsdevel-owner@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6AF51924 for <patchwork-linux-fsdevel@patchwork.kernel.org>; Thu, 26 Sep 2019 02:13:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 287A4222C1 for <patchwork-linux-fsdevel@patchwork.kernel.org>; Thu, 26 Sep 2019 02:13:38 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=plexistor-com.20150623.gappssmtp.com header.i=@plexistor-com.20150623.gappssmtp.com header.b="j4g8CuNT" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732328AbfIZCNh (ORCPT <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>); Wed, 25 Sep 2019 22:13:37 -0400 Received: from mail-wr1-f68.google.com ([209.85.221.68]:45715 "EHLO mail-wr1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727403AbfIZCNh (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>); Wed, 25 Sep 2019 22:13:37 -0400 Received: by mail-wr1-f68.google.com with SMTP id r5so478158wrm.12 for <linux-fsdevel@vger.kernel.org>; Wed, 25 Sep 2019 19:13:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=plexistor-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=YrpOaPQLD5Jn7KZmM5HKkpm2ZuROLPYaVDN60R153Dg=; b=j4g8CuNTXWw9n3m8mffCwJuNA+PA+Hb6QMdufOp6oW+zr2d788h5EAi13pP01dJzd5 bVs+2ejbaK9sG4OOGBg0xMo7ZJskBDopWQ05u8UKIW7n1mtn6ceXkOozRagN4+CCjG2h uKKT7FXWvMKgw4kiAfZ0MTtsSK+wcxOdSVIbhOXjzsABFJ2JVti4hEfomR/hZdp+v3oJ O7K9kaocTb1EkoZkDhoUonzIwQOyY+3gFBT8E6/PMC7BbJEIKBw7hU82ngIOXcM71Wwc /LjonYbJw2FxEPy8+OGGwpuzrFOqteKiV7JUABD7eb63VhI8tV+PGWoFsUwGR2gbsXmB 9KVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=YrpOaPQLD5Jn7KZmM5HKkpm2ZuROLPYaVDN60R153Dg=; b=Nd/NL5yFggLPQnZvgwgzvg1s7KJHlFYpb3FyH/yfv4NSoX6DLNkOm5X3GsB14C7TRu H9Gk4jZTCWEejlaNzKku0oBK3ZnkennIBv3F62HOA8BUMdC3U6W2WUmxYSHQzAfTn1Y8 GTj/BOkVW8nUKAAeeehpt/buk2GOIP38Az1vhltiEfTX3IwRLZ4r0NHJbasJbgjvrbB6 l5NYnVnO1pxx7wHPPiT0qKA6z0aVAs9T+QF96l7rlGZaVStv7C3FRHW5KCQfvQLc3Yxv gp6WZ50AABhh0ucRMOs231/4wa3DmQIA8FWNOACR/PdJTdEM+EndG7GwtMyBtfTTpsI4 u10w== X-Gm-Message-State: APjAAAX+CZjBDxI7QrQ+VDSWfmVJ8pXz/FENbMm7+DgxftouelMqBJV1 09A8tjnvtu0gfSp953csM4FhySZM750= X-Google-Smtp-Source: APXvYqylDWaqgEqmxxFXmsAbL+2r+weaOoWOV8cQKKFETCvPOqz+p7IzwBGzY9VsUA74PYRf0dnkZw== X-Received: by 2002:a5d:40c4:: with SMTP id b4mr839063wrq.214.1569464007263; Wed, 25 Sep 2019 19:13:27 -0700 (PDT) Received: from Bfire.plexistor.com ([217.70.210.43]) by smtp.googlemail.com with ESMTPSA id o19sm968751wro.50.2019.09.25.19.13.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Sep 2019 19:13:26 -0700 (PDT) From: Boaz Harrosh <boaz@plexistor.com> X-Google-Original-From: Boaz Harrosh <boazh@netapp.com> To: linux-fsdevel <linux-fsdevel@vger.kernel.org>, Anna Schumaker <Anna.Schumaker@netapp.com>, Al Viro <viro@zeniv.linux.org.uk>, Matt Benjamin <mbenjami@redhat.com> Cc: Miklos Szeredi <mszeredi@redhat.com>, Amir Goldstein <amir73il@gmail.com>, Sagi Manole <sagim@netapp.com>, Matthew Wilcox <willy@infradead.org>, Dan Williams <dan.j.williams@intel.com> Subject: [PATCH 11/16] zuf: Write/Read implementation Date: Thu, 26 Sep 2019 05:07:20 +0300 Message-Id: <20190926020725.19601-12-boazh@netapp.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190926020725.19601-1-boazh@netapp.com> References: <20190926020725.19601-1-boazh@netapp.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: <linux-fsdevel.vger.kernel.org> X-Mailing-List: linux-fsdevel@vger.kernel.org
Series	zuf: ZUFS Zero-copy User-mode FileSystem \| expand [PATCHSET,v02,00/16] zuf: ZUFS Zero-copy User-mode FileSystem [01/16] fs: Add the ZUF filesystem to the build + License [02/16] MAINTAINERS: Add the ZUFS maintainership [03/16] zuf: Preliminary Documentation [04/16] zuf: zuf-rootfs [05/16] zuf: zuf-core The ZTs [06/16] zuf: Multy Devices [07/16] zuf: mounting [08/16] zuf: Namei and directory operations [09/16] zuf: readdir operation [10/16] zuf: symlink [11/16] zuf: Write/Read implementation [12/16] zuf: mmap & sync [13/16] zuf: More file operation [14/16] zuf: ioctl implementation [15/16] zuf: xattr && acl implementation [16/16] zuf: Support for dynamic-debug of zusFSs

[11/16] zuf: Write/Read implementation

Commit Message

Comments

Patch