From patchwork Tue Feb 14 21:33:15 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Stefano Stabellini X-Patchwork-Id: 9572969 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id B5CEE601E7 for ; Tue, 14 Feb 2017 21:36:14 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A567626E4D for ; Tue, 14 Feb 2017 21:36:14 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 98C1027D4D; Tue, 14 Feb 2017 21:36:14 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.6 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_MED,RCVD_IN_SORBS_SPAM,T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 5A4FB26E4D for ; Tue, 14 Feb 2017 21:36:12 +0000 (UTC) Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cdkiw-00046D-LR; Tue, 14 Feb 2017 21:33:22 +0000 Received: from mail6.bemta6.messagelabs.com ([193.109.254.103]) by lists.xenproject.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cdkiv-000467-PF for xen-devel@lists.xenproject.org; Tue, 14 Feb 2017 21:33:22 +0000 Received: from [193.109.254.147] by server-4.bemta-6.messagelabs.com id F0/EC-25093-1A773A85; Tue, 14 Feb 2017 21:33:21 +0000 X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFnrJIsWRWlGSWpSXmKPExsVyMfSapu788sU RBofOm1h83zKZyYHR4/CHKywBjFGsmXlJ+RUJrBk9uy+wFcxsZKzYOYWvgbE/pouRi0NIYDqj xMY7f5hBHBaBblaJbR3PWUEcCYHlrBKTlh1k6mLkBHJiJDbc2Q5ll0k8bG5iBbGFBJQkNrb8Y YMYNYNJ4vCq64wgCRYBbYlD216CNbAJ6Emc+dcNZHMANRtKLPnMARIWETCS6LxzmQWkl1lgKq PEnOMQvcICOhJ3Ns5jB7E5BSwkpj7bwgJi8wp4Scz8f4gdYnGZRG/fTjYQW1RAV+LQvz9sEDW CEidnPgGrZxbQklg+fRvLBEbhWUhSs5CkFjAyrWLUKE4tKkst0jU00EsqykzPKMlNzMwB8sz0 clOLixPTU3MSk4r1kvNzNzECg5oBCHYw3lsWcIhRkoNJSZS3ImhxhBBfUn5KZUZicUZ8UWlOa vEhRhkODiUJ3gtlQDnBotT01Iq0zBxgfMGkJTh4lER4s0DSvMUFibnFmekQqVOMlhw9XadfMn GcugEi9+y6/JJJiCUvPy9VSpx3CUiDAEhDRmke3DhYCrjEKCslzMsIdKAQT0FqUW5mCar8K0Z xDkYlYV5+kCk8mXklcFtfAR3EBHQQa9xCkINKEhFSUg2Mkg6hG3Su71Iust53dfErP6OGt5dq mJRXSOYmPajU++65aGWYvcTJyFjFadLGzxMXFC9OV5830S8qV39Jf2NJQOHGt8mqRn4zNp/8s 2DBm7S4TaKJErWfLhkcWWl74V8Yf+FU6/LytBcXDutdLC484Tp/0+/+V6+2bZHcfbRX8M4Lz7 kBXbc7lViKMxINtZiLihMB7pRNY/wCAAA= X-Env-Sender: stefano@aporeto.com X-Msg-Ref: server-4.tower-27.messagelabs.com!1487107998!86277457!1 X-Originating-IP: [209.85.214.41] X-SpamReason: No, hits=0.5 required=7.0 tests=BODY_RANDOM_LONG X-StarScan-Received: X-StarScan-Version: 9.2.3; banners=-,-,- X-VirusChecked: Checked Received: (qmail 38602 invoked from network); 14 Feb 2017 21:33:19 -0000 Received: from mail-it0-f41.google.com (HELO mail-it0-f41.google.com) (209.85.214.41) by server-4.tower-27.messagelabs.com with AES128-GCM-SHA256 encrypted SMTP; 14 Feb 2017 21:33:19 -0000 Received: by mail-it0-f41.google.com with SMTP id c7so51619643itd.1 for ; Tue, 14 Feb 2017 13:33:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=aporeto.com; s=google; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=B20GjgiZ7Wv06d/9r5bAKIfyS2yKG/imHzaueo8Fe+8=; b=c2Ct0ig44ToOK+i9WxYfvD63KS8NWaseFbluhwZMWIPjhKWstLeahMzoFvbcDv6G0J lIJGs91aHzGU+ye7233jfVLZaHQwY6/OI/XDWYv27Al5/tTS6W8Y5JXaGB5laEdUzxGm wsF6rHzeK0Fw00kMVKeUgnNpyT8CqAejgHP70F0cwl7CX+PdaT0BcZ+31EeF5TqKFYLV CsAHyNtsO/UTd/vTrTyZaomH3Te/qHOEvkz2db5YiTnYZ/iTxPDx4I6Z2W/SLFoGuBIz LgvvBVMjIDYr+oV+fRf2kcFEUKq5xc0LJ7pCEX3P2vagUjZTjkR/olRMpzAGvDyT3Vw8 EQtA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=B20GjgiZ7Wv06d/9r5bAKIfyS2yKG/imHzaueo8Fe+8=; b=ON/lylhHZcpgk4hjuqVwT6DuCmSsmx/3+qQdz34PMsQcRhcFx2cdZ/+Whtro9IMPk4 5ip1OCyyOwQ20p9jmWdHBB7KPZFt7FQ/+xtLUIbLWjvP2VhhcDalzpuGVLGWVAjwohTm Rrozv5gTX942IGitNzHwUrpOE/SniEJqAPHNVF/J8EZ3s8Lxu+eLOkq0VqVIcHjLALYA 7I8vA+PctEkSyATDCwRuxTr/g1VDJmagOUdC0j2D/rKWmkgD1nSdH0w/WS/sxctnya/9 A1JpcHRAbHsLO+uDHL9iU93NNFK5DSpDTyrhBKTgeQmbeX3185SO/fK6KDOnf30PQg/Z BopA== X-Gm-Message-State: AMke39n4bQsl6qFFjBGfY7UOIcJPVPbcHgT1lhF3FM3j0btwKsxOCJFWQQqfkoc0YrSvXv6C X-Received: by 10.84.172.1 with SMTP id m1mr38803946plb.5.1487107998300; Tue, 14 Feb 2017 13:33:18 -0800 (PST) Received: from [10.1.10.56] (96-82-76-110-static.hfc.comcastbusiness.net. [96.82.76.110]) by smtp.gmail.com with ESMTPSA id t12sm3068416pfg.4.2017.02.14.13.33.16 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 14 Feb 2017 13:33:17 -0800 (PST) Date: Tue, 14 Feb 2017 13:33:15 -0800 (PST) From: Stefano Stabellini X-X-Sender: sstabellini@sstabellini-ThinkPad-X260 To: Konrad Rzeszutek Wilk In-Reply-To: <20170214191829.GA16227@char.us.ORACLE.com> Message-ID: References: <20170214191829.GA16227@char.us.ORACLE.com> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Cc: Stefano Stabellini , xen-devel@lists.xenproject.org, wei.liu2@citrix.com, andr2000@gmail.com, andrew.cooper3@citrix.com Subject: Re: [Xen-devel] [DOC v5] Xen transport for 9pfs X-BeenThere: xen-devel@lists.xen.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" X-Virus-Scanned: ClamAV using ClamSMTP On Tue, 14 Feb 2017, Konrad Rzeszutek Wilk wrote: > On Mon, Feb 13, 2017 at 11:47:26AM -0800, Stefano Stabellini wrote: > > Reviewed-by: Konrad Rzeszutek Wilk Thank you! For your convenience: --- docs: add Xen transport for 9pfs Signed-off-by: Stefano Stabellini Reviewed-by: Konrad Rzeszutek Wilk diff --git a/docs/misc/9pfs.markdown b/docs/misc/9pfs.markdown new file mode 100644 index 0000000..7f13831 --- /dev/null +++ b/docs/misc/9pfs.markdown @@ -0,0 +1,419 @@ +# Xen transport for 9pfs version 1 + +## Background + +9pfs is a network filesystem protocol developed for Plan 9. 9pfs is very +simple and describes a series of commands and responses. It is +completely independent from the communication channels, in fact many +clients and servers support multiple channels, usually called +"transports". For example the Linux client supports tcp and unix +sockets, fds, virtio and rdma. + + +### 9pfs protocol + +This document won't cover the full 9pfs specification. Please refer to +this [paper] and this [website] for a detailed description of it. +However it is useful to know that each 9pfs request and response has the +following header: + + struct header { + uint32_t size; + uint8_t id; + uint16_t tag; + } __attribute__((packed)); + + 0 4 5 7 + +---------+--+----+ + | size |id|tag | + +---------+--+----+ + +- *size* +The size of the request or response. + +- *id* +The 9pfs request or response operation. + +- *tag* +Unique id that identifies a specific request/response pair. It is used +to multiplex operations on a single channel. + +It is possible to have multiple requests in-flight at any given time. + + +## Rationale + +This document describes a Xen based transport for 9pfs, in the +traditional PV frontend and backend format. The PV frontend is used by +the client to send commands to the server. The PV backend is used by the +9pfs server to receive commands from clients and send back responses. + +The transport protocol supports multiple rings up to the maximum +supported by the backend. The size of every ring is also configurable +and can span multiple pages, up to the maximum supported by the backend +(although it cannot be more than 2MB). The design is to exploit +parallelism at the vCPU level and support multiple outstanding requests +simultaneously. + +This document does not cover the 9pfs client/server design or +implementation, only the transport for it. + + +## Xenstore + +The frontend and the backend connect via xenstore to exchange +information. The toolstack creates front and back nodes with state +[XenbusStateInitialising]. The protocol node name is **9pfs**. + +Multiple rings are supported for each frontend and backend connection. + +### Backend XenBus Nodes + +Backend specific properties, written by the backend, read by the +frontend: + + versions + Values: + + List of comma separated protocol versions supported by the backend. + For example "1,2,3". Currently the value is just "1", as there is + only one version. N.B.: this is the version of the Xen trasport + protocol, not the version of 9pfs supported by the server. + + max-rings + Values: + + The maximum supported number of rings per frontend. + + max-ring-page-order + Values: + + The maximum supported size of a memory allocation in units of + log2n(machine pages), e.g. 1 = 2 pages, 2 == 4 pages, etc. It + must be at least 1. + +Backend configuration nodes, written by the toolstack, read by the +backend: + + path + Values: + + Host filesystem path to share. + + tag + Values: + + Alphanumeric tag that identifies the 9pfs share. The client needs + to know the tag to be able to mount it. + + security-model + Values: "none" + + *none*: files are stored using the same credentials as they are + created on the guest (no user ownership squash or remap) + Only "none" is supported in this version of the protocol. + +### Frontend XenBus Nodes + + version + Values: + + Protocol version, chosen among the ones supported by the backend + (see **versions** under [Backend XenBus Nodes]). Currently the + value must be "1". + + num-rings + Values: + + Number of rings. It needs to be lower or equal to max-rings. + + event-channel- (event-channel-0, event-channel-1, etc) + Values: + + The identifier of the Xen event channel used to signal activity + in the ring buffer. One for each ring. + + ring-ref (ring-ref0, ring-ref1, etc) + Values: + + The Xen grant reference granting permission for the backend to + map a page with information to setup a share ring. One for each + ring. + +### State Machine + +Initialization: + + *Front* *Back* + XenbusStateInitialising XenbusStateInitialising + - Query virtual device - Query backend device + properties. identification data. + - Setup OS device instance. - Publish backend features + - Allocate and initialize the and transport parameters + request ring. | + - Publish transport parameters | + that will be in effect during V + this connection. XenbusStateInitWait + | + | + V + XenbusStateInitialised + + - Query frontend transport parameters. + - Connect to the request ring and + event channel. + | + | + V + XenbusStateConnected + + - Query backend device properties. + - Finalize OS virtual device + instance. + | + | + V + XenbusStateConnected + +Once frontend and backend are connected, they have a shared page per +ring, which are used to setup the rings, and an event channel per ring, +which are used to send notifications. + +Shutdown: + + *Front* *Back* + XenbusStateConnected XenbusStateConnected + | + | + V + XenbusStateClosing + + - Unmap grants + - Unbind evtchns + | + | + V + XenbusStateClosing + + - Unbind evtchns + - Free rings + - Free data structures + | + | + V + XenbusStateClosed + + - Free remaining data structures + | + | + V + XenbusStateClosed + + +## Ring Setup + +The shared page has the following layout: + + typedef uint32_t XEN_9PFS_RING_IDX; + + struct xen_9pfs_intf { + XEN_9PFS_RING_IDX in_cons, in_prod; + uint8_t pad[56]; + XEN_9PFS_RING_IDX out_cons, out_prod; + uint8_t pad[56]; + + uint32_t ring_order; + /* this is an array of (1 << ring_order) elements */ + grant_ref_t ref[1]; + }; + + /* not actually C compliant (ring_order changes from ring to ring) */ + struct ring_data { + char in[((1 << ring_order) << PAGE_SHIFT) / 2]; + char out[((1 << ring_order) << PAGE_SHIFT) / 2]; + }; + +- **ring_order** + It represents the order of the data ring. The following list of grant + references is of `(1 << ring_order)` elements. It cannot be greater than + **max-ring-page-order**, as specified by the backend on XenBus. +- **ref[]** + The list of grant references which will contain the actual data. They are + mapped contiguosly in virtual memory. The first half of the pages is the + **in** array, the second half is the **out** array. The array must + have a power of two number of elements. +- **out** is an array used as circular buffer + It contains client requests. The producer is the frontend, the + consumer is the backend. +- **in** is an array used as circular buffer + It contains server responses. The producer is the backend, the + consumer is the frontend. +- **out_cons**, **out_prod** + Consumer and producer indices for client requests. They keep track of + how much data has been written by the frontend to **out** and how much + data has already been consumed by the backend. **out_prod** is + increased by the frontend, after writing data to **out**. **out_cons** + is increased by the backend, after reading data from **out**. +- **in_cons** and **in_prod** + Consumer and producer indices for responses. They keep track of how + much data has already been consumed by the frontend from the **in** + array. **in_prod** is increased by the backend, after writing data to + **in**. **in_cons** is increased by the frontend, after reading data + from **in**. + +The binary layout of `struct xen_9pfs_intf` follows: + + 0 4 8 64 68 72 76 + +---------+---------+-----//-----+---------+---------+---------+ + | in_cons | in_prod | padding |out_cons |out_prod |ring_orde| + +---------+---------+-----//-----+---------+---------+---------+ + + 76 80 84 4092 4096 + +---------+---------+----//---+---------+ + | ref[0] | ref[1] | | ref[N] | + +---------+---------+----//---+---------+ + +**N.B** For one page, N is maximum 991 (4096-132)/4, but given that N +needs to be a power of two, actually max N is 512. As 512 == (1 << 9), +the maximum possible max-ring-page-order value is 9. + +The binary layout of the ring buffers follow: + + 0 ((1< backend to frontend only + out-> frontend to backend only + +In the case of the **in** ring, the frontend is the consumer, and the +backend is the producer. Everything is the same but mirrored for the +**out** ring. + +The producer, the backend in this case, never reads from the **in** +ring. In fact, the producer doesn't need any notifications unless the +ring is full. This version of the protocol doesn't take advantage of it, +leaving room for optimizations. + +On the other end, the consumer always requires notifications, unless it +is already actively reading from the ring. The producer can figure it +out, without any additional fields in the protocol, by comparing the +indexes at the beginning and the end of the function. This is similar to +what [ring.h] does. + +## Ring Usage + +The **in** and **out** arrays are used as circular buffers: + + 0 sizeof(array) == ((1< XEN_9PFS_RING_SIZE - *masked_cons) { + memcpy(h, buf + *masked_cons, XEN_9PFS_RING_SIZE - *masked_cons); + memcpy((char *)h + XEN_9PFS_RING_SIZE - *masked_cons, buf, len - (XEN_9PFS_RING_SIZE - *masked_cons)); + } else { + memcpy(h, buf + *masked_cons, len); + } + } + *masked_cons = _MASK_XEN_9PFS_IDX(*masked_cons + len); + } + + static inline void xen_9pfs_write(char *buf, + XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons, + uint8_t *opaque, size_t len) { + if (*masked_prod < *masked_cons) { + memcpy(buf + *masked_prod, opaque, len); + } else { + if (len > XEN_9PFS_RING_SIZE - *masked_prod) { + memcpy(buf + *masked_prod, opaque, XEN_9PFS_RING_SIZE - *masked_prod); + memcpy(buf, opaque + (XEN_9PFS_RING_SIZE - *masked_prod), len - (XEN_9PFS_RING_SIZE - *masked_prod)); + } else { + memcpy(buf + *masked_prod, opaque, len); + } + } + *masked_prod = _MASK_XEN_9PFS_IDX(*masked_prod + len); + } + +The producer (the backend for **in**, the frontend for **out**) writes to the +array in the following way: + +- read *cons*, *prod* from shared memory +- general memory barrier +- verify *prod* against local copy (consumer shouldn't change it) +- write to array at position *prod* up to *cons*, wrapping around the circular + buffer when necessary +- write memory barrier +- increase *prod* +- notify the other end via event channel + +The consumer (the backend for **out**, the frontend for **in**) reads from the +array in the following way: + +- read *prod*, *cons* from shared memory +- read memory barrier +- verify *cons* against local copy (producer shouldn't change it) +- read from array at position *cons* up to *prod*, wrapping around the circular + buffer when necessary +- general memory barrier +- increase *cons* +- notify the other end via event channel + +The producer takes care of writing only as many bytes as available in the buffer +up to *cons*. The consumer takes care of reading only as many bytes as available +in the buffer up to *prod*. + + +## Request/Response Workflow + +The client chooses one of the available rings, then it sends a request +to the other end on the *out* array, following the producer workflow +described in [Ring Usage]. + +The server receives the notification and reads the request, following +the consumer workflow described in [Ring Usage]. The server knows how +much to read because it is specified in the *size* field of the 9pfs +header. The server processes the request and sends back a response on +the *in* array of the same ring, following the producer workflow as +usual. Thus, every request/response pair is on one ring. + +The client receives a notification and reads the response from the *in* +array. The client knows how much data to read because it is specified in +the *size* field of the 9pfs header. + + +[paper]: https://www.usenix.org/legacy/event/usenix05/tech/freenix/full_papers/hensbergen/hensbergen.pdf +[website]: https://github.com/chaos/diod/blob/master/protocol.md +[XenbusStateInitialising]: http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html +[ring.h]: http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/ring.h;hb=HEAD