From patchwork Fri Jul 22 18:48:24 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bernard Metzler X-Patchwork-Id: 1000492 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter1.kernel.org (8.14.4/8.14.4) with ESMTP id p6MIju6b025778 for ; Fri, 22 Jul 2011 18:48:28 GMT Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754943Ab1GVSs1 (ORCPT ); Fri, 22 Jul 2011 14:48:27 -0400 Received: from mtagate3.uk.ibm.com ([194.196.100.163]:46617 "EHLO mtagate3.uk.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754940Ab1GVSs1 (ORCPT ); Fri, 22 Jul 2011 14:48:27 -0400 Received: from d06nrmr1507.portsmouth.uk.ibm.com (d06nrmr1507.portsmouth.uk.ibm.com [9.149.38.233]) by mtagate3.uk.ibm.com (8.13.1/8.13.1) with ESMTP id p6MImPnW011851 for ; Fri, 22 Jul 2011 18:48:25 GMT Received: from d06av03.portsmouth.uk.ibm.com (d06av03.portsmouth.uk.ibm.com [9.149.37.213]) by d06nrmr1507.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id p6MImPkg2621662 for ; Fri, 22 Jul 2011 19:48:25 +0100 Received: from d06av03.portsmouth.uk.ibm.com (localhost.localdomain [127.0.0.1]) by d06av03.portsmouth.uk.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id p6MImPAV008413 for ; Fri, 22 Jul 2011 12:48:25 -0600 Received: from aare.zurich.ibm.com (aare.zurich.ibm.com [9.4.2.232]) by d06av03.portsmouth.uk.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id p6MImPjY008410 for ; Fri, 22 Jul 2011 12:48:25 -0600 Received: from localhost.localdomain (achilles.zurich.ibm.com [9.4.243.2]) by aare.zurich.ibm.com (AIX6.1/8.13.4/8.13.4) with ESMTP id p6MImPsm5374238; Fri, 22 Jul 2011 20:48:25 +0200 From: Bernard Metzler To: linux-rdma@vger.kernel.org Cc: Bernard Metzler Subject: [PATCH 14/14] SIWv3: Documentation: siw.txt Date: Fri, 22 Jul 2011 20:48:24 +0200 Message-Id: <1311360504-15343-1-git-send-email-bmt@zurich.ibm.com> X-Mailer: git-send-email 1.5.4.3 Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by milter-greylist-4.2.6 (demeter1.kernel.org [140.211.167.41]); Fri, 22 Jul 2011 18:48:28 +0000 (UTC) --- Documentation/networking/siw.txt | 155 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 155 insertions(+), 0 deletions(-) create mode 100644 Documentation/networking/siw.txt diff --git a/Documentation/networking/siw.txt b/Documentation/networking/siw.txt new file mode 100644 index 0000000..fb51735 --- /dev/null +++ b/Documentation/networking/siw.txt @@ -0,0 +1,155 @@ +SoftiWARP: Software iWARP kernel driver module. + +General +------- +SoftiWARP (siw) implements the iWARP protocol suite (MPA/DDP/RDMAP, +IETF-RFC 5044/5041/5040) completely in software as a Linux kernel module. +siw runs on top of TCP kernel sockets and exports the Linux kernel ibverbs +RDMA interface. siw interfaces with the iwcm connection manager. + + +Transmit Path +------------- +If a send queue (SQ) work queue element gets posted, siw tries to send +it directly out of the application context. If the SQ was non-empty, +SQ processing is done asynchronously by a kernel worker thread. This +thread is scheduled if the TCP socket signals new write space to +be available. If during send operation the socket send space becomes +exhausted, SQ processing is abandoned until new socket write space +becomes available. + + +Receive Path +------------ +All application data is placed into target buffers within softirq +socket callback. Application notification is asynchronous. + + +User Interface +-------------- +All user space fast path operations such as posting of work requests and +reaping of work completions currently involve an asynchronous call into +the siw kernel module via ib_uverbs interface. Kernel/user-mapped send +and receive as well as completion queues are not part of the current code. +In particular, mapped completion queues may improve performance, +since reaping completion queue entries as well as re-arming +the completion queue could be done more efficiently. + + +Kernel Client Support +--------------------- +To guarantee non-blocking fast path operations, for kernel clients +all work queue elements (send/receive/shared-receive queue) are +pre-allocated during connection resource setup. + + +Memory Management +----------------- +siw currently uses the ib_umem_get() function of the ib_core module +to pin memory for later use in data transfer operations. Transmit +and receive memory are checked against correct access permissions only +at the moment of access by the network input path or before pushing it +to the TCP socket for transmission. +ib_umem_get() provides DMA mappings for the requested address space which +are not used by siw. + + +Module Parameters +----------------- +The following siw module parameters are recognized. + +loopback_enabled: + If set, siw attaches also to the looback device. Checked only + during module insertion. + +mpa_crc_required: + If set, the MPA CRC is generated and checked both in tx and rx + path. Without hardware support, setting this flag will severely + hurt throughput. Default setting is 0 (off). + +mpa_crc_strict: + If set, MPA CRC will not be enabled, even if peer requests + it. If the peer requests CRC generation, the connection setup + will be aborted. Default setting is 1 (on). + +zcopy_tx: + If set, payloads of non-signalled work requests + (such as non-signalled WRITE or SEND as well as all READ + responses) are transferred using the TCP sockets + sendpage interface. This parameter can be switched on and + off dynamically (echo 1 >> /sys/module/siw/parameters/zcopy_tx + for enablement, 0 for disabling). System load may benefits from + using zero copy data transmission. Zero copy is not enabled if + mpa_crc_enabled is set. Default setting is 1 (on). + +tcp_nodelay: + If set, on the TCP socket the TCP_NODELAY option is set. + Default setting is 1 (on). + +iface_list: + Comma-separated list of interfaces siw should attach to. + If no list is given, siw attaches to all available devices. + If a list is given, siw skips those devices not listed. + Currently, the list is restricted to 12 entries. If needed, + the 'SIW_MAX_IF' #define in siw_main.c can be modified. + This parameter might be useful to skip devices which are + attached to a real RNIC device. Default setting is an empty list. + + +Compile Time Flags: +------------------- +-DCHECK_DMA_CAPABILITIES + Checks if the device siw wants to attach to provides + DMA capabilities. While DMA capabilities are currently not + needed (siw works on top of kernel TCP sockets), siw + uses ib_umem_get() which performs a (not used) DMA address + translation. Writing a siw private memory reservation and + pinning routine would solve the issue. + +-DSIW_TX_FULLSEGS + Experimental, not enabled by default. If set, + siw tries not to overrun the socket (not sending until + -EAGAIN return), but stops sending if the current segment + would not fit into the socket's estimated tx buffer. With that, + wire FPDUs may get truncated by the TCP stack far less often. + Since this feature manipulates the sock's SOCK_NOSPACE + bit, it violates strict layering and is therefore considered + proprietary. + Since TCP is a byte stream protocol, no guarantee can be given + if FPDUs are not fragmented. + + +Debugging SoftiWARP: +-------------------- +Runtime debugging: + The siw_debug.h file defines a 'dprint' macro which is used + to debug siw at runtime. Verbosity of debugging is controlled + at compile time via setting 'DPRINT_MASK' to an or'd list + of know values as defined in siw_debug.h, + e.g. '#define DPRINT_MASK (DBG_ON|DBG_CM)' + to debug errors and connection management. Defining DPRINT_MASK + to '0' avoids to compile any runtime debugging code. + +Debugfs support: + To track siw's usage of its objects (connection endpoints, + TCP sockets, protection domains, queue pairs, shared receive + queues, completion queues, memory registrations, work queue + elements), some debug filesystem support has been added. + To make use of it, the kernel must be enabled for debug + filesystem support (enable 'Kernel hacking -> Debug filesystem' + during kernel configuration). Furthermore, the debug filesystem + must be mounted, e.g. use + + # mount -t debugfs none /sys/kernel/debug + + If the siw kernel module is loaded, the siw/ directory now + contains the following entries for each siw device + (e.g. /sys/kernel/debug/siw/siw_eth0): + + stats: Summary of allocated WQEs, PDs, QPs, CQs, SRQs, MRs, CEPs. + WQE statistics are not gathered if 'DPRINT_MASK' is + set to '0' (see above). + + qp: Status of allocated queue pairs. + + cep: Status of allocated connection end points.