From patchwork Mon Oct 23 13:17:52 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yishai Hadas X-Patchwork-Id: 10022477 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 37303603FA for ; Mon, 23 Oct 2017 13:18:35 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 31267204FE for ; Mon, 23 Oct 2017 13:18:35 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 260A128898; Mon, 23 Oct 2017 13:18:35 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0820A28896 for ; Mon, 23 Oct 2017 13:18:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751254AbdJWNSb (ORCPT ); Mon, 23 Oct 2017 09:18:31 -0400 Received: from mail-il-dmz.mellanox.com ([193.47.165.129]:51868 "EHLO mellanox.co.il" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751331AbdJWNS3 (ORCPT ); Mon, 23 Oct 2017 09:18:29 -0400 Received: from Internal Mail-Server by MTLPINE1 (envelope-from yishaih@mellanox.com) with ESMTPS (AES256-SHA encrypted); 23 Oct 2017 15:18:22 +0200 Received: from vnc17.mtl.labs.mlnx (vnc17.mtl.labs.mlnx [10.7.2.17]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id v9NDIM0A027864; Mon, 23 Oct 2017 16:18:22 +0300 Received: from vnc17.mtl.labs.mlnx (vnc17.mtl.labs.mlnx [127.0.0.1]) by vnc17.mtl.labs.mlnx (8.13.8/8.13.8) with ESMTP id v9NDIMlh004618; Mon, 23 Oct 2017 16:18:22 +0300 Received: (from yishaih@localhost) by vnc17.mtl.labs.mlnx (8.13.8/8.13.8/Submit) id v9NDIMBD004617; Mon, 23 Oct 2017 16:18:22 +0300 From: Yishai Hadas To: dledford@redhat.com Cc: linux-rdma@vger.kernel.org, yishaih@mellanox.com, majd@mellanox.com, artemyko@mellanox.com Subject: [PATCH rdma-core 01/10] verbs: Hardware tag matching Date: Mon, 23 Oct 2017 16:17:52 +0300 Message-Id: <1508764681-4531-2-git-send-email-yishaih@mellanox.com> X-Mailer: git-send-email 1.8.2.3 In-Reply-To: <1508764681-4531-1-git-send-email-yishaih@mellanox.com> References: <1508764681-4531-1-git-send-email-yishaih@mellanox.com> Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Artemy Kovalyov Add a document that provides terms and core explanations for tag matching (TM). It describes its protocols, matching process and the relation to RDMA. Signed-off-by: Artemy Kovalyov Reviewed-by: Yishai Hadas --- Documentation/CMakeLists.txt | 1 + Documentation/tag_matching.md | 132 ++++++++++++++++++++++++++++++++++++++++++ debian/rdma-core.install | 1 + redhat/rdma-core.spec | 1 + suse/rdma-core.spec | 1 + 5 files changed, 136 insertions(+) create mode 100644 Documentation/tag_matching.md diff --git a/Documentation/CMakeLists.txt b/Documentation/CMakeLists.txt index d6e08de..4b9e07f 100644 --- a/Documentation/CMakeLists.txt +++ b/Documentation/CMakeLists.txt @@ -6,6 +6,7 @@ install(FILES librdmacm.md rxe.md udev.md + tag_matching.md ../README.md ../MAINTAINERS DESTINATION "${CMAKE_INSTALL_DOCDIR}") diff --git a/Documentation/tag_matching.md b/Documentation/tag_matching.md new file mode 100644 index 0000000..5b5cd7d --- /dev/null +++ b/Documentation/tag_matching.md @@ -0,0 +1,132 @@ +# Hardware tag matching + +## Introduction + +The MPI standard defines a set of rules, known as tag-matching, for matching +source send operations to destination receives according to the following +attributes: + +* Communicator +* User tag - wild card may be specified by the receiver +* Source rank - wild card may be specified by the receiver +* Destination rank - wild card may be specified by the receiver + +These matching attributes are specified by all Send and Receive operations. +Send operations from a given source to a given destination are processed in +the order in which the Sends were posted. Receive operations are associated +with the earliest send operation (from any source) that matches the +attributes, in the order in which the Receives were posted. Note that Receive +tags are not necessarily consumed in the order they are created, e.g., a later +generated tag may be consumed if earlier tags do not satisfy the matching +rules. + +When a message arrives at the receiver, MPI implementations often classify it +as either 'expected' or 'unexpected' according to whether a Receive operation +with a matching tag has already been posted by the application. In the +expected case, the message may be processed immediately. In the unexpected +case, the message is saved in an unexpected message queue, and will be +processed when a matching Receive operation is posted. + +To bound the amount of memory to hold unexpected messages, MPI implementations +use 2 data transfer protocols. The 'eager' protocol is used for small +messages. Eager messages are sent without any prior synchronization and +processed/buffered at the receiver. Typically, with RDMA, a single RDMA-Send +operation is used to transfer the data. + +The 'rendezvous' protocol is used for large messages. Initially, only the +message tag is sent along with some meta-data. Only when the tag is matched to +a Receive operation, will the receiver initiate the corresponding data +transfer. A common RDMA implementation is to send the message tag with an +RDMA-Send, and transfer the data with an RDMA-Read issued by the receiver. +When the transfer is complete, the receiver will notify the sender that its +buffer may be freed using an RDMA-Send. + +## RDMA tag-matching offload + +Tag-matching offload satisfies the following principals: +- Tag-matching is viewed as an RDMA application, and thus does not affect the + RDMA transport in any way [(*)](#m1) +- Tag-matching processing will be split between HW and SW. + * HW will hold a bounded prefix of Receive tags +- HW will process and transfer any expected message that matches a tag held + in HW. + * In case the message uses the rendezvous protocol, HW will also initiate + the RDMA-Read data transfer and send a notification message when the + data transfer completes. +- SW will handle any message that is either unexpected or whose tag is not + held in HW. + +(*) +This concept can apply to additional application-specific offloads in the +future. + +Tag-matching is initially defined for RC transport. Tag-matching messages are +encapsulated in RDMA-Send messages and contain the following headers: + +``` + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + Tag Matching Header (TMH): + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Operation | reserved | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | User data (optional) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Tag | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Tag | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Rendezvous Header (RVH): + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Virtual Address | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Virtual Address | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Remote Key | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Length | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +``` + +Tag-matching messages always contain a TMH. An RHV is added for Rendezvous +request messages. The following message formats are defined: +- Eager request: TMH | payload +- Rendezvous request: TMH | RHV | optional meta-data [(**)](#m2) +- Rendezvous response: TMH + +Note that rendezvous data transfers are standard RDMA-Reads + +(**) +Rendezvous request messages may also arrive unexpected; in this case, the +message is handled in SW, optionally leveraging additional meta-data passed by +the sender. + +As tag-matching messages are standard RDMA-Sends, no special HW support is +needed at the sender. At the receiver, we introduce a new SRQ type - a +Tag-Matching SRQ (TM-SRQ). The TM-SRQ forms the serialization point for +matching messages coming from any of the associated RC connections, and reports +all tag matching completions and events to a dedicated CQ. +2 kinds of buffers may be posted to the TM-SRQ: +- Buffers associated with tags (tagged-buffers), which are used when a match + is made by HW +- Standard SRQ buffers, which are used for unexpected messages (from HW's + perspective) +When a message is matched by HW, the payload is transferred directly to the +application buffer (both in the eager and the rendezvous case), while skipping +any TM headers. Otherwise, the entire message, including any TM headers, is +scattered to the SRQ buffer. + +Since unexpected messages are handled in SW, there exists an inherent race +between the arrival of messages from the wire and posting of new tagged +buffers. For example, consider 2 incoming messages m1 and m2 and matching +buffers b1 and b2 that are posted asynchronously. If b1 is posted after m1 +arrives but before m2, m1 would be delivered as an unexpected message while m2 +would match b1, violating the ordering rules. + +Consequently, whenever HW deems a message unexpected, tag matching must be +disabled for new tags until SW and HW synchronize. This synchronization is +achieved by reporting to HW the number of unexpected messages handled by SW +(with respect to the current posted tags). When the SW and HW are in synch, tag +matching resumes normally. + diff --git a/debian/rdma-core.install b/debian/rdma-core.install index ca08a9d..da3e73c 100644 --- a/debian/rdma-core.install +++ b/debian/rdma-core.install @@ -27,6 +27,7 @@ usr/share/doc/rdma-core/MAINTAINERS usr/share/doc/rdma-core/README.md usr/share/doc/rdma-core/rxe.md usr/share/doc/rdma-core/udev.md +usr/share/doc/rdma-core/tag_matching.md usr/share/man/man5/iwpmd.conf.5 usr/share/man/man7/rxe.7 usr/share/man/man8/iwpmd.8 diff --git a/redhat/rdma-core.spec b/redhat/rdma-core.spec index 48b7d30..ae399df 100644 --- a/redhat/rdma-core.spec +++ b/redhat/rdma-core.spec @@ -318,6 +318,7 @@ rm -rf %{buildroot}/%{_sbindir}/srp_daemon.sh %doc %{_docdir}/%{name}-%{version}/README.md %doc %{_docdir}/%{name}-%{version}/rxe.md %doc %{_docdir}/%{name}-%{version}/udev.md +%doc %{_docdir}/%{name}-%{version}/tag_matching.md %config(noreplace) %{_sysconfdir}/rdma/mlx4.conf %config(noreplace) %{_sysconfdir}/rdma/modules/infiniband.conf %config(noreplace) %{_sysconfdir}/rdma/modules/iwarp.conf diff --git a/suse/rdma-core.spec b/suse/rdma-core.spec index defe3b4..03c4a29 100644 --- a/suse/rdma-core.spec +++ b/suse/rdma-core.spec @@ -544,6 +544,7 @@ rm -rf %{buildroot}/%{_sbindir}/srp_daemon.sh %doc %{_docdir}/%{name}-%{version}/libibverbs.md %doc %{_docdir}/%{name}-%{version}/rxe.md %doc %{_docdir}/%{name}-%{version}/udev.md +%doc %{_docdir}/%{name}-%{version}/tag_matching.md %{_bindir}/rxe_cfg %{_mandir}/man7/rxe* %{_mandir}/man8/rxe*