From patchwork Wed Jun 6 15:25:04 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Pen X-Patchwork-Id: 10450453 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 06FE360375 for ; Wed, 6 Jun 2018 15:26:01 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E9FE5297A8 for ; Wed, 6 Jun 2018 15:26:00 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DE33A297B4; Wed, 6 Jun 2018 15:26:00 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C1BB5297BD for ; Wed, 6 Jun 2018 15:25:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752508AbeFFPZ6 (ORCPT ); Wed, 6 Jun 2018 11:25:58 -0400 Received: from mail-wr0-f193.google.com ([209.85.128.193]:42919 "EHLO mail-wr0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932303AbeFFPZr (ORCPT ); Wed, 6 Jun 2018 11:25:47 -0400 Received: by mail-wr0-f193.google.com with SMTP id w10-v6so6742918wrk.9 for ; Wed, 06 Jun 2018 08:25:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=profitbricks-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=z1rtqqqrtVbjo5WigMdpnlVh9qNrGlfdPoo0J0UlTHA=; b=LPV/2/gmNV9xmSYxIYODGw1EbtzQklBIgdYJfQV/GUM9hQgMnojzrNkv0BbyPkaR9/ +W61fgsTEVnEH09WZ6MFvTc8HHbhWfjXboIsBPTCR7AyALOC8xy0/nt2eBDDvtiElN/a spoaZGE7sECoffdTJW89RrQsYE7i6XQu0BNdYzoQsXEmZy05bCRko2rAVsAXYF2enmW9 3HxXJj3hgLzN+D2KGe7Tasg3uVstxy/ETiwtbsxG1Qu3kWoumGedy+ZdPzXiYNolP+I9 eC6nIXROSQ/BoQ2Bzcc1ImmEweMfhdIJNnrkUWu9Lj/JLzXJIvwpFGvxEJcjRadY2o2T DTeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=z1rtqqqrtVbjo5WigMdpnlVh9qNrGlfdPoo0J0UlTHA=; b=iIHjwTRu45+4stOy4rLZfJ58MIattm8QoIJSRlvUzf0OC7mIcZHslb7TSYIMof3N/S ZS+AjHVRrjnKXjN2eGZPmkUxiEzQGj/A6MkEmpZ3O58n97Uv7Fgr7KiqxeNrxu64+70+ +QuslDXazYCw+LNBt2vi4MWNZpDZynkMKIs5J79ypb8ZoGOoy+EkSaR1JBNeYFUmjR72 GDtvyy7/q/SVQ9+wJgQ4w7Sk1DTPHbaaS1CwLrTWBdMCbsadE7WZ4cUzwwPb919jvq0M NX8ZlYrid1fxRMOz+sSsBN0f4urtCqVSFroqBkPy1A7SI2c7IteULoo8XUvidRclz9m4 spqA== X-Gm-Message-State: APt69E3SGsGhSepW6/RFbVRMwvLSUQd31T9J30fUTYtNSvMLAkeyXR1B t2+IiCJucIjodXlSe3OZP5BgUQ9SOEc= X-Google-Smtp-Source: ADUXVKK0mT6+7YYS1WN+iiZrfJw2ODDBlngdHqb+ZJo0QnoP6k7e3s92p2s9acIyDyro1hIprC0YRw== X-Received: by 2002:adf:f8c2:: with SMTP id f2-v6mr2564608wrq.237.1528298745902; Wed, 06 Jun 2018 08:25:45 -0700 (PDT) Received: from pb.pb.local ([62.217.45.26]) by smtp.gmail.com with ESMTPSA id n11-v6sm18645834wro.13.2018.06.06.08.25.44 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 06 Jun 2018 08:25:45 -0700 (PDT) From: Roman Pen To: linux-block@vger.kernel.org, linux-rdma@vger.kernel.org Cc: Jens Axboe , Christoph Hellwig , Sagi Grimberg , Bart Van Assche , Or Gerlitz , Doug Ledford , Danil Kipnis , Jack Wang , Roman Pen Subject: [PATCH v3 14/25] ibtrs: a bit of documentation Date: Wed, 6 Jun 2018 17:25:04 +0200 Message-Id: <20180606152515.25807-15-roman.penyaev@profitbricks.com> X-Mailer: git-send-email 2.13.1 In-Reply-To: <20180606152515.25807-1-roman.penyaev@profitbricks.com> References: <20180606152515.25807-1-roman.penyaev@profitbricks.com> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP README with description of major sysfs entries. Signed-off-by: Roman Pen Signed-off-by: Danil Kipnis Cc: Jack Wang --- drivers/infiniband/ulp/ibtrs/README | 390 ++++++++++++++++++++++++++++++++++++ 1 file changed, 390 insertions(+) create mode 100644 drivers/infiniband/ulp/ibtrs/README diff --git a/drivers/infiniband/ulp/ibtrs/README b/drivers/infiniband/ulp/ibtrs/README new file mode 100644 index 000000000000..d9d8cd69d44f --- /dev/null +++ b/drivers/infiniband/ulp/ibtrs/README @@ -0,0 +1,390 @@ +**************************** +InfiniBand Transport (IBTRS) +**************************** + +IBTRS (InfiniBand Transport) is a reliable high speed transport library +which provides support to establish optimal number of connections +between client and server machines using RDMA (InfiniBand, RoCE, iWarp) +transport. It is optimized to transfer (read/write) IO blocks. + +In its core interface it follows the BIO semantics of providing the +possibility to either write data from an sg list to the remote side +or to request ("read") data transfer from the remote side into a given +sg list. + +IBTRS provides I/O fail-over and load-balancing capabilities by using +multipath I/O (see "add_path" and "mp_policy" configuration entries). + +IBTRS is used by the IBNBD (Infiniband Network Block Device) modules. + +====================== +Client Sysfs Interface +====================== + +This chapter describes only the most important files of sysfs interface +on client side. + +Entries under /sys/devices/virtual/ibtrs-client/ +================================================ + +When a user of IBTRS API creates a new session, a directory entry with +the name of that session is created. + +Entries under /sys/devices/virtual/ibtrs-client// +=============================================================== + +add_path (RW) +------------- + +Adds a new path (connection) to an existing session. Expected format is the +following: + + <[source addr,]destination addr> + + *addr ::= [ ip: | gid: ] + +max_reconnect_attempts (RW) +--------------------------- + +Maximum number reconnect attempts the client should make before giving up +after connection breaks unexpectedly. + +mp_policy (RW) +-------------- + +Multipath policy specifies which path should be selected on each IO: + + round-robin (0): + select path in per CPU round-robin manner. + + min-inflight (1): + select path with minimum inflights. + +Entries under /sys/devices/virtual/ibtrs-client//paths/ +===================================================================== + + +Each path belonging to a given session is listed here by its source and +destination address. When a new path is added to a session by writing to +the "add_path" entry, a directory is created. + +Entries under /sys/devices/virtual/ibtrs-client//paths// +=============================================================================== + +state (R) +--------- + +Contains "connected" if the session is connected to the peer and fully +functional. Otherwise the file contains "disconnected" + +reconnect (RW) +-------------- + +Write "1" to the file in order to reconnect the path. +Operation is blocking and returns 0 if reconnect was successful. + +disconnect (RW) +--------------- + +Write "1" to the file in order to disconnect the path. +Operation blocks until IBTRS path is disconnected. + +remove_path (RW) +---------------- + +Write "1" to the file in order to disconnected and remove the path +from the session. Operation blocks until the path is disconnected +and removed from the session. + +hca_name (R) +------------ + +Contains the the name of HCA the connection established on. + +hca_port (R) +------------ + +Contains the port number of active port traffic is going through. + +src_addr (R) +------------ + +Contains the source address of the path + +dst_addr (R) +------------ + +Contains the destination address of the path + + +Entries under /sys/devices/virtual/ibtrs-client//paths//stats/ +===================================================================================== + +Write "0" to any file in that directory to reset corresponding statistics. + +reset_all (RW) +-------------- + +Read will return usage help, write 0 will clear all the statistics. + +sg_entries (RW) +--------------- + +Data to be transferred via RDMA is passed to IBTRS as scatter-gather +list. A scatter-gather list can contain multiple entries. +Scatter-gather list with less entries require less processing power +and can therefore transferred faster. The file sg_entries outputs a +per-CPU distribution table for the number of entries in the +scatter-gather lists, that were passed to the IBTRS API function +ibtrs_clt_request (READ or WRITE). + +cpu_migration (RW) +------------------ + +IBTRS expects that each HCA IRQ is pinned to a separate CPU. If it's +not the case, the processing of an I/O response could be processed on a +different CPU than where it was originally submitted. This file shows +how many interrupts where generated on a non expected CPU. +"from:" is the CPU on which the IRQ was expected, but not generated. +"to:" is the CPU on which the IRQ was generated, but not expected. + +reconnects (RW) +--------------- + +Contains 2 unsigned int values, the first one records number of successful +reconnects in the path lifetime, the second one records number of failed +reconnects in the path lifetime. + +rdma_lat (RW) +------------- + +Latency distribution of IBTRS requests. +The format is: + 1 ms: + 2 ms: + 4 ms: + 8 ms: + 16 ms: + ... + 65536 ms: + >= 65536 ms: + maximum ms: + +wc_completion (RW) +------------------ + +Contains 2 unsigned int values, the first one records max number of work +requests processed in work_completion in session lifetime, the second +one records average number of work requests processed in work_completion +in session lifetime. + +rdma (RW) +--------- + +Contains statistics regarding rdma operations and inflight operations. +The output consists of 6 values: + + \ + + +====================== +Server Sysfs Interface +====================== + +Entries under /sys/devices/virtual/ibtrs-server/ +================================================ + +When a user of IBTRS API creates a new session on a client side, a +directory entry with the name of that session is created in here. + +Entries under /sys/devices/virtual/ibtrs-server//paths/ +===================================================================== + +When new path is created by writing to "add_path" entry on client side, +a directory entry named as @ is created +on server. + +Entries under /sys/devices/virtual/ibtrs-server//paths// +=============================================================================== + +disconnect (RW) +--------------- + +When "1" is written to the file, the IBTRS session is being disconnected. +Operations is non-blocking and returns control immediately to the caller. + +hca_name (R) +------------ + +Contains the the name of HCA the connection established on. + +hca_port (R) +------------ + +Contains the port number of active port traffic is going through. + +src_addr (R) +------------ + +Contains the source address of the path + +dst_addr (R) +------------ + +Contains the destination address of the path + +Entries under /sys/devices/virtual/ibtrs-server//paths//stats/ +===================================================================================== + +When "0" is written to a file in this directory, the corresponding counters +will be reset. + +reset_all (RW) +-------------- + +Read will return usage help, write 0 will clear all the counters about +stats. + +rdma (RW) +--------- + +Contains statistics regarding rdma operations and inflight operations. +The output consists of 5 values: + + + +wc_completion (RW) +------------------ + +Contains 3 values, the first one is int, records max number of work +requests processed in work_completion in session lifetime, the second +one long int records total number of work requests processed in +work_completion in session lifetime and the 3rd one long int records +total number of calls to the cq completion handler. Division of 2nd +number through 3rd gives the average number of completions processed +in completion handler. + +================== +Transport protocol +================== + +Overview +-------- +An established connection between a client and a server is called ibtrs +session. A session is associated with a set of memory chunks reserved on the +server side for a given client for rdma transfer. A session +consists of multiple paths, each representing a separate physical link +between client and server. Those are used for load balancing and failover. +Each path consists of as many connections (QPs) as there are cpus on +the client. + +When processing an incoming rdma write or read request ibtrs client uses memory +chunks reserved for him on the server side. Their number, size and addresses +need to be exchanged between client and server during the connection +establishment phase. Apart from the memory related information client needs to +inform the server about the session name and identify each path and connection +individually. + +On an established session client sends to server write or read messages. +Server uses immediate field to tell the client which request is being +acknowledged and for errno. Client uses immediate field to tell the server +which of the memory chunks has been accessed and at which offset the message +can be found. + +Connection establishment +------------------------ + +1. Client starts establishing connections belonging to a path of a session one +by one via attaching IBTRS_MSG_CON_REQ messages to the rdma_connect requests. +Those include uuid of the session and uuid of the path to be +established. They are used by the server to find a persisting session/path or +to create a new one when necessary. The message also contains the protocol +version and magic for compatibility, total number of connections per session +(as many as cpus on the client), the id of the current connection and +the reconnect counter, which is used to resolve the situations where +client is trying to reconnect a path, while server is still destroying the old +one. + +2. Server accepts the connection requests one by one and attaches +IBTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and +protocol version, the messages include error code, queue depth supported by +the server (number of memory chunks which are going to be allocated for that +session) and the maximum size of one io. + +3. After all connections of a path are established client sends to server the +IBTRS_MSG_INFO_REQ message, containing the name of the session. This message +requests the address information from the server. + +4. Server replies to the session info request message with IBTRS_MSG_INFO_RSP, +which contains the addresses and keys of the RDMA buffers allocated for that +session. + +5. Session becomes connected after all paths to be established are connected +(i.e. steps 1-4 finished for all paths requested for a session) + +6. Server and client exchange periodically heartbeat messages (empty rdma +messages with an immediate field) which are used to detect a crash on remote +side or network outage in an absence of IO. + +7. On any RDMA related error or in the case of a heartbeat timeout, the +corresponding path is disconnected, all the inflight IO are failed over to a +healthy path, if any, and the reconnect mechanism is triggered. + +CLT SRV +*for each connection belonging to a path and for each path: +IBTRS_MSG_CON_REQ -------------------> + <------------------- IBTRS_MSG_CON_RSP +... +*after all connections are established: +IBTRS_MSG_INFO_REQ -------------------> + <------------------- IBTRS_MSG_INFO_RSP +*heartbeat is started from both sides: + -------------------> [IBTRS_HB_MSG_IMM] +[IBTRS_HB_MSG_ACK] <------------------- +[IBTRS_HB_MSG_IMM] <------------------- + -------------------> [IBTRS_HB_MSG_ACK] + +IO path +------- + +* Write * + +1. When processing a write request client selects one of the memory chunks +on the server side and rdma writes there the user data, user header and the +IBTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only +contains size of the user header. The client tells the server which chunk has +been accessed and at what offset the IBTRS_MSG_RDMA_WRITE can be found by +using the IMM field. + +2. When confirming a write request server sends an "empty" rdma message with +an immediate field. The 32 bit field is used to specify the outstanding +inflight IO and for the error code. + +CLT SRV +usr_data + usr_hdr + ibtrs_msg_rdma_write -----------------> [IBTRS_IO_REQ_IMM] +[IBTRS_IO_RSP_IMM] <----------------- (id + errno) + +* Read * + +1. When processing a read request client selects one of the memory chunks +on the server side and rdma writes there the user header and the +IBTRS_MSG_RDMA_READ message. This message contains the type (read), size of +the user header, flags (specifying if memory invalidation is necessary) and the +list of addresses along with keys for the data to be read into. + +2. When confirming a read request server transfers the requested data first, +attaches an invalidation message if requested and finally an "empty" rdma +message with an immediate field. The 32 bit field is used to specify the +outstanding inflight IO and the error code. + +CLT SRV +usr_hdr + ibtrs_msg_rdma_read --------------> [IBTRS_IO_REQ_IMM] +[IBTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) +or in case client requested invalidation: +[IBTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) + + +Contact +------- + +Mailing list: "IBNBD/IBTRS Storage Team"