From patchwork Thu Jan 30 21:15:34 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13954981 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 485811F0E31 for ; Thu, 30 Jan 2025 21:15:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738271747; cv=none; b=EXS77O8BA+5jgVSWVNxvLg/8zjIErr1+Wc9RKpBoJGW0GRgyps9i2e3D5RPdwZ4tlydUSnFFYBjUpU8vgPBawzCwubVNE5R0RMb654hoEs0m5CfE3GN3nxCy81TnUJNAlOtIFP1aAfKlg3aD9kwhnaoGV1+KUtsCwBLctcYFw6M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738271747; c=relaxed/simple; bh=3u/7jLBYe2KfRcVNNTNkFn5t4vUJKiOG+tUb7ZrbBhE=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=mk54iT23ajV9EOQTBfG8K0BiW8lAtyvNsfCJnETMcJP9hGXHqFPzTi724+bjHKHZ7UqKwRPGMSgLmBlSfHNCDc2FGPUCW5OcEV1XRfNEMBrgw4HkOG/tL/uFG54rIOT8PESOUsdujmiUJikK5aR+KeyzJkG+/LYqvNL7A3YGv+o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=I+O6Bjj+; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="I+O6Bjj+" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2f81a0d0a18so2486284a91.3 for ; Thu, 30 Jan 2025 13:15:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738271744; x=1738876544; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=4kMF0lC9HmaCcCSq7dy1+BxkweYLRN4mbfBRIjXSD14=; b=I+O6Bjj+cNMe2s1B+u8TGCi18k/D7LFYzaR66kO1CNEwgXH7djWegR6QGkzj5JwQSJ Y9Qo/aIZjL8IVBGJdsO2zOwzcF3Wu55sI85G+Fe3rkcTQItm/fXUzlJZrJacmAsH43L2 XbkNZU6/y1alUi/QKO8B5ezpeOe4EBmxmBmbFlO0yHVJQByLdNtI2OaqVrjQql3IE/UN ugcJYz+k8WiLHa36AWHhp0f+UFVXKpwg1ZRqhgEzUVM7QsOdYyvW3vVzTqiZ5i0cGigB OTOIEI8IbPWIPGgcmh6eBMGuPplbl4n+wAl5Q9nCuWCxx2B7nI2LcTN2x7tYRf4bAHFn a7UA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738271744; x=1738876544; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=4kMF0lC9HmaCcCSq7dy1+BxkweYLRN4mbfBRIjXSD14=; b=sR97rxjLpzkPd1ynvBM6NPlR/+ApGz4fBJyYRhQp0pTIfZ+z4faG0BphuWDoZzQIta j81z1DMY2WF177sN3wywdQUW699XTAo2N1IdGGBSiO0lgjM99pYFnPEVI6Tx9YlO1VHR hNASecivXt8EnYD5DTDfKfXxgd8jGnzCImlmQSYTZT+xc3rRvvMpOXhMwxQyYdQQavHF A/HNHXJYep0W9UAaaR+tHDAT68f7/V+D4mNrOgOz1Km/t/Yi7CYls1F2f5LiiLqQ33P8 xNOWfp8IkHz2CKUTLNaEoKB4r1vdUgdpF8+H0GBbWMS8a/2juKwX1b67Z8G/BngS1BcQ aEBA== X-Gm-Message-State: AOJu0Yxn/RjrvjLU40YnEl3fBl4rdsbO10veP5nIRmPGaHRqk5n2dgjA x9oJWLSSvx4v0/TFlBioWut8gEe0kwwZuxyJtG+3ghqdfajwsGp0QvqSqcc6IQlFuxheZFTQ3Ey zAL4EczHsc5brVJTEk6gYKJtNltOsdsLQM5XcW5/sQza2podFwFNE3hLSUR+sHUlz2Giarxss/o lFiUBonuYOemLwA+jsP+i7n4xAIJyyF1E3GElCFjdlXGeY5k1etoahdUZUwCc= X-Google-Smtp-Source: AGHT+IHSaYi9y/zBICOD8Bg8IwjEpl0YXsgZ0bkByw1FvqGCzDUrTuYSRphV9YF7474nI+yW7N+2oAhVQjlB0p0oJA== X-Received: from pjbsw5.prod.google.com ([2002:a17:90b:2c85:b0:2d3:d4ca:5fb0]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:258b:b0:2ee:fd53:2b17 with SMTP id 98e67ed59e1d1-2f83ac836d0mr12411031a91.29.1738271744350; Thu, 30 Jan 2025 13:15:44 -0800 (PST) Date: Thu, 30 Jan 2025 21:15:34 +0000 In-Reply-To: <20250130211539.428952-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250130211539.428952-1-almasrymina@google.com> X-Mailer: git-send-email 2.48.1.362.g079036d154-goog Message-ID: <20250130211539.428952-2-almasrymina@google.com> Subject: [PATCH RFC net-next v2 1/6] net: add devmem TCP TX documentation From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kvm@vger.kernel.org, virtualization@lists.linux.dev, linux-kselftest@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , David Ahern , Stefan Hajnoczi , Stefano Garzarella , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , " =?utf-8?q?Eugenio_P=C3=A9rez?= " , Shuah Khan , sdf@fomichev.me, asml.silence@gmail.com, dw@davidwei.uk, Jamal Hadi Salim , Victor Nogueira , Pedro Tammela X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC Add documentation outlining the usage and details of the devmem TCP TX API. Signed-off-by: Mina Almasry --- v2: - Update documentation for iov_base is the dmabuf offset (Stan) --- Documentation/networking/devmem.rst | 144 +++++++++++++++++++++++++++- 1 file changed, 140 insertions(+), 4 deletions(-) diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst index d95363645331..8166fe09da13 100644 --- a/Documentation/networking/devmem.rst +++ b/Documentation/networking/devmem.rst @@ -62,15 +62,15 @@ More Info https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/ -Interface -========= +RX Interface +============ Example ------- -tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up -the RX path of this API. +./tools/testing/selftests/drivers/net/hw/ncdevmem:do_server shows an example of +setting up the RX path of this API. NIC Setup @@ -235,6 +235,142 @@ can be less than the tokens provided by the user in case of: (a) an internal kernel leak bug. (b) the user passed more than 1024 frags. +TX Interface +============ + + +Example +------- + +./tools/testing/selftests/drivers/net/hw/ncdevmem:do_client shows an example of +setting up the TX path of this API. + + +NIC Setup +--------- + +The user must bind a TX dmabuf to a given NIC using the netlink API:: + + struct netdev_bind_tx_req *req = NULL; + struct netdev_bind_tx_rsp *rsp = NULL; + struct ynl_error yerr; + + *ys = ynl_sock_create(&ynl_netdev_family, &yerr); + + req = netdev_bind_tx_req_alloc(); + netdev_bind_tx_req_set_ifindex(req, ifindex); + netdev_bind_tx_req_set_fd(req, dmabuf_fd); + + rsp = netdev_bind_tx(*ys, req); + + tx_dmabuf_id = rsp->id; + + +The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf +that has been bound. + +The user can unbind the dmabuf from the netdevice by closing the netlink socket +that established the binding. We do this so that the binding is automatically +unbound even if the userspace process crashes. + +Note that any reasonably well-behaved dmabuf from any exporter should work with +devmem TCP, even if the dmabuf is not actually backed by devmem. An example of +this is udmabuf, which wraps user memory (non-devmem) in a dmabuf. + +Socket Setup +------------ + +The user application must use MSG_ZEROCOPY flag when sending devmem TCP. Devmem +cannot be copied by the kernel, so the semantics of the devmem TX are similar +to the semantics of MSG_ZEROCOPY. + + ret = setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt)); + +Sending data +-------------- + +Devmem data is sent using the SCM_DEVMEM_DMABUF cmsg. + +The user should create a msghdr where, + +iov_base is set to the offset into the dmabuf to start sending from. +iov_len is set to the number of bytes to be sent from the dmabuf. + +The user passes the dma-buf id to send from via the dmabuf_tx_cmsg.dmabuf_id. + +The example below sends 1024 bytes from offset 100 into the dmabuf, and 2048 +from offset 2000 into the dmabuf. The dmabuf to send from is tx_dmabuf_id:: + + char ctrl_data[CMSG_SPACE(sizeof(struct dmabuf_tx_cmsg))]; + struct dmabuf_tx_cmsg ddmabuf; + struct msghdr msg = {}; + struct cmsghdr *cmsg; + struct iovec iov[2]; + + iov[0].iov_base = (void*)100; + iov[0].iov_len = 1024; + iov[1].iov_base = (void*)2000; + iov[1].iov_len = 2048; + + msg.msg_iov = iov; + msg.msg_iovlen = 2; + + msg.msg_control = ctrl_data; + msg.msg_controllen = sizeof(ctrl_data); + + cmsg = CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SCM_DEVMEM_DMABUF; + cmsg->cmsg_len = CMSG_LEN(sizeof(struct dmabuf_tx_cmsg)); + + ddmabuf.dmabuf_id = tx_dmabuf_id; + + *((struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg)) = ddmabuf; + + sendmsg(socket_fd, &msg, MSG_ZEROCOPY); + + +Reusing TX dmabufs +------------------ + +Similar to MSG_ZEROCOPY with regular memory, the user should not modify the +contents of the dma-buf while a send operation is in progress. This is because +the kernel does not keep a copy of the dmabuf contents. Instead, the kernel +will pin and send data from the buffer available to the userspace. + +Just as in MSG_ZEROCOPY, the kernel notifies the userspace of send completions +using MSG_ERRQUEUE:: + + int64_t tstop = gettimeofday_ms() + waittime_ms; + char control[CMSG_SPACE(100)] = {}; + struct sock_extended_err *serr; + struct msghdr msg = {}; + struct cmsghdr *cm; + int retries = 10; + __u32 hi, lo; + + msg.msg_control = control; + msg.msg_controllen = sizeof(control); + + while (gettimeofday_ms() < tstop) { + if (!do_poll(fd)) continue; + + ret = recvmsg(fd, &msg, MSG_ERRQUEUE); + + for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { + serr = (void *)CMSG_DATA(cm); + + hi = serr->ee_data; + lo = serr->ee_info; + + fprintf(stdout, "tx complete [%d,%d]\n", lo, hi); + } + } + +After the associated sendmsg has been completed, the dmabuf can be reused by +the userspace. + + Implementation & Caveats ======================== From patchwork Thu Jan 30 21:15:35 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13954982 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4769A1F1531 for ; Thu, 30 Jan 2025 21:15:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738271749; cv=none; b=j6G4Or000jEB5Y7dTbQNKnIr5CD/KRitmeYF9/YhTaRv22aNWBRrFHBcS/df4zjuvUdm7hjQdKraWtXLXhgdSrTwoG1GF8mx873YfNI6oCT+H5UJthh5pLCAonce9RFxhdGXB9SpKiH3189OToLvjpaBEV0/kIWEWfW7W7Wc4m4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738271749; c=relaxed/simple; bh=66rLAwnUv7rphgDJMF+7euFwFkLHnlg56rVm7MMIEb8=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=AzzpIVs0JFimRga2n4dKSNZL6SjyKNkm7H6DMS4aV2S2Ut7cwhk+fljhaHUGBKzdh1NhqtEo1UUH94lYChORfDtWZkVF9u3azc/Pg2QPd7oYDjhZaCKy+gFTnKg/2fa2zG8GtQh92a6on0dHUDKjsCknR0jGgpFLATQNmYL8pPE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=c3l0sL/1; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="c3l0sL/1" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-2163dc0f689so37679535ad.1 for ; Thu, 30 Jan 2025 13:15:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738271746; x=1738876546; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=y4JyUAFzTAYUC0Df8KEcLXbVKxtOHohKpnphfTgnZKU=; b=c3l0sL/1rBNUmsWHa5wLfT5qZL8+qDETeR/QeLM3XhzGgAQ8Ce71j8ue0Th4wTSV8M W5pBSt9Zuc8A5L03Ior1+RzdzpwZCUqVVum1Uc3sa+oIpFr/r9txkh2eGP1+8ql/Mvs8 l8Ce4a/gb+7oPS3V072fwEGi7CWIP3tSpyoNeEadxLc+h6fTNT9ti+BUyHM6D83egbDm IXuQyns6f8rN03PmOTTixFPD1PXzw2q0zyEm19WFe6gRxTAG9gfW0Sv/6JDpE1LFWDbo tRSpLejrtnTvZcOUEzZ92x5TL6HdblmGcLQk+j8Orz7ZALO8OiKXKWufyP7/ZLYZhqQX HCzA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738271746; x=1738876546; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=y4JyUAFzTAYUC0Df8KEcLXbVKxtOHohKpnphfTgnZKU=; b=UhGBgGdRnSzoo+emu9ks32lTxmH7V+xX1czxZKTJja4aLXhWhSN2xz+8T1EWVwv2Je h/sB1inFSOBfJzHUilEZJH5+LCdZidohZ9G+pJMIF392lGXMcv4zlxr7bmcvdrNJAH3/ k1jOrdVqbnANkm1YWMGPYz9h/n4YmZCzqAvDMDdFictiUCB+LcVWT/oLlrWWYhHgq9vF ek4jjA1JUTlFZaN1yQd0cd/wXdSIv4ox9fJTyQ/lj5dS0vVfN7GAbhy0/bT/MQf6G3yR zdcaQHj47Gx6FUIHdn+PcDa6drzhPwYCPDVlqcMxgYhxOU2dJb2pdAIsM0cuvcD0X6L5 q7XQ== X-Gm-Message-State: AOJu0Yx/SH3gij3IYGV11RrE2w6BJfzbEtSSOfHoaTwId9QGR5Pl683X IB/go+rebUcm3uqvyAPSUmIaTTMNjE5ahF112iYIZ09RGx66hI/UcX/KFW8+lo6It0bFgzPmpmd i0TjztfYkpjtWTgjOX2shaH0jAJKT/9w/Q3/aWrAzpWmTkyBBs//y7Q9xc1Uyq8KEE0bGr5xVBJ WGqk5+gfddNHLxUB+JXRreAi7cJYVF7ZNOgIAR4Z5ecPLzodBCAuKzrnTsA8Q= X-Google-Smtp-Source: AGHT+IGEXyT3j04bY/5ZNHe8pdFbFw8I9+JyrJY9uf1FZOjYzwUxuAtpHqu/TWcN6Zdk9Uk0syDHlmNub4OfHCJGYw== X-Received: from pjtu8.prod.google.com ([2002:a17:90a:c888:b0:2f7:f660:cfe7]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:e5d0:b0:216:42fd:79d2 with SMTP id d9443c01a7336-21dd7de4684mr138116805ad.49.1738271746028; Thu, 30 Jan 2025 13:15:46 -0800 (PST) Date: Thu, 30 Jan 2025 21:15:35 +0000 In-Reply-To: <20250130211539.428952-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250130211539.428952-1-almasrymina@google.com> X-Mailer: git-send-email 2.48.1.362.g079036d154-goog Message-ID: <20250130211539.428952-3-almasrymina@google.com> Subject: [PATCH RFC net-next v2 2/6] selftests: ncdevmem: Implement devmem TCP TX From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kvm@vger.kernel.org, virtualization@lists.linux.dev, linux-kselftest@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , David Ahern , Stefan Hajnoczi , Stefano Garzarella , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , " =?utf-8?q?Eugenio_P=C3=A9rez?= " , Shuah Khan , sdf@fomichev.me, asml.silence@gmail.com, dw@davidwei.uk, Jamal Hadi Salim , Victor Nogueira , Pedro Tammela X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC Add support for devmem TX in ncdevmem. This is a combination of the ncdevmem from the devmem TCP series RFCv1 which included the TX path, and work by Stan to include the netlink API and refactored on top of his generic memory_provider support. Signed-off-by: Mina Almasry Signed-off-by: Stanislav Fomichev --- v2: - make errors a static variable so that we catch instances where there are less than 20 errors across different buffers. - Fix the issue where the seed is reset to 0 instead of its starting value 1. - Use 1000ULL instead of 1000 to guard against overflow (Willem). - Do not set POLLERR (Willem). - Update the test to use the new interface where iov_base is the dmabuf_offset. - Update the test to send 2 iov instead of 1, so we get some test coverage over sending multiple iovs at once. - Print the ifindex the test is using, useful for debugging issues where maybe the test may fail because the ifindex of the socket is different from the dmabuf binding. --- .../selftests/drivers/net/hw/ncdevmem.c | 276 +++++++++++++++++- 1 file changed, 272 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c index 19a6969643f4..8455f19ecd1a 100644 --- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c +++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c @@ -40,15 +40,18 @@ #include #include #include +#include #include #include #include #include #include +#include #include #include +#include #include #include #include @@ -80,6 +83,8 @@ static int num_queues = -1; static char *ifname; static unsigned int ifindex; static unsigned int dmabuf_id; +static uint32_t tx_dmabuf_id; +static int waittime_ms = 500; struct memory_buffer { int fd; @@ -93,6 +98,8 @@ struct memory_buffer { struct memory_provider { struct memory_buffer *(*alloc)(size_t size); void (*free)(struct memory_buffer *ctx); + void (*memcpy_to_device)(struct memory_buffer *dst, size_t off, + void *src, int n); void (*memcpy_from_device)(void *dst, struct memory_buffer *src, size_t off, int n); }; @@ -153,6 +160,20 @@ static void udmabuf_free(struct memory_buffer *ctx) free(ctx); } +static void udmabuf_memcpy_to_device(struct memory_buffer *dst, size_t off, + void *src, int n) +{ + struct dma_buf_sync sync = {}; + + sync.flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_WRITE; + ioctl(dst->fd, DMA_BUF_IOCTL_SYNC, &sync); + + memcpy(dst->buf_mem + off, src, n); + + sync.flags = DMA_BUF_SYNC_END | DMA_BUF_SYNC_WRITE; + ioctl(dst->fd, DMA_BUF_IOCTL_SYNC, &sync); +} + static void udmabuf_memcpy_from_device(void *dst, struct memory_buffer *src, size_t off, int n) { @@ -170,6 +191,7 @@ static void udmabuf_memcpy_from_device(void *dst, struct memory_buffer *src, static struct memory_provider udmabuf_memory_provider = { .alloc = udmabuf_alloc, .free = udmabuf_free, + .memcpy_to_device = udmabuf_memcpy_to_device, .memcpy_from_device = udmabuf_memcpy_from_device, }; @@ -188,7 +210,7 @@ void validate_buffer(void *line, size_t size) { static unsigned char seed = 1; unsigned char *ptr = line; - int errors = 0; + static int errors; size_t i; for (i = 0; i < size; i++) { @@ -202,7 +224,7 @@ void validate_buffer(void *line, size_t size) } seed++; if (seed == do_validation) - seed = 0; + seed = 1; } fprintf(stdout, "Validated buffer\n"); @@ -394,6 +416,49 @@ static int bind_rx_queue(unsigned int ifindex, unsigned int dmabuf_fd, return -1; } +static int bind_tx_queue(unsigned int ifindex, unsigned int dmabuf_fd, + struct ynl_sock **ys) +{ + struct netdev_bind_tx_req *req = NULL; + struct netdev_bind_tx_rsp *rsp = NULL; + struct ynl_error yerr; + + *ys = ynl_sock_create(&ynl_netdev_family, &yerr); + if (!*ys) { + fprintf(stderr, "YNL: %s\n", yerr.msg); + return -1; + } + + req = netdev_bind_tx_req_alloc(); + netdev_bind_tx_req_set_ifindex(req, ifindex); + netdev_bind_tx_req_set_fd(req, dmabuf_fd); + + rsp = netdev_bind_tx(*ys, req); + if (!rsp) { + perror("netdev_bind_tx"); + goto err_close; + } + + if (!rsp->_present.id) { + perror("id not present"); + goto err_close; + } + + fprintf(stderr, "got tx dmabuf id=%d\n", rsp->id); + tx_dmabuf_id = rsp->id; + + netdev_bind_tx_req_free(req); + netdev_bind_tx_rsp_free(rsp); + + return 0; + +err_close: + fprintf(stderr, "YNL failed: %s\n", (*ys)->err.msg); + netdev_bind_tx_req_free(req); + ynl_sock_destroy(*ys); + return -1; +} + static void enable_reuseaddr(int fd) { int opt = 1; @@ -432,7 +497,7 @@ static int parse_address(const char *str, int port, struct sockaddr_in6 *sin6) return 0; } -int do_server(struct memory_buffer *mem) +static int do_server(struct memory_buffer *mem) { char ctrl_data[sizeof(int) * 20000]; struct netdev_queue_id *queues; @@ -686,6 +751,207 @@ void run_devmem_tests(void) provider->free(mem); } +static uint64_t gettimeofday_ms(void) +{ + struct timeval tv; + + gettimeofday(&tv, NULL); + return (tv.tv_sec * 1000ULL) + (tv.tv_usec / 1000ULL); +} + +static int do_poll(int fd) +{ + struct pollfd pfd; + int ret; + + pfd.revents = 0; + pfd.fd = fd; + + ret = poll(&pfd, 1, waittime_ms); + if (ret == -1) + error(1, errno, "poll"); + + return ret && (pfd.revents & POLLERR); +} + +static void wait_compl(int fd) +{ + int64_t tstop = gettimeofday_ms() + waittime_ms; + char control[CMSG_SPACE(100)] = {}; + struct sock_extended_err *serr; + struct msghdr msg = {}; + struct cmsghdr *cm; + int retries = 10; + __u32 hi, lo; + int ret; + + msg.msg_control = control; + msg.msg_controllen = sizeof(control); + + while (gettimeofday_ms() < tstop) { + if (!do_poll(fd)) + continue; + + ret = recvmsg(fd, &msg, MSG_ERRQUEUE); + if (ret < 0) { + if (errno == EAGAIN) + continue; + error(1, ret, "recvmsg(MSG_ERRQUEUE)"); + return; + } + if (msg.msg_flags & MSG_CTRUNC) + error(1, 0, "MSG_CTRUNC\n"); + + for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { + if (cm->cmsg_level != SOL_IP && + cm->cmsg_level != SOL_IPV6) + continue; + if (cm->cmsg_level == SOL_IP && + cm->cmsg_type != IP_RECVERR) + continue; + if (cm->cmsg_level == SOL_IPV6 && + cm->cmsg_type != IPV6_RECVERR) + continue; + + serr = (void *)CMSG_DATA(cm); + if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) + error(1, 0, "wrong origin %u", serr->ee_origin); + if (serr->ee_errno != 0) + error(1, 0, "wrong errno %d", serr->ee_errno); + + hi = serr->ee_data; + lo = serr->ee_info; + + fprintf(stderr, "tx complete [%d,%d]\n", lo, hi); + return; + } + } + + error(1, 0, "did not receive tx completion"); +} + +static int do_client(struct memory_buffer *mem) +{ + char ctrl_data[CMSG_SPACE(sizeof(struct dmabuf_tx_cmsg))]; + struct sockaddr_in6 server_sin; + struct sockaddr_in6 client_sin; + struct dmabuf_tx_cmsg ddmabuf; + struct ynl_sock *ys = NULL; + struct msghdr msg = {}; + ssize_t line_size = 0; + struct cmsghdr *cmsg; + struct iovec iov[2]; + uint64_t off = 100; + char *line = NULL; + size_t len = 0; + int socket_fd; + int ret, mid; + int opt = 1; + + ret = parse_address(server_ip, atoi(port), &server_sin); + if (ret < 0) + error(1, 0, "parse server address"); + + socket_fd = socket(AF_INET6, SOCK_STREAM, 0); + if (socket_fd < 0) + error(1, socket_fd, "create socket"); + + enable_reuseaddr(socket_fd); + + ret = setsockopt(socket_fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, + strlen(ifname) + 1); + if (ret) + error(1, ret, "bindtodevice"); + + if (bind_tx_queue(ifindex, mem->fd, &ys)) + error(1, 0, "Failed to bind\n"); + + ret = parse_address(client_ip, atoi(port), &client_sin); + if (ret < 0) + error(1, 0, "parse client address"); + + ret = bind(socket_fd, &client_sin, sizeof(client_sin)); + if (ret) + error(1, ret, "bind"); + + ret = setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt)); + if (ret) + error(1, ret, "set sock opt"); + + fprintf(stderr, "Connect to %s %d (via %s)\n", server_ip, + ntohs(server_sin.sin6_port), ifname); + + ret = connect(socket_fd, &server_sin, sizeof(server_sin)); + if (ret) + error(1, ret, "connect"); + + while (1) { + free(line); + line = NULL; + /* Subtract 1 from line_size to remove trailing newlines that + * get_line are surely to parse... + */ + line_size = getline(&line, &len, stdin) - 1; + + if (line_size < 0) + break; + + mid = (line_size / 2) + 1; + + iov[0].iov_base = (void *)100; + iov[0].iov_len = mid; + iov[1].iov_base = (void *)2000; + iov[1].iov_len = line_size - mid; + + provider->memcpy_to_device(mem, (size_t)iov[0].iov_base, line, + iov[0].iov_len); + provider->memcpy_to_device(mem, (size_t)iov[1].iov_base, + line + iov[0].iov_len, + iov[1].iov_len); + + fprintf(stderr, + "read line_size=%ld off=%d iov[0].iov_base=%d, iov[0].iov_len=%d, iov[1].iov_base=%d, iov[1].iov_len=%d\n", + line_size, off, iov[0].iov_base, iov[0].iov_len, + iov[1].iov_base, iov[1].iov_len); + + msg.msg_iov = iov; + msg.msg_iovlen = 2; + + msg.msg_control = ctrl_data; + msg.msg_controllen = sizeof(ctrl_data); + + cmsg = CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SCM_DEVMEM_DMABUF; + cmsg->cmsg_len = CMSG_LEN(sizeof(struct dmabuf_tx_cmsg)); + + ddmabuf.dmabuf_id = tx_dmabuf_id; + + *((struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg)) = ddmabuf; + + ret = sendmsg(socket_fd, &msg, MSG_ZEROCOPY); + if (ret < 0) + error(1, errno, "Failed sendmsg"); + + fprintf(stderr, "sendmsg_ret=%d\n", ret); + + if (ret != line_size) + error(1, errno, "Did not send all bytes"); + + wait_compl(socket_fd); + } + + fprintf(stderr, "%s: tx ok\n", TEST_PREFIX); + + free(line); + close(socket_fd); + + if (ys) + ynl_sock_destroy(ys); + + return 0; +} + int main(int argc, char *argv[]) { struct memory_buffer *mem; @@ -729,6 +995,8 @@ int main(int argc, char *argv[]) ifindex = if_nametoindex(ifname); + fprintf(stderr, "using ifindex=%u\n", ifindex); + if (!server_ip && !client_ip) { if (start_queue < 0 && num_queues < 0) { num_queues = rxq_num(ifindex); @@ -779,7 +1047,7 @@ int main(int argc, char *argv[]) error(1, 0, "Missing -p argument\n"); mem = provider->alloc(getpagesize() * NUM_PAGES); - ret = is_server ? do_server(mem) : 1; + ret = is_server ? do_server(mem) : do_client(mem); provider->free(mem); return ret; From patchwork Thu Jan 30 21:15:36 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13954983 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6570319F101 for ; Thu, 30 Jan 2025 21:15:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738271750; cv=none; b=EFohPAKK9NrcDPQeR2WJU+BnkpOjrkvljkqy6WwzP5kPrKykrsWt1cwLXTkoCOjQEtRuioBB3u4h4QSKqUDbUzeWSssvlhSPqh45JxJ9B3MiPgCt90QVFquVzDQwqFz4SLDzqFqztZoXJGy+6Lc8B977HVlwnKN8y3r2kmlZrIQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738271750; c=relaxed/simple; bh=yALvHvPZ06XHv8lSBAulVCm+vGbBL+K++sbCUEtN9RI=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=e48xzxBZBGuFeO4+ZsxCCbtahOG9qlkVG6wvutgXXPH0ksHKt1hARt/etLO9X38MW63x79ThA6axcVqB8apaJhbasqfA/CAbROv525i20/Pd7cy3BLEnLJFyQPZTPeL5PaTT9v+3YawZjViJvL3VgxS0GZcsem1g4i1aj+kmDXg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=EaBYTkXI; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="EaBYTkXI" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-2ef114d8346so2464482a91.0 for ; Thu, 30 Jan 2025 13:15:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738271748; x=1738876548; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=XpaVlDvetL7z/kA++uxshgtjRqWTMmKZzr7CrXSNsGU=; b=EaBYTkXIecVborws4XaW3yoNo8AdTRMBvo2ELma8ZejSV9tBpl6gqtQYpshW9gMSt9 xvk0uIMfPNAL3JT406+KIcCCGWi3QuW1OcPWc5jO/yY3d3ibLboa+0mu3fuMLZOqqgGx mOMQE+5BBzCNJyI6+OGVjhgogu9F40+ylUap4zvTDybeQdIWRMpBHX5gZHmI8vz/7N5Q 1PugbHNC0z/xFyMT8pZ1iaO2axQI/azDB6fZEsRtRkrtOZkAYOrdWZCaFmomOCURyQCW ROggWKHPe/DPCECBwBMVig+qTzRzgmNFe1QpJXVYAjUzLG0xv73qNIokEmMDTZLiHTew NznA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738271748; x=1738876548; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XpaVlDvetL7z/kA++uxshgtjRqWTMmKZzr7CrXSNsGU=; b=KUkDjXajil4rpjN45Qd5kHaVKdrX7Mcks6HrOuCVTePBS97TeetlLgv9smkcex8gaX pcsKqqs8wDUHxtEGR2KalNjtIIks+/RFuMgoAwTm4HsfJStXNba6qh8wTtKd1QbUGpSY +O8dmhlWjf0gaGQ6lxHSnmSAnk1k/Z8zs4d4WJGu1KZ/TMe+teOpeoMnt4umgsVy0eKT hmMXNqp06UPSB2CHBhSXrDvtSFhxJLtQfw7qwD5y/6kRpc98jjWQyh9BIrWGJ3+N4OZ4 Jy3JUnI4wyJotHMx5o38PpR4Ia9hbKRyfFYus4V3t5UKsUMrCcPrOODxSYtPmlNAG9wi HkFA== X-Gm-Message-State: AOJu0Yyq0/GA5r3eHstU0FU+ye9WHdS3iSIIZFaANatJUZWZCMDci5yy 5FauReLziw4dYaz6PVVXlXtJbpjNP1+rfYdV+2vk4SQiGu03kw9LOusJqcaZK8lJ8vK47IcFxfD I6u4JkE+RUGUGKZR7Bs04q3XE0sOGvm1mBIMlI567LxrpA1U3ZP3DlMe/zzeu88wdYom14xFegp KFc62JH6UtCBt3ETmfboxq/enLlE2IsheAqLeTh3UnP8OIQnRDaQoL5+pVpmw= X-Google-Smtp-Source: AGHT+IElTnvOq/6i+2nm6/m4i2RArO/vZJviJEeOjIzLBOG1zPQlpTeO/2hdQq9Brx+Lb54xWkBOB5LE4XqIadamFg== X-Received: from pfbcd23.prod.google.com ([2002:a05:6a00:4217:b0:725:ceac:b484]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:4642:b0:724:f86e:e3d9 with SMTP id d2e1a72fcca58-72fd0c02ec9mr12270749b3a.14.1738271747460; Thu, 30 Jan 2025 13:15:47 -0800 (PST) Date: Thu, 30 Jan 2025 21:15:36 +0000 In-Reply-To: <20250130211539.428952-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250130211539.428952-1-almasrymina@google.com> X-Mailer: git-send-email 2.48.1.362.g079036d154-goog Message-ID: <20250130211539.428952-4-almasrymina@google.com> Subject: [PATCH RFC net-next v2 3/6] net: add get_netmem/put_netmem support From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kvm@vger.kernel.org, virtualization@lists.linux.dev, linux-kselftest@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , David Ahern , Stefan Hajnoczi , Stefano Garzarella , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , " =?utf-8?q?Eugenio_P=C3=A9rez?= " , Shuah Khan , sdf@fomichev.me, asml.silence@gmail.com, dw@davidwei.uk, Jamal Hadi Salim , Victor Nogueira , Pedro Tammela X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC Currently net_iovs support only pp ref counts, and do not support a page ref equivalent. This is fine for the RX path as net_iovs are used exclusively with the pp and only pp refcounting is needed there. The TX path however does not use pp ref counts, thus, support for get_page/put_page equivalent is needed for netmem. Support get_netmem/put_netmem. Check the type of the netmem before passing it to page or net_iov specific code to obtain a page ref equivalent. For dmabuf net_iovs, we obtain a ref on the underlying binding. This ensures the entire binding doesn't disappear until all the net_iovs have been put_netmem'ed. We do not need to track the refcount of individual dmabuf net_iovs as we don't allocate/free them from a pool similar to what the buddy allocator does for pages. This code is written to be extensible by other net_iov implementers. get_netmem/put_netmem will check the type of the netmem and route it to the correct helper: pages -> [get|put]_page() dmabuf net_iovs -> net_devmem_[get|put]_net_iov() new net_iovs -> new helpers Signed-off-by: Mina Almasry --- v2: - Add comment on top of refcount_t ref explaining the usage in the XT path. - Fix missing definition of net_devmem_dmabuf_binding_put in this patch. --- include/linux/skbuff_ref.h | 4 ++-- include/net/netmem.h | 3 +++ net/core/devmem.c | 10 ++++++++++ net/core/devmem.h | 20 ++++++++++++++++++++ net/core/skbuff.c | 30 ++++++++++++++++++++++++++++++ 5 files changed, 65 insertions(+), 2 deletions(-) diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h index 0f3c58007488..9e49372ef1a0 100644 --- a/include/linux/skbuff_ref.h +++ b/include/linux/skbuff_ref.h @@ -17,7 +17,7 @@ */ static inline void __skb_frag_ref(skb_frag_t *frag) { - get_page(skb_frag_page(frag)); + get_netmem(skb_frag_netmem(frag)); } /** @@ -40,7 +40,7 @@ static inline void skb_page_unref(netmem_ref netmem, bool recycle) if (recycle && napi_pp_put_page(netmem)) return; #endif - put_page(netmem_to_page(netmem)); + put_netmem(netmem); } /** diff --git a/include/net/netmem.h b/include/net/netmem.h index 1b58faa4f20f..d30f31878a09 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -245,4 +245,7 @@ static inline unsigned long netmem_get_dma_addr(netmem_ref netmem) return __netmem_clear_lsb(netmem)->dma_addr; } +void get_netmem(netmem_ref netmem); +void put_netmem(netmem_ref netmem); + #endif /* _NET_NETMEM_H */ diff --git a/net/core/devmem.c b/net/core/devmem.c index 3bba3f018df0..20985a570662 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -333,6 +333,16 @@ void dev_dmabuf_uninstall(struct net_device *dev) } } +void net_devmem_get_net_iov(struct net_iov *niov) +{ + net_devmem_dmabuf_binding_get(niov->owner->binding); +} + +void net_devmem_put_net_iov(struct net_iov *niov) +{ + net_devmem_dmabuf_binding_put(niov->owner->binding); +} + /*** "Dmabuf devmem memory provider" ***/ int mp_dmabuf_devmem_init(struct page_pool *pool) diff --git a/net/core/devmem.h b/net/core/devmem.h index 76099ef9c482..8b51caff5a0e 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -27,6 +27,10 @@ struct net_devmem_dmabuf_binding { * The binding undos itself and unmaps the underlying dmabuf once all * those refs are dropped and the binding is no longer desired or in * use. + * + * net_devmem_get_net_iov() on dmabuf net_iovs will increment this + * reference, making sure that the binding remains alive until all the + * net_iovs are no longer used. */ refcount_t ref; @@ -119,6 +123,9 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding) __net_devmem_dmabuf_binding_free(binding); } +void net_devmem_get_net_iov(struct net_iov *niov); +void net_devmem_put_net_iov(struct net_iov *niov); + struct net_iov * net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding); void net_devmem_free_dmabuf(struct net_iov *ppiov); @@ -126,6 +133,19 @@ void net_devmem_free_dmabuf(struct net_iov *ppiov); #else struct net_devmem_dmabuf_binding; +static inline void +net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding) +{ +} + +static inline void net_devmem_get_net_iov(struct net_iov *niov) +{ +} + +static inline void net_devmem_put_net_iov(struct net_iov *niov) +{ +} + static inline void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) { diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a441613a1e6c..815245d5c36b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -88,6 +88,7 @@ #include #include "dev.h" +#include "devmem.h" #include "netmem_priv.h" #include "sock_destructor.h" @@ -7290,3 +7291,32 @@ bool csum_and_copy_from_iter_full(void *addr, size_t bytes, return false; } EXPORT_SYMBOL(csum_and_copy_from_iter_full); + +void get_netmem(netmem_ref netmem) +{ + if (netmem_is_net_iov(netmem)) { + /* Assume any net_iov is devmem and route it to + * net_devmem_get_net_iov. As new net_iov types are added they + * need to be checked here. + */ + net_devmem_get_net_iov(netmem_to_net_iov(netmem)); + return; + } + get_page(netmem_to_page(netmem)); +} +EXPORT_SYMBOL(get_netmem); + +void put_netmem(netmem_ref netmem) +{ + if (netmem_is_net_iov(netmem)) { + /* Assume any net_iov is devmem and route it to + * net_devmem_put_net_iov. As new net_iov types are added they + * need to be checked here. + */ + net_devmem_put_net_iov(netmem_to_net_iov(netmem)); + return; + } + + put_page(netmem_to_page(netmem)); +} +EXPORT_SYMBOL(put_netmem); From patchwork Thu Jan 30 21:15:37 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13954984 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AF6721F2392 for ; Thu, 30 Jan 2025 21:15:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738271751; cv=none; b=N0tk05RqrmKhtNqwivH9QrsxDN1zDmZ7b8Kf3w91DbS7Qfse7eW5dSL6rZZ6b3rhvkcgB2YYS/FjOVdEYEA2MszqJuuDOU2gsPeaz1K5MWBs6qPNsC4ux5ytg8PPyTegpmOouB5MYvxWzbqDKdmKzKdbGyGNxoeKJ80eKtLSNv4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738271751; c=relaxed/simple; bh=k12/lbC5UqTQvuOY0XqVg2rUFY//dbOT9zo+qbv4oOk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=dJJFfLZPo/mn0F/5BvOM0iMJshHZPgsm1UdMWKkhkP73kQPvocFdb7qx4NRfHWYRvq7Sl4UknJ4RVpIQ+t7REgdEEbBO3iLYUMAvMxWI0C6LRDBSAnKXnR1bjIdPuUtPRDY8MhyLFVl6r7HFtAVpKIJ+D0XfkmqZvNGVKNsEDao= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=hgYDM2DR; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="hgYDM2DR" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-2162259a5dcso40739705ad.3 for ; Thu, 30 Jan 2025 13:15:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738271749; x=1738876549; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=YboRbWKoVWlemRIOH59KqylnvDeKBTmTmqLYeqMQiIE=; b=hgYDM2DRgW3Jw+KPlILiShctD6WDpJyDQTS58xlCULZVVzIQ/2GRCI7yChXl1CKMp6 JaqM1g5pqlEmd8QqMkV9y6yaZV5LDyI/jRygsP4qztOuEhoT3urE8sQdDtFVPxNjul18 8TM3cgKyFkesSRUvgeS6scGGzva+upGka+OdB+6hQ546csl+Yp/qHU6qavYJdDaV0j5+ dVRbrxTQ4Vyc7tWuQBIh+LS8oiZdDEVYef74o/D/DfXkAuCtuMNfa6r5v5M1yvx2klzj pwoTQvSKimZjbeZCJ7V1lvLLW2n4/xV+Ny4eCSsmy+IGpuqO+h2vgHltIjGbvCzBOUfm fLqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738271749; x=1738876549; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=YboRbWKoVWlemRIOH59KqylnvDeKBTmTmqLYeqMQiIE=; b=eutGzf61mxsC0mDEIXo4nNXfoRPmqtsXfmRvTegRvFg4XNp9YCiI255buTi3BI8bdo BIrawtHk8jYT+gB9miEDNhE1e6j+vYQ8Mp6twC3/JpvCEeW2jcC8hVKf7y/7uQCFPxC3 LGIxpQETtnBrh7nPnhOKvRNZVPP5fB4kXzR5pArBnFlZnNwlOpQlqAd1jv6L1Bt0eEa1 dPkqLpAesvJ4bekiKBreakbMf9iMWZ4FYtMAXs5EmLBxy+IOo46jjxyVMSKO0ocV2IXW GThNM1tGjKbuc1TmeiQBu0KhuqdRX+DdGo6qCpkh/sLnwmP8mlxmMPNrQTEIs1yqsjg5 4u6w== X-Gm-Message-State: AOJu0Yw/PjqZdieZ6Xjg39y7Y2T4pr8iaBgab9NFz3VpKqp97HtWh92E HTMCsqylppw3DNFAm6afl4E7Di86jbQsnnA27Dt9kQ6vkg3BC5AiAPSk55Yy4v5YOvOMee2B7ZZ /zKV2KQwZd1TDvCZ8jd1ug+QOTxQ2wcZf6dwYTQtiQiIoeDHAtZevAOJQ7otZBGmXXEN/WVW171 qW5Ge6Lx2AxqZOX9OiMXFoBSh5XHqLshxrsP5/V0MgnfkUySUhluQTMJToYKk= X-Google-Smtp-Source: AGHT+IEV38TyLfSKPYHV1CRGVy4zRH5M+/rjcv1gz+qw8RgwDKGY/6aR5/6lTle+LOsExHp5nlip6aiSOGyu7Pm1WA== X-Received: from pfbds12.prod.google.com ([2002:a05:6a00:4acc:b0:72a:bc6b:89ad]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:670d:b0:1e1:ce7d:c0cc with SMTP id adf61e73a8af0-1ed7a61aeefmr12572453637.38.1738271748853; Thu, 30 Jan 2025 13:15:48 -0800 (PST) Date: Thu, 30 Jan 2025 21:15:37 +0000 In-Reply-To: <20250130211539.428952-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250130211539.428952-1-almasrymina@google.com> X-Mailer: git-send-email 2.48.1.362.g079036d154-goog Message-ID: <20250130211539.428952-5-almasrymina@google.com> Subject: [PATCH RFC net-next v2 4/6] net: devmem: TCP tx netlink api From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kvm@vger.kernel.org, virtualization@lists.linux.dev, linux-kselftest@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , David Ahern , Stefan Hajnoczi , Stefano Garzarella , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , " =?utf-8?q?Eugenio_P=C3=A9rez?= " , Shuah Khan , sdf@fomichev.me, asml.silence@gmail.com, dw@davidwei.uk, Jamal Hadi Salim , Victor Nogueira , Pedro Tammela X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: Stanislav Fomichev Add bind-tx netlink call to attach dmabuf for TX; queue is not required, only ifindex and dmabuf fd for attachment. Signed-off-by: Stanislav Fomichev Signed-off-by: Mina Almasry --- Documentation/netlink/specs/netdev.yaml | 12 ++++++++++++ include/uapi/linux/netdev.h | 1 + net/core/netdev-genl-gen.c | 13 +++++++++++++ net/core/netdev-genl-gen.h | 1 + net/core/netdev-genl.c | 6 ++++++ tools/include/uapi/linux/netdev.h | 1 + 6 files changed, 34 insertions(+) diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml index cbb544bd6c84..93f4333e7bc6 100644 --- a/Documentation/netlink/specs/netdev.yaml +++ b/Documentation/netlink/specs/netdev.yaml @@ -711,6 +711,18 @@ operations: - defer-hard-irqs - gro-flush-timeout - irq-suspend-timeout + - + name: bind-tx + doc: Bind dmabuf to netdev for TX + attribute-set: dmabuf + do: + request: + attributes: + - ifindex + - fd + reply: + attributes: + - id kernel-family: headers: [ "linux/list.h"] diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h index e4be227d3ad6..04364ef5edbe 100644 --- a/include/uapi/linux/netdev.h +++ b/include/uapi/linux/netdev.h @@ -203,6 +203,7 @@ enum { NETDEV_CMD_QSTATS_GET, NETDEV_CMD_BIND_RX, NETDEV_CMD_NAPI_SET, + NETDEV_CMD_BIND_TX, __NETDEV_CMD_MAX, NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1) diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c index 996ac6a449eb..9e947284e42d 100644 --- a/net/core/netdev-genl-gen.c +++ b/net/core/netdev-genl-gen.c @@ -99,6 +99,12 @@ static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_IRQ_SUSPE [NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT] = { .type = NLA_UINT, }, }; +/* NETDEV_CMD_BIND_TX - do */ +static const struct nla_policy netdev_bind_tx_nl_policy[NETDEV_A_DMABUF_FD + 1] = { + [NETDEV_A_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1), + [NETDEV_A_DMABUF_FD] = { .type = NLA_U32, }, +}; + /* Ops table for netdev */ static const struct genl_split_ops netdev_nl_ops[] = { { @@ -190,6 +196,13 @@ static const struct genl_split_ops netdev_nl_ops[] = { .maxattr = NETDEV_A_NAPI_IRQ_SUSPEND_TIMEOUT, .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO, }, + { + .cmd = NETDEV_CMD_BIND_TX, + .doit = netdev_nl_bind_tx_doit, + .policy = netdev_bind_tx_nl_policy, + .maxattr = NETDEV_A_DMABUF_FD, + .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO, + }, }; static const struct genl_multicast_group netdev_nl_mcgrps[] = { diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h index e09dd7539ff2..c1fed66e92b9 100644 --- a/net/core/netdev-genl-gen.h +++ b/net/core/netdev-genl-gen.h @@ -34,6 +34,7 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info); +int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info); enum { NETDEV_NLGRP_MGMT, diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index 715f85c6b62e..0e41699df419 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -911,6 +911,12 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info) return err; } +/* stub */ +int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info) +{ + return 0; +} + void netdev_nl_sock_priv_init(struct list_head *priv) { INIT_LIST_HEAD(priv); diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h index e4be227d3ad6..04364ef5edbe 100644 --- a/tools/include/uapi/linux/netdev.h +++ b/tools/include/uapi/linux/netdev.h @@ -203,6 +203,7 @@ enum { NETDEV_CMD_QSTATS_GET, NETDEV_CMD_BIND_RX, NETDEV_CMD_NAPI_SET, + NETDEV_CMD_BIND_TX, __NETDEV_CMD_MAX, NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1) From patchwork Thu Jan 30 21:15:38 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13954985 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BD0861F37C9 for ; Thu, 30 Jan 2025 21:15:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738271754; cv=none; b=nriwm+0c5Ful6NnTbiTcUoheTAvPb6uA5OE9dBqWgxpldehiQKT58U4gSGkPuDn3Z6URN28HEb1cICebR5Jbza6ZQq0ZdqMYrg/8y5tOs43Mi8qMoRs2bUj9K8XdkycPAdazXJCELUliOg9YKmXegMkkwpn9RLBaaiYPNlyOxj0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738271754; c=relaxed/simple; bh=/uknsmzOoUo2AdnBk1rVUhiIg4unhsT5mhFtuI1Eqi8=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=Kwh7UeyiiSKEjT/A9eUnMcywsQeCzHyykZ7/bOTu1jHP1uLOxSOaaMOcyXmDZf1UkxM0T8FUnFtG7kZfZUqtzNJbLGYI+qh1ZqH7NaqjotWM4A9YyH3OpEW6iXVerPGJiV87KrbOHnrWwnIQC/nUAmnSaTeJ/mxEMI4p2BuO+lk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=tLuOLrMu; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="tLuOLrMu" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-216430a88b0so26482835ad.0 for ; Thu, 30 Jan 2025 13:15:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738271751; x=1738876551; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=fDk3QDS0XhOw3yXbWbE/Vg50WGFMxpH0ic4EdkM2RAQ=; b=tLuOLrMu43KQvQDXz52IfatxY8yOF3010RusGHcdM8MnOGP44ijIUkN4PGVPXT7QmN zYybSJReV4p6XmAVAOuDVLwMAxYJXIRz3NU22rNZckL8QQ20lFh07heqQGauLIYdvMfp 4yRpHAv4Tw9qAeW7jgZJ/xy4AYNkg/KBUiEHwXmDVAqSd8J4BnAKXZ6bz2EG3FTmRQXI KTdR7p3hgZsc9Upz17M+wEn9eWMsY7kurubImzl9y7dei0twFKFpLRhxosmpY3eCfwLc rOQ1nFtxErSPcPg7QzqZSWwI0WRUPHP2YJlkSF/c0zPSOTcODFApcUbrCs99cS46eZod rRQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738271751; x=1738876551; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=fDk3QDS0XhOw3yXbWbE/Vg50WGFMxpH0ic4EdkM2RAQ=; b=j+Kv2nX7+cSQgKCCNZlyddlsC4u2wL95r8SZjIu1OTv0/QtzEAjy/850OmMmrfwdUX p6tWGVEsnZyIw+OYDmVlV1MnhzAWkcM0+lwhVA3MibGAQCY0Dx4dezK3Tu6x2NHrKjjp Z37YzCwSR3WmS/90HUEZBfdJYZAgwqdngLbDPr2qo7rkgWznG7vqeQfelss+cApABq7I xz4Vv1BN6kThlwB/L+ufbFAYO45O07vMGSwSNc1WKDqgH5j+5vCEfYqBtu1v+/Ufr1ag a4XLIDVpUseTFwsAd3NSup2wPjHfP//K/oJUuO1Q7Hyw2maH/uTBgt7KBDK17ApKTL3s PSYw== X-Gm-Message-State: AOJu0Yw48HCYNyAFstILh+Ov1CblnAKzUJ1GM3JVVfSaGWDC26ImvPfK vTTBteWwUc0nLHolWIW/gNW+6L2tC5Gw/hUmeC3LP+GqKQYTKTqQbE1EIvPXO1YVEP9Va0zWkrw VKFqhpbFIVJqADBYQezJwbS5+hqafy74f8P8tZ3OL9KVWLrtx7Ri5DAzgBKU611+pCPO7DYBFOX togRLXgP758YIPge4mT1ymv14atvs9R42IMtGBN7GlZj5bFKSAR/lxJMjQxXE= X-Google-Smtp-Source: AGHT+IHcaz6CmOrsYbzsvMJeE/blq+/BkvNmzMyDJWRKsl8YFQuxYBw7Zu7NR1mzLDYYc6WsnZx0PxjaPPLL1jiawQ== X-Received: from plbkp5.prod.google.com ([2002:a17:903:2805:b0:216:2c51:475d]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:e84b:b0:215:b087:5d62 with SMTP id d9443c01a7336-21dd7deec73mr138184945ad.36.1738271750486; Thu, 30 Jan 2025 13:15:50 -0800 (PST) Date: Thu, 30 Jan 2025 21:15:38 +0000 In-Reply-To: <20250130211539.428952-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250130211539.428952-1-almasrymina@google.com> X-Mailer: git-send-email 2.48.1.362.g079036d154-goog Message-ID: <20250130211539.428952-6-almasrymina@google.com> Subject: [PATCH RFC net-next v2 5/6] net: devmem: Implement TX path From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kvm@vger.kernel.org, virtualization@lists.linux.dev, linux-kselftest@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , David Ahern , Stefan Hajnoczi , Stefano Garzarella , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , " =?utf-8?q?Eugenio_P=C3=A9rez?= " , Shuah Khan , sdf@fomichev.me, asml.silence@gmail.com, dw@davidwei.uk, Jamal Hadi Salim , Victor Nogueira , Pedro Tammela , Kaiyuan Zhang X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC Augment dmabuf binding to be able to handle TX. Additional to all the RX binding, we also create tx_vec needed for the TX path. Provide API for sendmsg to be able to send dmabufs bound to this device: - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from. - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf. Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY implementation, while disabling instances where MSG_ZEROCOPY falls back to copying. We additionally pipe the binding down to the new zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems instead of the traditional page netmems. We also special case skb_frag_dma_map to return the dma-address of these dmabuf net_iovs instead of attempting to map pages. Based on work by Stanislav Fomichev . A lot of the meat of the implementation came from devmem TCP RFC v1[1], which included the TX path, but Stan did all the rebasing on top of netmem/net_iov. Cc: Stanislav Fomichev Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry --- v2: - Remove dmabuf_offset from the dmabuf cmsg. - Update zerocopy_fill_skb_from_devmem to interpret the iov_base/iter_iov_addr as the offset into the dmabuf to send from (Stan). - Remove the confusing binding->tx_iter which is not needed if we interpret the iov_base/iter_iov_addr as offset into the dmabuf (Stan). - Remove check for binding->sgt and binding->sgt->nents in dmabuf binding. - Simplify the calculation of binding->tx_vec. - Check in net_devmem_get_binding that the binding we're returning has ifindex matching the sending socket (Willem). --- include/linux/skbuff.h | 15 +++- include/net/sock.h | 1 + include/uapi/linux/uio.h | 6 +- net/core/datagram.c | 41 ++++++++++- net/core/devmem.c | 96 +++++++++++++++++++++++-- net/core/devmem.h | 42 +++++++++-- net/core/netdev-genl.c | 65 ++++++++++++++++- net/core/skbuff.c | 6 +- net/core/sock.c | 8 +++ net/ipv4/tcp.c | 36 +++++++--- net/vmw_vsock/virtio_transport_common.c | 3 +- 11 files changed, 285 insertions(+), 34 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index bb2b751d274a..3ff8f568c382 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1711,9 +1711,12 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size, void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref); +struct net_devmem_dmabuf_binding; + int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, struct sk_buff *skb, struct iov_iter *from, - size_t length); + size_t length, + struct net_devmem_dmabuf_binding *binding); int zerocopy_fill_skb_from_iter(struct sk_buff *skb, struct iov_iter *from, size_t length); @@ -1721,12 +1724,14 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb, static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len) { - return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len); + return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len, + NULL); } int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, struct msghdr *msg, int len, - struct ubuf_info *uarg); + struct ubuf_info *uarg, + struct net_devmem_dmabuf_binding *binding); /* Internal */ #define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB))) @@ -3697,6 +3702,10 @@ static inline dma_addr_t __skb_frag_dma_map(struct device *dev, size_t offset, size_t size, enum dma_data_direction dir) { + if (skb_frag_is_net_iov(frag)) { + return netmem_to_net_iov(frag->netmem)->dma_addr + offset + + frag->offset; + } return dma_map_page(dev, skb_frag_page(frag), skb_frag_off(frag) + offset, size, dir); } diff --git a/include/net/sock.h b/include/net/sock.h index 8036b3b79cd8..09eb918525b6 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1822,6 +1822,7 @@ struct sockcm_cookie { u32 tsflags; u32 ts_opt_id; u32 priority; + u32 dmabuf_id; }; static inline void sockcm_init(struct sockcm_cookie *sockc, diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 649739e0c404..866bd5dfe39f 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -38,10 +38,14 @@ struct dmabuf_token { __u32 token_count; }; +struct dmabuf_tx_cmsg { + __u32 dmabuf_id; +}; + /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ - + #define UIO_FASTIOV 8 #define UIO_MAXIOV 1024 diff --git a/net/core/datagram.c b/net/core/datagram.c index f0693707aece..c989606ff58d 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -63,6 +63,8 @@ #include #include +#include "devmem.h" + /* * Is a socket 'connection oriented' ? */ @@ -692,9 +694,42 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb, return 0; } +static int +zerocopy_fill_skb_from_devmem(struct sk_buff *skb, struct iov_iter *from, + int length, + struct net_devmem_dmabuf_binding *binding) +{ + int i = skb_shinfo(skb)->nr_frags; + size_t virt_addr, size, off; + struct net_iov *niov; + + while (length && iov_iter_count(from)) { + if (i == MAX_SKB_FRAGS) + return -EMSGSIZE; + + virt_addr = (size_t)iter_iov_addr(from); + niov = net_devmem_get_niov_at(binding, virt_addr, &off, &size); + if (!niov) + return -EFAULT; + + size = min_t(size_t, size, length); + size = min_t(size_t, size, iter_iov_len(from)); + + get_netmem(net_iov_to_netmem(niov)); + skb_add_rx_frag_netmem(skb, i, net_iov_to_netmem(niov), off, + size, PAGE_SIZE); + iov_iter_advance(from, size); + length -= size; + i++; + } + + return 0; +} + int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, struct sk_buff *skb, struct iov_iter *from, - size_t length) + size_t length, + struct net_devmem_dmabuf_binding *binding) { unsigned long orig_size = skb->truesize; unsigned long truesize; @@ -702,6 +737,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, if (msg && msg->msg_ubuf && msg->sg_from_iter) ret = msg->sg_from_iter(skb, from, length); + else if (unlikely(binding)) + ret = zerocopy_fill_skb_from_devmem(skb, from, length, binding); else ret = zerocopy_fill_skb_from_iter(skb, from, length); @@ -735,7 +772,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from) if (skb_copy_datagram_from_iter(skb, 0, from, copy)) return -EFAULT; - return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U); + return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U, NULL); } EXPORT_SYMBOL(zerocopy_sg_from_iter); diff --git a/net/core/devmem.c b/net/core/devmem.c index 20985a570662..796338b1599e 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "devmem.h" @@ -64,8 +65,10 @@ void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) dma_buf_detach(binding->dmabuf, binding->attachment); dma_buf_put(binding->dmabuf); xa_destroy(&binding->bound_rxqs); + kfree(binding->tx_vec); kfree(binding); } +EXPORT_SYMBOL(__net_devmem_dmabuf_binding_free); struct net_iov * net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) @@ -110,6 +113,13 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding) unsigned long xa_idx; unsigned int rxq_idx; + xa_erase(&net_devmem_dmabuf_bindings, binding->id); + + /* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the + * erase. + */ + synchronize_net(); + if (binding->list.next) list_del(&binding->list); @@ -123,8 +133,6 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding) WARN_ON(netdev_rx_queue_restart(binding->dev, rxq_idx)); } - xa_erase(&net_devmem_dmabuf_bindings, binding->id); - net_devmem_dmabuf_binding_put(binding); } @@ -185,8 +193,9 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, } struct net_devmem_dmabuf_binding * -net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, - struct netlink_ext_ack *extack) +net_devmem_bind_dmabuf(struct net_device *dev, + enum dma_data_direction direction, + unsigned int dmabuf_fd, struct netlink_ext_ack *extack) { struct net_devmem_dmabuf_binding *binding; static u32 id_alloc_next; @@ -229,7 +238,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, } binding->sgt = dma_buf_map_attachment_unlocked(binding->attachment, - DMA_FROM_DEVICE); + direction); if (IS_ERR(binding->sgt)) { err = PTR_ERR(binding->sgt); NL_SET_ERR_MSG(extack, "Failed to map dmabuf attachment"); @@ -240,13 +249,22 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, * binding can be much more flexible than that. We may be able to * allocate MTU sized chunks here. Leave that for future work... */ - binding->chunk_pool = - gen_pool_create(PAGE_SHIFT, dev_to_node(&dev->dev)); + binding->chunk_pool = gen_pool_create(PAGE_SHIFT, + dev_to_node(&dev->dev)); if (!binding->chunk_pool) { err = -ENOMEM; goto err_unmap; } + if (direction == DMA_TO_DEVICE) { + binding->tx_vec = kcalloc(dmabuf->size / PAGE_SIZE, + sizeof(struct net_iov *), GFP_KERNEL); + if (!binding->tx_vec) { + err = -ENOMEM; + goto err_free_chunks; + } + } + virtual = 0; for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) { dma_addr_t dma_addr = sg_dma_address(sg); @@ -288,6 +306,8 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, niov->owner = owner; page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov), net_devmem_get_dma_addr(niov)); + if (direction == DMA_TO_DEVICE) + binding->tx_vec[owner->base_virtual / PAGE_SIZE + i] = niov; } virtual += len; @@ -313,6 +333,21 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, return ERR_PTR(err); } +struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id) +{ + struct net_devmem_dmabuf_binding *binding; + + rcu_read_lock(); + binding = xa_load(&net_devmem_dmabuf_bindings, id); + if (binding) { + if (!net_devmem_dmabuf_binding_get(binding)) + binding = NULL; + } + rcu_read_unlock(); + + return binding; +} + void dev_dmabuf_uninstall(struct net_device *dev) { struct net_devmem_dmabuf_binding *binding; @@ -343,6 +378,53 @@ void net_devmem_put_net_iov(struct net_iov *niov) net_devmem_dmabuf_binding_put(niov->owner->binding); } +struct net_devmem_dmabuf_binding *net_devmem_get_binding(struct sock *sk, + unsigned int dmabuf_id) +{ + struct net_devmem_dmabuf_binding *binding; + struct dst_entry *dst = __sk_dst_get(sk); + int err = 0; + + binding = net_devmem_lookup_dmabuf(dmabuf_id); + if (!binding || !binding->tx_vec) { + err = -EINVAL; + goto out_err; + } + + /* The dma-addrs in this binding are only reachable to the corresponding + * net_device. + */ + if (!dst || !dst->dev || dst->dev->ifindex != binding->dev->ifindex) { + err = -ENODEV; + goto out_err; + } + + return binding; + +out_err: + if (binding) + net_devmem_dmabuf_binding_put(binding); + + return ERR_PTR(err); +} + +struct net_iov * +net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, + size_t virt_addr, size_t *off, size_t *size) +{ + size_t idx; + + if (virt_addr >= binding->dmabuf->size) + return NULL; + + idx = virt_addr / PAGE_SIZE; + + *off = virt_addr % PAGE_SIZE; + *size = PAGE_SIZE - *off; + + return binding->tx_vec[idx]; +} + /*** "Dmabuf devmem memory provider" ***/ int mp_dmabuf_devmem_init(struct page_pool *pool) diff --git a/net/core/devmem.h b/net/core/devmem.h index 8b51caff5a0e..874e891e70e0 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -46,6 +46,12 @@ struct net_devmem_dmabuf_binding { * active. */ u32 id; + + /* Array of net_iov pointers for this binding, sorted by virtual + * address. This array is convenient to map the virtual addresses to + * net_iovs in the TX path. + */ + struct net_iov **tx_vec; }; #if defined(CONFIG_NET_DEVMEM) @@ -70,12 +76,15 @@ struct dmabuf_genpool_chunk_owner { void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding); struct net_devmem_dmabuf_binding * -net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, - struct netlink_ext_ack *extack); +net_devmem_bind_dmabuf(struct net_device *dev, + enum dma_data_direction direction, + unsigned int dmabuf_fd, struct netlink_ext_ack *extack); +struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id); void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding); int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, struct net_devmem_dmabuf_binding *binding, struct netlink_ext_ack *extack); +void net_devmem_bind_tx_release(struct sock *sk); void dev_dmabuf_uninstall(struct net_device *dev); static inline struct dmabuf_genpool_chunk_owner * @@ -108,10 +117,10 @@ static inline u32 net_iov_binding_id(const struct net_iov *niov) return net_iov_owner(niov)->binding->id; } -static inline void +static inline bool net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding) { - refcount_inc(&binding->ref); + return refcount_inc_not_zero(&binding->ref); } static inline void @@ -130,6 +139,12 @@ struct net_iov * net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding); void net_devmem_free_dmabuf(struct net_iov *ppiov); +struct net_devmem_dmabuf_binding * +net_devmem_get_binding(struct sock *sk, unsigned int dmabuf_id); +struct net_iov * +net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, size_t addr, + size_t *off, size_t *size); + #else struct net_devmem_dmabuf_binding; @@ -153,11 +168,17 @@ __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) static inline struct net_devmem_dmabuf_binding * net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, + enum dma_data_direction direction, struct netlink_ext_ack *extack) { return ERR_PTR(-EOPNOTSUPP); } +static inline struct net_devmem_dmabuf_binding *net_devmem_lookup_dmabuf(u32 id) +{ + return NULL; +} + static inline void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding) { @@ -195,6 +216,19 @@ static inline u32 net_iov_binding_id(const struct net_iov *niov) { return 0; } + +static inline struct net_devmem_dmabuf_binding * +net_devmem_get_binding(struct sock *sk, unsigned int dmabuf_id) +{ + return ERR_PTR(-EOPNOTSUPP); +} + +static inline struct net_iov * +net_devmem_get_niov_at(struct net_devmem_dmabuf_binding *binding, size_t addr, + size_t *off, size_t *size) +{ + return NULL; +} #endif #endif /* _NET_DEVMEM_H */ diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index 0e41699df419..9ba6994e2a05 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -854,7 +854,8 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info) goto err_unlock; } - binding = net_devmem_bind_dmabuf(netdev, dmabuf_fd, info->extack); + binding = net_devmem_bind_dmabuf(netdev, DMA_FROM_DEVICE, dmabuf_fd, + info->extack); if (IS_ERR(binding)) { err = PTR_ERR(binding); goto err_unlock; @@ -911,10 +912,68 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info) return err; } -/* stub */ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info) { - return 0; + struct net_devmem_dmabuf_binding *binding; + struct list_head *sock_binding_list; + struct net_device *netdev; + u32 ifindex, dmabuf_fd; + struct sk_buff *rsp; + int err = 0; + void *hdr; + + if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_DEV_IFINDEX) || + GENL_REQ_ATTR_CHECK(info, NETDEV_A_DMABUF_FD)) + return -EINVAL; + + ifindex = nla_get_u32(info->attrs[NETDEV_A_DEV_IFINDEX]); + dmabuf_fd = nla_get_u32(info->attrs[NETDEV_A_DMABUF_FD]); + + sock_binding_list = + genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk); + if (IS_ERR(sock_binding_list)) + return PTR_ERR(sock_binding_list); + + rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL); + if (!rsp) + return -ENOMEM; + + hdr = genlmsg_iput(rsp, info); + if (!hdr) { + err = -EMSGSIZE; + goto err_genlmsg_free; + } + + rtnl_lock(); + + netdev = __dev_get_by_index(genl_info_net(info), ifindex); + if (!netdev || !netif_device_present(netdev)) { + err = -ENODEV; + goto err_unlock; + } + + binding = net_devmem_bind_dmabuf(netdev, DMA_TO_DEVICE, dmabuf_fd, + info->extack); + if (IS_ERR(binding)) { + err = PTR_ERR(binding); + goto err_unlock; + } + + list_add(&binding->list, sock_binding_list); + + nla_put_u32(rsp, NETDEV_A_DMABUF_ID, binding->id); + genlmsg_end(rsp, hdr); + + rtnl_unlock(); + + return genlmsg_reply(rsp, info); + + net_devmem_unbind_dmabuf(binding); +err_unlock: + rtnl_unlock(); +err_genlmsg_free: + nlmsg_free(rsp); + return err; } void netdev_nl_sock_priv_init(struct list_head *priv) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 815245d5c36b..6289ffcbb20b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1882,7 +1882,8 @@ EXPORT_SYMBOL_GPL(msg_zerocopy_ubuf_ops); int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, struct msghdr *msg, int len, - struct ubuf_info *uarg) + struct ubuf_info *uarg, + struct net_devmem_dmabuf_binding *binding) { int err, orig_len = skb->len; @@ -1901,7 +1902,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, return -EEXIST; } - err = __zerocopy_sg_from_iter(msg, sk, skb, &msg->msg_iter, len); + err = __zerocopy_sg_from_iter(msg, sk, skb, &msg->msg_iter, len, + binding); if (err == -EFAULT || (err == -EMSGSIZE && skb->len == orig_len)) { struct sock *save_sk = skb->sk; diff --git a/net/core/sock.c b/net/core/sock.c index eae2ae70a2e0..353669f124ab 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2911,6 +2911,7 @@ EXPORT_SYMBOL(sock_alloc_send_pskb); int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg, struct sockcm_cookie *sockc) { + struct dmabuf_tx_cmsg dmabuf_tx; u32 tsflags; BUILD_BUG_ON(SOF_TIMESTAMPING_LAST == (1 << 31)); @@ -2964,6 +2965,13 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg, if (!sk_set_prio_allowed(sk, *(u32 *)CMSG_DATA(cmsg))) return -EPERM; sockc->priority = *(u32 *)CMSG_DATA(cmsg); + break; + case SCM_DEVMEM_DMABUF: + if (cmsg->cmsg_len != CMSG_LEN(sizeof(struct dmabuf_tx_cmsg))) + return -EINVAL; + dmabuf_tx = *(struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg); + sockc->dmabuf_id = dmabuf_tx.dmabuf_id; + break; default: return -EINVAL; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 0d704bda6c41..44198ae7e44c 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1051,6 +1051,7 @@ int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int *copied, int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) { + struct net_devmem_dmabuf_binding *binding = NULL; struct tcp_sock *tp = tcp_sk(sk); struct ubuf_info *uarg = NULL; struct sk_buff *skb; @@ -1063,6 +1064,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) flags = msg->msg_flags; + sockcm_init(&sockc, sk); + if (msg->msg_controllen) { + err = sock_cmsg_send(sk, msg, &sockc); + if (unlikely(err)) { + err = -EINVAL; + goto out_err; + } + } + if ((flags & MSG_ZEROCOPY) && size) { if (msg->msg_ubuf) { uarg = msg->msg_ubuf; @@ -1080,6 +1090,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) else uarg_to_msgzc(uarg)->zerocopy = 0; } + + if (sockc.dmabuf_id != 0) { + binding = net_devmem_get_binding(sk, sockc.dmabuf_id); + if (IS_ERR(binding)) { + err = PTR_ERR(binding); + binding = NULL; + goto out_err; + } + } } else if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES) && size) { if (sk->sk_route_caps & NETIF_F_SG) zc = MSG_SPLICE_PAGES; @@ -1123,15 +1142,6 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) /* 'common' sending to sendq */ } - sockcm_init(&sockc, sk); - if (msg->msg_controllen) { - err = sock_cmsg_send(sk, msg, &sockc); - if (unlikely(err)) { - err = -EINVAL; - goto out_err; - } - } - /* This should be in poll */ sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk); @@ -1248,7 +1258,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) goto wait_for_space; } - err = skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg); + err = skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg, + binding); if (err == -EMSGSIZE || err == -EEXIST) { tcp_mark_push(tp, skb); goto new_segment; @@ -1329,6 +1340,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) /* msg->msg_ubuf is pinned by the caller so we don't take extra refs */ if (uarg && !msg->msg_ubuf) net_zcopy_put(uarg); + if (binding) + net_devmem_dmabuf_binding_put(binding); return copied + copied_syn; do_error: @@ -1346,6 +1359,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) sk->sk_write_space(sk); tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED); } + if (binding) + net_devmem_dmabuf_binding_put(binding); + return err; } EXPORT_SYMBOL_GPL(tcp_sendmsg_locked); diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 7f7de6d88096..f6d4bb798517 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -107,8 +107,7 @@ static int virtio_transport_fill_skb(struct sk_buff *skb, { if (zcopy) return __zerocopy_sg_from_iter(info->msg, NULL, skb, - &info->msg->msg_iter, - len); + &info->msg->msg_iter, len, NULL); return memcpy_from_msg(skb_put(skb, len), info->msg, len); } From patchwork Thu Jan 30 21:15:39 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13954986 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DB4651F3D55 for ; Thu, 30 Jan 2025 21:15:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738271755; cv=none; b=AULScZbYopvfJuY41rdBM7HNoRpor7YezJzYVXjtjwVilse139PeZ6ip3rJuhXrzi+WIBNwjwJ96u6/2ORGbFFWWycnj13AJI0uWYlwNEfHgeZ1Fn04IhfPfuKN2yRYvkKOToR9ybMiN1vA61vW3sFM40TpC21y0zqVAGrq+ESY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738271755; c=relaxed/simple; bh=x4IKxvmNWD/4EDjE6znR+TcbhAdsP8iIvWcJKwYoK60=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=h6ofL9cIMjXzLJTnuvtMV5xI+MuXm8Eejx7L+cCq9Er/AtjW6waKIOLgYd306M7bNXePVe3FezAZeuxYgwIZvxNfHOoVct6QZ2Kj8X/UtFv0P+oJwlvHeeQdr+phTC5VAxhZhUy8h48MYjiKk6yDudUqBKBv2EpMNWJm2FmP0c0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=KZ7Bmtlu; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="KZ7Bmtlu" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2ee9f66cb12so2499953a91.1 for ; Thu, 30 Jan 2025 13:15:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738271752; x=1738876552; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=d1EKwdWQ1Gk6X/oh63eg9ipo10zb+OP7eOCU+VWvx8c=; b=KZ7BmtluVhcCJn9IUHVV8r7ob/n3rn/w2JpW+8/u2fdB+TESc9Z79iTaZnmj3LIv75 z7GUr5MJm+7U0/ob/r2Gvl9JEfezcVefaU8SMziL7NoY2OnhNmWAJMvZFJRCTPqkgIhW nBmGSnbt4aPzxLVR6qHD4jrIJv23UIdC+B5X3A8JE9ZZLqXXbPqmSDIrf0jNo0/enqXv 1g5AESwi4GMDWvk/t1nFEXYmtVMBjvJHf64QkRVN0ObL9Sk/uo6nysCvxUSCkVPb2+uo S6nP+KztP6XpIkBZSCqJRZXKmTrI96FyMa8E2MQDHzL6t95m68zl7iqw8FKd72JESeLJ P6UQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738271752; x=1738876552; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=d1EKwdWQ1Gk6X/oh63eg9ipo10zb+OP7eOCU+VWvx8c=; b=s2tA6HWGi1NutjiZawi3U3LbHu6UO7cydagpKyEC7n/k7b9RArAXT8AOWQueXsQK91 F0mwqiNTo1cT+gghzmIQDtx2OUHca7xEbbr8btBZV+xL3DhpxOUawsRQP4QtZFpVusxh T9OtTMCyQQeBzPuQTMht4lJiu6JlBx6JUW6VHBUZrtx05ojsUPVVTCdxd5+lNDPj3wt1 e44OCM/gIjCyUqL1PiWdb7DGDJHcBh5znCPdxrTxw/0iS6LmI5HzbyD4pLsSvhCrBo87 /JPLNpebOBocCXeVO+Coh25iOM5uG204E+zfqqVJ0/TIy79WGHU2MgButLWOhCGEkguY A/ww== X-Gm-Message-State: AOJu0YwJe214XZ5akpkTPgzxwwipRL082R66Wgfptm5nTwVY/UNGDjAD GHMs79leCDiuT0x7hlCuwRHjLxiCEkBzda6drBM65lVM05pn4WuzhzjwYqXYuVWrf7dt1kPmFRQ 0bW0dMjePHUyJPTtS+2JCY0az5k6gAmffAp0Vko+DsXEaWgbgioEeIrCVQhT2PrTJ8C+15pdRQA 21pgN3HbOpaHS3wo5iFYa8VYY6On/7UCQQ8/Dw22wNwZLx6FdA9Y/6F5iaP5E= X-Google-Smtp-Source: AGHT+IEu842jzz9/l2yJnwVinJKP7fFjPrUiJo2kyVUr6Y5tdahD7orUVlBr0+J+ihAQ9NwxSAcZpEbIb/SaC3F/GA== X-Received: from pjbsw11.prod.google.com ([2002:a17:90b:2c8b:b0:2ef:a732:f48d]) (user=almasrymina job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:4ec8:b0:2ee:f677:aa14 with SMTP id 98e67ed59e1d1-2f83abea7c9mr13096491a91.13.1738271752211; Thu, 30 Jan 2025 13:15:52 -0800 (PST) Date: Thu, 30 Jan 2025 21:15:39 +0000 In-Reply-To: <20250130211539.428952-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250130211539.428952-1-almasrymina@google.com> X-Mailer: git-send-email 2.48.1.362.g079036d154-goog Message-ID: <20250130211539.428952-7-almasrymina@google.com> Subject: [PATCH RFC net-next v2 6/6] net: devmem: make dmabuf unbinding scheduled work From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kvm@vger.kernel.org, virtualization@lists.linux.dev, linux-kselftest@vger.kernel.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Donald Hunter , Jonathan Corbet , Andrew Lunn , David Ahern , Stefan Hajnoczi , Stefano Garzarella , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , " =?utf-8?q?Eugenio_P=C3=A9rez?= " , Shuah Khan , sdf@fomichev.me, asml.silence@gmail.com, dw@davidwei.uk, Jamal Hadi Salim , Victor Nogueira , Pedro Tammela X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC The TX path may release the dmabuf in a context where we cannot wait. This happens when the user unbinds a TX dmabuf while there are still references to its netmems in the TX path. In that case, the netmems will be put_netmem'd from a context where we can't unmap the dmabuf, resulting in a BUG like seen by Stan: [ 1.548495] BUG: sleeping function called from invalid context at drivers/dma-buf/dma-buf.c:1255 [ 1.548741] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 149, name: ncdevmem [ 1.548926] preempt_count: 201, expected: 0 [ 1.549026] RCU nest depth: 0, expected: 0 [ 1.549197] [ 1.549237] ============================= [ 1.549331] [ BUG: Invalid wait context ] [ 1.549425] 6.13.0-rc3-00770-gbc9ef9606dc9-dirty #15 Tainted: G W [ 1.549609] ----------------------------- [ 1.549704] ncdevmem/149 is trying to lock: [ 1.549801] ffff8880066701c0 (reservation_ww_class_mutex){+.+.}-{4:4}, at: dma_buf_unmap_attachment_unlocked+0x4b/0x90 [ 1.550051] other info that might help us debug this: [ 1.550167] context-{5:5} [ 1.550229] 3 locks held by ncdevmem/149: [ 1.550322] #0: ffff888005730208 (&sb->s_type->i_mutex_key#11){+.+.}-{4:4}, at: sock_close+0x40/0xf0 [ 1.550530] #1: ffff88800b148f98 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x19/0x80 [ 1.550731] #2: ffff88800b148f18 (slock-AF_INET6){+.-.}-{3:3}, at: __tcp_close+0x185/0x4b0 [ 1.550921] stack backtrace: [ 1.550990] CPU: 0 UID: 0 PID: 149 Comm: ncdevmem Tainted: G W 6.13.0-rc3-00770-gbc9ef9606dc9-dirty #15 [ 1.551233] Tainted: [W]=WARN [ 1.551304] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 1.551518] Call Trace: [ 1.551584] [ 1.551636] dump_stack_lvl+0x86/0xc0 [ 1.551723] __lock_acquire+0xb0f/0xc30 [ 1.551814] ? dma_buf_unmap_attachment_unlocked+0x4b/0x90 [ 1.551941] lock_acquire+0xf1/0x2a0 [ 1.552026] ? dma_buf_unmap_attachment_unlocked+0x4b/0x90 [ 1.552152] ? dma_buf_unmap_attachment_unlocked+0x4b/0x90 [ 1.552281] ? dma_buf_unmap_attachment_unlocked+0x4b/0x90 [ 1.552408] __ww_mutex_lock+0x121/0x1060 [ 1.552503] ? dma_buf_unmap_attachment_unlocked+0x4b/0x90 [ 1.552648] ww_mutex_lock+0x3d/0xa0 [ 1.552733] dma_buf_unmap_attachment_unlocked+0x4b/0x90 [ 1.552857] __net_devmem_dmabuf_binding_free+0x56/0xb0 [ 1.552979] skb_release_data+0x120/0x1f0 [ 1.553074] __kfree_skb+0x29/0xa0 [ 1.553156] tcp_write_queue_purge+0x41/0x310 [ 1.553259] tcp_v4_destroy_sock+0x127/0x320 [ 1.553363] ? __tcp_close+0x169/0x4b0 [ 1.553452] inet_csk_destroy_sock+0x53/0x130 [ 1.553560] __tcp_close+0x421/0x4b0 [ 1.553646] tcp_close+0x24/0x80 [ 1.553724] inet_release+0x5d/0x90 [ 1.553806] sock_close+0x4a/0xf0 [ 1.553886] __fput+0x9c/0x2b0 [ 1.553960] task_work_run+0x89/0xc0 [ 1.554046] do_exit+0x27f/0x980 [ 1.554125] do_group_exit+0xa4/0xb0 [ 1.554211] __x64_sys_exit_group+0x17/0x20 [ 1.554309] x64_sys_call+0x21a0/0x21a0 [ 1.554400] do_syscall_64+0xec/0x1d0 [ 1.554487] ? exc_page_fault+0x8a/0xf0 [ 1.554585] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 1.554703] RIP: 0033:0x7f2f8a27abcd Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd. Suggested-by: Stanislav Fomichev Signed-off-by: Mina Almasry --- net/core/devmem.c | 4 +++- net/core/devmem.h | 10 ++++++---- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/net/core/devmem.c b/net/core/devmem.c index 796338b1599e..58fcae0c0c69 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -46,8 +46,10 @@ static dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov) ((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT); } -void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) +void __net_devmem_dmabuf_binding_free(struct work_struct *wq) { + struct net_devmem_dmabuf_binding *binding = container_of(wq, typeof(*binding), unbind_w); + size_t size, avail; gen_pool_for_each_chunk(binding->chunk_pool, diff --git a/net/core/devmem.h b/net/core/devmem.h index 874e891e70e0..63d16dbaca2d 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -52,6 +52,8 @@ struct net_devmem_dmabuf_binding { * net_iovs in the TX path. */ struct net_iov **tx_vec; + + struct work_struct unbind_w; }; #if defined(CONFIG_NET_DEVMEM) @@ -74,7 +76,7 @@ struct dmabuf_genpool_chunk_owner { struct net_devmem_dmabuf_binding *binding; }; -void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding); +void __net_devmem_dmabuf_binding_free(struct work_struct *wq); struct net_devmem_dmabuf_binding * net_devmem_bind_dmabuf(struct net_device *dev, enum dma_data_direction direction, @@ -129,7 +131,8 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding) if (!refcount_dec_and_test(&binding->ref)) return; - __net_devmem_dmabuf_binding_free(binding); + INIT_WORK(&binding->unbind_w, __net_devmem_dmabuf_binding_free); + schedule_work(&binding->unbind_w); } void net_devmem_get_net_iov(struct net_iov *niov); @@ -161,8 +164,7 @@ static inline void net_devmem_put_net_iov(struct net_iov *niov) { } -static inline void -__net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) +static inline void __net_devmem_dmabuf_binding_free(struct work_struct *wq) { }