From patchwork Thu Mar 13 23:32:53 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016049 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8E2261FA851 for ; Thu, 13 Mar 2025 23:34:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908844; cv=none; b=OXEjx5kh3VSZrWCs7qJEqINPVvXkGJaYcv9QdDkFuYRIt3Pf4mclJZ1+7aN+oKEUb3TPnBJdKai0VVmRX2P4r4EuNlOd2mBBWQb3BXNiWZ3hegpc9mx31uOkfq8OHmHSepwp4EcsUnjVu6kmdsDNmt9z/8odEIKMALnnkEFGBbo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908844; c=relaxed/simple; bh=lwORvviNhSF8SgSU7winhQCg+6QYL87hKWXdsGcTc4c=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=WO6Fh0quxu0gMGhcWF49NoIAEyg6Fsy1dDkR13auG+rUC4/+aQuGJnLksGHYjKQeWc2P0oeaJN7wN/7FZvohkjjnzIfVed9Hk2FM1ZfR18mRjTMZvChOTaHgdSCVA+rfIMTvbUx8lO4Eih1BJFelsuBGVRp/3NyT8/Xy9jLiHts= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=EaO2SAut; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="EaO2SAut" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908841; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Fi0cYhXmllmGlxSznwGExnwwbD78Vn5+i2/pgWu6QI4=; b=EaO2SAutB6d4fsOMMRLE0nM5GRPE3+dF7VejRlUVYE0Oq3Dz40IE4AnxsFUSzu/ExgSvZW 0GJ0WAtWA0Ln3d7KDB5IbThzCLuZLGfFskoHftrYr4RAg0Cs0faApPkxmKz2LL/ZVtjiMb At7/lj60fPU6NyIvzQQhicvB4hFYIUU= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-73-5yCKPd0pMzSflmKtr2Ob_Q-1; Thu, 13 Mar 2025 19:33:56 -0400 X-MC-Unique: 5yCKPd0pMzSflmKtr2Ob_Q-1 X-Mimecast-MFC-AGG-ID: 5yCKPd0pMzSflmKtr2Ob_Q_1741908834 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id AF20919560B7; Thu, 13 Mar 2025 23:33:54 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 5B3F81955F2D; Thu, 13 Mar 2025 23:33:51 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Xiubo Li Subject: [RFC PATCH 01/35] ceph: Fix incorrect flush end position calculation Date: Thu, 13 Mar 2025 23:32:53 +0000 Message-ID: <20250313233341.1675324-2-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 In ceph, in fill_fscrypt_truncate(), the end flush position is calculated by: loff_t lend = orig_pos + CEPH_FSCRYPT_BLOCK_SHIFT - 1; but that's using the block shift not the block size. Fix this to use the block size instead. Fixes: 5c64737d2536 ("ceph: add truncate size handling support for fscrypt") Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Xiubo Li cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/ceph/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index c15970fa240f..b060f765ad20 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -2364,7 +2364,7 @@ static int fill_fscrypt_truncate(struct inode *inode, /* Try to writeback the dirty pagecaches */ if (issued & (CEPH_CAP_FILE_BUFFER)) { - loff_t lend = orig_pos + CEPH_FSCRYPT_BLOCK_SHIFT - 1; + loff_t lend = orig_pos + CEPH_FSCRYPT_BLOCK_SIZE - 1; ret = filemap_write_and_wait_range(inode->i_mapping, orig_pos, lend); From patchwork Thu Mar 13 23:32:54 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016050 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8E2811FA854 for ; Thu, 13 Mar 2025 23:34:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908844; cv=none; b=F2WMI+O3mOJNQw74f1GJSZFJiKOCme2MZo3WQ0xPo/0BnvUhAHVMaxO+64jzWeYBpNdT+FNy9zDoOCVejGkYe6Z8N/WhVHUip5L3FAeJg9OGH9dcNDEXiW/ZGYgsLMylhA3iQEyT7tuI5VKTyFGLf+OznzZzk9M/tczAf6tH6dA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908844; c=relaxed/simple; bh=a5+wM17zrDIxgoQGyf0nP7R3txTLjuIAWm6gbxEvJl8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=gXI6Yw8ybPrpdHuOvxt8PRQcPJTqLTSrqjkM7BlTJ7oG2ujcMM11GXV/VjvoMxGMeEzquUeUfIPkYO+CwkNBznVB+6Ja1Ce+yyH+ZFL36xmIEvxsnOxWvfny3hVBeG/57FmIw64hpLYOTlk3S77uHGA9q3lYJLaHawCdcP2QJ7M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=erAZnhHM; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="erAZnhHM" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908841; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=zkaBKVJpvKYr2NdYwHCCcrSLhw9yz26BiRgfLhJA8+U=; b=erAZnhHMKVSEEBDAGXprcMRLTZYk0SEH1W0biRXcA3LQPugWw+rf2uiGXsCv8pOV6Cnf1Y MLId19RHv6GycASsOh8t03rGAEvoEeRx//nNQHoaqCM+tb3lSSopOLA9XpWbYey+WYvnY2 /RhnLsxTPE3QWAu7wKUorM2ncgD5MeA= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-557-aN_D_xhZNV-InGIDk7kLMQ-1; Thu, 13 Mar 2025 19:34:00 -0400 X-MC-Unique: aN_D_xhZNV-InGIDk7kLMQ-1 X-Mimecast-MFC-AGG-ID: aN_D_xhZNV-InGIDk7kLMQ_1741908838 Received: from mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 345BD19560AE; Thu, 13 Mar 2025 23:33:58 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id E1B1419560AB; Thu, 13 Mar 2025 23:33:55 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 02/35] libceph: Rename alignment to offset Date: Thu, 13 Mar 2025 23:32:54 +0000 Message-ID: <20250313233341.1675324-3-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.40 Rename 'alignment' to 'offset' in a number of places where it seems to be talking about the offset into the first page of a sequence of pages. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/ceph/addr.c | 4 ++-- include/linux/ceph/messenger.h | 4 ++-- include/linux/ceph/osd_client.h | 10 +++++----- net/ceph/messenger.c | 10 +++++----- net/ceph/osd_client.c | 24 ++++++++++++------------ 5 files changed, 26 insertions(+), 26 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 20b6bd8cd004..482a9f41a685 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -254,7 +254,7 @@ static void finish_netfs_read(struct ceph_osd_request *req) if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) { ceph_put_page_vector(osd_data->pages, - calc_pages_for(osd_data->alignment, + calc_pages_for(osd_data->offset, osd_data->length), false); } if (err > 0) { @@ -918,7 +918,7 @@ static void writepages_finish(struct ceph_osd_request *req) osd_data = osd_req_op_extent_osd_data(req, i); BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_PAGES); len += osd_data->length; - num_pages = calc_pages_for((u64)osd_data->alignment, + num_pages = calc_pages_for((u64)osd_data->offset, (u64)osd_data->length); total_pages += num_pages; for (j = 0; j < num_pages; j++) { diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 1717cc57cdac..db2aba32b8a0 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -221,7 +221,7 @@ struct ceph_msg_data { struct { struct page **pages; size_t length; /* total # bytes */ - unsigned int alignment; /* first page */ + unsigned int offset; /* first page */ bool own_pages; }; struct ceph_pagelist *pagelist; @@ -602,7 +602,7 @@ extern bool ceph_con_keepalive_expired(struct ceph_connection *con, unsigned long interval); void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages, - size_t length, size_t alignment, bool own_pages); + size_t length, size_t offset, bool own_pages); extern void ceph_msg_data_add_pagelist(struct ceph_msg *msg, struct ceph_pagelist *pagelist); #ifdef CONFIG_BLOCK diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index d55b30057a45..8fc84f389aad 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -118,7 +118,7 @@ struct ceph_osd_data { struct { struct page **pages; u64 length; - u32 alignment; + u32 offset; bool pages_from_pool; bool own_pages; }; @@ -469,7 +469,7 @@ struct ceph_osd_req_op *osd_req_op_init(struct ceph_osd_request *osd_req, extern void osd_req_op_raw_data_in_pages(struct ceph_osd_request *, unsigned int which, struct page **pages, u64 length, - u32 alignment, bool pages_from_pool, + u32 offset, bool pages_from_pool, bool own_pages); extern void osd_req_op_extent_init(struct ceph_osd_request *osd_req, @@ -488,7 +488,7 @@ extern struct ceph_osd_data *osd_req_op_extent_osd_data( extern void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *, unsigned int which, struct page **pages, u64 length, - u32 alignment, bool pages_from_pool, + u32 offset, bool pages_from_pool, bool own_pages); extern void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *, unsigned int which, @@ -515,7 +515,7 @@ extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *, extern void osd_req_op_cls_request_data_pages(struct ceph_osd_request *, unsigned int which, struct page **pages, u64 length, - u32 alignment, bool pages_from_pool, + u32 offset, bool pages_from_pool, bool own_pages); void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req, unsigned int which, @@ -524,7 +524,7 @@ void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req, extern void osd_req_op_cls_response_data_pages(struct ceph_osd_request *, unsigned int which, struct page **pages, u64 length, - u32 alignment, bool pages_from_pool, + u32 offset, bool pages_from_pool, bool own_pages); int osd_req_op_cls_init(struct ceph_osd_request *osd_req, unsigned int which, const char *class, const char *method); diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index d1b5705dc0c6..1df4291cc80b 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -840,8 +840,8 @@ static void ceph_msg_data_pages_cursor_init(struct ceph_msg_data_cursor *cursor, BUG_ON(!data->length); cursor->resid = min(length, data->length); - page_count = calc_pages_for(data->alignment, (u64)data->length); - cursor->page_offset = data->alignment & ~PAGE_MASK; + page_count = calc_pages_for(data->offset, (u64)data->length); + cursor->page_offset = data->offset & ~PAGE_MASK; cursor->page_index = 0; BUG_ON(page_count > (int)USHRT_MAX); cursor->page_count = (unsigned short)page_count; @@ -1873,7 +1873,7 @@ static struct ceph_msg_data *ceph_msg_data_add(struct ceph_msg *msg) static void ceph_msg_data_destroy(struct ceph_msg_data *data) { if (data->type == CEPH_MSG_DATA_PAGES && data->own_pages) { - int num_pages = calc_pages_for(data->alignment, data->length); + int num_pages = calc_pages_for(data->offset, data->length); ceph_release_page_vector(data->pages, num_pages); } else if (data->type == CEPH_MSG_DATA_PAGELIST) { ceph_pagelist_release(data->pagelist); @@ -1881,7 +1881,7 @@ static void ceph_msg_data_destroy(struct ceph_msg_data *data) } void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages, - size_t length, size_t alignment, bool own_pages) + size_t length, size_t offset, bool own_pages) { struct ceph_msg_data *data; @@ -1892,7 +1892,7 @@ void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages, data->type = CEPH_MSG_DATA_PAGES; data->pages = pages; data->length = length; - data->alignment = alignment & ~PAGE_MASK; + data->offset = offset & ~PAGE_MASK; data->own_pages = own_pages; msg->data_length += length; diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index b24afec24138..e359e70ad47e 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -130,13 +130,13 @@ static void ceph_osd_data_init(struct ceph_osd_data *osd_data) * Consumes @pages if @own_pages is true. */ static void ceph_osd_data_pages_init(struct ceph_osd_data *osd_data, - struct page **pages, u64 length, u32 alignment, + struct page **pages, u64 length, u32 offset, bool pages_from_pool, bool own_pages) { osd_data->type = CEPH_OSD_DATA_TYPE_PAGES; osd_data->pages = pages; osd_data->length = length; - osd_data->alignment = alignment; + osd_data->offset = offset; osd_data->pages_from_pool = pages_from_pool; osd_data->own_pages = own_pages; } @@ -196,26 +196,26 @@ EXPORT_SYMBOL(osd_req_op_extent_osd_data); void osd_req_op_raw_data_in_pages(struct ceph_osd_request *osd_req, unsigned int which, struct page **pages, - u64 length, u32 alignment, + u64 length, u32 offset, bool pages_from_pool, bool own_pages) { struct ceph_osd_data *osd_data; osd_data = osd_req_op_raw_data_in(osd_req, which); - ceph_osd_data_pages_init(osd_data, pages, length, alignment, + ceph_osd_data_pages_init(osd_data, pages, length, offset, pages_from_pool, own_pages); } EXPORT_SYMBOL(osd_req_op_raw_data_in_pages); void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *osd_req, unsigned int which, struct page **pages, - u64 length, u32 alignment, + u64 length, u32 offset, bool pages_from_pool, bool own_pages) { struct ceph_osd_data *osd_data; osd_data = osd_req_op_data(osd_req, which, extent, osd_data); - ceph_osd_data_pages_init(osd_data, pages, length, alignment, + ceph_osd_data_pages_init(osd_data, pages, length, offset, pages_from_pool, own_pages); } EXPORT_SYMBOL(osd_req_op_extent_osd_data_pages); @@ -312,12 +312,12 @@ EXPORT_SYMBOL(osd_req_op_cls_request_data_pagelist); void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req, unsigned int which, struct page **pages, u64 length, - u32 alignment, bool pages_from_pool, bool own_pages) + u32 offset, bool pages_from_pool, bool own_pages) { struct ceph_osd_data *osd_data; osd_data = osd_req_op_data(osd_req, which, cls, request_data); - ceph_osd_data_pages_init(osd_data, pages, length, alignment, + ceph_osd_data_pages_init(osd_data, pages, length, offset, pages_from_pool, own_pages); osd_req->r_ops[which].cls.indata_len += length; osd_req->r_ops[which].indata_len += length; @@ -344,12 +344,12 @@ EXPORT_SYMBOL(osd_req_op_cls_request_data_bvecs); void osd_req_op_cls_response_data_pages(struct ceph_osd_request *osd_req, unsigned int which, struct page **pages, u64 length, - u32 alignment, bool pages_from_pool, bool own_pages) + u32 offset, bool pages_from_pool, bool own_pages) { struct ceph_osd_data *osd_data; osd_data = osd_req_op_data(osd_req, which, cls, response_data); - ceph_osd_data_pages_init(osd_data, pages, length, alignment, + ceph_osd_data_pages_init(osd_data, pages, length, offset, pages_from_pool, own_pages); } EXPORT_SYMBOL(osd_req_op_cls_response_data_pages); @@ -382,7 +382,7 @@ static void ceph_osd_data_release(struct ceph_osd_data *osd_data) if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES && osd_data->own_pages) { int num_pages; - num_pages = calc_pages_for((u64)osd_data->alignment, + num_pages = calc_pages_for((u64)osd_data->offset, (u64)osd_data->length); ceph_release_page_vector(osd_data->pages, num_pages); } else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) { @@ -969,7 +969,7 @@ static void ceph_osdc_msg_data_add(struct ceph_msg *msg, BUG_ON(length > (u64) SIZE_MAX); if (length) ceph_msg_data_add_pages(msg, osd_data->pages, - length, osd_data->alignment, false); + length, osd_data->offset, false); } else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) { BUG_ON(!length); ceph_msg_data_add_pagelist(msg, osd_data->pagelist); From patchwork Thu Mar 13 23:32:55 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016051 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A0441FC7E1 for ; Thu, 13 Mar 2025 23:34:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908852; cv=none; b=b8hJGcjR8Qys96yAMCwF81B0U2tybwXYxk5o93er8VF/VzF3le/U/fOA8OmvEhHEKhgBDxfcFym6knV/L8pPPMCbTQUzVo5HRCHz5TX4AFigKoSInMsvujnLEiXMVMYTFDABshQ8hoF+IdMX7yaSyU45pWDjvnhE/+hg34M4kJQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908852; c=relaxed/simple; bh=93bsT2wHf3/F90g4Octf8J1152N7Qwun0pcje7lTR5A=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=gwcyznKXp7Qbw0F1sHq6QbDui7H55o03TBDIYBqqlErEhFbyXGwOA5VehlSPTT7c8rpOsdlkUa+6d7ztTVI49ZvC8vLCTZQFgZBD63T/fK+gWc9J1PvZUEThiJ+wbUf5QTnU3JtoyF27smYdNQvvy+s5RJkBfBXvNVVwaWJe/cU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=SChWANZA; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="SChWANZA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908849; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dx+3jVw5Ls7OLpNlaQ5TGP/nAsdz4S9eFkCPMqACq2w=; b=SChWANZARgXmyzvw5GFCR2cKogZktlWSje6yB+X6mfsGqt5z+g/4eu/34F8+IwcV28oNS8 C+YLB1x3kQeKY4Ugz8qBkIHx88cX0lWrYCc7ecFsjEl4CqfAfF0bxqWLoyctwtj0D4K7i6 8B/bxiUj3t9ju0paAHG9OU6kbZAO2RA= Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-584-vS8KTAWxMqKA4obt44UXXg-1; Thu, 13 Mar 2025 19:34:05 -0400 X-MC-Unique: vS8KTAWxMqKA4obt44UXXg-1 X-Mimecast-MFC-AGG-ID: vS8KTAWxMqKA4obt44UXXg_1741908843 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 7697E1955DCC; Thu, 13 Mar 2025 23:34:02 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 8F9F318001F6; Thu, 13 Mar 2025 23:33:59 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 03/35] libceph: Add a new data container type, ceph_databuf Date: Thu, 13 Mar 2025 23:32:55 +0000 Message-ID: <20250313233341.1675324-4-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Add a new ceph data container type, ceph_databuf, that can carry a list of pages in a bvec and use an iov_iter to handle describe the data to the next layer down. The iterator can also be used to refer to other types, such as ITER_FOLIOQ. There are two ways of loading the bvec. One way is to allocate a buffer with space in it and then add data, expanding the space as needed; the other is to splice in pages, expanding the bvec[] as needed. This is intended to replace all other types. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- include/linux/ceph/databuf.h | 131 +++++++++++++++++++++ include/linux/ceph/messenger.h | 6 +- include/linux/ceph/osd_client.h | 3 + net/ceph/Makefile | 3 +- net/ceph/databuf.c | 200 ++++++++++++++++++++++++++++++++ net/ceph/messenger.c | 20 +++- net/ceph/osd_client.c | 11 +- 7 files changed, 369 insertions(+), 5 deletions(-) create mode 100644 include/linux/ceph/databuf.h create mode 100644 net/ceph/databuf.c diff --git a/include/linux/ceph/databuf.h b/include/linux/ceph/databuf.h new file mode 100644 index 000000000000..14c7a6449467 --- /dev/null +++ b/include/linux/ceph/databuf.h @@ -0,0 +1,131 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __FS_CEPH_DATABUF_H +#define __FS_CEPH_DATABUF_H + +#include +#include +#include + +struct ceph_databuf { + struct bio_vec *bvec; /* List of pages */ + struct bio_vec inline_bvec[1]; /* Inline bvecs for small buffers */ + struct iov_iter iter; /* Iterator defining occupied data */ + size_t limit; /* Maximum length before expansion required */ + size_t nr_bvec; /* Number of bvec[] that have pages */ + size_t max_bvec; /* Size of bvec[] */ + refcount_t refcnt; + bool put_pages; /* T if pages in bvec[] need to be put*/ +}; + +struct ceph_databuf *ceph_databuf_alloc(size_t min_bvec, size_t space, + unsigned int data_source, gfp_t gfp); +struct ceph_databuf *ceph_databuf_get(struct ceph_databuf *dbuf); +void ceph_databuf_release(struct ceph_databuf *dbuf); +int ceph_databuf_append(struct ceph_databuf *dbuf, const void *d, size_t l); +int ceph_databuf_reserve(struct ceph_databuf *dbuf, size_t space, gfp_t gfp); +int ceph_databuf_insert_frag(struct ceph_databuf *dbuf, unsigned int ix, + size_t len, gfp_t gfp); + +static inline +struct ceph_databuf *ceph_databuf_req_alloc(size_t min_bvec, size_t space, gfp_t gfp) +{ + return ceph_databuf_alloc(min_bvec, space, ITER_SOURCE, gfp); +} + +static inline +struct ceph_databuf *ceph_databuf_reply_alloc(size_t min_bvec, size_t space, gfp_t gfp) +{ + struct ceph_databuf *dbuf; + + dbuf = ceph_databuf_alloc(min_bvec, space, ITER_DEST, gfp); + if (dbuf) + iov_iter_reexpand(&dbuf->iter, space); + return dbuf; +} + +static inline struct page *ceph_databuf_page(struct ceph_databuf *dbuf, + unsigned int ix) +{ + return dbuf->bvec[ix].bv_page; +} + +#define kmap_ceph_databuf_page(dbuf, ix) \ + kmap_local_page(ceph_databuf_page(dbuf, ix)); + +static inline int ceph_databuf_encode_64(struct ceph_databuf *dbuf, u64 v) +{ + __le64 ev = cpu_to_le64(v); + return ceph_databuf_append(dbuf, &ev, sizeof(ev)); +} +static inline int ceph_databuf_encode_32(struct ceph_databuf *dbuf, u32 v) +{ + __le32 ev = cpu_to_le32(v); + return ceph_databuf_append(dbuf, &ev, sizeof(ev)); +} +static inline int ceph_databuf_encode_16(struct ceph_databuf *dbuf, u16 v) +{ + __le16 ev = cpu_to_le16(v); + return ceph_databuf_append(dbuf, &ev, sizeof(ev)); +} +static inline int ceph_databuf_encode_8(struct ceph_databuf *dbuf, u8 v) +{ + return ceph_databuf_append(dbuf, &v, 1); +} +static inline int ceph_databuf_encode_string(struct ceph_databuf *dbuf, + const char *s, u32 len) +{ + int ret = ceph_databuf_encode_32(dbuf, len); + if (ret) + return ret; + if (len) + return ceph_databuf_append(dbuf, s, len); + return 0; +} + +static inline size_t ceph_databuf_len(struct ceph_databuf *dbuf) +{ + return dbuf->iter.count; +} + +static inline void ceph_databuf_added_data(struct ceph_databuf *dbuf, + size_t len) +{ + dbuf->iter.count += len; +} + +static inline void ceph_databuf_reply_ready(struct ceph_databuf *reply, + size_t len) +{ + reply->iter.data_source = ITER_SOURCE; + iov_iter_truncate(&reply->iter, len); +} + +static inline void ceph_databuf_reset_reply(struct ceph_databuf *reply) +{ + iov_iter_bvec(&reply->iter, ITER_DEST, + reply->bvec, reply->nr_bvec, reply->limit); +} + +static inline void ceph_databuf_append_page(struct ceph_databuf *dbuf, + struct page *page, + unsigned int offset, + unsigned int len) +{ + BUG_ON(dbuf->nr_bvec >= dbuf->max_bvec); + bvec_set_page(&dbuf->bvec[dbuf->nr_bvec++], page, len, offset); + dbuf->iter.count += len; + dbuf->iter.nr_segs++; +} + +static inline void *ceph_databuf_enc_start(struct ceph_databuf *dbuf) +{ + return page_address(ceph_databuf_page(dbuf, 0)) + dbuf->iter.count; +} + +static inline void ceph_databuf_enc_stop(struct ceph_databuf *dbuf, void *p) +{ + dbuf->iter.count = p - page_address(ceph_databuf_page(dbuf, 0)); + BUG_ON(dbuf->iter.count > dbuf->limit); +} + +#endif /* __FS_CEPH_DATABUF_H */ diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index db2aba32b8a0..864aad369c91 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -117,6 +117,7 @@ struct ceph_messenger { enum ceph_msg_data_type { CEPH_MSG_DATA_NONE, /* message contains no data payload */ + CEPH_MSG_DATA_DATABUF, /* data source/destination is a data buffer */ CEPH_MSG_DATA_PAGES, /* data source/destination is a page array */ CEPH_MSG_DATA_PAGELIST, /* data source/destination is a pagelist */ #ifdef CONFIG_BLOCK @@ -210,7 +211,10 @@ struct ceph_bvec_iter { struct ceph_msg_data { enum ceph_msg_data_type type; + struct iov_iter iter; + bool release_dbuf; union { + struct ceph_databuf *dbuf; #ifdef CONFIG_BLOCK struct { struct ceph_bio_iter bio_pos; @@ -225,7 +229,6 @@ struct ceph_msg_data { bool own_pages; }; struct ceph_pagelist *pagelist; - struct iov_iter iter; }; }; @@ -601,6 +604,7 @@ extern void ceph_con_keepalive(struct ceph_connection *con); extern bool ceph_con_keepalive_expired(struct ceph_connection *con, unsigned long interval); +void ceph_msg_data_add_databuf(struct ceph_msg *msg, struct ceph_databuf *dbuf); void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages, size_t length, size_t offset, bool own_pages); extern void ceph_msg_data_add_pagelist(struct ceph_msg *msg, diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 8fc84f389aad..b8fb5a71dd57 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -16,6 +16,7 @@ #include #include #include +#include struct ceph_msg; struct ceph_snap_context; @@ -103,6 +104,7 @@ struct ceph_osd { enum ceph_osd_data_type { CEPH_OSD_DATA_TYPE_NONE = 0, + CEPH_OSD_DATA_TYPE_DATABUF, CEPH_OSD_DATA_TYPE_PAGES, CEPH_OSD_DATA_TYPE_PAGELIST, #ifdef CONFIG_BLOCK @@ -115,6 +117,7 @@ enum ceph_osd_data_type { struct ceph_osd_data { enum ceph_osd_data_type type; union { + struct ceph_databuf *dbuf; struct { struct page **pages; u64 length; diff --git a/net/ceph/Makefile b/net/ceph/Makefile index 8802a0c0155d..4b2e0b654e45 100644 --- a/net/ceph/Makefile +++ b/net/ceph/Makefile @@ -15,4 +15,5 @@ libceph-y := ceph_common.o messenger.o msgpool.o buffer.o pagelist.o \ auth_x.o \ ceph_strings.o ceph_hash.o \ pagevec.o snapshot.o string_table.o \ - messenger_v1.o messenger_v2.o + messenger_v1.o messenger_v2.o \ + databuf.o diff --git a/net/ceph/databuf.c b/net/ceph/databuf.c new file mode 100644 index 000000000000..9d108fff5a4f --- /dev/null +++ b/net/ceph/databuf.c @@ -0,0 +1,200 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Data container + * + * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + */ + +#include +#include +#include +#include +#include +#include +#include + +struct ceph_databuf *ceph_databuf_alloc(size_t min_bvec, size_t space, + unsigned int data_source, gfp_t gfp) +{ + struct ceph_databuf *dbuf; + size_t inl = ARRAY_SIZE(dbuf->inline_bvec); + + dbuf = kzalloc(sizeof(*dbuf), gfp); + if (!dbuf) + return NULL; + + refcount_set(&dbuf->refcnt, 1); + + if (min_bvec == 0 && space == 0) { + /* Do nothing */ + } else if (min_bvec <= inl && space <= inl * PAGE_SIZE) { + dbuf->bvec = dbuf->inline_bvec; + dbuf->max_bvec = inl; + dbuf->limit = space; + } else if (min_bvec) { + min_bvec = umax(min_bvec, 16); + + dbuf->bvec = kcalloc(min_bvec, sizeof(struct bio_vec), gfp); + if (!dbuf->bvec) { + kfree(dbuf); + return NULL; + } + + dbuf->max_bvec = min_bvec; + } + + iov_iter_bvec(&dbuf->iter, data_source, dbuf->bvec, 0, 0); + + if (space) { + if (ceph_databuf_reserve(dbuf, space, gfp) < 0) { + ceph_databuf_release(dbuf); + return NULL; + } + } + return dbuf; +} +EXPORT_SYMBOL(ceph_databuf_alloc); + +struct ceph_databuf *ceph_databuf_get(struct ceph_databuf *dbuf) +{ + if (!dbuf) + return NULL; + refcount_inc(&dbuf->refcnt); + return dbuf; +} +EXPORT_SYMBOL(ceph_databuf_get); + +void ceph_databuf_release(struct ceph_databuf *dbuf) +{ + size_t i; + + if (!dbuf || !refcount_dec_and_test(&dbuf->refcnt)) + return; + + if (dbuf->put_pages) + for (i = 0; i < dbuf->nr_bvec; i++) + put_page(dbuf->bvec[i].bv_page); + if (dbuf->bvec != dbuf->inline_bvec) + kfree(dbuf->bvec); + kfree(dbuf); +} +EXPORT_SYMBOL(ceph_databuf_release); + +/* + * Expand the bvec[] in the dbuf. + */ +static int ceph_databuf_expand(struct ceph_databuf *dbuf, size_t req_bvec, + gfp_t gfp) +{ + struct bio_vec *bvec = dbuf->bvec, *old = bvec; + size_t size, max_bvec, off = dbuf->iter.bvec - old; + size_t inl = ARRAY_SIZE(dbuf->inline_bvec); + + if (req_bvec <= inl) { + dbuf->bvec = dbuf->inline_bvec; + dbuf->max_bvec = inl; + dbuf->iter.bvec = dbuf->inline_bvec + off; + return 0; + } + + max_bvec = roundup_pow_of_two(req_bvec); + size = array_size(max_bvec, sizeof(struct bio_vec)); + + if (old == dbuf->inline_bvec) { + bvec = kmalloc_array(max_bvec, sizeof(struct bio_vec), gfp); + if (!bvec) + return -ENOMEM; + memcpy(bvec, old, inl); + } else { + bvec = krealloc(old, size, gfp); + if (!bvec) + return -ENOMEM; + } + dbuf->bvec = bvec; + dbuf->max_bvec = max_bvec; + dbuf->iter.bvec = bvec + off; + return 0; +} + +/* Allocate enough pages for a dbuf to append the given amount + * of dbuf without allocating. + * Returns: 0 on success, -ENOMEM on error. + */ +int ceph_databuf_reserve(struct ceph_databuf *dbuf, size_t add_space, + gfp_t gfp) +{ + struct bio_vec *bvec; + size_t i, req_bvec = DIV_ROUND_UP(dbuf->iter.count + add_space, PAGE_SIZE); + int ret; + + dbuf->put_pages = true; + if (req_bvec > dbuf->max_bvec) { + ret = ceph_databuf_expand(dbuf, req_bvec, gfp); + if (ret < 0) + return ret; + } + + bvec = dbuf->bvec; + while (dbuf->nr_bvec < req_bvec) { + struct page *pages[16]; + size_t want = min(req_bvec, ARRAY_SIZE(pages)), got; + + memset(pages, 0, sizeof(pages)); + got = alloc_pages_bulk(gfp, want, pages); + if (!got) + return -ENOMEM; + for (i = 0; i < got; i++) + bvec_set_page(&bvec[dbuf->nr_bvec + i], pages[i], + PAGE_SIZE, 0); + dbuf->iter.nr_segs += got; + dbuf->nr_bvec += got; + dbuf->limit = dbuf->nr_bvec * PAGE_SIZE; + } + + return 0; +} +EXPORT_SYMBOL(ceph_databuf_reserve); + +int ceph_databuf_append(struct ceph_databuf *dbuf, const void *buf, size_t len) +{ + struct iov_iter temp_iter; + + if (!len) + return 0; + if (dbuf->limit - dbuf->iter.count > len && + ceph_databuf_reserve(dbuf, len, GFP_NOIO) < 0) + return -ENOMEM; + + iov_iter_bvec(&temp_iter, ITER_DEST, + dbuf->bvec, dbuf->nr_bvec, dbuf->limit); + iov_iter_advance(&temp_iter, dbuf->iter.count); + + if (copy_to_iter(buf, len, &temp_iter) != len) + return -EFAULT; + dbuf->iter.count += len; + return 0; +} +EXPORT_SYMBOL(ceph_databuf_append); + +/* + * Allocate a fragment and insert it into the buffer at the specified index. + */ +int ceph_databuf_insert_frag(struct ceph_databuf *dbuf, unsigned int ix, + size_t len, gfp_t gfp) +{ + struct page *page; + + page = alloc_page(gfp); + if (!page) + return -ENOMEM; + + bvec_set_page(&dbuf->bvec[ix], page, len, 0); + + if (dbuf->nr_bvec == ix) { + dbuf->iter.nr_segs = ix + 1; + dbuf->nr_bvec = ix + 1; + dbuf->iter.count += len; + } + return 0; +} +EXPORT_SYMBOL(ceph_databuf_insert_frag); diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 1df4291cc80b..802f0b222131 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1872,7 +1872,9 @@ static struct ceph_msg_data *ceph_msg_data_add(struct ceph_msg *msg) static void ceph_msg_data_destroy(struct ceph_msg_data *data) { - if (data->type == CEPH_MSG_DATA_PAGES && data->own_pages) { + if (data->type == CEPH_MSG_DATA_DATABUF) { + ceph_databuf_release(data->dbuf); + } else if (data->type == CEPH_MSG_DATA_PAGES && data->own_pages) { int num_pages = calc_pages_for(data->offset, data->length); ceph_release_page_vector(data->pages, num_pages); } else if (data->type == CEPH_MSG_DATA_PAGELIST) { @@ -1880,6 +1882,22 @@ static void ceph_msg_data_destroy(struct ceph_msg_data *data) } } +void ceph_msg_data_add_databuf(struct ceph_msg *msg, struct ceph_databuf *dbuf) +{ + struct ceph_msg_data *data; + + BUG_ON(!dbuf); + BUG_ON(!ceph_databuf_len(dbuf)); + + data = ceph_msg_data_add(msg); + data->type = CEPH_MSG_DATA_DATABUF; + data->dbuf = ceph_databuf_get(dbuf); + data->iter = dbuf->iter; + + msg->data_length += ceph_databuf_len(dbuf); +} +EXPORT_SYMBOL(ceph_msg_data_add_databuf); + void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages, size_t length, size_t offset, bool own_pages) { diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index e359e70ad47e..c84634264377 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -359,6 +359,8 @@ static u64 ceph_osd_data_length(struct ceph_osd_data *osd_data) switch (osd_data->type) { case CEPH_OSD_DATA_TYPE_NONE: return 0; + case CEPH_OSD_DATA_TYPE_DATABUF: + return ceph_databuf_len(osd_data->dbuf); case CEPH_OSD_DATA_TYPE_PAGES: return osd_data->length; case CEPH_OSD_DATA_TYPE_PAGELIST: @@ -379,7 +381,9 @@ static u64 ceph_osd_data_length(struct ceph_osd_data *osd_data) static void ceph_osd_data_release(struct ceph_osd_data *osd_data) { - if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES && osd_data->own_pages) { + if (osd_data->type == CEPH_OSD_DATA_TYPE_DATABUF) { + ceph_databuf_release(osd_data->dbuf); + } else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES && osd_data->own_pages) { int num_pages; num_pages = calc_pages_for((u64)osd_data->offset, @@ -965,7 +969,10 @@ static void ceph_osdc_msg_data_add(struct ceph_msg *msg, { u64 length = ceph_osd_data_length(osd_data); - if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) { + if (osd_data->type == CEPH_OSD_DATA_TYPE_DATABUF) { + BUG_ON(!length); + ceph_msg_data_add_databuf(msg, osd_data->dbuf); + } else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) { BUG_ON(length > (u64) SIZE_MAX); if (length) ceph_msg_data_add_pages(msg, osd_data->pages, From patchwork Thu Mar 13 23:32:56 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016052 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 942151F8721 for ; Thu, 13 Mar 2025 23:34:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908853; cv=none; b=lap/70Bm7j0jtXhi1kM83zEWFvyadvhKjdJsHoofmUbkFAGl45nwgw4jjlZ1CFJcxSdzZAY5D9Q3WBmH67JfloeYt9hF5NqL3wIfNf9byARGcKkvsNjjLe14P2Q6/BW3d4+vo00jKrrjQpoeo9MDh9HRI4lFrZbbOYruG74pP6E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908853; c=relaxed/simple; bh=CluZwY+eYDnkZf+z8fWzdwFU78oZfC7qBc49tfiJI8A=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=uT0wOeEiBP7O5V0/c/JEmeoEr1LgwmybUgosPBA3J52jilRlAWvip81P1sb2C4KUWA6zyo3LUfjm845WGLG+0OpbhkHgY7TycQ0XMjqGi3xkOTjRLlvf0ZIYnmJ850Dyzz21TVQ2nI3jTUWqscB8jiumhlinwwOpCO87RKQAqOA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=E4aNjIO5; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="E4aNjIO5" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908850; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bDsUct3iSIdqsb4gcB1FzVfxHpzHUbzpTjfZc6AJBmM=; b=E4aNjIO5q5IhgsI3qNRfu8A9jvY8duhwRqXTmoOvBHsCYocBEiBuxqtGdJfbhqNzL+vsuQ JEDdeLyDo0vvU0SpSz1jFK2U0SSu6iC6if+s9bP791cenCNIGoHYZMEtWyz0ctcn+RDva7 EZJOJNL+ciofzaNi+psfE8H2CA1Wg5U= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-590-IF-bWtVKNfeYz1R2hlm-Pg-1; Thu, 13 Mar 2025 19:34:07 -0400 X-MC-Unique: IF-bWtVKNfeYz1R2hlm-Pg-1 X-Mimecast-MFC-AGG-ID: IF-bWtVKNfeYz1R2hlm-Pg_1741908846 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id EA181180035C; Thu, 13 Mar 2025 23:34:05 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id AD14D1800945; Thu, 13 Mar 2025 23:34:03 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 04/35] ceph: Convert ceph_mds_request::r_pagelist to a databuf Date: Thu, 13 Mar 2025 23:32:56 +0000 Message-ID: <20250313233341.1675324-5-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Convert ceph_mds_request::r_pagelist to a databuf, along with the stuff that uses it such as setxattr ops. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/ceph/acl.c | 39 ++++++++++---------- fs/ceph/file.c | 12 ++++--- fs/ceph/inode.c | 85 +++++++++++++++++++------------------------- fs/ceph/mds_client.c | 11 +++--- fs/ceph/mds_client.h | 2 +- fs/ceph/super.h | 2 +- fs/ceph/xattr.c | 68 +++++++++++++++-------------------- 7 files changed, 96 insertions(+), 123 deletions(-) diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c index 1564eacc253d..d6da650db83e 100644 --- a/fs/ceph/acl.c +++ b/fs/ceph/acl.c @@ -171,7 +171,7 @@ int ceph_pre_init_acls(struct inode *dir, umode_t *mode, { struct posix_acl *acl, *default_acl; size_t val_size1 = 0, val_size2 = 0; - struct ceph_pagelist *pagelist = NULL; + struct ceph_databuf *dbuf = NULL; void *tmp_buf = NULL; int err; @@ -201,58 +201,55 @@ int ceph_pre_init_acls(struct inode *dir, umode_t *mode, tmp_buf = kmalloc(max(val_size1, val_size2), GFP_KERNEL); if (!tmp_buf) goto out_err; - pagelist = ceph_pagelist_alloc(GFP_KERNEL); - if (!pagelist) + dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_KERNEL); + if (!dbuf) goto out_err; - err = ceph_pagelist_reserve(pagelist, PAGE_SIZE); - if (err) - goto out_err; - - ceph_pagelist_encode_32(pagelist, acl && default_acl ? 2 : 1); + ceph_databuf_encode_32(dbuf, acl && default_acl ? 2 : 1); if (acl) { size_t len = strlen(XATTR_NAME_POSIX_ACL_ACCESS); - err = ceph_pagelist_reserve(pagelist, len + val_size1 + 8); + err = ceph_databuf_reserve(dbuf, len + val_size1 + 8, + GFP_KERNEL); if (err) goto out_err; - ceph_pagelist_encode_string(pagelist, XATTR_NAME_POSIX_ACL_ACCESS, - len); + ceph_databuf_encode_string(dbuf, XATTR_NAME_POSIX_ACL_ACCESS, + len); err = posix_acl_to_xattr(&init_user_ns, acl, tmp_buf, val_size1); if (err < 0) goto out_err; - ceph_pagelist_encode_32(pagelist, val_size1); - ceph_pagelist_append(pagelist, tmp_buf, val_size1); + ceph_databuf_encode_32(dbuf, val_size1); + ceph_databuf_append(dbuf, tmp_buf, val_size1); } if (default_acl) { size_t len = strlen(XATTR_NAME_POSIX_ACL_DEFAULT); - err = ceph_pagelist_reserve(pagelist, len + val_size2 + 8); + err = ceph_databuf_reserve(dbuf, len + val_size2 + 8, + GFP_KERNEL); if (err) goto out_err; - ceph_pagelist_encode_string(pagelist, - XATTR_NAME_POSIX_ACL_DEFAULT, len); + ceph_databuf_encode_string(dbuf, + XATTR_NAME_POSIX_ACL_DEFAULT, len); err = posix_acl_to_xattr(&init_user_ns, default_acl, tmp_buf, val_size2); if (err < 0) goto out_err; - ceph_pagelist_encode_32(pagelist, val_size2); - ceph_pagelist_append(pagelist, tmp_buf, val_size2); + ceph_databuf_encode_32(dbuf, val_size2); + ceph_databuf_append(dbuf, tmp_buf, val_size2); } kfree(tmp_buf); as_ctx->acl = acl; as_ctx->default_acl = default_acl; - as_ctx->pagelist = pagelist; + as_ctx->dbuf = dbuf; return 0; out_err: posix_acl_release(acl); posix_acl_release(default_acl); kfree(tmp_buf); - if (pagelist) - ceph_pagelist_release(pagelist); + ceph_databuf_release(dbuf); return err; } diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 851d70200c6b..9de2960748b9 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -679,9 +679,9 @@ static int ceph_finish_async_create(struct inode *dir, struct inode *inode, iinfo.change_attr = 1; ceph_encode_timespec64(&iinfo.btime, &now); - if (req->r_pagelist) { - iinfo.xattr_len = req->r_pagelist->length; - iinfo.xattr_data = req->r_pagelist->mapped_tail; + if (req->r_dbuf) { + iinfo.xattr_len = ceph_databuf_len(req->r_dbuf); + iinfo.xattr_data = kmap_ceph_databuf_page(req->r_dbuf, 0); } else { /* fake it */ iinfo.xattr_len = ARRAY_SIZE(xattr_buf); @@ -731,6 +731,8 @@ static int ceph_finish_async_create(struct inode *dir, struct inode *inode, ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session, req->r_fmode, NULL); up_read(&mdsc->snap_rwsem); + if (req->r_dbuf) + kunmap_local(iinfo.xattr_data); if (ret) { doutc(cl, "failed to fill inode: %d\n", ret); ceph_dir_clear_complete(dir); @@ -849,8 +851,8 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry, goto out_ctx; } /* Async create can't handle more than a page of xattrs */ - if (as_ctx.pagelist && - !list_is_singular(&as_ctx.pagelist->head)) + if (as_ctx.dbuf && + as_ctx.dbuf->nr_bvec > 1) try_async = false; } else if (!d_in_lookup(dentry)) { /* If it's not being looked up, it's negative */ diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index b060f765ad20..ec9b80fec7be 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -112,9 +112,9 @@ struct inode *ceph_new_inode(struct inode *dir, struct dentry *dentry, void ceph_as_ctx_to_req(struct ceph_mds_request *req, struct ceph_acl_sec_ctx *as_ctx) { - if (as_ctx->pagelist) { - req->r_pagelist = as_ctx->pagelist; - as_ctx->pagelist = NULL; + if (as_ctx->dbuf) { + req->r_dbuf = as_ctx->dbuf; + as_ctx->dbuf = NULL; } ceph_fscrypt_as_ctx_to_req(req, as_ctx); } @@ -2341,11 +2341,10 @@ static int fill_fscrypt_truncate(struct inode *inode, loff_t pos, orig_pos = round_down(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE); u64 block = orig_pos >> CEPH_FSCRYPT_BLOCK_SHIFT; - struct ceph_pagelist *pagelist = NULL; - struct kvec iov = {0}; + struct ceph_databuf *dbuf = NULL; struct iov_iter iter; - struct page *page = NULL; - struct ceph_fscrypt_truncate_size_header header; + struct ceph_fscrypt_truncate_size_header *header; + void *p; int retry_op = 0; int len = CEPH_FSCRYPT_BLOCK_SIZE; loff_t i_size = i_size_read(inode); @@ -2372,37 +2371,35 @@ static int fill_fscrypt_truncate(struct inode *inode, goto out; } - page = __page_cache_alloc(GFP_KERNEL); - if (page == NULL) { - ret = -ENOMEM; + ret = -ENOMEM; + dbuf = ceph_databuf_req_alloc(2, 0, GFP_KERNEL); + if (!dbuf) goto out; - } - pagelist = ceph_pagelist_alloc(GFP_KERNEL); - if (!pagelist) { - ret = -ENOMEM; + if (ceph_databuf_insert_frag(dbuf, 0, sizeof(*header), GFP_KERNEL) < 0) + goto out; + if (ceph_databuf_insert_frag(dbuf, 1, PAGE_SIZE, GFP_KERNEL) < 0) goto out; - } - iov.iov_base = kmap_local_page(page); - iov.iov_len = len; - iov_iter_kvec(&iter, READ, &iov, 1, len); + iov_iter_bvec(&iter, ITER_DEST, &dbuf->bvec[1], 1, len); pos = orig_pos; ret = __ceph_sync_read(inode, &pos, &iter, &retry_op, &objver); if (ret < 0) goto out; + header = kmap_ceph_databuf_page(dbuf, 0); + /* Insert the header first */ - header.ver = 1; - header.compat = 1; - header.change_attr = cpu_to_le64(inode_peek_iversion_raw(inode)); + header->ver = 1; + header->compat = 1; + header->change_attr = cpu_to_le64(inode_peek_iversion_raw(inode)); /* * Always set the block_size to CEPH_FSCRYPT_BLOCK_SIZE, * because in MDS it may need this to do the truncate. */ - header.block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE); + header->block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE); /* * If we hit a hole here, we should just skip filling @@ -2417,51 +2414,41 @@ static int fill_fscrypt_truncate(struct inode *inode, if (!objver) { doutc(cl, "hit hole, ppos %lld < size %lld\n", pos, i_size); - header.data_len = cpu_to_le32(8 + 8 + 4); - header.file_offset = 0; + header->data_len = cpu_to_le32(8 + 8 + 4); + header->file_offset = 0; ret = 0; } else { - header.data_len = cpu_to_le32(8 + 8 + 4 + CEPH_FSCRYPT_BLOCK_SIZE); - header.file_offset = cpu_to_le64(orig_pos); + header->data_len = cpu_to_le32(8 + 8 + 4 + CEPH_FSCRYPT_BLOCK_SIZE); + header->file_offset = cpu_to_le64(orig_pos); doutc(cl, "encrypt block boff/bsize %d/%lu\n", boff, CEPH_FSCRYPT_BLOCK_SIZE); /* truncate and zero out the extra contents for the last block */ - memset(iov.iov_base + boff, 0, PAGE_SIZE - boff); + p = kmap_ceph_databuf_page(dbuf, 1); + memset(p + boff, 0, PAGE_SIZE - boff); + kunmap_local(p); /* encrypt the last block */ - ret = ceph_fscrypt_encrypt_block_inplace(inode, page, - CEPH_FSCRYPT_BLOCK_SIZE, - 0, block, - GFP_KERNEL); + ret = ceph_fscrypt_encrypt_block_inplace( + inode, ceph_databuf_page(dbuf, 1), + CEPH_FSCRYPT_BLOCK_SIZE, 0, block, GFP_KERNEL); if (ret) goto out; } - /* Insert the header */ - ret = ceph_pagelist_append(pagelist, &header, sizeof(header)); - if (ret) - goto out; + ceph_databuf_added_data(dbuf, sizeof(*header)); + if (header->block_size) + ceph_databuf_added_data(dbuf, CEPH_FSCRYPT_BLOCK_SIZE); - if (header.block_size) { - /* Append the last block contents to pagelist */ - ret = ceph_pagelist_append(pagelist, iov.iov_base, - CEPH_FSCRYPT_BLOCK_SIZE); - if (ret) - goto out; - } - req->r_pagelist = pagelist; + req->r_dbuf = dbuf; out: doutc(cl, "%p %llx.%llx size dropping cap refs on %s\n", inode, ceph_vinop(inode), ceph_cap_string(got)); ceph_put_cap_refs(ci, got); - if (iov.iov_base) - kunmap_local(iov.iov_base); - if (page) - __free_pages(page, 0); - if (ret && pagelist) - ceph_pagelist_release(pagelist); + kunmap_local(header); + if (ret) + ceph_databuf_release(dbuf); return ret; } diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 230e0c3f341f..09661a34f287 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -1125,8 +1125,7 @@ void ceph_mdsc_release_request(struct kref *kref) put_cred(req->r_cred); if (req->r_mnt_idmap) mnt_idmap_put(req->r_mnt_idmap); - if (req->r_pagelist) - ceph_pagelist_release(req->r_pagelist); + ceph_databuf_release(req->r_dbuf); kfree(req->r_fscrypt_auth); kfree(req->r_altname); put_request_session(req); @@ -3207,10 +3206,10 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session, msg->front.iov_len = p - msg->front.iov_base; msg->hdr.front_len = cpu_to_le32(msg->front.iov_len); - if (req->r_pagelist) { - struct ceph_pagelist *pagelist = req->r_pagelist; - ceph_msg_data_add_pagelist(msg, pagelist); - msg->hdr.data_len = cpu_to_le32(pagelist->length); + if (req->r_dbuf) { + struct ceph_databuf *dbuf = req->r_dbuf; + ceph_msg_data_add_databuf(msg, dbuf); + msg->hdr.data_len = cpu_to_le32(ceph_databuf_len(dbuf)); } else { msg->hdr.data_len = 0; } diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index 3e2a6fa7c19a..a7ee8da07ce7 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -333,7 +333,7 @@ struct ceph_mds_request { u32 r_direct_hash; /* choose dir frag based on this dentry hash */ /* data payload is used for xattr ops */ - struct ceph_pagelist *r_pagelist; + struct ceph_databuf *r_dbuf; /* what caps shall we drop? */ int r_inode_drop, r_inode_unless; diff --git a/fs/ceph/super.h b/fs/ceph/super.h index bb0db0cc8003..984a6d2a5378 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -1137,7 +1137,7 @@ struct ceph_acl_sec_ctx { #ifdef CONFIG_FS_ENCRYPTION struct ceph_fscrypt_auth *fscrypt_auth; #endif - struct ceph_pagelist *pagelist; + struct ceph_databuf *dbuf; }; #ifdef CONFIG_SECURITY diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index 537165db4519..b083cd3b3974 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -1114,17 +1114,17 @@ static int ceph_sync_setxattr(struct inode *inode, const char *name, struct ceph_mds_request *req; struct ceph_mds_client *mdsc = fsc->mdsc; struct ceph_osd_client *osdc = &fsc->client->osdc; - struct ceph_pagelist *pagelist = NULL; + struct ceph_databuf *dbuf = NULL; int op = CEPH_MDS_OP_SETXATTR; int err; if (size > 0) { - /* copy value into pagelist */ - pagelist = ceph_pagelist_alloc(GFP_NOFS); - if (!pagelist) + /* copy value into dbuf */ + dbuf = ceph_databuf_req_alloc(1, size, GFP_NOFS); + if (!dbuf) return -ENOMEM; - err = ceph_pagelist_append(pagelist, value, size); + err = ceph_databuf_append(dbuf, value, size); if (err) goto out; } else if (!value) { @@ -1154,8 +1154,8 @@ static int ceph_sync_setxattr(struct inode *inode, const char *name, req->r_args.setxattr.flags = cpu_to_le32(flags); req->r_args.setxattr.osdmap_epoch = cpu_to_le32(osdc->osdmap->epoch); - req->r_pagelist = pagelist; - pagelist = NULL; + req->r_dbuf = dbuf; + dbuf = NULL; } req->r_inode = inode; @@ -1169,8 +1169,7 @@ static int ceph_sync_setxattr(struct inode *inode, const char *name, doutc(cl, "xattr.ver (after): %lld\n", ci->i_xattrs.version); out: - if (pagelist) - ceph_pagelist_release(pagelist); + ceph_databuf_release(dbuf); return err; } @@ -1377,7 +1376,7 @@ bool ceph_security_xattr_deadlock(struct inode *in) int ceph_security_init_secctx(struct dentry *dentry, umode_t mode, struct ceph_acl_sec_ctx *as_ctx) { - struct ceph_pagelist *pagelist = as_ctx->pagelist; + struct ceph_databuf *dbuf = as_ctx->dbuf; const char *name; size_t name_len; int err; @@ -1391,14 +1390,11 @@ int ceph_security_init_secctx(struct dentry *dentry, umode_t mode, } err = -ENOMEM; - if (!pagelist) { - pagelist = ceph_pagelist_alloc(GFP_KERNEL); - if (!pagelist) + if (!dbuf) { + dbuf = ceph_databuf_req_alloc(0, PAGE_SIZE, GFP_KERNEL); + if (!dbuf) goto out; - err = ceph_pagelist_reserve(pagelist, PAGE_SIZE); - if (err) - goto out; - ceph_pagelist_encode_32(pagelist, 1); + ceph_databuf_encode_32(dbuf, 1); } /* @@ -1407,38 +1403,31 @@ int ceph_security_init_secctx(struct dentry *dentry, umode_t mode, * dentry_init_security hook. */ name_len = strlen(name); - err = ceph_pagelist_reserve(pagelist, - 4 * 2 + name_len + as_ctx->lsmctx.len); + err = ceph_databuf_reserve(dbuf, 4 * 2 + name_len + as_ctx->lsmctx.len, + GFP_KERNEL); if (err) goto out; - if (as_ctx->pagelist) { + if (as_ctx->dbuf) { /* update count of KV pairs */ - BUG_ON(pagelist->length <= sizeof(__le32)); - if (list_is_singular(&pagelist->head)) { - le32_add_cpu((__le32*)pagelist->mapped_tail, 1); - } else { - struct page *page = list_first_entry(&pagelist->head, - struct page, lru); - void *addr = kmap_atomic(page); - le32_add_cpu((__le32*)addr, 1); - kunmap_atomic(addr); - } + BUG_ON(ceph_databuf_len(dbuf) <= sizeof(__le32)); + __le32 *addr = kmap_ceph_databuf_page(dbuf, 0); + le32_add_cpu(addr, 1); + kunmap_local(addr); } else { - as_ctx->pagelist = pagelist; + as_ctx->dbuf = dbuf; } - ceph_pagelist_encode_32(pagelist, name_len); - ceph_pagelist_append(pagelist, name, name_len); + ceph_databuf_encode_32(dbuf, name_len); + ceph_databuf_append(dbuf, name, name_len); - ceph_pagelist_encode_32(pagelist, as_ctx->lsmctx.len); - ceph_pagelist_append(pagelist, as_ctx->lsmctx.context, - as_ctx->lsmctx.len); + ceph_databuf_encode_32(dbuf, as_ctx->lsmctx.len); + ceph_databuf_append(dbuf, as_ctx->lsmctx.context, as_ctx->lsmctx.len); err = 0; out: - if (pagelist && !as_ctx->pagelist) - ceph_pagelist_release(pagelist); + if (dbuf && !as_ctx->dbuf) + ceph_databuf_release(dbuf); return err; } #endif /* CONFIG_CEPH_FS_SECURITY_LABEL */ @@ -1456,8 +1445,7 @@ void ceph_release_acl_sec_ctx(struct ceph_acl_sec_ctx *as_ctx) #ifdef CONFIG_FS_ENCRYPTION kfree(as_ctx->fscrypt_auth); #endif - if (as_ctx->pagelist) - ceph_pagelist_release(as_ctx->pagelist); + ceph_databuf_release(as_ctx->dbuf); } /* From patchwork Thu Mar 13 23:32:57 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016053 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9387B1FCD1B for ; Thu, 13 Mar 2025 23:34:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908856; cv=none; b=TXIAwoHumm5lni/+rYn2kbRaAALixtugT76U3wDHdGuhSmqXsCTnLk/X/7qsePl9GyXOk7QQwmIuL2jjiJu5IjbJuigHbT9tO3fYtgHkkZn7q58CLGcqbC3U2fhA3eYZ9t9lLBFREVxdbl05g6BhQXqeZXBYIyJvSc+nMG17utg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908856; c=relaxed/simple; bh=yL4/m2YIxwhUvzkWIB4YZgxW/U4ajUEzgtqGj1sym24=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=pt8QzlAQlZemJm8b+GBuSqVQmAHB8Smkdq5R92Sw3hD6x4bJ88cW2dK2biI85FptqG0Szx+fPmcRe/5yuGie/pL8WXNsfhBgkofOmBCy4aoxBgD0xVR0vZ7WF9pRU8MdHfvzqpN00AVAf1EbwGpSF6+Hiewpb4c94XL59hABTSA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=HVzacQsx; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="HVzacQsx" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908853; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=f/b4f0Vb/gY4x2pdxmK3KmjPYZhvvjUyHmYb6tyCaWw=; b=HVzacQsxnMXemuDKvT7P6zDR4UwaEbjkR5GTF5w9gcSEvXOIkQHyrkerzU2is1iZitFde7 HgnP+/UbNfFkbpmmPtzXCi6nEEYvIT4ml62X3YUEZQSMJNIl8bd35n+cvuzLbkNcgWKUkp rbqRz4cdybY/mTd0s0CfEm9ERmPg7vE= Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-622-VedX4AZ0PTi6XBDFdDjK7g-1; Thu, 13 Mar 2025 19:34:11 -0400 X-MC-Unique: VedX4AZ0PTi6XBDFdDjK7g-1 X-Mimecast-MFC-AGG-ID: VedX4AZ0PTi6XBDFdDjK7g_1741908850 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id E13501955DCC; Thu, 13 Mar 2025 23:34:09 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 57A2C300376F; Thu, 13 Mar 2025 23:34:07 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 05/35] libceph: Add functions to add ceph_databufs to requests Date: Thu, 13 Mar 2025 23:32:57 +0000 Message-ID: <20250313233341.1675324-6-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 Add some helper functions to add ceph_databufs to ceph_osd_data structs attached to ceph_osd_request structs. The osd_data->iter is moved out of the union so that it can be set at the same time as osd_data->dbuf. Eventually, the I/O routines will only look at ->iter; ->dbuf will be used as a pin that gets released at the end of the I/O. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- include/linux/ceph/osd_client.h | 11 +++++++- net/ceph/messenger.c | 3 ++ net/ceph/osd_client.c | 50 +++++++++++++++++++++++++++++++++ 3 files changed, 63 insertions(+), 1 deletion(-) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index b8fb5a71dd57..172ee515a0f3 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -116,6 +116,7 @@ enum ceph_osd_data_type { struct ceph_osd_data { enum ceph_osd_data_type type; + struct iov_iter iter; union { struct ceph_databuf *dbuf; struct { @@ -136,7 +137,6 @@ struct ceph_osd_data { struct ceph_bvec_iter bvec_pos; u32 num_bvecs; }; - struct iov_iter iter; }; }; @@ -488,6 +488,9 @@ extern struct ceph_osd_data *osd_req_op_extent_osd_data( struct ceph_osd_request *osd_req, unsigned int which); +void osd_req_op_extent_osd_databuf(struct ceph_osd_request *req, + unsigned int which, + struct ceph_databuf *dbuf); extern void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *, unsigned int which, struct page **pages, u64 length, @@ -512,6 +515,9 @@ void osd_req_op_extent_osd_data_bvec_pos(struct ceph_osd_request *osd_req, void osd_req_op_extent_osd_iter(struct ceph_osd_request *osd_req, unsigned int which, struct iov_iter *iter); +void osd_req_op_cls_request_databuf(struct ceph_osd_request *req, + unsigned int which, + struct ceph_databuf *dbuf); extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *, unsigned int which, struct ceph_pagelist *pagelist); @@ -524,6 +530,9 @@ void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req, unsigned int which, struct bio_vec *bvecs, u32 num_bvecs, u32 bytes); +void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req, + unsigned int which, + struct ceph_databuf *dbuf); extern void osd_req_op_cls_response_data_pages(struct ceph_osd_request *, unsigned int which, struct page **pages, u64 length, diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 802f0b222131..02439b38ec94 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1052,6 +1052,7 @@ static void __ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor) case CEPH_MSG_DATA_BVECS: ceph_msg_data_bvecs_cursor_init(cursor, length); break; + case CEPH_MSG_DATA_DATABUF: case CEPH_MSG_DATA_ITER: ceph_msg_data_iter_cursor_init(cursor, length); break; @@ -1102,6 +1103,7 @@ struct page *ceph_msg_data_next(struct ceph_msg_data_cursor *cursor, case CEPH_MSG_DATA_BVECS: page = ceph_msg_data_bvecs_next(cursor, page_offset, length); break; + case CEPH_MSG_DATA_DATABUF: case CEPH_MSG_DATA_ITER: page = ceph_msg_data_iter_next(cursor, page_offset, length); break; @@ -1143,6 +1145,7 @@ void ceph_msg_data_advance(struct ceph_msg_data_cursor *cursor, size_t bytes) case CEPH_MSG_DATA_BVECS: new_piece = ceph_msg_data_bvecs_advance(cursor, bytes); break; + case CEPH_MSG_DATA_DATABUF: case CEPH_MSG_DATA_ITER: new_piece = ceph_msg_data_iter_advance(cursor, bytes); break; diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index c84634264377..720d8a605fc4 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -178,6 +178,17 @@ static void ceph_osd_iter_init(struct ceph_osd_data *osd_data, osd_data->iter = *iter; } +/* + * Consumes a ref on @dbuf. + */ +static void ceph_osd_databuf_init(struct ceph_osd_data *osd_data, + struct ceph_databuf *dbuf) +{ + osd_data->type = CEPH_OSD_DATA_TYPE_DATABUF; + osd_data->dbuf = dbuf; + osd_data->iter = dbuf->iter; +} + static struct ceph_osd_data * osd_req_op_raw_data_in(struct ceph_osd_request *osd_req, unsigned int which) { @@ -207,6 +218,17 @@ void osd_req_op_raw_data_in_pages(struct ceph_osd_request *osd_req, } EXPORT_SYMBOL(osd_req_op_raw_data_in_pages); +void osd_req_op_extent_osd_databuf(struct ceph_osd_request *osd_req, + unsigned int which, + struct ceph_databuf *dbuf) +{ + struct ceph_osd_data *osd_data; + + osd_data = osd_req_op_data(osd_req, which, extent, osd_data); + ceph_osd_databuf_init(osd_data, dbuf); +} +EXPORT_SYMBOL(osd_req_op_extent_osd_databuf); + void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *osd_req, unsigned int which, struct page **pages, u64 length, u32 offset, @@ -297,6 +319,21 @@ static void osd_req_op_cls_request_info_pagelist( ceph_osd_data_pagelist_init(osd_data, pagelist); } +void osd_req_op_cls_request_databuf(struct ceph_osd_request *osd_req, + unsigned int which, + struct ceph_databuf *dbuf) +{ + struct ceph_osd_data *osd_data; + + BUG_ON(!ceph_databuf_len(dbuf)); + + osd_data = osd_req_op_data(osd_req, which, cls, request_data); + ceph_osd_databuf_init(osd_data, dbuf); + osd_req->r_ops[which].cls.indata_len += ceph_databuf_len(dbuf); + osd_req->r_ops[which].indata_len += ceph_databuf_len(dbuf); +} +EXPORT_SYMBOL(osd_req_op_cls_request_databuf); + void osd_req_op_cls_request_data_pagelist( struct ceph_osd_request *osd_req, unsigned int which, struct ceph_pagelist *pagelist) @@ -342,6 +379,19 @@ void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req, } EXPORT_SYMBOL(osd_req_op_cls_request_data_bvecs); +void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req, + unsigned int which, + struct ceph_databuf *dbuf) +{ + struct ceph_osd_data *osd_data; + + BUG_ON(!ceph_databuf_len(dbuf)); + + osd_data = osd_req_op_data(osd_req, which, cls, response_data); + ceph_osd_databuf_init(osd_data, ceph_databuf_get(dbuf)); +} +EXPORT_SYMBOL(osd_req_op_cls_response_databuf); + void osd_req_op_cls_response_data_pages(struct ceph_osd_request *osd_req, unsigned int which, struct page **pages, u64 length, u32 offset, bool pages_from_pool, bool own_pages) From patchwork Thu Mar 13 23:32:58 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016054 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E5B381FDA9E for ; Thu, 13 Mar 2025 23:34:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908861; cv=none; b=XngzG8im5Iy6cS8k9rR4dzV8DA1+vge3CstBOL3XlODl2WsUsQE3PJJzs+Nkx/Mv1HYazDycygHKQGGnwJXLbMJvs89elA+VpdBA9MErXX1cZGsaUcZQJQmZLzg9RcW8tEPVdWbZTBnWMeeY4blKCNznZbveEqYv3OzLP1xup9g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908861; c=relaxed/simple; bh=ELZzfLVThklEhnRD+KdNugmxtZtmQlgpGDgVylXS9mo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=f7UTLlzFGzo6NZ2dHmjSe6L3eh/AQ4ISals7eZ7n0klEvxZTOlVV3J59m36aN2nFWS/lB6qIVO1jlYgWYdHzCxDGSKDoHqpXKoW7LCvEckse39gZRbmNytyF5jWVdpZ2yCJt9xt5ZyE/DOnKdUPc5+gQGssdA7IkrCs5/jIkl2o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Wps2xPXY; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Wps2xPXY" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908859; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=X8xJClgE/4f6GHAhKkjiBRStsaqLXa6LjfKIeZiOaMo=; b=Wps2xPXYULdtPjX2C/085zwCoMr19ARl/YNfxnPhWuQioBM81ZP8zL0sqBE8kBkKQZBdwF 9pGkuWOrebiD+sjlP+UQeJIt5kB8aiIFPWSrCkblf3M0jvOXzgBwBvofJ8RrBaz8Mafuon 1yFyxUg0Ok8kkzSyILdYfxS4tjfgnwM= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-678-QAzyDBykOfuW1A5NVtncvg-1; Thu, 13 Mar 2025 19:34:14 -0400 X-MC-Unique: QAzyDBykOfuW1A5NVtncvg-1 X-Mimecast-MFC-AGG-ID: QAzyDBykOfuW1A5NVtncvg_1741908853 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 7C50419560B9; Thu, 13 Mar 2025 23:34:13 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 31DC11828A87; Thu, 13 Mar 2025 23:34:11 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 06/35] rbd: Use ceph_databuf for rbd_obj_read_sync() Date: Thu, 13 Mar 2025 23:32:58 +0000 Message-ID: <20250313233341.1675324-7-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Make rbd_obj_read_sync() allocate and use a ceph_databuf object to convey the data into the operation. This has some space preallocated and this is allocated by alloc_pages() and accessed with kmap_local rather than being kmalloc'd. This allows MSG_SPLICE_PAGES to be used. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- drivers/block/rbd.c | 45 ++++++++++++++++++++------------------------- 1 file changed, 20 insertions(+), 25 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index faafd7ff43d6..bb953634c7cb 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -4822,13 +4822,10 @@ static void rbd_free_disk(struct rbd_device *rbd_dev) static int rbd_obj_read_sync(struct rbd_device *rbd_dev, struct ceph_object_id *oid, struct ceph_object_locator *oloc, - void *buf, int buf_len) - + struct ceph_databuf *dbuf, int len) { struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; struct ceph_osd_request *req; - struct page **pages; - int num_pages = calc_pages_for(0, buf_len); int ret; req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_KERNEL); @@ -4839,15 +4836,8 @@ static int rbd_obj_read_sync(struct rbd_device *rbd_dev, ceph_oloc_copy(&req->r_base_oloc, oloc); req->r_flags = CEPH_OSD_FLAG_READ; - pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL); - if (IS_ERR(pages)) { - ret = PTR_ERR(pages); - goto out_req; - } - - osd_req_op_extent_init(req, 0, CEPH_OSD_OP_READ, 0, buf_len, 0, 0); - osd_req_op_extent_osd_data_pages(req, 0, pages, buf_len, 0, false, - true); + osd_req_op_extent_init(req, 0, CEPH_OSD_OP_READ, 0, len, 0, 0); + osd_req_op_extent_osd_databuf(req, 0, dbuf); ret = ceph_osdc_alloc_messages(req, GFP_KERNEL); if (ret) @@ -4855,9 +4845,6 @@ static int rbd_obj_read_sync(struct rbd_device *rbd_dev, ceph_osdc_start_request(osdc, req); ret = ceph_osdc_wait_request(osdc, req); - if (ret >= 0) - ceph_copy_from_page_vector(pages, buf, 0, ret); - out_req: ceph_osdc_put_request(req); return ret; @@ -4872,12 +4859,18 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev, struct rbd_image_header *header, bool first_time) { - struct rbd_image_header_ondisk *ondisk = NULL; + struct rbd_image_header_ondisk *ondisk; + struct ceph_databuf *dbuf = NULL; u32 snap_count = 0; u64 names_size = 0; u32 want_count; int ret; + dbuf = ceph_databuf_req_alloc(1, sizeof(*ondisk), GFP_KERNEL); + if (!dbuf) + return -ENOMEM; + ondisk = kmap_ceph_databuf_page(dbuf, 0); + /* * The complete header will include an array of its 64-bit * snapshot ids, followed by the names of those snapshots as @@ -4888,17 +4881,18 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev, do { size_t size; - kfree(ondisk); - size = sizeof (*ondisk); size += snap_count * sizeof (struct rbd_image_snap_ondisk); size += names_size; - ondisk = kmalloc(size, GFP_KERNEL); - if (!ondisk) - return -ENOMEM; + + ret = -ENOMEM; + if (size > dbuf->limit && + ceph_databuf_reserve(dbuf, size - dbuf->limit, + GFP_KERNEL) < 0) + goto out; ret = rbd_obj_read_sync(rbd_dev, &rbd_dev->header_oid, - &rbd_dev->header_oloc, ondisk, size); + &rbd_dev->header_oloc, dbuf, size); if (ret < 0) goto out; if ((size_t)ret < size) { @@ -4907,6 +4901,7 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev, size, ret); goto out; } + if (!rbd_dev_ondisk_valid(ondisk)) { ret = -ENXIO; rbd_warn(rbd_dev, "invalid header"); @@ -4920,8 +4915,8 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev, ret = rbd_header_from_disk(header, ondisk, first_time); out: - kfree(ondisk); - + kunmap_local(ondisk); + ceph_databuf_release(dbuf); return ret; } From patchwork Thu Mar 13 23:32:59 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016055 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DB1B71FE466 for ; Thu, 13 Mar 2025 23:34:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908866; cv=none; b=JVefG0wDvLAXpB1Ta+sKDP4OjM/v0cafk/Otls9gBBXHMa0GVL8THQPVoNUMduFecG36l3ak4uKJynaROTI0kcFjG7F61xFqVR6CGqSR4InI/fD+hpTaNQGHjykIpsWy25RSEpSrhf5NKBMPb45SFcomKmK+Ik+MotSTXev8PB4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908866; c=relaxed/simple; bh=bwe/S77ac/IcKSf788j9Gz7zL6r2dfqzSqkEQJt+8uI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=sHIaddsY1AY6iJZy5Nx0zGCl7IpgZhpfQUDgGAs3nyrICrOH5ZVFOQ90LNHoixqU7t0bSdbvvZXrr2XxhJYnIY+TEn1degVmbSHd4d0E8p/SRBD1ao/pYDBCQpMK4hZti/57v6vFX1SNfNAtlOvpvI1mlvegwvbq9kh/8DeKZgA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=FBtHOxtb; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="FBtHOxtb" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908863; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AV19c3X2FlLGAowb4crJkltgoZbDv9HiTi9WKc6v+8Q=; b=FBtHOxtbr9qGxOQ8o4Rumog9Vkb8q630Gs6ieW2RsmqwGvPG9H7/4SKB6JDbuMLdfaBq89 EnLD4lUrnE/bJ9KmG1QarHu7Mwdo2f66nFXgHMah8xRA0X1NFSoDxq29pr0xWcFLqQTI6+ +Vx4uNzcoWPbF+yq/14az/iWhu5awqM= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-164-qbPPz8w_M2-xTacEKN5Bbg-1; Thu, 13 Mar 2025 19:34:18 -0400 X-MC-Unique: qbPPz8w_M2-xTacEKN5Bbg-1 X-Mimecast-MFC-AGG-ID: qbPPz8w_M2-xTacEKN5Bbg_1741908857 Received: from mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 418FB1956046; Thu, 13 Mar 2025 23:34:17 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id C3CC11978F5D; Thu, 13 Mar 2025 23:34:14 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 07/35] libceph: Change ceph_osdc_call()'s reply to a ceph_databuf Date: Thu, 13 Mar 2025 23:32:59 +0000 Message-ID: <20250313233341.1675324-8-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.15 Change the type of ceph_osdc_call()'s reply to a ceph_databuf struct rather than a list of pages and access it with kmap_local rather than page_address(). Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- drivers/block/rbd.c | 135 ++++++++++++++++++-------------- include/linux/ceph/osd_client.h | 2 +- net/ceph/cls_lock_client.c | 41 +++++----- net/ceph/osd_client.c | 16 ++-- 4 files changed, 109 insertions(+), 85 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index bb953634c7cb..073e80d2d966 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1826,9 +1826,8 @@ static int __rbd_object_map_load(struct rbd_device *rbd_dev) { struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; CEPH_DEFINE_OID_ONSTACK(oid); - struct page **pages; - void *p, *end; - size_t reply_len; + struct ceph_databuf *reply; + void *p, *q, *end; u64 num_objects; u64 object_map_bytes; u64 object_map_size; @@ -1842,48 +1841,57 @@ static int __rbd_object_map_load(struct rbd_device *rbd_dev) object_map_bytes = DIV_ROUND_UP_ULL(num_objects * BITS_PER_OBJ, BITS_PER_BYTE); num_pages = calc_pages_for(0, object_map_bytes) + 1; - pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL); - if (IS_ERR(pages)) - return PTR_ERR(pages); - reply_len = num_pages * PAGE_SIZE; + reply = ceph_databuf_reply_alloc(num_pages, num_pages * PAGE_SIZE, + GFP_KERNEL); + if (!reply) + return -ENOMEM; + rbd_object_map_name(rbd_dev, rbd_dev->spec->snap_id, &oid); ret = ceph_osdc_call(osdc, &oid, &rbd_dev->header_oloc, "rbd", "object_map_load", CEPH_OSD_FLAG_READ, - NULL, 0, pages, &reply_len); + NULL, 0, reply); if (ret) goto out; - p = page_address(pages[0]); - end = p + min(reply_len, (size_t)PAGE_SIZE); - ret = decode_object_map_header(&p, end, &object_map_size); + p = kmap_ceph_databuf_page(reply, 0); + end = p + min(ceph_databuf_len(reply), (size_t)PAGE_SIZE); + q = p; + ret = decode_object_map_header(&q, end, &object_map_size); if (ret) - goto out; + goto out_unmap; if (object_map_size != num_objects) { rbd_warn(rbd_dev, "object map size mismatch: %llu vs %llu", object_map_size, num_objects); ret = -EINVAL; - goto out; + goto out_unmap; } + iov_iter_advance(&reply->iter, q - p); - if (offset_in_page(p) + object_map_bytes > reply_len) { + if (object_map_bytes > ceph_databuf_len(reply)) { ret = -EINVAL; - goto out; + goto out_unmap; } rbd_dev->object_map = kvmalloc(object_map_bytes, GFP_KERNEL); if (!rbd_dev->object_map) { ret = -ENOMEM; - goto out; + goto out_unmap; } rbd_dev->object_map_size = object_map_size; - ceph_copy_from_page_vector(pages, rbd_dev->object_map, - offset_in_page(p), object_map_bytes); + ret = -EIO; + if (copy_from_iter(rbd_dev->object_map, object_map_bytes, + &reply->iter) != object_map_bytes) + goto out_unmap; + + ret = 0; +out_unmap: + kunmap_local(p); out: - ceph_release_page_vector(pages, num_pages); + ceph_databuf_release(reply); return ret; } @@ -1952,6 +1960,7 @@ static int rbd_object_map_update_finish(struct rbd_obj_request *obj_req, { struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev; struct ceph_osd_data *osd_data; + struct ceph_databuf *dbuf; u64 objno; u8 state, new_state, current_state; bool has_current_state; @@ -1971,9 +1980,10 @@ static int rbd_object_map_update_finish(struct rbd_obj_request *obj_req, */ rbd_assert(osd_req->r_num_ops == 2); osd_data = osd_req_op_data(osd_req, 1, cls, request_data); - rbd_assert(osd_data->type == CEPH_OSD_DATA_TYPE_PAGES); + rbd_assert(osd_data->type == CEPH_OSD_DATA_TYPE_DATABUF); + dbuf = osd_data->dbuf; - p = page_address(osd_data->pages[0]); + p = kmap_ceph_databuf_page(dbuf, 0); objno = ceph_decode_64(&p); rbd_assert(objno == obj_req->ex.oe_objno); rbd_assert(ceph_decode_64(&p) == objno + 1); @@ -1981,6 +1991,7 @@ static int rbd_object_map_update_finish(struct rbd_obj_request *obj_req, has_current_state = ceph_decode_8(&p); if (has_current_state) current_state = ceph_decode_8(&p); + kunmap_local(p); spin_lock(&rbd_dev->object_map_lock); state = __rbd_object_map_get(rbd_dev, objno); @@ -2020,7 +2031,7 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req, int which, u64 objno, u8 new_state, const u8 *current_state) { - struct page **pages; + struct ceph_databuf *dbuf; void *p, *start; int ret; @@ -2028,11 +2039,11 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req, if (ret) return ret; - pages = ceph_alloc_page_vector(1, GFP_NOIO); - if (IS_ERR(pages)) - return PTR_ERR(pages); + dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO); + if (!dbuf) + return -ENOMEM; - p = start = page_address(pages[0]); + p = start = kmap_ceph_databuf_page(dbuf, 0); ceph_encode_64(&p, objno); ceph_encode_64(&p, objno + 1); ceph_encode_8(&p, new_state); @@ -2042,9 +2053,10 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req, } else { ceph_encode_8(&p, 0); } + kunmap_local(p); + ceph_databuf_added_data(dbuf, p - start); - osd_req_op_cls_request_data_pages(req, which, pages, p - start, 0, - false, true); + osd_req_op_cls_request_databuf(req, which, dbuf); return 0; } @@ -4673,8 +4685,8 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev, size_t inbound_size) { struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; + struct ceph_databuf *reply; struct page *req_page = NULL; - struct page *reply_page; int ret; /* @@ -4695,8 +4707,8 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev, memcpy(page_address(req_page), outbound, outbound_size); } - reply_page = alloc_page(GFP_KERNEL); - if (!reply_page) { + reply = ceph_databuf_reply_alloc(1, inbound_size, GFP_KERNEL); + if (!reply) { if (req_page) __free_page(req_page); return -ENOMEM; @@ -4704,15 +4716,16 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev, ret = ceph_osdc_call(osdc, oid, oloc, RBD_DRV_NAME, method_name, CEPH_OSD_FLAG_READ, req_page, outbound_size, - &reply_page, &inbound_size); + reply); if (!ret) { - memcpy(inbound, page_address(reply_page), inbound_size); - ret = inbound_size; + ret = ceph_databuf_len(reply); + if (copy_from_iter(inbound, ret, &reply->iter) != ret) + ret = -EIO; } if (req_page) __free_page(req_page); - __free_page(reply_page); + ceph_databuf_release(reply); return ret; } @@ -5633,7 +5646,7 @@ static int decode_parent_image_spec(void **p, void *end, static int __get_parent_info(struct rbd_device *rbd_dev, struct page *req_page, - struct page *reply_page, + struct ceph_databuf *reply, struct parent_image_info *pii) { struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; @@ -5643,27 +5656,31 @@ static int __get_parent_info(struct rbd_device *rbd_dev, ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc, "rbd", "parent_get", CEPH_OSD_FLAG_READ, - req_page, sizeof(u64), &reply_page, &reply_len); + req_page, sizeof(u64), reply); if (ret) return ret == -EOPNOTSUPP ? 1 : ret; - p = page_address(reply_page); + p = kmap_ceph_databuf_page(reply, 0); end = p + reply_len; ret = decode_parent_image_spec(&p, end, pii); + kunmap_local(p); if (ret) return ret; + ceph_databuf_reset_reply(reply); + ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc, "rbd", "parent_overlap_get", CEPH_OSD_FLAG_READ, - req_page, sizeof(u64), &reply_page, &reply_len); + req_page, sizeof(u64), reply); if (ret) return ret; - p = page_address(reply_page); + p = kmap_ceph_databuf_page(reply, 0); end = p + reply_len; ceph_decode_8_safe(&p, end, pii->has_overlap, e_inval); if (pii->has_overlap) ceph_decode_64_safe(&p, end, pii->overlap, e_inval); + kunmap_local(p); dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n", __func__, pii->pool_id, pii->pool_ns, pii->image_id, pii->snap_id, @@ -5679,25 +5696,25 @@ static int __get_parent_info(struct rbd_device *rbd_dev, */ static int __get_parent_info_legacy(struct rbd_device *rbd_dev, struct page *req_page, - struct page *reply_page, + struct ceph_databuf *reply, struct parent_image_info *pii) { struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; - size_t reply_len = PAGE_SIZE; void *p, *end; int ret; ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc, "rbd", "get_parent", CEPH_OSD_FLAG_READ, - req_page, sizeof(u64), &reply_page, &reply_len); + req_page, sizeof(u64), reply); if (ret) return ret; - p = page_address(reply_page); - end = p + reply_len; + p = kmap_ceph_databuf_page(reply, 0); + end = p + ceph_databuf_len(reply); ceph_decode_64_safe(&p, end, pii->pool_id, e_inval); pii->image_id = ceph_extract_encoded_string(&p, end, NULL, GFP_KERNEL); if (IS_ERR(pii->image_id)) { + kunmap_local(p); ret = PTR_ERR(pii->image_id); pii->image_id = NULL; return ret; @@ -5705,6 +5722,7 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev, ceph_decode_64_safe(&p, end, pii->snap_id, e_inval); pii->has_overlap = true; ceph_decode_64_safe(&p, end, pii->overlap, e_inval); + kunmap_local(p); dout("%s pool_id %llu pool_ns %s image_id %s snap_id %llu has_overlap %d overlap %llu\n", __func__, pii->pool_id, pii->pool_ns, pii->image_id, pii->snap_id, @@ -5718,29 +5736,30 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev, static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev, struct parent_image_info *pii) { - struct page *req_page, *reply_page; + struct ceph_databuf *reply; + struct page *req_page; void *p; - int ret; + int ret = -ENOMEM; req_page = alloc_page(GFP_KERNEL); if (!req_page) - return -ENOMEM; + goto out; - reply_page = alloc_page(GFP_KERNEL); - if (!reply_page) { - __free_page(req_page); - return -ENOMEM; - } + reply = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_KERNEL); + if (!reply) + goto out_free; - p = page_address(req_page); + p = kmap_local_page(req_page); ceph_encode_64(&p, rbd_dev->spec->snap_id); - ret = __get_parent_info(rbd_dev, req_page, reply_page, pii); + kunmap_local(p); + ret = __get_parent_info(rbd_dev, req_page, reply, pii); if (ret > 0) - ret = __get_parent_info_legacy(rbd_dev, req_page, reply_page, - pii); + ret = __get_parent_info_legacy(rbd_dev, req_page, reply, pii); + ceph_databuf_release(reply); +out_free: __free_page(req_page); - __free_page(reply_page); +out: return ret; } diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 172ee515a0f3..57b8aff53f28 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -610,7 +610,7 @@ int ceph_osdc_call(struct ceph_osd_client *osdc, const char *class, const char *method, unsigned int flags, struct page *req_page, size_t req_len, - struct page **resp_pages, size_t *resp_len); + struct ceph_databuf *response); /* watch/notify */ struct ceph_osd_linger_request * diff --git a/net/ceph/cls_lock_client.c b/net/ceph/cls_lock_client.c index 66136a4c1ce7..37bb8708e8bb 100644 --- a/net/ceph/cls_lock_client.c +++ b/net/ceph/cls_lock_client.c @@ -74,7 +74,7 @@ int ceph_cls_lock(struct ceph_osd_client *osdc, __func__, lock_name, type, cookie, tag, desc, flags); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "lock", CEPH_OSD_FLAG_WRITE, lock_op_page, - lock_op_buf_size, NULL, NULL); + lock_op_buf_size, NULL); dout("%s: status %d\n", __func__, ret); __free_page(lock_op_page); @@ -124,7 +124,7 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc, dout("%s lock_name %s cookie %s\n", __func__, lock_name, cookie); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "unlock", CEPH_OSD_FLAG_WRITE, unlock_op_page, - unlock_op_buf_size, NULL, NULL); + unlock_op_buf_size, NULL); dout("%s: status %d\n", __func__, ret); __free_page(unlock_op_page); @@ -179,7 +179,7 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc, cookie, ENTITY_NAME(*locker)); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "break_lock", CEPH_OSD_FLAG_WRITE, break_op_page, - break_op_buf_size, NULL, NULL); + break_op_buf_size, NULL); dout("%s: status %d\n", __func__, ret); __free_page(break_op_page); @@ -230,7 +230,7 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc, __func__, lock_name, type, old_cookie, tag, new_cookie); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "set_cookie", CEPH_OSD_FLAG_WRITE, cookie_op_page, - cookie_op_buf_size, NULL, NULL); + cookie_op_buf_size, NULL); dout("%s: status %d\n", __func__, ret); __free_page(cookie_op_page); @@ -337,10 +337,10 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc, char *lock_name, u8 *type, char **tag, struct ceph_locker **lockers, u32 *num_lockers) { + struct ceph_databuf *reply; int get_info_op_buf_size; int name_len = strlen(lock_name); - struct page *get_info_op_page, *reply_page; - size_t reply_len = PAGE_SIZE; + struct page *get_info_op_page; void *p, *end; int ret; @@ -353,8 +353,8 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc, if (!get_info_op_page) return -ENOMEM; - reply_page = alloc_page(GFP_NOIO); - if (!reply_page) { + reply = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO); + if (!reply) { __free_page(get_info_op_page); return -ENOMEM; } @@ -370,18 +370,19 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc, dout("%s lock_name %s\n", __func__, lock_name); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "get_info", CEPH_OSD_FLAG_READ, get_info_op_page, - get_info_op_buf_size, &reply_page, &reply_len); + get_info_op_buf_size, reply); dout("%s: status %d\n", __func__, ret); if (ret >= 0) { - p = page_address(reply_page); - end = p + reply_len; + p = kmap_ceph_databuf_page(reply, 0); + end = p + ceph_databuf_len(reply); ret = decode_lockers(&p, end, type, tag, lockers, num_lockers); + kunmap_local(p); } __free_page(get_info_op_page); - __free_page(reply_page); + ceph_databuf_release(reply); return ret; } EXPORT_SYMBOL(ceph_cls_lock_info); @@ -389,11 +390,11 @@ EXPORT_SYMBOL(ceph_cls_lock_info); int ceph_cls_assert_locked(struct ceph_osd_request *req, int which, char *lock_name, u8 type, char *cookie, char *tag) { + struct ceph_databuf *dbuf; int assert_op_buf_size; int name_len = strlen(lock_name); int cookie_len = strlen(cookie); int tag_len = strlen(tag); - struct page **pages; void *p, *end; int ret; @@ -408,11 +409,11 @@ int ceph_cls_assert_locked(struct ceph_osd_request *req, int which, if (ret) return ret; - pages = ceph_alloc_page_vector(1, GFP_NOIO); - if (IS_ERR(pages)) - return PTR_ERR(pages); + dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO); + if (!dbuf) + return -ENOMEM; - p = page_address(pages[0]); + p = kmap_ceph_databuf_page(dbuf, 0); end = p + assert_op_buf_size; /* encode cls_lock_assert_op struct */ @@ -422,10 +423,12 @@ int ceph_cls_assert_locked(struct ceph_osd_request *req, int which, ceph_encode_8(&p, type); ceph_encode_string(&p, end, cookie, cookie_len); ceph_encode_string(&p, end, tag, tag_len); + kunmap(p); WARN_ON(p != end); + ceph_databuf_added_data(dbuf, assert_op_buf_size); - osd_req_op_cls_request_data_pages(req, which, pages, assert_op_buf_size, - 0, false, true); + osd_req_op_cls_request_databuf(req, which, dbuf); return 0; } EXPORT_SYMBOL(ceph_cls_assert_locked); + diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 720d8a605fc4..b6cf875d3de4 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -5195,7 +5195,10 @@ EXPORT_SYMBOL(ceph_osdc_maybe_request_map); * Execute an OSD class method on an object. * * @flags: CEPH_OSD_FLAG_* - * @resp_len: in/out param for reply length + * @response: Pointer to the storage descriptor for the reply or NULL. + * + * The size of the response buffer is set by the caller in @response->limit and + * the size of the response obtained is set in @response->iter. */ int ceph_osdc_call(struct ceph_osd_client *osdc, struct ceph_object_id *oid, @@ -5203,7 +5206,7 @@ int ceph_osdc_call(struct ceph_osd_client *osdc, const char *class, const char *method, unsigned int flags, struct page *req_page, size_t req_len, - struct page **resp_pages, size_t *resp_len) + struct ceph_databuf *response) { struct ceph_osd_request *req; int ret; @@ -5226,9 +5229,8 @@ int ceph_osdc_call(struct ceph_osd_client *osdc, if (req_page) osd_req_op_cls_request_data_pages(req, 0, &req_page, req_len, 0, false, false); - if (resp_pages) - osd_req_op_cls_response_data_pages(req, 0, resp_pages, - *resp_len, 0, false, false); + if (response) + osd_req_op_cls_response_databuf(req, 0, response); ret = ceph_osdc_alloc_messages(req, GFP_NOIO); if (ret) @@ -5238,8 +5240,8 @@ int ceph_osdc_call(struct ceph_osd_client *osdc, ret = ceph_osdc_wait_request(osdc, req); if (ret >= 0) { ret = req->r_ops[0].rval; - if (resp_pages) - *resp_len = req->r_ops[0].outdata_len; + if (response) + ceph_databuf_reply_ready(response, req->r_ops[0].outdata_len); } out_put_req: From patchwork Thu Mar 13 23:33:00 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016056 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 69E741FE457 for ; Thu, 13 Mar 2025 23:34:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908868; cv=none; b=ehCd4apaRf8xE18RR1FJzse4JJMVbNtpw0TssfFB5iwWThCfafwPLs8ID/kuXSIFNN1zlZ0HGXperxR4dqsflW8nhk+i7kt6YKBiVGFKFks0L7FsUGhKT1lFiimCa2PB5tFsr5h/ltuo4JaOuN4EYpQRllqr3WHcv7SjvHmTT3s= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908868; c=relaxed/simple; bh=+sr01K5oPvodn2YTEK+84j3VIa1heezoCkpS6TEEYWk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=h1EKPVueBbiOf+j5P1zJ1otUEqXX6D0JniI81Zqk0q31uOYRkNLE1bC/7T+B6gBMTBxyhto0MpxsmAIecll7DYCFgbNWf13+vNeCj1GEnsLIBHS9tYjYUxzParCVJ1ztDSPsTc+kI5uxWuvpXsZ7udROHrhrsTywiGDyYEJX8mk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=FPVXXWXt; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="FPVXXWXt" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908865; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=OtHeyZXbJzwFdbyx00ozzozwEOHhSdJdgcRqDuoiMJw=; b=FPVXXWXtg3/2xpmWv+BJkxVytoRDyXyrc+DqYRO5lXHZ9Fbv8afTyZeaMNtHRbcwpC6R0R 2ZrguHTDa7Qh4X6KoLEtgotLsSTv9BTu8iWN4Hq7CceZWnaBTN3iBNka7rf8xXjdf7WK1W N0ViqYUJhuDIWvDeHVinpm/XTNDRpxU= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-393-LkokaXSAPJWj8EmEgOthtg-1; Thu, 13 Mar 2025 19:34:22 -0400 X-MC-Unique: LkokaXSAPJWj8EmEgOthtg-1 X-Mimecast-MFC-AGG-ID: LkokaXSAPJWj8EmEgOthtg_1741908861 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 1830419560B2; Thu, 13 Mar 2025 23:34:21 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 7D4EF1955BCB; Thu, 13 Mar 2025 23:34:18 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 08/35] libceph: Unexport osd_req_op_cls_request_data_pages() Date: Thu, 13 Mar 2025 23:33:00 +0000 Message-ID: <20250313233341.1675324-9-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 Unexport osd_req_op_cls_request_data_pages() as it's not used outside of the file in which it is defined and it will be replaced. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- include/linux/ceph/osd_client.h | 5 ----- net/ceph/osd_client.c | 3 +-- 2 files changed, 1 insertion(+), 7 deletions(-) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 57b8aff53f28..60f28fc0238b 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -521,11 +521,6 @@ void osd_req_op_cls_request_databuf(struct ceph_osd_request *req, extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *, unsigned int which, struct ceph_pagelist *pagelist); -extern void osd_req_op_cls_request_data_pages(struct ceph_osd_request *, - unsigned int which, - struct page **pages, u64 length, - u32 offset, bool pages_from_pool, - bool own_pages); void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req, unsigned int which, struct bio_vec *bvecs, u32 num_bvecs, diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index b6cf875d3de4..10827b1227e4 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -347,7 +347,7 @@ void osd_req_op_cls_request_data_pagelist( } EXPORT_SYMBOL(osd_req_op_cls_request_data_pagelist); -void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req, +static void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req, unsigned int which, struct page **pages, u64 length, u32 offset, bool pages_from_pool, bool own_pages) { @@ -359,7 +359,6 @@ void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req, osd_req->r_ops[which].cls.indata_len += length; osd_req->r_ops[which].indata_len += length; } -EXPORT_SYMBOL(osd_req_op_cls_request_data_pages); void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req, unsigned int which, From patchwork Thu Mar 13 23:33:01 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016057 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E6A61FF1B5 for ; Thu, 13 Mar 2025 23:34:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908871; cv=none; b=nR8q9AYcUjcGadwinAri3qf75T1VHpLwNasCPf8WZQm3FyNlbop18HD3ChTN09zZjiE+TWijMisWpFNMIdWXa5H25I86Ewvnbzl8zI31r0MJoVybU036xjabz7vcMyFHIlIcSnmTg1qT65eypy89c51mOtviA4mK5Qe0qTbx+ao= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908871; c=relaxed/simple; bh=Q9soLd2gAe8fV27/2nYtXayub2OpW4/N73rjy8kjx90=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=fE9wxlIdCJvMHB7yPZVVvA4/hUGyc1Wjj2KxRLB2P8rez+atwCYzWMOPZtNnWBBdiZpgWBj0li6BfM7t57dcZNci5F6TuqPKBKq0MyyCHF/lFu8BLASfrj1clSewNO3A/2R+Ko9b1hk2RJGgX1jHj+TSDtpjN3hVVOilny6Povk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=IQxA7n+E; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="IQxA7n+E" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908869; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SV4UUXX2sAfU6Pjn2rOfZC7Y+xeVVFMspmAQi4dcP3Y=; b=IQxA7n+EprvekqXRMuQpWkjbY9xo5f7VTJoBH26UdD4VobeN8uZPDzZJMjCT4KYTYOA1s+ WhCAoGwv4uqFj0myayLAQtbDxJ7cGrXx5MJXdvngRLlJPsWi4b9aHYHbSKx5ZDGDto3epq ZKGEdxlY1t/JFZWLVaCp5/HdNifrums= Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-219-nBsrPFlQOiWFKoi4vVyX-Q-1; Thu, 13 Mar 2025 19:34:26 -0400 X-MC-Unique: nBsrPFlQOiWFKoi4vVyX-Q-1 X-Mimecast-MFC-AGG-ID: nBsrPFlQOiWFKoi4vVyX-Q_1741908864 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id A68F11955DDD; Thu, 13 Mar 2025 23:34:24 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 6553F300376F; Thu, 13 Mar 2025 23:34:22 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 09/35] libceph: Remove osd_req_op_cls_response_data_pages() Date: Thu, 13 Mar 2025 23:33:01 +0000 Message-ID: <20250313233341.1675324-10-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 Remove osd_req_op_cls_response_data_pages() as it's no longer used. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- include/linux/ceph/osd_client.h | 5 ----- net/ceph/osd_client.c | 12 ------------ 2 files changed, 17 deletions(-) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 60f28fc0238b..fe51c6ed23af 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -528,11 +528,6 @@ void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req, void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req, unsigned int which, struct ceph_databuf *dbuf); -extern void osd_req_op_cls_response_data_pages(struct ceph_osd_request *, - unsigned int which, - struct page **pages, u64 length, - u32 offset, bool pages_from_pool, - bool own_pages); int osd_req_op_cls_init(struct ceph_osd_request *osd_req, unsigned int which, const char *class, const char *method); extern int osd_req_op_xattr_init(struct ceph_osd_request *osd_req, unsigned int which, diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 10827b1227e4..e1dbde4bf2b9 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -391,18 +391,6 @@ void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req, } EXPORT_SYMBOL(osd_req_op_cls_response_databuf); -void osd_req_op_cls_response_data_pages(struct ceph_osd_request *osd_req, - unsigned int which, struct page **pages, u64 length, - u32 offset, bool pages_from_pool, bool own_pages) -{ - struct ceph_osd_data *osd_data; - - osd_data = osd_req_op_data(osd_req, which, cls, response_data); - ceph_osd_data_pages_init(osd_data, pages, length, offset, - pages_from_pool, own_pages); -} -EXPORT_SYMBOL(osd_req_op_cls_response_data_pages); - static u64 ceph_osd_data_length(struct ceph_osd_data *osd_data) { switch (osd_data->type) { From patchwork Thu Mar 13 23:33:02 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016058 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D5FEB1FF5F7 for ; Thu, 13 Mar 2025 23:34:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908876; cv=none; b=iYc3PJ2X0ar12XzxCo5rBmMhBsfju/1G8/AMxDV1S3iuMCVZYSA+FjkwpxXbSv1XrIH2BwtromRYYcSTWekhiGQV+gtuH3BizBvVBnEV2DYdOUhLlunhisnIdA5aomEA4YODco+fYKCOU8KVfDKRoS3bJZ3ZVQcVqNOA7doUCqc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908876; c=relaxed/simple; bh=NiNBTgNWjrimsWNddD2U+rQy0qV/K2c9VNKMA41DA0o=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=rInYSTtBlTcc1Y1ix1/NZAAftawivz4E0qTuHbli2sUDK3yGXAso21QxUQunJAn5HqkJ168iai2JsOIF0el7Ii47dTdvd2jWSVpCt9StmIxYcNkeFSzJzVxjdbZbUlzKg6/gWpegMVy4sKMMo56gihPGews7d+nJ/CWLUngjS1A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=UXluir65; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="UXluir65" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908873; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NpR89Edjbx9ldmoLpMeX/2aHlbT7ns1oIoTcfz5jVJA=; b=UXluir652p+rveQ0vDAbPf0dQ/wdF4GSVjDUZ83tPWrsAtXl1EwmzAkied12mQ3yt5RkWY I8DPYPBf8474X+pJ8Ctht3/O7x9qbubgXMQotKOEHx5Yq+GYVNfA/Hx6hfu9CoplPK9+tQ FzwAj0e/Mcml8f/10Q5khryVabizuM0= Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-241-QLe6hXCdOD2nqvSGHu527Q-1; Thu, 13 Mar 2025 19:34:29 -0400 X-MC-Unique: QLe6hXCdOD2nqvSGHu527Q-1 X-Mimecast-MFC-AGG-ID: QLe6hXCdOD2nqvSGHu527Q_1741908868 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 35A6A19560AB; Thu, 13 Mar 2025 23:34:28 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id ED62D1828A83; Thu, 13 Mar 2025 23:34:25 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 10/35] libceph: Convert notify_id_pages to a ceph_databuf Date: Thu, 13 Mar 2025 23:33:02 +0000 Message-ID: <20250313233341.1675324-11-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Convert linger->notify_id_pages to a ceph_databuf Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- include/linux/ceph/osd_client.h | 2 +- net/ceph/osd_client.c | 24 +++++++++++++----------- 2 files changed, 14 insertions(+), 12 deletions(-) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index fe51c6ed23af..5ac4c0b4dfcd 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -349,7 +349,7 @@ struct ceph_osd_linger_request { void *data; struct ceph_pagelist *request_pl; - struct page **notify_id_pages; + struct ceph_databuf *notify_id_buf; struct page ***preply_pages; size_t *preply_len; diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index e1dbde4bf2b9..fc5c136e793d 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -2841,9 +2841,7 @@ static void linger_release(struct kref *kref) if (lreq->request_pl) ceph_pagelist_release(lreq->request_pl); - if (lreq->notify_id_pages) - ceph_release_page_vector(lreq->notify_id_pages, 1); - + ceph_databuf_release(lreq->notify_id_buf); ceph_osdc_put_request(lreq->reg_req); ceph_osdc_put_request(lreq->ping_req); target_destroy(&lreq->t); @@ -3128,10 +3126,13 @@ static void linger_commit_cb(struct ceph_osd_request *req) if (!lreq->is_watch) { struct ceph_osd_data *osd_data = osd_req_op_data(req, 0, notify, response_data); - void *p = page_address(osd_data->pages[0]); + struct ceph_databuf *notify_id_buf = lreq->notify_id_buf; + void *p; WARN_ON(req->r_ops[0].op != CEPH_OSD_OP_NOTIFY || - osd_data->type != CEPH_OSD_DATA_TYPE_PAGES); + osd_data->type != CEPH_OSD_DATA_TYPE_DATABUF); + + p = kmap_ceph_databuf_page(notify_id_buf, 0); /* make note of the notify_id */ if (req->r_ops[0].outdata_len >= sizeof(u64)) { @@ -3141,6 +3142,8 @@ static void linger_commit_cb(struct ceph_osd_request *req) } else { dout("lreq %p no notify_id\n", lreq); } + + kunmap_local(p); } out: @@ -3224,9 +3227,9 @@ static void send_linger(struct ceph_osd_linger_request *lreq) refcount_inc(&lreq->request_pl->refcnt); osd_req_op_notify_init(req, 0, lreq->linger_id, lreq->request_pl); - ceph_osd_data_pages_init( + ceph_osd_databuf_init( osd_req_op_data(req, 0, notify, response_data), - lreq->notify_id_pages, PAGE_SIZE, 0, false, false); + ceph_databuf_get(lreq->notify_id_buf)); } dout("lreq %p register\n", lreq); req->r_callback = linger_commit_cb; @@ -5016,10 +5019,9 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc, } /* for notify_id */ - lreq->notify_id_pages = ceph_alloc_page_vector(1, GFP_NOIO); - if (IS_ERR(lreq->notify_id_pages)) { - ret = PTR_ERR(lreq->notify_id_pages); - lreq->notify_id_pages = NULL; + lreq->notify_id_buf = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO); + if (!lreq->notify_id_buf) { + ret = -ENOMEM; goto out_put_lreq; } From patchwork Thu Mar 13 23:33:03 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016059 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4E80D1FF615 for ; Thu, 13 Mar 2025 23:34:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908879; cv=none; b=tFiwNxTbCktD416CJqT4ZJagn4NZ+6cHxFBXrV+qGXr2CROWoy1Jf9I7bdffA7hfyDijO+DmRugG/mF/i2Ax/SVRF1R099NP251awigzw5I4A0Q3Py/o+Ny5xwHlD8GAKcY9v40cvjLXRSFP7wPjmqA7HheJDLCREoByYLWLK1Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908879; c=relaxed/simple; bh=PCGKE94dgKsYUob0Vzun+pXaZU/9aL7QAnFrX30bHJA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=saoy0bOAQ6q1LPilbJ78fnCVe+MhCpBd/UTdQBhLUvSvvdPa4pXtGm9ZqiYzUB7XEQDz6IdJ1vK1IWYC2GQKvEDvq+fUmHqnB/YG+HhqJ0d0SnSr1cmLSKCSY6liA9UcPcMaFV4BGytEDznzfrO8C4jlB6jR56rLYK+tWYNSo/Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=A2a9nc+5; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="A2a9nc+5" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908876; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=N3o0+rtD0ocRw1P10cTGKXFQGnE8YrSB/y3t2jOjxOM=; b=A2a9nc+5qvMZIhADNFzEw1tgOKxEP0lfHgNAYNVbaiD0FPYF1RXvF4QTrDVVxbIB3TJdM8 u+oku1mHxsFL9pLnkKUU+wUWudRDWlV9CjlcbrYY8sLwcWHDnhLjY5K9P2DY6XYZbKdFpv o/43wwjfa2mk5+MV8F5J/VcT7rfdwHc= Received: from mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-473-8x87g4PeOlG9B_EVVAIkNQ-1; Thu, 13 Mar 2025 19:34:33 -0400 X-MC-Unique: 8x87g4PeOlG9B_EVVAIkNQ-1 X-Mimecast-MFC-AGG-ID: 8x87g4PeOlG9B_EVVAIkNQ_1741908872 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 03CC519560AB; Thu, 13 Mar 2025 23:34:32 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 809E0300376F; Thu, 13 Mar 2025 23:34:29 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 11/35] ceph: Use ceph_databuf in DIO Date: Thu, 13 Mar 2025 23:33:03 +0000 Message-ID: <20250313233341.1675324-12-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 Stash the list of pages to be read into/written from during a ceph fs direct read/write in a ceph_databuf struct rather than using a bvec array. Eventually this will be replaced with just an iterator supplied by netfslib. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/ceph/file.c | 110 +++++++++++++++++++++---------------------------- 1 file changed, 47 insertions(+), 63 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 9de2960748b9..fb4024bc8274 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -82,11 +82,10 @@ static __le32 ceph_flags_sys2wire(struct ceph_mds_client *mdsc, u32 flags) */ #define ITER_GET_BVECS_PAGES 64 -static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize, - struct bio_vec *bvecs) +static int __iter_get_bvecs(struct iov_iter *iter, size_t maxsize, + struct ceph_databuf *dbuf) { size_t size = 0; - int bvec_idx = 0; if (maxsize > iov_iter_count(iter)) maxsize = iov_iter_count(iter); @@ -98,22 +97,24 @@ static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize, int idx = 0; bytes = iov_iter_get_pages2(iter, pages, maxsize - size, - ITER_GET_BVECS_PAGES, &start); - if (bytes < 0) - return size ?: bytes; - - size += bytes; + ITER_GET_BVECS_PAGES, &start); + if (bytes < 0) { + if (size == 0) + return bytes; + break; + } - for ( ; bytes; idx++, bvec_idx++) { + while (bytes) { int len = min_t(int, bytes, PAGE_SIZE - start); - bvec_set_page(&bvecs[bvec_idx], pages[idx], len, start); + ceph_databuf_append_page(dbuf, pages[idx++], start, len); bytes -= len; + size += len; start = 0; } } - return size; + return 0; } /* @@ -124,52 +125,44 @@ static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize, * Attempt to get up to @maxsize bytes worth of pages from @iter. * Return the number of bytes in the created bio_vec array, or an error. */ -static ssize_t iter_get_bvecs_alloc(struct iov_iter *iter, size_t maxsize, - struct bio_vec **bvecs, int *num_bvecs) +static struct ceph_databuf *iter_get_bvecs_alloc(struct iov_iter *iter, + size_t maxsize, bool write) { - struct bio_vec *bv; + struct ceph_databuf *dbuf; size_t orig_count = iov_iter_count(iter); - ssize_t bytes; - int npages; + int npages, ret; iov_iter_truncate(iter, maxsize); npages = iov_iter_npages(iter, INT_MAX); iov_iter_reexpand(iter, orig_count); - /* - * __iter_get_bvecs() may populate only part of the array -- zero it - * out. - */ - bv = kvmalloc_array(npages, sizeof(*bv), GFP_KERNEL | __GFP_ZERO); - if (!bv) - return -ENOMEM; + if (write) + dbuf = ceph_databuf_req_alloc(npages, 0, GFP_KERNEL); + else + dbuf = ceph_databuf_reply_alloc(npages, 0, GFP_KERNEL); + if (!dbuf) + return ERR_PTR(-ENOMEM); - bytes = __iter_get_bvecs(iter, maxsize, bv); - if (bytes < 0) { + ret = __iter_get_bvecs(iter, maxsize, dbuf); + if (ret < 0) { /* * No pages were pinned -- just free the array. */ - kvfree(bv); - return bytes; + ceph_databuf_release(dbuf); + return ERR_PTR(ret); } - *bvecs = bv; - *num_bvecs = npages; - return bytes; + return dbuf; } -static void put_bvecs(struct bio_vec *bvecs, int num_bvecs, bool should_dirty) +static void ceph_dirty_pages(struct ceph_databuf *dbuf) { + struct bio_vec *bvec = dbuf->bvec; int i; - for (i = 0; i < num_bvecs; i++) { - if (bvecs[i].bv_page) { - if (should_dirty) - set_page_dirty_lock(bvecs[i].bv_page); - put_page(bvecs[i].bv_page); - } - } - kvfree(bvecs); + for (i = 0; i < dbuf->nr_bvec; i++) + if (bvec[i].bv_page) + set_page_dirty_lock(bvec[i].bv_page); } /* @@ -1338,14 +1331,11 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req) struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0); struct ceph_osd_req_op *op = &req->r_ops[0]; struct ceph_client_metric *metric = &ceph_sb_to_mdsc(inode->i_sb)->metric; - unsigned int len = osd_data->bvec_pos.iter.bi_size; + size_t len = osd_data->iter.count; bool sparse = (op->op == CEPH_OSD_OP_SPARSE_READ); struct ceph_client *cl = ceph_inode_to_client(inode); - BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_BVECS); - BUG_ON(!osd_data->num_bvecs); - - doutc(cl, "req %p inode %p %llx.%llx, rc %d bytes %u\n", req, + doutc(cl, "req %p inode %p %llx.%llx, rc %d bytes %zu\n", req, inode, ceph_vinop(inode), rc, len); if (rc == -EOLDSNAPC) { @@ -1367,7 +1357,6 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req) if (rc == -ENOENT) rc = 0; if (rc >= 0 && len > rc) { - struct iov_iter i; int zlen = len - rc; /* @@ -1384,10 +1373,8 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req) aio_req->total_len = rc + zlen; } - iov_iter_bvec(&i, ITER_DEST, osd_data->bvec_pos.bvecs, - osd_data->num_bvecs, len); - iov_iter_advance(&i, rc); - iov_iter_zero(zlen, &i); + iov_iter_advance(&osd_data->iter, rc); + iov_iter_zero(zlen, &osd_data->iter); } } @@ -1401,8 +1388,8 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req) req->r_end_latency, len, rc); } - put_bvecs(osd_data->bvec_pos.bvecs, osd_data->num_bvecs, - aio_req->should_dirty); + if (aio_req->should_dirty) + ceph_dirty_pages(osd_data->dbuf); ceph_osdc_put_request(req); if (rc < 0) @@ -1491,9 +1478,8 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter, struct ceph_client_metric *metric = &fsc->mdsc->metric; struct ceph_vino vino; struct ceph_osd_request *req; - struct bio_vec *bvecs; struct ceph_aio_request *aio_req = NULL; - int num_pages = 0; + struct ceph_databuf *dbuf = NULL; int flags; int ret = 0; struct timespec64 mtime = current_time(inode); @@ -1529,8 +1515,8 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter, while (iov_iter_count(iter) > 0) { u64 size = iov_iter_count(iter); - ssize_t len; struct ceph_osd_req_op *op; + size_t len; int readop = sparse ? CEPH_OSD_OP_SPARSE_READ : CEPH_OSD_OP_READ; int extent_cnt; @@ -1563,16 +1549,17 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter, } } - len = iter_get_bvecs_alloc(iter, size, &bvecs, &num_pages); - if (len < 0) { + dbuf = iter_get_bvecs_alloc(iter, size, write); + if (IS_ERR(dbuf)) { ceph_osdc_put_request(req); - ret = len; + ret = PTR_ERR(dbuf); break; } + len = ceph_databuf_len(dbuf); if (len != size) osd_req_op_extent_update(req, 0, len); - osd_req_op_extent_osd_data_bvecs(req, 0, bvecs, num_pages, len); + osd_req_op_extent_osd_databuf(req, 0, dbuf); /* * To simplify error handling, allow AIO when IO within i_size @@ -1637,20 +1624,17 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter, ret = 0; if (ret >= 0 && ret < len && pos + ret < size) { - struct iov_iter i; int zlen = min_t(size_t, len - ret, size - pos - ret); - iov_iter_bvec(&i, ITER_DEST, bvecs, num_pages, len); - iov_iter_advance(&i, ret); - iov_iter_zero(zlen, &i); + iov_iter_advance(&dbuf->iter, ret); + iov_iter_zero(zlen, &dbuf->iter); ret += zlen; } if (ret >= 0) len = ret; } - put_bvecs(bvecs, num_pages, should_dirty); ceph_osdc_put_request(req); if (ret < 0) break; From patchwork Thu Mar 13 23:33:04 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016060 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 42CB71FF7B0 for ; Thu, 13 Mar 2025 23:34:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908881; cv=none; b=B2EsLjtNg0oIog+VJ6DmSXqlu7fIlGMrNk5WTx4qbeE8WD0CqxCX/UztDoVHP41IVQlCrgtqmyxi/59+uwZ+wXrXBKBUn/7F4AwHN651+vnWvdOs8yjRIDoPOL9p/EdfusyNHtFmLc+hSznoiBMhcAs/e8r1xl8ffYBTzDYmW5Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908881; c=relaxed/simple; bh=bUmykwDxrEmtIQ9GoL8F0xFmqiOF9yATjusg+hmIxWc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=b5oU9svZNUEGU4CfbkEUbuxjUzZJY1E0gH308p/6458zs0iZK7+4xuklyUDpVJFAQVDY2OW4UMEPLbsqeolVnfNo5JGutCDhaRoT1jCu/EAJyZHNwE/KaJVrVYA38cshDdLU19hAEiIRIGqZhIEjzOhIl/MZ4YvPizHdW9H5a5k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=B+sZFWIF; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="B+sZFWIF" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908878; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xCq288hnIzAt6EE/D2KI6gM8OPNMHb29b2SqQDvuZUg=; b=B+sZFWIF5v2gUtF7kqET9MtMr0zsQAjSQ/brGsnoBqMNCe3O3dxOE1CoMRvANoBIT7ogo3 sm7oLlJ5zLpWB1rt1Y2qZZMIwfSxWyNrhctCR1lNgiVyINbLZ/vhWb15j8oD5pxUsN4jCe Vy8oc9fGk8NuQZWI0JLZUfM5lG83nJo= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-20-VEEEos5ANpi7A-NKHDFhlA-1; Thu, 13 Mar 2025 19:34:37 -0400 X-MC-Unique: VEEEos5ANpi7A-NKHDFhlA-1 X-Mimecast-MFC-AGG-ID: VEEEos5ANpi7A-NKHDFhlA_1741908875 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 90AA319560B6; Thu, 13 Mar 2025 23:34:35 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 505061828A87; Thu, 13 Mar 2025 23:34:33 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 12/35] libceph: Bypass the messenger-v1 Tx loop for databuf/iter data blobs Date: Thu, 13 Mar 2025 23:33:04 +0000 Message-ID: <20250313233341.1675324-13-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Don't use the messenger-v1 Tx loop for databuf/iter data blobs, which sends page fragments individually, but rather pass the entire iterator to the socket in one go. This uses the loop inside of tcp_sendmsg() to do the work and allows TCP to make better choices. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- include/linux/ceph/messenger.h | 1 + net/ceph/messenger.c | 1 + net/ceph/messenger_v1.c | 76 ++++++++++++++++++++++++++++------ 3 files changed, 65 insertions(+), 13 deletions(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 864aad369c91..1b646d0dff39 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -255,6 +255,7 @@ struct ceph_msg_data_cursor { }; struct { struct iov_iter iov_iter; + struct iov_iter crc_iter; unsigned int lastlen; }; }; diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 02439b38ec94..dc8082575e4f 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -975,6 +975,7 @@ static void ceph_msg_data_iter_cursor_init(struct ceph_msg_data_cursor *cursor, struct ceph_msg_data *data = cursor->data; cursor->iov_iter = data->iter; + cursor->crc_iter = data->iter; cursor->lastlen = 0; iov_iter_truncate(&cursor->iov_iter, length); cursor->resid = iov_iter_count(&cursor->iov_iter); diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c index 0cb61c76b9b8..d6464ac62b09 100644 --- a/net/ceph/messenger_v1.c +++ b/net/ceph/messenger_v1.c @@ -3,6 +3,7 @@ #include #include +#include #include #include #include @@ -74,6 +75,21 @@ static int ceph_tcp_sendmsg(struct socket *sock, struct kvec *iov, return r; } +static int ceph_tcp_sock_sendmsg(struct socket *sock, struct iov_iter *iter, + unsigned int flags) +{ + struct msghdr msg = { + .msg_iter = *iter, + .msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL | flags, + }; + int r; + + r = sock_sendmsg(sock, &msg); + if (r == -EAGAIN) + r = 0; + return r; +} + /* * @more: MSG_MORE or 0. */ @@ -455,6 +471,24 @@ static int write_partial_kvec(struct ceph_connection *con) return ret; /* done! */ } +static size_t ceph_crc_from_iter(void *iter_from, size_t progress, + size_t len, void *priv, void *priv2) +{ + u32 *crc = priv; + + *crc = crc32c(*crc, iter_from, len); + return 0; +} + +static void ceph_calc_crc(struct iov_iter *iter, size_t count, u32 *crc) +{ + size_t done; + + done = iterate_and_advance_kernel(iter, count, crc, NULL, + ceph_crc_from_iter); + WARN_ON(done != count); +} + /* * Write as much message data payload as we can. If we finish, queue * up the footer. @@ -467,7 +501,7 @@ static int write_partial_message_data(struct ceph_connection *con) struct ceph_msg *msg = con->out_msg; struct ceph_msg_data_cursor *cursor = &msg->cursor; bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC); - u32 crc; + u32 crc = 0; dout("%s %p msg %p\n", __func__, con, msg); @@ -484,9 +518,6 @@ static int write_partial_message_data(struct ceph_connection *con) */ crc = do_datacrc ? le32_to_cpu(msg->footer.data_crc) : 0; while (cursor->total_resid) { - struct page *page; - size_t page_offset; - size_t length; int ret; if (!cursor->resid) { @@ -494,17 +525,36 @@ static int write_partial_message_data(struct ceph_connection *con) continue; } - page = ceph_msg_data_next(cursor, &page_offset, &length); - ret = ceph_tcp_sendpage(con->sock, page, page_offset, length, - MSG_MORE); - if (ret <= 0) { - if (do_datacrc) - msg->footer.data_crc = cpu_to_le32(crc); + if (cursor->data->type == CEPH_MSG_DATA_DATABUF || + cursor->data->type == CEPH_MSG_DATA_ITER) { + ret = ceph_tcp_sock_sendmsg(con->sock, &cursor->iov_iter, + MSG_MORE); + if (ret <= 0) { + if (do_datacrc) + msg->footer.data_crc = cpu_to_le32(crc); - return ret; + return ret; + } + if (do_datacrc && cursor->need_crc) + ceph_calc_crc(&cursor->crc_iter, ret, &crc); + } else { + struct page *page; + size_t page_offset; + size_t length; + + page = ceph_msg_data_next(cursor, &page_offset, &length); + ret = ceph_tcp_sendpage(con->sock, page, page_offset, + length, MSG_MORE); + if (ret <= 0) { + if (do_datacrc) + msg->footer.data_crc = cpu_to_le32(crc); + + return ret; + } + if (do_datacrc && cursor->need_crc) + crc = ceph_crc32c_page(crc, page, page_offset, + length); } - if (do_datacrc && cursor->need_crc) - crc = ceph_crc32c_page(crc, page, page_offset, length); ceph_msg_data_advance(cursor, (size_t)ret); } From patchwork Thu Mar 13 23:33:05 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016061 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7862E1FF7B8 for ; Thu, 13 Mar 2025 23:34:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908886; cv=none; b=Hp7q/idHUbSZOYMLUCXJ64TM4C4o5ct2Is9xVpI+5pQlLHVyytBNU7fUCNiByfNdyLp9EOvvYjvM9kDbz2IofA/A3MYJwkqHEAOCfCcJTSsA7P4MIMKjix/nLqmmd/WvIRgFHhkkJELjECS/jJ7MVV2WGz3PR2GMwRnUmzOiPrc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908886; c=relaxed/simple; bh=ELBwlKod9L5y2/hvpaoNXIIkZKOWGnJicHOEF5NHgHw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=RCTc77rJhyfXW+nCweTBGOqOCEIVa+8hEJ4n6cBIf0uSiKPQrmoem0MgsrYf0LVd7jKxYdnDvhhuYNtk6tYuOtRX6dI8JtyjKlP+ziEq1jpvqTljaAwjSLKrA3D3/JgCp8TgiFsiRk5iVmmppSG9EuccKFLwgjJT50kWljLhxnY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Dpbzig43; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Dpbzig43" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908882; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=b1JOhB1HV1bgbTA8oOlOZjWqLGwntHtGTw5VdSyajrw=; b=Dpbzig434BM2PkpmHHmJZBhtTdlv4poUZ3KxG287lzv3mNVPkQw+6I01PDk9hGjpiePkCw S6vD9qcr0FjaPeiGJ12eglZWNBoZytHQahiM7GIFbZR84VZcCStxqyA8Zz+bnQvZKxfXfR fPAGl7/mKJsfw11KNVEArtcCnuG5CVQ= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-531-B8NL9l25N4OAYxAkXGgHyg-1; Thu, 13 Mar 2025 19:34:40 -0400 X-MC-Unique: B8NL9l25N4OAYxAkXGgHyg-1 X-Mimecast-MFC-AGG-ID: B8NL9l25N4OAYxAkXGgHyg_1741908879 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 8423D1956059; Thu, 13 Mar 2025 23:34:39 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id D8FAA1955BCB; Thu, 13 Mar 2025 23:34:36 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Xiubo Li Subject: [RFC PATCH 13/35] rbd: Switch from using bvec_iter to iov_iter Date: Thu, 13 Mar 2025 23:33:05 +0000 Message-ID: <20250313233341.1675324-14-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 Switch from using a ceph_bio_iter/ceph_bvec_iter for iterating over the bio_vecs attached to the request to using a ceph_databuf with the bio_vecs transscribed from the bio list. This allows the entire bio bvec[] set to be passed down to the socket (if unencrypted). Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: Xiubo Li cc: linux-fsdevel@vger.kernel.org --- drivers/block/rbd.c | 642 ++++++++++++++--------------------- include/linux/ceph/databuf.h | 22 ++ include/linux/ceph/striper.h | 58 +++- net/ceph/striper.c | 53 --- 4 files changed, 331 insertions(+), 444 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 073e80d2d966..dd22cea7ae89 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -46,6 +46,7 @@ #include #include #include +#include #include "rbd_types.h" @@ -214,13 +215,6 @@ struct pending_result { struct rbd_img_request; -enum obj_request_type { - OBJ_REQUEST_NODATA = 1, - OBJ_REQUEST_BIO, /* pointer into provided bio (list) */ - OBJ_REQUEST_BVECS, /* pointer into provided bio_vec array */ - OBJ_REQUEST_OWN_BVECS, /* private bio_vec array, doesn't own pages */ -}; - enum obj_operation_type { OBJ_OP_READ = 1, OBJ_OP_WRITE, @@ -295,18 +289,12 @@ struct rbd_obj_request { struct ceph_file_extent *img_extents; u32 num_img_extents; - union { - struct ceph_bio_iter bio_pos; - struct { - struct ceph_bvec_iter bvec_pos; - u32 bvec_count; - u32 bvec_idx; - }; - }; + unsigned int bvec_count; + struct iov_iter iter; + struct ceph_databuf *dbuf; enum rbd_obj_copyup_state copyup_state; - struct bio_vec *copyup_bvecs; - u32 copyup_bvec_count; + struct ceph_databuf *copyup_buf; struct list_head osd_reqs; /* w/ r_private_item */ @@ -330,7 +318,6 @@ enum rbd_img_state { struct rbd_img_request { struct rbd_device *rbd_dev; enum obj_operation_type op_type; - enum obj_request_type data_type; unsigned long flags; enum rbd_img_state state; union { @@ -1221,26 +1208,6 @@ static void rbd_dev_mapping_clear(struct rbd_device *rbd_dev) rbd_dev->mapping.size = 0; } -static void zero_bios(struct ceph_bio_iter *bio_pos, u32 off, u32 bytes) -{ - struct ceph_bio_iter it = *bio_pos; - - ceph_bio_iter_advance(&it, off); - ceph_bio_iter_advance_step(&it, bytes, ({ - memzero_bvec(&bv); - })); -} - -static void zero_bvecs(struct ceph_bvec_iter *bvec_pos, u32 off, u32 bytes) -{ - struct ceph_bvec_iter it = *bvec_pos; - - ceph_bvec_iter_advance(&it, off); - ceph_bvec_iter_advance_step(&it, bytes, ({ - memzero_bvec(&bv); - })); -} - /* * Zero a range in @obj_req data buffer defined by a bio (list) or * (private) bio_vec array. @@ -1252,17 +1219,9 @@ static void rbd_obj_zero_range(struct rbd_obj_request *obj_req, u32 off, { dout("%s %p data buf %u~%u\n", __func__, obj_req, off, bytes); - switch (obj_req->img_request->data_type) { - case OBJ_REQUEST_BIO: - zero_bios(&obj_req->bio_pos, off, bytes); - break; - case OBJ_REQUEST_BVECS: - case OBJ_REQUEST_OWN_BVECS: - zero_bvecs(&obj_req->bvec_pos, off, bytes); - break; - default: - BUG(); - } + iov_iter_advance(&obj_req->dbuf->iter, off); + iov_iter_zero(bytes, &obj_req->dbuf->iter); + iov_iter_revert(&obj_req->dbuf->iter, off); } static void rbd_obj_request_destroy(struct kref *kref); @@ -1487,7 +1446,6 @@ static void rbd_obj_request_destroy(struct kref *kref) { struct rbd_obj_request *obj_request; struct ceph_osd_request *osd_req; - u32 i; obj_request = container_of(kref, struct rbd_obj_request, kref); @@ -1500,27 +1458,8 @@ static void rbd_obj_request_destroy(struct kref *kref) ceph_osdc_put_request(osd_req); } - switch (obj_request->img_request->data_type) { - case OBJ_REQUEST_NODATA: - case OBJ_REQUEST_BIO: - case OBJ_REQUEST_BVECS: - break; /* Nothing to do */ - case OBJ_REQUEST_OWN_BVECS: - kfree(obj_request->bvec_pos.bvecs); - break; - default: - BUG(); - } - kfree(obj_request->img_extents); - if (obj_request->copyup_bvecs) { - for (i = 0; i < obj_request->copyup_bvec_count; i++) { - if (obj_request->copyup_bvecs[i].bv_page) - __free_page(obj_request->copyup_bvecs[i].bv_page); - } - kfree(obj_request->copyup_bvecs); - } - + ceph_databuf_release(obj_request->copyup_buf); kmem_cache_free(rbd_obj_request_cache, obj_request); } @@ -1855,7 +1794,7 @@ static int __rbd_object_map_load(struct rbd_device *rbd_dev) goto out; p = kmap_ceph_databuf_page(reply, 0); - end = p + min(ceph_databuf_len(reply), (size_t)PAGE_SIZE); + end = p + umin(ceph_databuf_len(reply), PAGE_SIZE); q = p; ret = decode_object_map_header(&q, end, &object_map_size); if (ret) @@ -2167,29 +2106,6 @@ static int rbd_obj_calc_img_extents(struct rbd_obj_request *obj_req, return 0; } -static void rbd_osd_setup_data(struct ceph_osd_request *osd_req, int which) -{ - struct rbd_obj_request *obj_req = osd_req->r_priv; - - switch (obj_req->img_request->data_type) { - case OBJ_REQUEST_BIO: - osd_req_op_extent_osd_data_bio(osd_req, which, - &obj_req->bio_pos, - obj_req->ex.oe_len); - break; - case OBJ_REQUEST_BVECS: - case OBJ_REQUEST_OWN_BVECS: - rbd_assert(obj_req->bvec_pos.iter.bi_size == - obj_req->ex.oe_len); - rbd_assert(obj_req->bvec_idx == obj_req->bvec_count); - osd_req_op_extent_osd_data_bvec_pos(osd_req, which, - &obj_req->bvec_pos); - break; - default: - BUG(); - } -} - static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which) { struct page **pages; @@ -2223,8 +2139,7 @@ static int rbd_osd_setup_copyup(struct ceph_osd_request *osd_req, int which, if (ret) return ret; - osd_req_op_cls_request_data_bvecs(osd_req, which, obj_req->copyup_bvecs, - obj_req->copyup_bvec_count, bytes); + osd_req_op_cls_request_databuf(osd_req, which, obj_req->copyup_buf); return 0; } @@ -2256,7 +2171,7 @@ static void __rbd_osd_setup_write_ops(struct ceph_osd_request *osd_req, osd_req_op_extent_init(osd_req, which, opcode, obj_req->ex.oe_off, obj_req->ex.oe_len, 0, 0); - rbd_osd_setup_data(osd_req, which); + osd_req_op_extent_osd_databuf(osd_req, which, obj_req->dbuf); } static int rbd_obj_init_write(struct rbd_obj_request *obj_req) @@ -2427,6 +2342,19 @@ static void rbd_osd_setup_write_ops(struct ceph_osd_request *osd_req, } } +static struct ceph_object_extent *alloc_object_extent(void *arg) +{ + struct rbd_img_request *img_req = arg; + struct rbd_obj_request *obj_req; + + obj_req = rbd_obj_request_create(); + if (!obj_req) + return NULL; + + rbd_img_obj_request_add(img_req, obj_req); + return &obj_req->ex; +} + /* * Prune the list of object requests (adjust offset and/or length, drop * redundant requests). Prepare object request state machines and image @@ -2466,104 +2394,232 @@ static int __rbd_img_fill_request(struct rbd_img_request *img_req) return 0; } -union rbd_img_fill_iter { - struct ceph_bio_iter bio_iter; - struct ceph_bvec_iter bvec_iter; -}; +/* + * Handle ranged, but dataless ops such as DISCARD and ZEROOUT. + */ +static int rbd_img_fill_nodata(struct rbd_img_request *img_req, + u64 off, u64 len) +{ + int ret; + + ret = ceph_file_to_extents(&img_req->rbd_dev->layout, off, len, + &img_req->object_extents, + alloc_object_extent, img_req, + NULL, NULL); + if (ret) + return ret; -struct rbd_img_fill_ctx { - enum obj_request_type pos_type; - union rbd_img_fill_iter *pos; - union rbd_img_fill_iter iter; - ceph_object_extent_fn_t set_pos_fn; - ceph_object_extent_fn_t count_fn; - ceph_object_extent_fn_t copy_fn; + return __rbd_img_fill_request(img_req); +} + +struct rbd_bio_iter { + const struct bio *first_bio; + const struct bio *bio; + size_t skip; + unsigned int bvix; }; -static struct ceph_object_extent *alloc_object_extent(void *arg) +static void rbd_start_bio_iteration(struct rbd_bio_iter *iter, struct bio *bio) { - struct rbd_img_request *img_req = arg; - struct rbd_obj_request *obj_req; + iter->bio = bio; + iter->bvix = 0; + iter->skip = 0; +} - obj_req = rbd_obj_request_create(); - if (!obj_req) - return NULL; +static void count_bio_bvecs(struct ceph_object_extent *ex, u32 bytes, void *arg) +{ + struct rbd_obj_request *obj_req = container_of(ex, struct rbd_obj_request, ex); + struct rbd_bio_iter *iter = arg; + const struct bio *bio; + unsigned int need_bv = obj_req->bvec_count, i = 0; + size_t skip; + + /* Count the number of bvecs we need. */ + skip = iter->skip; + bio = iter->bio; + while (bio) { + for (i = iter->bvix; i < bio->bi_vcnt; i++, skip = 0) { + const struct bio_vec *bv = bio->bi_io_vec + i; + size_t part = umin(bytes, bv->bv_len - skip); + + if (!part) + continue; - rbd_img_obj_request_add(img_req, obj_req); - return &obj_req->ex; + need_bv++; + skip += part; + bytes -= part; + if (!bytes) + goto done; + } + + bio = bio->bi_next; + iter->bvix = 0; + iter->skip = 0; + } + +done: + iter->bio = bio; + iter->bvix = i; + iter->skip = skip; + obj_req->bvec_count += need_bv; } -/* - * While su != os && sc == 1 is technically not fancy (it's the same - * layout as su == os && sc == 1), we can't use the nocopy path for it - * because ->set_pos_fn() should be called only once per object. - * ceph_file_to_extents() invokes action_fn once per stripe unit, so - * treat su != os && sc == 1 as fancy. - */ -static bool rbd_layout_is_fancy(struct ceph_file_layout *l) +static void copy_bio_bvecs(struct ceph_object_extent *ex, u32 bytes, void *arg) +{ + struct rbd_obj_request *obj_req = container_of(ex, struct rbd_obj_request, ex); + struct rbd_bio_iter *iter = arg; + struct ceph_databuf *dbuf = obj_req->dbuf; + const struct bio *bio; + unsigned int i; + size_t skip = iter->skip; + + /* Transcribe the pages to the databuf. */ + for (bio = iter->bio; bio; bio = bio->bi_next) { + for (i = iter->bvix; i < bio->bi_vcnt; i++, skip = 0) { + const struct bio_vec *bv = bio->bi_io_vec + i; + size_t part = umin(bytes, bv->bv_len - skip); + + if (!part) + continue; + + ceph_databuf_append_page(dbuf, bv->bv_page, + bv->bv_offset + skip, + bv->bv_len - skip); + skip += part; + bytes -= part; + if (!bytes) + goto done; + } + + iter->bvix = 0; + iter->skip = 0; + } + +done: + iter->bio = bio; + iter->bvix = i; + iter->skip = skip; +} + +static int rbd_img_alloc_databufs(struct rbd_img_request *img_req) { - return l->stripe_unit != l->object_size; + struct rbd_obj_request *obj_req; + + for_each_obj_request(img_req, obj_req) { + if (img_req->op_type == OBJ_OP_READ) + obj_req->dbuf = ceph_databuf_reply_alloc(obj_req->bvec_count, 0, + GFP_NOIO); + else + obj_req->dbuf = ceph_databuf_req_alloc(obj_req->bvec_count, 0, + GFP_NOIO); + if (!obj_req->dbuf) + return -ENOMEM; + } + + return 0; } -static int rbd_img_fill_request_nocopy(struct rbd_img_request *img_req, - struct ceph_file_extent *img_extents, - u32 num_img_extents, - struct rbd_img_fill_ctx *fctx) +/* + * Map an image extent that is backed by a bio chain to a list of object + * extents, create the corresponding object requests (normally each to a + * different object, but not always) and add them to @img_req. For each object + * request, set up its data descriptor to point to a distilled list of page + * fragments. + * + * Because ceph_file_to_extents() will merge adjacent object extents together, + * each object request's data descriptor may point to multiple different chunks + * of the data buffer. + * + * The data buffer is assumed to be large enough. + */ +static int rbd_img_fill_from_bio(struct rbd_img_request *img_req, + u64 off, u64 len, struct bio *bio) { - u32 i; + struct rbd_bio_iter iter; + struct rbd_device *rbd_dev = img_req->rbd_dev; int ret; - img_req->data_type = fctx->pos_type; + /* + * Create object requests and determine ->bvec_count for each object + * request. Note that ->bvec_count sum over all object requests may + * be greater than the number of bio_vecs in the provided bio (list) + * or bio_vec array because when mapped, those bio_vecs can straddle + * stripe unit boundaries. + */ + rbd_start_bio_iteration(&iter, bio); + ret = ceph_file_to_extents(&rbd_dev->layout, off, len, + &img_req->object_extents, + alloc_object_extent, img_req, + count_bio_bvecs, &iter); + if (ret) + return ret; + + ret = rbd_img_alloc_databufs(img_req); + if (ret) + return ret; /* - * Create object requests and set each object request's starting - * position in the provided bio (list) or bio_vec array. + * Fill in each object request's databuf, splitting and rearranging the + * provided bio_vecs in stripe unit chunks as needed. */ - fctx->iter = *fctx->pos; - for (i = 0; i < num_img_extents; i++) { - ret = ceph_file_to_extents(&img_req->rbd_dev->layout, - img_extents[i].fe_off, - img_extents[i].fe_len, - &img_req->object_extents, - alloc_object_extent, img_req, - fctx->set_pos_fn, &fctx->iter); - if (ret) - return ret; - } + rbd_start_bio_iteration(&iter, bio); + ret = ceph_iterate_extents(&rbd_dev->layout, off, len, + &img_req->object_extents, + copy_bio_bvecs, &iter); + if (ret) + return ret; return __rbd_img_fill_request(img_req); } +static void rbd_count_iter(struct ceph_object_extent *ex, u32 bytes, void *arg) +{ + struct rbd_obj_request *obj_req = container_of(ex, struct rbd_obj_request, ex); + struct iov_iter *iter = arg; + + obj_req->bvec_count += iov_iter_npages_cap(iter, INT_MAX, bytes); +} + +static size_t rbd_copy_iter_step(void *iter_base, size_t progress, size_t len, + void *priv, void *priv2) +{ + struct ceph_databuf *dbuf = priv; + struct page *page = virt_to_page(iter_base); + + ceph_databuf_append_page(dbuf, page, (unsigned long)iter_base & ~PAGE_MASK, len); + return 0; +} + +static void rbd_copy_iter(struct ceph_object_extent *ex, u32 bytes, void *arg) +{ + struct rbd_obj_request *obj_req = container_of(ex, struct rbd_obj_request, ex); + struct iov_iter *iter = arg; + + iterate_bvec(iter, bytes, obj_req->dbuf, NULL, rbd_copy_iter_step); +} + /* - * Map a list of image extents to a list of object extents, create the - * corresponding object requests (normally each to a different object, - * but not always) and add them to @img_req. For each object request, - * set up its data descriptor to point to the corresponding chunk(s) of - * @fctx->pos data buffer. + * Map a list of image extents to a list of object extents, creating the + * corresponding object requests (normally each to a different object, but not + * always) and add them to @img_req. For each object request, set up its data + * descriptor to point to the corresponding chunk(s) of the @dbuf data buffer. * * Because ceph_file_to_extents() will merge adjacent object extents * together, each object request's data descriptor may point to multiple - * different chunks of @fctx->pos data buffer. + * different chunks of the data buffer. * - * @fctx->pos data buffer is assumed to be large enough. + * The data buffer is assumed to be large enough. */ -static int rbd_img_fill_request(struct rbd_img_request *img_req, - struct ceph_file_extent *img_extents, - u32 num_img_extents, - struct rbd_img_fill_ctx *fctx) +static int rbd_img_fill_from_dbuf(struct rbd_img_request *img_req, + const struct ceph_file_extent *img_extents, + u32 num_img_extents, + const struct ceph_databuf *dbuf) { struct rbd_device *rbd_dev = img_req->rbd_dev; - struct rbd_obj_request *obj_req; - u32 i; + struct iov_iter iter; + unsigned int i; int ret; - if (fctx->pos_type == OBJ_REQUEST_NODATA || - !rbd_layout_is_fancy(&rbd_dev->layout)) - return rbd_img_fill_request_nocopy(img_req, img_extents, - num_img_extents, fctx); - - img_req->data_type = OBJ_REQUEST_OWN_BVECS; - /* * Create object requests and determine ->bvec_count for each object * request. Note that ->bvec_count sum over all object requests may @@ -2571,37 +2627,33 @@ static int rbd_img_fill_request(struct rbd_img_request *img_req, * or bio_vec array because when mapped, those bio_vecs can straddle * stripe unit boundaries. */ - fctx->iter = *fctx->pos; + iter = dbuf->iter; for (i = 0; i < num_img_extents; i++) { ret = ceph_file_to_extents(&rbd_dev->layout, img_extents[i].fe_off, img_extents[i].fe_len, &img_req->object_extents, alloc_object_extent, img_req, - fctx->count_fn, &fctx->iter); + rbd_count_iter, &iter); if (ret) return ret; } - for_each_obj_request(img_req, obj_req) { - obj_req->bvec_pos.bvecs = kmalloc_array(obj_req->bvec_count, - sizeof(*obj_req->bvec_pos.bvecs), - GFP_NOIO); - if (!obj_req->bvec_pos.bvecs) - return -ENOMEM; - } + ret = rbd_img_alloc_databufs(img_req); + if (ret) + return ret; /* - * Fill in each object request's private bio_vec array, splitting and - * rearranging the provided bio_vecs in stripe unit chunks as needed. + * Fill in each object request's databuf, splitting and rearranging the + * provided bio_vecs in stripe unit chunks as needed. */ - fctx->iter = *fctx->pos; + iter = dbuf->iter; for (i = 0; i < num_img_extents; i++) { ret = ceph_iterate_extents(&rbd_dev->layout, img_extents[i].fe_off, img_extents[i].fe_len, &img_req->object_extents, - fctx->copy_fn, &fctx->iter); + rbd_copy_iter, &iter); if (ret) return ret; } @@ -2609,148 +2661,6 @@ static int rbd_img_fill_request(struct rbd_img_request *img_req, return __rbd_img_fill_request(img_req); } -static int rbd_img_fill_nodata(struct rbd_img_request *img_req, - u64 off, u64 len) -{ - struct ceph_file_extent ex = { off, len }; - union rbd_img_fill_iter dummy = {}; - struct rbd_img_fill_ctx fctx = { - .pos_type = OBJ_REQUEST_NODATA, - .pos = &dummy, - }; - - return rbd_img_fill_request(img_req, &ex, 1, &fctx); -} - -static void set_bio_pos(struct ceph_object_extent *ex, u32 bytes, void *arg) -{ - struct rbd_obj_request *obj_req = - container_of(ex, struct rbd_obj_request, ex); - struct ceph_bio_iter *it = arg; - - dout("%s objno %llu bytes %u\n", __func__, ex->oe_objno, bytes); - obj_req->bio_pos = *it; - ceph_bio_iter_advance(it, bytes); -} - -static void count_bio_bvecs(struct ceph_object_extent *ex, u32 bytes, void *arg) -{ - struct rbd_obj_request *obj_req = - container_of(ex, struct rbd_obj_request, ex); - struct ceph_bio_iter *it = arg; - - dout("%s objno %llu bytes %u\n", __func__, ex->oe_objno, bytes); - ceph_bio_iter_advance_step(it, bytes, ({ - obj_req->bvec_count++; - })); - -} - -static void copy_bio_bvecs(struct ceph_object_extent *ex, u32 bytes, void *arg) -{ - struct rbd_obj_request *obj_req = - container_of(ex, struct rbd_obj_request, ex); - struct ceph_bio_iter *it = arg; - - dout("%s objno %llu bytes %u\n", __func__, ex->oe_objno, bytes); - ceph_bio_iter_advance_step(it, bytes, ({ - obj_req->bvec_pos.bvecs[obj_req->bvec_idx++] = bv; - obj_req->bvec_pos.iter.bi_size += bv.bv_len; - })); -} - -static int __rbd_img_fill_from_bio(struct rbd_img_request *img_req, - struct ceph_file_extent *img_extents, - u32 num_img_extents, - struct ceph_bio_iter *bio_pos) -{ - struct rbd_img_fill_ctx fctx = { - .pos_type = OBJ_REQUEST_BIO, - .pos = (union rbd_img_fill_iter *)bio_pos, - .set_pos_fn = set_bio_pos, - .count_fn = count_bio_bvecs, - .copy_fn = copy_bio_bvecs, - }; - - return rbd_img_fill_request(img_req, img_extents, num_img_extents, - &fctx); -} - -static int rbd_img_fill_from_bio(struct rbd_img_request *img_req, - u64 off, u64 len, struct bio *bio) -{ - struct ceph_file_extent ex = { off, len }; - struct ceph_bio_iter it = { .bio = bio, .iter = bio->bi_iter }; - - return __rbd_img_fill_from_bio(img_req, &ex, 1, &it); -} - -static void set_bvec_pos(struct ceph_object_extent *ex, u32 bytes, void *arg) -{ - struct rbd_obj_request *obj_req = - container_of(ex, struct rbd_obj_request, ex); - struct ceph_bvec_iter *it = arg; - - obj_req->bvec_pos = *it; - ceph_bvec_iter_shorten(&obj_req->bvec_pos, bytes); - ceph_bvec_iter_advance(it, bytes); -} - -static void count_bvecs(struct ceph_object_extent *ex, u32 bytes, void *arg) -{ - struct rbd_obj_request *obj_req = - container_of(ex, struct rbd_obj_request, ex); - struct ceph_bvec_iter *it = arg; - - ceph_bvec_iter_advance_step(it, bytes, ({ - obj_req->bvec_count++; - })); -} - -static void copy_bvecs(struct ceph_object_extent *ex, u32 bytes, void *arg) -{ - struct rbd_obj_request *obj_req = - container_of(ex, struct rbd_obj_request, ex); - struct ceph_bvec_iter *it = arg; - - ceph_bvec_iter_advance_step(it, bytes, ({ - obj_req->bvec_pos.bvecs[obj_req->bvec_idx++] = bv; - obj_req->bvec_pos.iter.bi_size += bv.bv_len; - })); -} - -static int __rbd_img_fill_from_bvecs(struct rbd_img_request *img_req, - struct ceph_file_extent *img_extents, - u32 num_img_extents, - struct ceph_bvec_iter *bvec_pos) -{ - struct rbd_img_fill_ctx fctx = { - .pos_type = OBJ_REQUEST_BVECS, - .pos = (union rbd_img_fill_iter *)bvec_pos, - .set_pos_fn = set_bvec_pos, - .count_fn = count_bvecs, - .copy_fn = copy_bvecs, - }; - - return rbd_img_fill_request(img_req, img_extents, num_img_extents, - &fctx); -} - -static int rbd_img_fill_from_bvecs(struct rbd_img_request *img_req, - struct ceph_file_extent *img_extents, - u32 num_img_extents, - struct bio_vec *bvecs) -{ - struct ceph_bvec_iter it = { - .bvecs = bvecs, - .iter = { .bi_size = ceph_file_extents_bytes(img_extents, - num_img_extents) }, - }; - - return __rbd_img_fill_from_bvecs(img_req, img_extents, num_img_extents, - &it); -} - static void rbd_img_handle_request_work(struct work_struct *work) { struct rbd_img_request *img_req = @@ -2791,7 +2701,7 @@ static int rbd_obj_read_object(struct rbd_obj_request *obj_req) osd_req_op_extent_init(osd_req, 0, CEPH_OSD_OP_READ, obj_req->ex.oe_off, obj_req->ex.oe_len, 0, 0); - rbd_osd_setup_data(osd_req, 0); + osd_req_op_extent_osd_databuf(osd_req, 0, obj_req->dbuf); rbd_osd_format_read(osd_req); ret = ceph_osdc_alloc_messages(osd_req, GFP_NOIO); @@ -2802,7 +2712,13 @@ static int rbd_obj_read_object(struct rbd_obj_request *obj_req) return 0; } -static int rbd_obj_read_from_parent(struct rbd_obj_request *obj_req) +/* + * Redirect an I/O request to the parent device. Note that by the time we get + * here, the page list from the original bio chain has been decanted into a + * databuf struct that we can just take slices from. + */ +static int rbd_obj_read_from_parent(struct rbd_obj_request *obj_req, + struct ceph_databuf *dbuf) { struct rbd_img_request *img_req = obj_req->img_request; struct rbd_device *parent = img_req->rbd_dev->parent; @@ -2824,30 +2740,10 @@ static int rbd_obj_read_from_parent(struct rbd_obj_request *obj_req) dout("%s child_img_req %p for obj_req %p\n", __func__, child_img_req, obj_req); - if (!rbd_img_is_write(img_req)) { - switch (img_req->data_type) { - case OBJ_REQUEST_BIO: - ret = __rbd_img_fill_from_bio(child_img_req, - obj_req->img_extents, - obj_req->num_img_extents, - &obj_req->bio_pos); - break; - case OBJ_REQUEST_BVECS: - case OBJ_REQUEST_OWN_BVECS: - ret = __rbd_img_fill_from_bvecs(child_img_req, - obj_req->img_extents, - obj_req->num_img_extents, - &obj_req->bvec_pos); - break; - default: - BUG(); - } - } else { - ret = rbd_img_fill_from_bvecs(child_img_req, - obj_req->img_extents, - obj_req->num_img_extents, - obj_req->copyup_bvecs); - } + ret = rbd_img_fill_from_dbuf(child_img_req, + obj_req->img_extents, + obj_req->num_img_extents, + dbuf); if (ret) { rbd_img_request_destroy(child_img_req); return ret; @@ -2890,7 +2786,8 @@ static bool rbd_obj_advance_read(struct rbd_obj_request *obj_req, int *result) return true; } if (obj_req->num_img_extents) { - ret = rbd_obj_read_from_parent(obj_req); + ret = rbd_obj_read_from_parent(obj_req, + obj_req->dbuf); if (ret) { *result = ret; return true; @@ -3004,23 +2901,6 @@ static int rbd_obj_write_object(struct rbd_obj_request *obj_req) return 0; } -/* - * copyup_bvecs pages are never highmem pages - */ -static bool is_zero_bvecs(struct bio_vec *bvecs, u32 bytes) -{ - struct ceph_bvec_iter it = { - .bvecs = bvecs, - .iter = { .bi_size = bytes }, - }; - - ceph_bvec_iter_advance_step(&it, bytes, ({ - if (memchr_inv(bvec_virt(&bv), 0, bv.bv_len)) - return false; - })); - return true; -} - #define MODS_ONLY U32_MAX static int rbd_obj_copyup_empty_snapc(struct rbd_obj_request *obj_req, @@ -3084,30 +2964,18 @@ static int rbd_obj_copyup_current_snapc(struct rbd_obj_request *obj_req, return 0; } -static int setup_copyup_bvecs(struct rbd_obj_request *obj_req, u64 obj_overlap) +static int setup_copyup_buf(struct rbd_obj_request *obj_req, u64 obj_overlap) { - u32 i; - - rbd_assert(!obj_req->copyup_bvecs); - obj_req->copyup_bvec_count = calc_pages_for(0, obj_overlap); - obj_req->copyup_bvecs = kcalloc(obj_req->copyup_bvec_count, - sizeof(*obj_req->copyup_bvecs), - GFP_NOIO); - if (!obj_req->copyup_bvecs) - return -ENOMEM; - - for (i = 0; i < obj_req->copyup_bvec_count; i++) { - unsigned int len = min(obj_overlap, (u64)PAGE_SIZE); - struct page *page = alloc_page(GFP_NOIO); + struct ceph_databuf *dbuf; - if (!page) - return -ENOMEM; + rbd_assert(!obj_req->copyup_buf); - bvec_set_page(&obj_req->copyup_bvecs[i], page, len, 0); - obj_overlap -= len; - } + dbuf = ceph_databuf_req_alloc(calc_pages_for(0, obj_overlap), + obj_overlap, GFP_NOIO); + if (!dbuf) + return -ENOMEM; - rbd_assert(!obj_overlap); + obj_req->copyup_buf = dbuf; return 0; } @@ -3134,11 +3002,11 @@ static int rbd_obj_copyup_read_parent(struct rbd_obj_request *obj_req) return rbd_obj_copyup_current_snapc(obj_req, MODS_ONLY); } - ret = setup_copyup_bvecs(obj_req, rbd_obj_img_extents_bytes(obj_req)); + ret = setup_copyup_buf(obj_req, rbd_obj_img_extents_bytes(obj_req)); if (ret) return ret; - return rbd_obj_read_from_parent(obj_req); + return rbd_obj_read_from_parent(obj_req, obj_req->copyup_buf); } static void rbd_obj_copyup_object_maps(struct rbd_obj_request *obj_req) @@ -3241,8 +3109,8 @@ static bool rbd_obj_advance_copyup(struct rbd_obj_request *obj_req, int *result) if (*result) return true; - if (is_zero_bvecs(obj_req->copyup_bvecs, - rbd_obj_img_extents_bytes(obj_req))) { + if (ceph_databuf_is_all_zero(obj_req->copyup_buf, + rbd_obj_img_extents_bytes(obj_req))) { dout("%s %p detected zeros\n", __func__, obj_req); obj_req->flags |= RBD_OBJ_FLAG_COPYUP_ZEROS; } diff --git a/include/linux/ceph/databuf.h b/include/linux/ceph/databuf.h index 14c7a6449467..54b76d0c91a0 100644 --- a/include/linux/ceph/databuf.h +++ b/include/linux/ceph/databuf.h @@ -5,6 +5,7 @@ #include #include #include +#include struct ceph_databuf { struct bio_vec *bvec; /* List of pages */ @@ -128,4 +129,25 @@ static inline void ceph_databuf_enc_stop(struct ceph_databuf *dbuf, void *p) BUG_ON(dbuf->iter.count > dbuf->limit); } +static __always_inline +size_t ceph_databuf_scan_for_nonzero(void *iter_from, size_t progress, + size_t len, void *priv, void *priv2) +{ + void *p; + + p = memchr_inv(iter_from, 0, len); + return p ? p - iter_from : 0; +} + +/* + * Scan a buffer to see if it contains only zeros. + */ +static inline bool ceph_databuf_is_all_zero(struct ceph_databuf *dbuf, size_t count) +{ + struct iov_iter iter_copy = dbuf->iter; + + return iterate_bvec(&iter_copy, count, NULL, NULL, + ceph_databuf_scan_for_nonzero) == count; +} + #endif /* __FS_CEPH_DATABUF_H */ diff --git a/include/linux/ceph/striper.h b/include/linux/ceph/striper.h index 3486636c0e6e..50bc1b88c5c4 100644 --- a/include/linux/ceph/striper.h +++ b/include/linux/ceph/striper.h @@ -4,6 +4,7 @@ #include #include +#include struct ceph_file_layout; @@ -39,10 +40,6 @@ int ceph_file_to_extents(struct ceph_file_layout *l, u64 off, u64 len, void *alloc_arg, ceph_object_extent_fn_t action_fn, void *action_arg); -int ceph_iterate_extents(struct ceph_file_layout *l, u64 off, u64 len, - struct list_head *object_extents, - ceph_object_extent_fn_t action_fn, - void *action_arg); struct ceph_file_extent { u64 fe_off; @@ -68,4 +65,57 @@ int ceph_extent_to_file(struct ceph_file_layout *l, u64 ceph_get_num_objects(struct ceph_file_layout *l, u64 size); +static __always_inline +struct ceph_object_extent *ceph_lookup_containing(struct list_head *object_extents, + u64 objno, u64 objoff, u32 xlen) +{ + struct ceph_object_extent *ex; + + list_for_each_entry(ex, object_extents, oe_item) { + if (ex->oe_objno == objno && + ex->oe_off <= objoff && + ex->oe_off + ex->oe_len >= objoff + xlen) /* paranoia */ + return ex; + + if (ex->oe_objno > objno) + break; + } + + return NULL; +} + +/* + * A stripped down, non-allocating version of ceph_file_to_extents(), + * for when @object_extents is already populated. + */ +static __always_inline +int ceph_iterate_extents(struct ceph_file_layout *l, u64 off, u64 len, + struct list_head *object_extents, + ceph_object_extent_fn_t action_fn, + void *action_arg) +{ + while (len) { + struct ceph_object_extent *ex; + u64 objno, objoff; + u32 xlen; + + ceph_calc_file_object_mapping(l, off, len, &objno, &objoff, + &xlen); + + ex = ceph_lookup_containing(object_extents, objno, objoff, xlen); + if (!ex) { + WARN(1, "%s: objno %llu %llu~%u not found!\n", + __func__, objno, objoff, xlen); + return -EINVAL; + } + + action_fn(ex, xlen, action_arg); + + off += xlen; + len -= xlen; + } + + return 0; +} + #endif diff --git a/net/ceph/striper.c b/net/ceph/striper.c index 3b3fa75d1189..3dedbf018fa6 100644 --- a/net/ceph/striper.c +++ b/net/ceph/striper.c @@ -70,25 +70,6 @@ lookup_last(struct list_head *object_extents, u64 objno, return NULL; } -static struct ceph_object_extent * -lookup_containing(struct list_head *object_extents, u64 objno, - u64 objoff, u32 xlen) -{ - struct ceph_object_extent *ex; - - list_for_each_entry(ex, object_extents, oe_item) { - if (ex->oe_objno == objno && - ex->oe_off <= objoff && - ex->oe_off + ex->oe_len >= objoff + xlen) /* paranoia */ - return ex; - - if (ex->oe_objno > objno) - break; - } - - return NULL; -} - /* * Map a file extent to a sorted list of object extents. * @@ -167,40 +148,6 @@ int ceph_file_to_extents(struct ceph_file_layout *l, u64 off, u64 len, } EXPORT_SYMBOL(ceph_file_to_extents); -/* - * A stripped down, non-allocating version of ceph_file_to_extents(), - * for when @object_extents is already populated. - */ -int ceph_iterate_extents(struct ceph_file_layout *l, u64 off, u64 len, - struct list_head *object_extents, - ceph_object_extent_fn_t action_fn, - void *action_arg) -{ - while (len) { - struct ceph_object_extent *ex; - u64 objno, objoff; - u32 xlen; - - ceph_calc_file_object_mapping(l, off, len, &objno, &objoff, - &xlen); - - ex = lookup_containing(object_extents, objno, objoff, xlen); - if (!ex) { - WARN(1, "%s: objno %llu %llu~%u not found!\n", - __func__, objno, objoff, xlen); - return -EINVAL; - } - - action_fn(ex, xlen, action_arg); - - off += xlen; - len -= xlen; - } - - return 0; -} -EXPORT_SYMBOL(ceph_iterate_extents); - /* * Reverse map an object extent to a sorted list of file extents. * From patchwork Thu Mar 13 23:33:06 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016062 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 89E7C200BAA for ; Thu, 13 Mar 2025 23:34:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908892; cv=none; b=QblxmfMluAfAvoWTEdblk2nwS5FjAx9WasLlgeyz65d4J3Zl02VPIxbzbNSQP2DDTuHyKLeLCZb62STCsX4aiR2Sxe1TxzDABJ03o+bpAjLMNysqEEFjfCB3lffyaMlciXs/LMkgIQnhFpiD5KOsdIufcUQAP/NrLWCNX6JOQjE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908892; c=relaxed/simple; bh=N3h3xQj7ocvZehZTOpZFvI4y4JkEaB8IInpqcT0rRDU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=f6Jra2PFguyDmryhM4EcdDvBwL9qCuADq6WQRLQXRa2dyKOv5gSjsvmwNTBQh3M77My0rFNS5qq2AdgBTsVYEl4P3RYlodmbZ4N8aQghE+jTDVBwU+RN06TYZd+YKsxeJWSpkkIJxsvDP6Jb0pMe89ME5Z2xtB/whtX7FU5+KyI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=dJQ1N5Tz; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="dJQ1N5Tz" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908887; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Ds+3B/FO8qMYcXcHGziHXc1CnXTOaJvq1Frqia6w9zc=; b=dJQ1N5TzMTdWF36WROP2Q4JzMdmxHOlaN8mp1Z3LADo/MqfqT34xHAAh9VSsK85Q8bonNg WW1A9Yc79Xqm5pTU4dFLI9KYM/8VMRJqZhiVcinO0ZN8KZ/YGn/I4f2rBbDh5d5i2Univl RdhuHRODxgp4gzVbOVohTmahZLddA4I= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-260-7iwvEo4jNHmvV4fm4k8D-A-1; Thu, 13 Mar 2025 19:34:44 -0400 X-MC-Unique: 7iwvEo4jNHmvV4fm4k8D-A-1 X-Mimecast-MFC-AGG-ID: 7iwvEo4jNHmvV4fm4k8D-A_1741908883 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 0B8C4195608B; Thu, 13 Mar 2025 23:34:43 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id B8E4B300376F; Thu, 13 Mar 2025 23:34:40 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 14/35] libceph: Remove bvec and bio data container types Date: Thu, 13 Mar 2025 23:33:06 +0000 Message-ID: <20250313233341.1675324-15-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 The CEPH_MSG_DATA_BIO and CEPH_MSG_DATA_BVEC data types are now not used, so remove them. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- include/linux/ceph/messenger.h | 103 -------------------- include/linux/ceph/osd_client.h | 31 ------ net/ceph/messenger.c | 166 -------------------------------- net/ceph/osd_client.c | 94 ------------------ 4 files changed, 394 deletions(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 1b646d0dff39..ff0aea6d2d31 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -120,108 +120,15 @@ enum ceph_msg_data_type { CEPH_MSG_DATA_DATABUF, /* data source/destination is a data buffer */ CEPH_MSG_DATA_PAGES, /* data source/destination is a page array */ CEPH_MSG_DATA_PAGELIST, /* data source/destination is a pagelist */ -#ifdef CONFIG_BLOCK - CEPH_MSG_DATA_BIO, /* data source/destination is a bio list */ -#endif /* CONFIG_BLOCK */ - CEPH_MSG_DATA_BVECS, /* data source/destination is a bio_vec array */ CEPH_MSG_DATA_ITER, /* data source/destination is an iov_iter */ }; -#ifdef CONFIG_BLOCK - -struct ceph_bio_iter { - struct bio *bio; - struct bvec_iter iter; -}; - -#define __ceph_bio_iter_advance_step(it, n, STEP) do { \ - unsigned int __n = (n), __cur_n; \ - \ - while (__n) { \ - BUG_ON(!(it)->iter.bi_size); \ - __cur_n = min((it)->iter.bi_size, __n); \ - (void)(STEP); \ - bio_advance_iter((it)->bio, &(it)->iter, __cur_n); \ - if (!(it)->iter.bi_size && (it)->bio->bi_next) { \ - dout("__ceph_bio_iter_advance_step next bio\n"); \ - (it)->bio = (it)->bio->bi_next; \ - (it)->iter = (it)->bio->bi_iter; \ - } \ - __n -= __cur_n; \ - } \ -} while (0) - -/* - * Advance @it by @n bytes. - */ -#define ceph_bio_iter_advance(it, n) \ - __ceph_bio_iter_advance_step(it, n, 0) - -/* - * Advance @it by @n bytes, executing BVEC_STEP for each bio_vec. - */ -#define ceph_bio_iter_advance_step(it, n, BVEC_STEP) \ - __ceph_bio_iter_advance_step(it, n, ({ \ - struct bio_vec bv; \ - struct bvec_iter __cur_iter; \ - \ - __cur_iter = (it)->iter; \ - __cur_iter.bi_size = __cur_n; \ - __bio_for_each_segment(bv, (it)->bio, __cur_iter, __cur_iter) \ - (void)(BVEC_STEP); \ - })) - -#endif /* CONFIG_BLOCK */ - -struct ceph_bvec_iter { - struct bio_vec *bvecs; - struct bvec_iter iter; -}; - -#define __ceph_bvec_iter_advance_step(it, n, STEP) do { \ - BUG_ON((n) > (it)->iter.bi_size); \ - (void)(STEP); \ - bvec_iter_advance((it)->bvecs, &(it)->iter, (n)); \ -} while (0) - -/* - * Advance @it by @n bytes. - */ -#define ceph_bvec_iter_advance(it, n) \ - __ceph_bvec_iter_advance_step(it, n, 0) - -/* - * Advance @it by @n bytes, executing BVEC_STEP for each bio_vec. - */ -#define ceph_bvec_iter_advance_step(it, n, BVEC_STEP) \ - __ceph_bvec_iter_advance_step(it, n, ({ \ - struct bio_vec bv; \ - struct bvec_iter __cur_iter; \ - \ - __cur_iter = (it)->iter; \ - __cur_iter.bi_size = (n); \ - for_each_bvec(bv, (it)->bvecs, __cur_iter, __cur_iter) \ - (void)(BVEC_STEP); \ - })) - -#define ceph_bvec_iter_shorten(it, n) do { \ - BUG_ON((n) > (it)->iter.bi_size); \ - (it)->iter.bi_size = (n); \ -} while (0) - struct ceph_msg_data { enum ceph_msg_data_type type; struct iov_iter iter; bool release_dbuf; union { struct ceph_databuf *dbuf; -#ifdef CONFIG_BLOCK - struct { - struct ceph_bio_iter bio_pos; - u32 bio_length; - }; -#endif /* CONFIG_BLOCK */ - struct ceph_bvec_iter bvec_pos; struct { struct page **pages; size_t length; /* total # bytes */ @@ -240,10 +147,6 @@ struct ceph_msg_data_cursor { int sr_resid; /* residual sparse_read len */ bool need_crc; /* crc update needed */ union { -#ifdef CONFIG_BLOCK - struct ceph_bio_iter bio_iter; -#endif /* CONFIG_BLOCK */ - struct bvec_iter bvec_iter; struct { /* pages */ unsigned int page_offset; /* offset in page */ unsigned short page_index; /* index in array */ @@ -610,12 +513,6 @@ void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages, size_t length, size_t offset, bool own_pages); extern void ceph_msg_data_add_pagelist(struct ceph_msg *msg, struct ceph_pagelist *pagelist); -#ifdef CONFIG_BLOCK -void ceph_msg_data_add_bio(struct ceph_msg *msg, struct ceph_bio_iter *bio_pos, - u32 length); -#endif /* CONFIG_BLOCK */ -void ceph_msg_data_add_bvecs(struct ceph_msg *msg, - struct ceph_bvec_iter *bvec_pos); void ceph_msg_data_add_iter(struct ceph_msg *msg, struct iov_iter *iter); diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 5ac4c0b4dfcd..9182aa5075b2 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -107,10 +107,6 @@ enum ceph_osd_data_type { CEPH_OSD_DATA_TYPE_DATABUF, CEPH_OSD_DATA_TYPE_PAGES, CEPH_OSD_DATA_TYPE_PAGELIST, -#ifdef CONFIG_BLOCK - CEPH_OSD_DATA_TYPE_BIO, -#endif /* CONFIG_BLOCK */ - CEPH_OSD_DATA_TYPE_BVECS, CEPH_OSD_DATA_TYPE_ITER, }; @@ -127,16 +123,6 @@ struct ceph_osd_data { bool own_pages; }; struct ceph_pagelist *pagelist; -#ifdef CONFIG_BLOCK - struct { - struct ceph_bio_iter bio_pos; - u32 bio_length; - }; -#endif /* CONFIG_BLOCK */ - struct { - struct ceph_bvec_iter bvec_pos; - u32 num_bvecs; - }; }; }; @@ -499,19 +485,6 @@ extern void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *, extern void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *, unsigned int which, struct ceph_pagelist *pagelist); -#ifdef CONFIG_BLOCK -void osd_req_op_extent_osd_data_bio(struct ceph_osd_request *osd_req, - unsigned int which, - struct ceph_bio_iter *bio_pos, - u32 bio_length); -#endif /* CONFIG_BLOCK */ -void osd_req_op_extent_osd_data_bvecs(struct ceph_osd_request *osd_req, - unsigned int which, - struct bio_vec *bvecs, u32 num_bvecs, - u32 bytes); -void osd_req_op_extent_osd_data_bvec_pos(struct ceph_osd_request *osd_req, - unsigned int which, - struct ceph_bvec_iter *bvec_pos); void osd_req_op_extent_osd_iter(struct ceph_osd_request *osd_req, unsigned int which, struct iov_iter *iter); @@ -521,10 +494,6 @@ void osd_req_op_cls_request_databuf(struct ceph_osd_request *req, extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *, unsigned int which, struct ceph_pagelist *pagelist); -void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req, - unsigned int which, - struct bio_vec *bvecs, u32 num_bvecs, - u32 bytes); void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req, unsigned int which, struct ceph_databuf *dbuf); diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index dc8082575e4f..cb66a768bd7c 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -12,9 +12,6 @@ #include #include #include -#ifdef CONFIG_BLOCK -#include -#endif /* CONFIG_BLOCK */ #include #include #include @@ -714,116 +711,6 @@ void ceph_con_discard_requeued(struct ceph_connection *con, u64 reconnect_seq) } } -#ifdef CONFIG_BLOCK - -/* - * For a bio data item, a piece is whatever remains of the next - * entry in the current bio iovec, or the first entry in the next - * bio in the list. - */ -static void ceph_msg_data_bio_cursor_init(struct ceph_msg_data_cursor *cursor, - size_t length) -{ - struct ceph_msg_data *data = cursor->data; - struct ceph_bio_iter *it = &cursor->bio_iter; - - cursor->resid = min_t(size_t, length, data->bio_length); - *it = data->bio_pos; - if (cursor->resid < it->iter.bi_size) - it->iter.bi_size = cursor->resid; - - BUG_ON(cursor->resid < bio_iter_len(it->bio, it->iter)); -} - -static struct page *ceph_msg_data_bio_next(struct ceph_msg_data_cursor *cursor, - size_t *page_offset, - size_t *length) -{ - struct bio_vec bv = bio_iter_iovec(cursor->bio_iter.bio, - cursor->bio_iter.iter); - - *page_offset = bv.bv_offset; - *length = bv.bv_len; - return bv.bv_page; -} - -static bool ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor, - size_t bytes) -{ - struct ceph_bio_iter *it = &cursor->bio_iter; - struct page *page = bio_iter_page(it->bio, it->iter); - - BUG_ON(bytes > cursor->resid); - BUG_ON(bytes > bio_iter_len(it->bio, it->iter)); - cursor->resid -= bytes; - bio_advance_iter(it->bio, &it->iter, bytes); - - if (!cursor->resid) - return false; /* no more data */ - - if (!bytes || (it->iter.bi_size && it->iter.bi_bvec_done && - page == bio_iter_page(it->bio, it->iter))) - return false; /* more bytes to process in this segment */ - - if (!it->iter.bi_size) { - it->bio = it->bio->bi_next; - it->iter = it->bio->bi_iter; - if (cursor->resid < it->iter.bi_size) - it->iter.bi_size = cursor->resid; - } - - BUG_ON(cursor->resid < bio_iter_len(it->bio, it->iter)); - return true; -} -#endif /* CONFIG_BLOCK */ - -static void ceph_msg_data_bvecs_cursor_init(struct ceph_msg_data_cursor *cursor, - size_t length) -{ - struct ceph_msg_data *data = cursor->data; - struct bio_vec *bvecs = data->bvec_pos.bvecs; - - cursor->resid = min_t(size_t, length, data->bvec_pos.iter.bi_size); - cursor->bvec_iter = data->bvec_pos.iter; - cursor->bvec_iter.bi_size = cursor->resid; - - BUG_ON(cursor->resid < bvec_iter_len(bvecs, cursor->bvec_iter)); -} - -static struct page *ceph_msg_data_bvecs_next(struct ceph_msg_data_cursor *cursor, - size_t *page_offset, - size_t *length) -{ - struct bio_vec bv = bvec_iter_bvec(cursor->data->bvec_pos.bvecs, - cursor->bvec_iter); - - *page_offset = bv.bv_offset; - *length = bv.bv_len; - return bv.bv_page; -} - -static bool ceph_msg_data_bvecs_advance(struct ceph_msg_data_cursor *cursor, - size_t bytes) -{ - struct bio_vec *bvecs = cursor->data->bvec_pos.bvecs; - struct page *page = bvec_iter_page(bvecs, cursor->bvec_iter); - - BUG_ON(bytes > cursor->resid); - BUG_ON(bytes > bvec_iter_len(bvecs, cursor->bvec_iter)); - cursor->resid -= bytes; - bvec_iter_advance(bvecs, &cursor->bvec_iter, bytes); - - if (!cursor->resid) - return false; /* no more data */ - - if (!bytes || (cursor->bvec_iter.bi_bvec_done && - page == bvec_iter_page(bvecs, cursor->bvec_iter))) - return false; /* more bytes to process in this segment */ - - BUG_ON(cursor->resid < bvec_iter_len(bvecs, cursor->bvec_iter)); - return true; -} - /* * For a page array, a piece comes from the first page in the array * that has not already been fully consumed. @@ -1045,14 +932,6 @@ static void __ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor) case CEPH_MSG_DATA_PAGES: ceph_msg_data_pages_cursor_init(cursor, length); break; -#ifdef CONFIG_BLOCK - case CEPH_MSG_DATA_BIO: - ceph_msg_data_bio_cursor_init(cursor, length); - break; -#endif /* CONFIG_BLOCK */ - case CEPH_MSG_DATA_BVECS: - ceph_msg_data_bvecs_cursor_init(cursor, length); - break; case CEPH_MSG_DATA_DATABUF: case CEPH_MSG_DATA_ITER: ceph_msg_data_iter_cursor_init(cursor, length); @@ -1096,14 +975,6 @@ struct page *ceph_msg_data_next(struct ceph_msg_data_cursor *cursor, case CEPH_MSG_DATA_PAGES: page = ceph_msg_data_pages_next(cursor, page_offset, length); break; -#ifdef CONFIG_BLOCK - case CEPH_MSG_DATA_BIO: - page = ceph_msg_data_bio_next(cursor, page_offset, length); - break; -#endif /* CONFIG_BLOCK */ - case CEPH_MSG_DATA_BVECS: - page = ceph_msg_data_bvecs_next(cursor, page_offset, length); - break; case CEPH_MSG_DATA_DATABUF: case CEPH_MSG_DATA_ITER: page = ceph_msg_data_iter_next(cursor, page_offset, length); @@ -1138,14 +1009,6 @@ void ceph_msg_data_advance(struct ceph_msg_data_cursor *cursor, size_t bytes) case CEPH_MSG_DATA_PAGES: new_piece = ceph_msg_data_pages_advance(cursor, bytes); break; -#ifdef CONFIG_BLOCK - case CEPH_MSG_DATA_BIO: - new_piece = ceph_msg_data_bio_advance(cursor, bytes); - break; -#endif /* CONFIG_BLOCK */ - case CEPH_MSG_DATA_BVECS: - new_piece = ceph_msg_data_bvecs_advance(cursor, bytes); - break; case CEPH_MSG_DATA_DATABUF: case CEPH_MSG_DATA_ITER: new_piece = ceph_msg_data_iter_advance(cursor, bytes); @@ -1938,35 +1801,6 @@ void ceph_msg_data_add_pagelist(struct ceph_msg *msg, } EXPORT_SYMBOL(ceph_msg_data_add_pagelist); -#ifdef CONFIG_BLOCK -void ceph_msg_data_add_bio(struct ceph_msg *msg, struct ceph_bio_iter *bio_pos, - u32 length) -{ - struct ceph_msg_data *data; - - data = ceph_msg_data_add(msg); - data->type = CEPH_MSG_DATA_BIO; - data->bio_pos = *bio_pos; - data->bio_length = length; - - msg->data_length += length; -} -EXPORT_SYMBOL(ceph_msg_data_add_bio); -#endif /* CONFIG_BLOCK */ - -void ceph_msg_data_add_bvecs(struct ceph_msg *msg, - struct ceph_bvec_iter *bvec_pos) -{ - struct ceph_msg_data *data; - - data = ceph_msg_data_add(msg); - data->type = CEPH_MSG_DATA_BVECS; - data->bvec_pos = *bvec_pos; - - msg->data_length += bvec_pos->iter.bi_size; -} -EXPORT_SYMBOL(ceph_msg_data_add_bvecs); - void ceph_msg_data_add_iter(struct ceph_msg *msg, struct iov_iter *iter) { diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index fc5c136e793d..10f65e9b1906 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -9,9 +9,6 @@ #include #include #include -#ifdef CONFIG_BLOCK -#include -#endif #include #include @@ -151,26 +148,6 @@ static void ceph_osd_data_pagelist_init(struct ceph_osd_data *osd_data, osd_data->pagelist = pagelist; } -#ifdef CONFIG_BLOCK -static void ceph_osd_data_bio_init(struct ceph_osd_data *osd_data, - struct ceph_bio_iter *bio_pos, - u32 bio_length) -{ - osd_data->type = CEPH_OSD_DATA_TYPE_BIO; - osd_data->bio_pos = *bio_pos; - osd_data->bio_length = bio_length; -} -#endif /* CONFIG_BLOCK */ - -static void ceph_osd_data_bvecs_init(struct ceph_osd_data *osd_data, - struct ceph_bvec_iter *bvec_pos, - u32 num_bvecs) -{ - osd_data->type = CEPH_OSD_DATA_TYPE_BVECS; - osd_data->bvec_pos = *bvec_pos; - osd_data->num_bvecs = num_bvecs; -} - static void ceph_osd_iter_init(struct ceph_osd_data *osd_data, struct iov_iter *iter) { @@ -252,47 +229,6 @@ void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *osd_req, } EXPORT_SYMBOL(osd_req_op_extent_osd_data_pagelist); -#ifdef CONFIG_BLOCK -void osd_req_op_extent_osd_data_bio(struct ceph_osd_request *osd_req, - unsigned int which, - struct ceph_bio_iter *bio_pos, - u32 bio_length) -{ - struct ceph_osd_data *osd_data; - - osd_data = osd_req_op_data(osd_req, which, extent, osd_data); - ceph_osd_data_bio_init(osd_data, bio_pos, bio_length); -} -EXPORT_SYMBOL(osd_req_op_extent_osd_data_bio); -#endif /* CONFIG_BLOCK */ - -void osd_req_op_extent_osd_data_bvecs(struct ceph_osd_request *osd_req, - unsigned int which, - struct bio_vec *bvecs, u32 num_bvecs, - u32 bytes) -{ - struct ceph_osd_data *osd_data; - struct ceph_bvec_iter it = { - .bvecs = bvecs, - .iter = { .bi_size = bytes }, - }; - - osd_data = osd_req_op_data(osd_req, which, extent, osd_data); - ceph_osd_data_bvecs_init(osd_data, &it, num_bvecs); -} -EXPORT_SYMBOL(osd_req_op_extent_osd_data_bvecs); - -void osd_req_op_extent_osd_data_bvec_pos(struct ceph_osd_request *osd_req, - unsigned int which, - struct ceph_bvec_iter *bvec_pos) -{ - struct ceph_osd_data *osd_data; - - osd_data = osd_req_op_data(osd_req, which, extent, osd_data); - ceph_osd_data_bvecs_init(osd_data, bvec_pos, 0); -} -EXPORT_SYMBOL(osd_req_op_extent_osd_data_bvec_pos); - /** * osd_req_op_extent_osd_iter - Set up an operation with an iterator buffer * @osd_req: The request to set up @@ -360,24 +296,6 @@ static void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req, osd_req->r_ops[which].indata_len += length; } -void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req, - unsigned int which, - struct bio_vec *bvecs, u32 num_bvecs, - u32 bytes) -{ - struct ceph_osd_data *osd_data; - struct ceph_bvec_iter it = { - .bvecs = bvecs, - .iter = { .bi_size = bytes }, - }; - - osd_data = osd_req_op_data(osd_req, which, cls, request_data); - ceph_osd_data_bvecs_init(osd_data, &it, num_bvecs); - osd_req->r_ops[which].cls.indata_len += bytes; - osd_req->r_ops[which].indata_len += bytes; -} -EXPORT_SYMBOL(osd_req_op_cls_request_data_bvecs); - void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req, unsigned int which, struct ceph_databuf *dbuf) @@ -402,12 +320,6 @@ static u64 ceph_osd_data_length(struct ceph_osd_data *osd_data) return osd_data->length; case CEPH_OSD_DATA_TYPE_PAGELIST: return (u64)osd_data->pagelist->length; -#ifdef CONFIG_BLOCK - case CEPH_OSD_DATA_TYPE_BIO: - return (u64)osd_data->bio_length; -#endif /* CONFIG_BLOCK */ - case CEPH_OSD_DATA_TYPE_BVECS: - return osd_data->bvec_pos.iter.bi_size; case CEPH_OSD_DATA_TYPE_ITER: return iov_iter_count(&osd_data->iter); default: @@ -1017,12 +929,6 @@ static void ceph_osdc_msg_data_add(struct ceph_msg *msg, } else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) { BUG_ON(!length); ceph_msg_data_add_pagelist(msg, osd_data->pagelist); -#ifdef CONFIG_BLOCK - } else if (osd_data->type == CEPH_OSD_DATA_TYPE_BIO) { - ceph_msg_data_add_bio(msg, &osd_data->bio_pos, length); -#endif - } else if (osd_data->type == CEPH_OSD_DATA_TYPE_BVECS) { - ceph_msg_data_add_bvecs(msg, &osd_data->bvec_pos); } else if (osd_data->type == CEPH_OSD_DATA_TYPE_ITER) { ceph_msg_data_add_iter(msg, &osd_data->iter); } else { From patchwork Thu Mar 13 23:33:07 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016063 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 69638201033 for ; Thu, 13 Mar 2025 23:34:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908897; cv=none; b=TOldj/2wD8LVhdE+z3V//DEvb1urT52++IeWprm0LdjO/z2TXuIXlLOMhjvB/eUiGgj175XQyYs9t27ejOwQi/YWHsmVVyYbbyHPfXJyr7OnGbjYtLrLoZOl7G9scLiuICZKz9XN4slWdlagHpgV5qZuO5YultGROCFCe8oYd5s= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908897; c=relaxed/simple; bh=dgaJKJPLu+2m9DfyFnPJ/5d+sJeqh9WzlokmjF3zf3E=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ETBqsgjIwBou9iWOt9B6gFzNlc9Ffq1PZUEenNChH0JAV9j7HGt2vY5kTvKvzvYKlDSFbEF/SVDvcRP2C3zaBUtG6h1yB28tw3bWDxQpo3mfcq6BqblicvJ9eJQ92klXLHYN0ma9elipAGZfSL84ikLdRwyQk65FLBABGHbnk9I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Fgtu75Aq; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Fgtu75Aq" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908892; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DBEmQI/3EJ1uZKgrd77H/O1zAeWoD9Sx9X2aQy7MD78=; b=Fgtu75AqCNJ7jmUxy/cAOVho6aTNaVcnZKmA9wGXJc4TCScGwONm7nk1stiXmdC/yKww2E YcjxF9K5SAVLL0+iet1DoXgatjQjMFOPmashEkTvfKv74cgpJXSqTLYAkKKTzvMbJnDF+0 1iW+FBM2rBDLE+scZ42jT0dN9aKgPek= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-434-wUzYjETROGCpdz79VYS9tw-1; Thu, 13 Mar 2025 19:34:48 -0400 X-MC-Unique: wUzYjETROGCpdz79VYS9tw-1 X-Mimecast-MFC-AGG-ID: wUzYjETROGCpdz79VYS9tw_1741908886 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 87AEA180AF4C; Thu, 13 Mar 2025 23:34:46 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 50D1E1800945; Thu, 13 Mar 2025 23:34:44 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 15/35] libceph: Make osd_req_op_cls_init() use a ceph_databuf and map it Date: Thu, 13 Mar 2025 23:33:07 +0000 Message-ID: <20250313233341.1675324-16-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Make osd_req_op_cls_init() use a ceph_databuf to hold the request_info data and then map it and write directly into it rather than using the databuf encode functions. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- net/ceph/osd_client.c | 55 +++++++++++++++++-------------------------- 1 file changed, 22 insertions(+), 33 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 10f65e9b1906..405ccf7e7a91 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -245,14 +245,14 @@ void osd_req_op_extent_osd_iter(struct ceph_osd_request *osd_req, } EXPORT_SYMBOL(osd_req_op_extent_osd_iter); -static void osd_req_op_cls_request_info_pagelist( - struct ceph_osd_request *osd_req, - unsigned int which, struct ceph_pagelist *pagelist) +static void osd_req_op_cls_request_info_databuf(struct ceph_osd_request *osd_req, + unsigned int which, + struct ceph_databuf *dbuf) { struct ceph_osd_data *osd_data; osd_data = osd_req_op_data(osd_req, which, cls, request_info); - ceph_osd_data_pagelist_init(osd_data, pagelist); + ceph_osd_databuf_init(osd_data, dbuf); } void osd_req_op_cls_request_databuf(struct ceph_osd_request *osd_req, @@ -778,42 +778,31 @@ int osd_req_op_cls_init(struct ceph_osd_request *osd_req, unsigned int which, const char *class, const char *method) { struct ceph_osd_req_op *op; - struct ceph_pagelist *pagelist; - size_t payload_len = 0; - size_t size; - int ret; + struct ceph_databuf *dbuf; + size_t csize = strlen(class), msize = strlen(method); + void *p; + + BUG_ON(csize > (size_t) U8_MAX); + BUG_ON(msize > (size_t) U8_MAX); op = osd_req_op_init(osd_req, which, CEPH_OSD_OP_CALL, 0); + op->cls.class_name = class; + op->cls.class_len = csize; + op->cls.method_name = method; + op->cls.method_len = msize; - pagelist = ceph_pagelist_alloc(GFP_NOFS); - if (!pagelist) + dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOFS); + if (!dbuf) return -ENOMEM; - op->cls.class_name = class; - size = strlen(class); - BUG_ON(size > (size_t) U8_MAX); - op->cls.class_len = size; - ret = ceph_pagelist_append(pagelist, class, size); - if (ret) - goto err_pagelist_free; - payload_len += size; - - op->cls.method_name = method; - size = strlen(method); - BUG_ON(size > (size_t) U8_MAX); - op->cls.method_len = size; - ret = ceph_pagelist_append(pagelist, method, size); - if (ret) - goto err_pagelist_free; - payload_len += size; + p = ceph_databuf_enc_start(dbuf); + ceph_encode_copy(&p, class, csize); + ceph_encode_copy(&p, method, msize); + ceph_databuf_enc_stop(dbuf, p); - osd_req_op_cls_request_info_pagelist(osd_req, which, pagelist); - op->indata_len = payload_len; + osd_req_op_cls_request_info_databuf(osd_req, which, dbuf); + op->indata_len = ceph_databuf_len(dbuf); return 0; - -err_pagelist_free: - ceph_pagelist_release(pagelist); - return ret; } EXPORT_SYMBOL(osd_req_op_cls_init); From patchwork Thu Mar 13 23:33:08 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016065 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 81044202C2A for ; Thu, 13 Mar 2025 23:34:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908901; cv=none; b=ItwQHF9M6rC5E6XeNvdAKnJ1QtJJFUMbEQ3qnJcSwMCEtdthxW62xretiqSTtI2Q6o/TLLs9Bk+qI+iSrxh0CO8t2SaLCuPvW111Uyn+VtmWSmHoVkFElZJyFKb2XpxQXQYOMtfHmp5lAJOQYLqBe7WMqmXodTXnWYOzPjorDao= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908901; c=relaxed/simple; bh=he/7RSS57ErR9Pdlrnra8tsEUCVVvgw5j8gqeGK+bVg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=K/YyoIRqGvpE19VO+KB4SEyoJGpsWgHXsPMu062UZ4M+taM6/BWAG4cCENDuBMRXR+HUHe8mfL928Sjl1FQiz3rZtWT3RIuZLkbKZw0K+jH+O/d8Qp10MTHYOpBPpRDt+dsUZ7gIPfcdRNQOZqXPFhUnsmVHTkQiYU1p/+9ghL8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=E3IgqOqq; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="E3IgqOqq" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908897; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gljeAmbGEXrIapBTZuehhzy5uTApUDeOQduXSt57X50=; b=E3IgqOqq/kj0E7vjwz8Qg4zyMxWHLUQggVYAv7ZzI3y3vigwseLH16uDAggmOcDm2V3BAH t0tJh080f4hu1Q1mZRZ2Pg/zjPDreqFyJpM8ArD4WosXQJIIL4mbKzNVaYnsu/vFJrYK0R V+YwcioRBIN17I5w0eBzdj+axWtN7ZY= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-132-7_Ni6y33Nf6FnGA1wSI2Og-1; Thu, 13 Mar 2025 19:34:51 -0400 X-MC-Unique: 7_Ni6y33Nf6FnGA1wSI2Og-1 X-Mimecast-MFC-AGG-ID: 7_Ni6y33Nf6FnGA1wSI2Og_1741908890 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 2853E19560B8; Thu, 13 Mar 2025 23:34:50 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id DABF81801757; Thu, 13 Mar 2025 23:34:47 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 16/35] libceph: Convert req_page of ceph_osdc_call() to ceph_databuf Date: Thu, 13 Mar 2025 23:33:08 +0000 Message-ID: <20250313233341.1675324-17-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Convert the request data (req_page) of ceph_osdc_call() to ceph_databuf. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- drivers/block/rbd.c | 53 +++++++++++----------- include/linux/ceph/osd_client.h | 2 +- net/ceph/cls_lock_client.c | 78 ++++++++++++++++++--------------- net/ceph/osd_client.c | 25 ++--------- 4 files changed, 74 insertions(+), 84 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index dd22cea7ae89..ec09d578b0b0 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1789,7 +1789,7 @@ static int __rbd_object_map_load(struct rbd_device *rbd_dev) rbd_object_map_name(rbd_dev, rbd_dev->spec->snap_id, &oid); ret = ceph_osdc_call(osdc, &oid, &rbd_dev->header_oloc, "rbd", "object_map_load", CEPH_OSD_FLAG_READ, - NULL, 0, reply); + NULL, reply); if (ret) goto out; @@ -4553,8 +4553,8 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev, size_t inbound_size) { struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; - struct ceph_databuf *reply; - struct page *req_page = NULL; + struct ceph_databuf *request = NULL, *reply; + void *p; int ret; /* @@ -4568,32 +4568,33 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev, if (outbound_size > PAGE_SIZE) return -E2BIG; - req_page = alloc_page(GFP_KERNEL); - if (!req_page) + request = ceph_databuf_req_alloc(0, outbound_size, GFP_KERNEL); + if (!request) return -ENOMEM; - memcpy(page_address(req_page), outbound, outbound_size); + p = kmap_ceph_databuf_page(request, 0); + memcpy(p, outbound, outbound_size); + kunmap_local(p); + ceph_databuf_added_data(request, outbound_size); } reply = ceph_databuf_reply_alloc(1, inbound_size, GFP_KERNEL); if (!reply) { - if (req_page) - __free_page(req_page); - return -ENOMEM; + ret = -ENOMEM; + goto out; } ret = ceph_osdc_call(osdc, oid, oloc, RBD_DRV_NAME, method_name, - CEPH_OSD_FLAG_READ, req_page, outbound_size, - reply); + CEPH_OSD_FLAG_READ, request, reply); if (!ret) { ret = ceph_databuf_len(reply); if (copy_from_iter(inbound, ret, &reply->iter) != ret) ret = -EIO; } - if (req_page) - __free_page(req_page); ceph_databuf_release(reply); +out: + ceph_databuf_release(request); return ret; } @@ -5513,7 +5514,7 @@ static int decode_parent_image_spec(void **p, void *end, } static int __get_parent_info(struct rbd_device *rbd_dev, - struct page *req_page, + struct ceph_databuf *request, struct ceph_databuf *reply, struct parent_image_info *pii) { @@ -5524,7 +5525,7 @@ static int __get_parent_info(struct rbd_device *rbd_dev, ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc, "rbd", "parent_get", CEPH_OSD_FLAG_READ, - req_page, sizeof(u64), reply); + request, reply); if (ret) return ret == -EOPNOTSUPP ? 1 : ret; @@ -5539,7 +5540,7 @@ static int __get_parent_info(struct rbd_device *rbd_dev, ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc, "rbd", "parent_overlap_get", CEPH_OSD_FLAG_READ, - req_page, sizeof(u64), reply); + request, reply); if (ret) return ret; @@ -5563,7 +5564,7 @@ static int __get_parent_info(struct rbd_device *rbd_dev, * The caller is responsible for @pii. */ static int __get_parent_info_legacy(struct rbd_device *rbd_dev, - struct page *req_page, + struct ceph_databuf *request, struct ceph_databuf *reply, struct parent_image_info *pii) { @@ -5573,7 +5574,7 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev, ret = ceph_osdc_call(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc, "rbd", "get_parent", CEPH_OSD_FLAG_READ, - req_page, sizeof(u64), reply); + request, reply); if (ret) return ret; @@ -5604,29 +5605,29 @@ static int __get_parent_info_legacy(struct rbd_device *rbd_dev, static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev, struct parent_image_info *pii) { - struct ceph_databuf *reply; - struct page *req_page; + struct ceph_databuf *request, *reply; void *p; int ret = -ENOMEM; - req_page = alloc_page(GFP_KERNEL); - if (!req_page) + request = ceph_databuf_req_alloc(0, sizeof(__le64), GFP_KERNEL); + if (!request) goto out; reply = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_KERNEL); if (!reply) goto out_free; - p = kmap_local_page(req_page); + p = kmap_ceph_databuf_page(request, 0); ceph_encode_64(&p, rbd_dev->spec->snap_id); kunmap_local(p); - ret = __get_parent_info(rbd_dev, req_page, reply, pii); + ceph_databuf_added_data(request, sizeof(__le64)); + ret = __get_parent_info(rbd_dev, request, reply, pii); if (ret > 0) - ret = __get_parent_info_legacy(rbd_dev, req_page, reply, pii); + ret = __get_parent_info_legacy(rbd_dev, request, reply, pii); ceph_databuf_release(reply); out_free: - __free_page(req_page); + ceph_databuf_release(request); out: return ret; } diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 9182aa5075b2..d31e59bd128c 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -568,7 +568,7 @@ int ceph_osdc_call(struct ceph_osd_client *osdc, struct ceph_object_locator *oloc, const char *class, const char *method, unsigned int flags, - struct page *req_page, size_t req_len, + struct ceph_databuf *request, struct ceph_databuf *response); /* watch/notify */ diff --git a/net/ceph/cls_lock_client.c b/net/ceph/cls_lock_client.c index 37bb8708e8bb..6c8608aabe5f 100644 --- a/net/ceph/cls_lock_client.c +++ b/net/ceph/cls_lock_client.c @@ -34,7 +34,7 @@ int ceph_cls_lock(struct ceph_osd_client *osdc, int tag_len = strlen(tag); int desc_len = strlen(desc); void *p, *end; - struct page *lock_op_page; + struct ceph_databuf *lock_op_req; struct timespec64 mtime; int ret; @@ -49,11 +49,11 @@ int ceph_cls_lock(struct ceph_osd_client *osdc, if (lock_op_buf_size > PAGE_SIZE) return -E2BIG; - lock_op_page = alloc_page(GFP_NOIO); - if (!lock_op_page) + lock_op_req = ceph_databuf_req_alloc(0, lock_op_buf_size, GFP_NOIO); + if (!lock_op_req) return -ENOMEM; - p = page_address(lock_op_page); + p = kmap_ceph_databuf_page(lock_op_req, 0); end = p + lock_op_buf_size; /* encode cls_lock_lock_op struct */ @@ -69,15 +69,16 @@ int ceph_cls_lock(struct ceph_osd_client *osdc, ceph_encode_timespec64(p, &mtime); p += sizeof(struct ceph_timespec); ceph_encode_8(&p, flags); + kunmap_local(p); + ceph_databuf_added_data(lock_op_req, lock_op_buf_size); dout("%s lock_name %s type %d cookie %s tag %s desc %s flags 0x%x\n", __func__, lock_name, type, cookie, tag, desc, flags); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "lock", - CEPH_OSD_FLAG_WRITE, lock_op_page, - lock_op_buf_size, NULL); + CEPH_OSD_FLAG_WRITE, lock_op_req, NULL); dout("%s: status %d\n", __func__, ret); - __free_page(lock_op_page); + ceph_databuf_release(lock_op_req); return ret; } EXPORT_SYMBOL(ceph_cls_lock); @@ -99,7 +100,7 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc, int name_len = strlen(lock_name); int cookie_len = strlen(cookie); void *p, *end; - struct page *unlock_op_page; + struct ceph_databuf *unlock_op_req; int ret; unlock_op_buf_size = name_len + sizeof(__le32) + @@ -108,11 +109,11 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc, if (unlock_op_buf_size > PAGE_SIZE) return -E2BIG; - unlock_op_page = alloc_page(GFP_NOIO); - if (!unlock_op_page) + unlock_op_req = ceph_databuf_req_alloc(0, unlock_op_buf_size, GFP_NOIO); + if (!unlock_op_req) return -ENOMEM; - p = page_address(unlock_op_page); + p = kmap_ceph_databuf_page(unlock_op_req, 0); end = p + unlock_op_buf_size; /* encode cls_lock_unlock_op struct */ @@ -120,14 +121,15 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc, unlock_op_buf_size - CEPH_ENCODING_START_BLK_LEN); ceph_encode_string(&p, end, lock_name, name_len); ceph_encode_string(&p, end, cookie, cookie_len); + kunmap_local(p); + ceph_databuf_added_data(unlock_op_req, unlock_op_buf_size); dout("%s lock_name %s cookie %s\n", __func__, lock_name, cookie); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "unlock", - CEPH_OSD_FLAG_WRITE, unlock_op_page, - unlock_op_buf_size, NULL); + CEPH_OSD_FLAG_WRITE, unlock_op_req, NULL); dout("%s: status %d\n", __func__, ret); - __free_page(unlock_op_page); + ceph_databuf_release(unlock_op_req); return ret; } EXPORT_SYMBOL(ceph_cls_unlock); @@ -150,7 +152,7 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc, int break_op_buf_size; int name_len = strlen(lock_name); int cookie_len = strlen(cookie); - struct page *break_op_page; + struct ceph_databuf *break_op_req; void *p, *end; int ret; @@ -161,11 +163,11 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc, if (break_op_buf_size > PAGE_SIZE) return -E2BIG; - break_op_page = alloc_page(GFP_NOIO); - if (!break_op_page) + break_op_req = ceph_databuf_req_alloc(0, break_op_buf_size, GFP_NOIO); + if (!break_op_req) return -ENOMEM; - p = page_address(break_op_page); + p = kmap_ceph_databuf_page(break_op_req, 0); end = p + break_op_buf_size; /* encode cls_lock_break_op struct */ @@ -174,15 +176,16 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc, ceph_encode_string(&p, end, lock_name, name_len); ceph_encode_copy(&p, locker, sizeof(*locker)); ceph_encode_string(&p, end, cookie, cookie_len); + kunmap_local(p); + ceph_databuf_added_data(break_op_req, break_op_buf_size); dout("%s lock_name %s cookie %s locker %s%llu\n", __func__, lock_name, cookie, ENTITY_NAME(*locker)); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "break_lock", - CEPH_OSD_FLAG_WRITE, break_op_page, - break_op_buf_size, NULL); + CEPH_OSD_FLAG_WRITE, break_op_req, NULL); dout("%s: status %d\n", __func__, ret); - __free_page(break_op_page); + ceph_databuf_release(break_op_req); return ret; } EXPORT_SYMBOL(ceph_cls_break_lock); @@ -199,7 +202,7 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc, int tag_len = strlen(tag); int new_cookie_len = strlen(new_cookie); void *p, *end; - struct page *cookie_op_page; + struct ceph_databuf *cookie_op_req; int ret; cookie_op_buf_size = name_len + sizeof(__le32) + @@ -210,11 +213,11 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc, if (cookie_op_buf_size > PAGE_SIZE) return -E2BIG; - cookie_op_page = alloc_page(GFP_NOIO); - if (!cookie_op_page) + cookie_op_req = ceph_databuf_req_alloc(0, cookie_op_buf_size, GFP_NOIO); + if (!cookie_op_req) return -ENOMEM; - p = page_address(cookie_op_page); + p = kmap_ceph_databuf_page(cookie_op_req, 0); end = p + cookie_op_buf_size; /* encode cls_lock_set_cookie_op struct */ @@ -225,15 +228,16 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc, ceph_encode_string(&p, end, old_cookie, old_cookie_len); ceph_encode_string(&p, end, tag, tag_len); ceph_encode_string(&p, end, new_cookie, new_cookie_len); + kunmap_local(p); + ceph_databuf_added_data(cookie_op_req, cookie_op_buf_size); dout("%s lock_name %s type %d old_cookie %s tag %s new_cookie %s\n", __func__, lock_name, type, old_cookie, tag, new_cookie); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "set_cookie", - CEPH_OSD_FLAG_WRITE, cookie_op_page, - cookie_op_buf_size, NULL); + CEPH_OSD_FLAG_WRITE, cookie_op_req, NULL); dout("%s: status %d\n", __func__, ret); - __free_page(cookie_op_page); + ceph_databuf_release(cookie_op_req); return ret; } EXPORT_SYMBOL(ceph_cls_set_cookie); @@ -340,7 +344,7 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc, struct ceph_databuf *reply; int get_info_op_buf_size; int name_len = strlen(lock_name); - struct page *get_info_op_page; + struct ceph_databuf *get_info_op_req; void *p, *end; int ret; @@ -349,28 +353,30 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc, if (get_info_op_buf_size > PAGE_SIZE) return -E2BIG; - get_info_op_page = alloc_page(GFP_NOIO); - if (!get_info_op_page) + get_info_op_req = ceph_databuf_req_alloc(0, get_info_op_buf_size, + GFP_NOIO); + if (!get_info_op_req) return -ENOMEM; reply = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO); if (!reply) { - __free_page(get_info_op_page); + ceph_databuf_release(get_info_op_req); return -ENOMEM; } - p = page_address(get_info_op_page); + p = kmap_ceph_databuf_page(get_info_op_req, 0); end = p + get_info_op_buf_size; /* encode cls_lock_get_info_op struct */ ceph_start_encoding(&p, 1, 1, get_info_op_buf_size - CEPH_ENCODING_START_BLK_LEN); ceph_encode_string(&p, end, lock_name, name_len); + kunmap_local(p); + ceph_databuf_added_data(get_info_op_req, get_info_op_buf_size); dout("%s lock_name %s\n", __func__, lock_name); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "get_info", - CEPH_OSD_FLAG_READ, get_info_op_page, - get_info_op_buf_size, reply); + CEPH_OSD_FLAG_READ, get_info_op_req, reply); dout("%s: status %d\n", __func__, ret); if (ret >= 0) { @@ -381,8 +387,8 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc, kunmap_local(p); } - __free_page(get_info_op_page); ceph_databuf_release(reply); + ceph_databuf_release(get_info_op_req); return ret; } EXPORT_SYMBOL(ceph_cls_lock_info); diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 405ccf7e7a91..c4525feb8e26 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -264,7 +264,7 @@ void osd_req_op_cls_request_databuf(struct ceph_osd_request *osd_req, BUG_ON(!ceph_databuf_len(dbuf)); osd_data = osd_req_op_data(osd_req, which, cls, request_data); - ceph_osd_databuf_init(osd_data, dbuf); + ceph_osd_databuf_init(osd_data, ceph_databuf_get(dbuf)); osd_req->r_ops[which].cls.indata_len += ceph_databuf_len(dbuf); osd_req->r_ops[which].indata_len += ceph_databuf_len(dbuf); } @@ -283,19 +283,6 @@ void osd_req_op_cls_request_data_pagelist( } EXPORT_SYMBOL(osd_req_op_cls_request_data_pagelist); -static void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req, - unsigned int which, struct page **pages, u64 length, - u32 offset, bool pages_from_pool, bool own_pages) -{ - struct ceph_osd_data *osd_data; - - osd_data = osd_req_op_data(osd_req, which, cls, request_data); - ceph_osd_data_pages_init(osd_data, pages, length, offset, - pages_from_pool, own_pages); - osd_req->r_ops[which].cls.indata_len += length; - osd_req->r_ops[which].indata_len += length; -} - void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req, unsigned int which, struct ceph_databuf *dbuf) @@ -5089,15 +5076,12 @@ int ceph_osdc_call(struct ceph_osd_client *osdc, struct ceph_object_locator *oloc, const char *class, const char *method, unsigned int flags, - struct page *req_page, size_t req_len, + struct ceph_databuf *request, struct ceph_databuf *response) { struct ceph_osd_request *req; int ret; - if (req_len > PAGE_SIZE) - return -E2BIG; - req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_NOIO); if (!req) return -ENOMEM; @@ -5110,9 +5094,8 @@ int ceph_osdc_call(struct ceph_osd_client *osdc, if (ret) goto out_put_req; - if (req_page) - osd_req_op_cls_request_data_pages(req, 0, &req_page, req_len, - 0, false, false); + if (request) + osd_req_op_cls_request_databuf(req, 0, request); if (response) osd_req_op_cls_response_databuf(req, 0, response); From patchwork Thu Mar 13 23:33:09 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016064 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 326D6202970 for ; Thu, 13 Mar 2025 23:34:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908899; cv=none; b=uamz6ZVqtVwz5Qu/IMzrNKJPraRVtpt8HABlHMUHlyEUnU5cPsPr9vjwNsGIVa3QT34VvFt3wrLnCE9p4C4iBrf58clZG92rPKhrumxhj8ROe9q5V8WnuNRNmj81jR4Xp/uJcG10LHomArRd+dld+4VSwsgoTbLcRZ3HihdvRP0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908899; c=relaxed/simple; bh=9baOV8Q0rHmzZzcj+dG4AbHLjZPzMvM3nWTOJ5Vha5U=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Nh4cQ9e/2LYtQloM6Km6sWHBPI7qasBYks06/7BYax4bSiAYW+5OcQcVRjdnv6M+MUqvfbb3iQJjCfKA3fHnb/pwk1XHI7WBUj8cpON4f+Y1PX0NR5qKbxmKvmToDcaNIfjo+9SBpJTKwUsrhat7VvfFC+XY6c3pzH8Ba3p2Ad0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ddrKT9qZ; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ddrKT9qZ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908896; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=T9lsj7CeBQbxlyHDOAiw78Xi6FdJjfh/a7c8DCu2g0I=; b=ddrKT9qZbJbmS9UAdpAxz47V1LHvTtt22+XMXjBgVN0mlEJZ487r5NCVS/U7jOkhdoNKSk 3haKA9ZOU09rCI23kg2/HVBEd+2VadyoO6DFrJ0BMlrvy+h6r97zPHeBOLueOxFBN6pSN7 okfM0I736l6zJMsktEnAo6VZ+9g2wZo= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-377-fO8r0MaMObuyJAnZgEuUHg-1; Thu, 13 Mar 2025 19:34:55 -0400 X-MC-Unique: fO8r0MaMObuyJAnZgEuUHg-1 X-Mimecast-MFC-AGG-ID: fO8r0MaMObuyJAnZgEuUHg_1741908893 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 99AD219560B7; Thu, 13 Mar 2025 23:34:53 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 599B81800945; Thu, 13 Mar 2025 23:34:51 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 17/35] libceph, rbd: Use ceph_databuf encoding start/stop Date: Thu, 13 Mar 2025 23:33:09 +0000 Message-ID: <20250313233341.1675324-18-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Use ceph_databuf_enc_start() and ceph_databuf_enc_stop() to encode RPC parameter data where possible. The start function maps the buffer and returns a pointer to the point to start writing at; the stop function updates the buffer size. The code is also made a bit more consistent in the use of size_t for length variables and using 'request' for a pointer to the request buffer. The end pointer is dropped from ceph_encode_string() as we shouldn't overrun with the string length being included in the buffer size precalculation. The final pointer is checked by ceph_databuf_enc_stop(). Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- drivers/block/rbd.c | 3 +- include/linux/ceph/decode.h | 4 +- net/ceph/cls_lock_client.c | 195 +++++++++++++++++------------------- net/ceph/mon_client.c | 10 +- net/ceph/osd_client.c | 26 +++-- 5 files changed, 112 insertions(+), 126 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index ec09d578b0b0..078bb1e3e1da 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -5762,8 +5762,7 @@ static char *rbd_dev_image_name(struct rbd_device *rbd_dev) return NULL; p = image_id; - end = image_id + image_id_size; - ceph_encode_string(&p, end, rbd_dev->spec->image_id, (u32)len); + ceph_encode_string(&p, rbd_dev->spec->image_id, len); size = sizeof (__le32) + RBD_IMAGE_NAME_LEN_MAX; reply_buf = kmalloc(size, GFP_KERNEL); diff --git a/include/linux/ceph/decode.h b/include/linux/ceph/decode.h index 8fc1aed64113..e2726c3152db 100644 --- a/include/linux/ceph/decode.h +++ b/include/linux/ceph/decode.h @@ -292,10 +292,8 @@ static inline void ceph_encode_filepath(void **p, void *end, *p += len; } -static inline void ceph_encode_string(void **p, void *end, - const char *s, u32 len) +static inline void ceph_encode_string(void **p, const char *s, u32 len) { - BUG_ON(*p + sizeof(len) + len > end); ceph_encode_32(p, len); if (len) memcpy(*p, s, len); diff --git a/net/ceph/cls_lock_client.c b/net/ceph/cls_lock_client.c index 6c8608aabe5f..c91259ff8557 100644 --- a/net/ceph/cls_lock_client.c +++ b/net/ceph/cls_lock_client.c @@ -28,14 +28,14 @@ int ceph_cls_lock(struct ceph_osd_client *osdc, char *lock_name, u8 type, char *cookie, char *tag, char *desc, u8 flags) { - int lock_op_buf_size; - int name_len = strlen(lock_name); - int cookie_len = strlen(cookie); - int tag_len = strlen(tag); - int desc_len = strlen(desc); - void *p, *end; - struct ceph_databuf *lock_op_req; + struct ceph_databuf *request; struct timespec64 mtime; + size_t lock_op_buf_size; + size_t name_len = strlen(lock_name); + size_t cookie_len = strlen(cookie); + size_t tag_len = strlen(tag); + size_t desc_len = strlen(desc); + void *p; int ret; lock_op_buf_size = name_len + sizeof(__le32) + @@ -49,36 +49,34 @@ int ceph_cls_lock(struct ceph_osd_client *osdc, if (lock_op_buf_size > PAGE_SIZE) return -E2BIG; - lock_op_req = ceph_databuf_req_alloc(0, lock_op_buf_size, GFP_NOIO); - if (!lock_op_req) + request = ceph_databuf_req_alloc(1, lock_op_buf_size, GFP_NOIO); + if (!request) return -ENOMEM; - p = kmap_ceph_databuf_page(lock_op_req, 0); - end = p + lock_op_buf_size; + p = ceph_databuf_enc_start(request); /* encode cls_lock_lock_op struct */ ceph_start_encoding(&p, 1, 1, lock_op_buf_size - CEPH_ENCODING_START_BLK_LEN); - ceph_encode_string(&p, end, lock_name, name_len); + ceph_encode_string(&p, lock_name, name_len); ceph_encode_8(&p, type); - ceph_encode_string(&p, end, cookie, cookie_len); - ceph_encode_string(&p, end, tag, tag_len); - ceph_encode_string(&p, end, desc, desc_len); + ceph_encode_string(&p, cookie, cookie_len); + ceph_encode_string(&p, tag, tag_len); + ceph_encode_string(&p, desc, desc_len); /* only support infinite duration */ memset(&mtime, 0, sizeof(mtime)); ceph_encode_timespec64(p, &mtime); p += sizeof(struct ceph_timespec); ceph_encode_8(&p, flags); - kunmap_local(p); - ceph_databuf_added_data(lock_op_req, lock_op_buf_size); + ceph_databuf_enc_stop(request, p); dout("%s lock_name %s type %d cookie %s tag %s desc %s flags 0x%x\n", __func__, lock_name, type, cookie, tag, desc, flags); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "lock", - CEPH_OSD_FLAG_WRITE, lock_op_req, NULL); + CEPH_OSD_FLAG_WRITE, request, NULL); dout("%s: status %d\n", __func__, ret); - ceph_databuf_release(lock_op_req); + ceph_databuf_release(request); return ret; } EXPORT_SYMBOL(ceph_cls_lock); @@ -96,11 +94,11 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc, struct ceph_object_locator *oloc, char *lock_name, char *cookie) { - int unlock_op_buf_size; - int name_len = strlen(lock_name); - int cookie_len = strlen(cookie); - void *p, *end; - struct ceph_databuf *unlock_op_req; + struct ceph_databuf *request; + size_t unlock_op_buf_size; + size_t name_len = strlen(lock_name); + size_t cookie_len = strlen(cookie); + void *p; int ret; unlock_op_buf_size = name_len + sizeof(__le32) + @@ -109,27 +107,25 @@ int ceph_cls_unlock(struct ceph_osd_client *osdc, if (unlock_op_buf_size > PAGE_SIZE) return -E2BIG; - unlock_op_req = ceph_databuf_req_alloc(0, unlock_op_buf_size, GFP_NOIO); - if (!unlock_op_req) + request = ceph_databuf_req_alloc(1, unlock_op_buf_size, GFP_NOIO); + if (!request) return -ENOMEM; - p = kmap_ceph_databuf_page(unlock_op_req, 0); - end = p + unlock_op_buf_size; + p = ceph_databuf_enc_start(request); /* encode cls_lock_unlock_op struct */ ceph_start_encoding(&p, 1, 1, unlock_op_buf_size - CEPH_ENCODING_START_BLK_LEN); - ceph_encode_string(&p, end, lock_name, name_len); - ceph_encode_string(&p, end, cookie, cookie_len); - kunmap_local(p); - ceph_databuf_added_data(unlock_op_req, unlock_op_buf_size); + ceph_encode_string(&p, lock_name, name_len); + ceph_encode_string(&p, cookie, cookie_len); + ceph_databuf_enc_stop(request, p); dout("%s lock_name %s cookie %s\n", __func__, lock_name, cookie); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "unlock", - CEPH_OSD_FLAG_WRITE, unlock_op_req, NULL); + CEPH_OSD_FLAG_WRITE, request, NULL); dout("%s: status %d\n", __func__, ret); - ceph_databuf_release(unlock_op_req); + ceph_databuf_release(request); return ret; } EXPORT_SYMBOL(ceph_cls_unlock); @@ -149,11 +145,11 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc, char *lock_name, char *cookie, struct ceph_entity_name *locker) { - int break_op_buf_size; - int name_len = strlen(lock_name); - int cookie_len = strlen(cookie); - struct ceph_databuf *break_op_req; - void *p, *end; + struct ceph_databuf *request; + size_t break_op_buf_size; + size_t name_len = strlen(lock_name); + size_t cookie_len = strlen(cookie); + void *p; int ret; break_op_buf_size = name_len + sizeof(__le32) + @@ -163,29 +159,27 @@ int ceph_cls_break_lock(struct ceph_osd_client *osdc, if (break_op_buf_size > PAGE_SIZE) return -E2BIG; - break_op_req = ceph_databuf_req_alloc(0, break_op_buf_size, GFP_NOIO); - if (!break_op_req) + request = ceph_databuf_req_alloc(1, break_op_buf_size, GFP_NOIO); + if (!request) return -ENOMEM; - p = kmap_ceph_databuf_page(break_op_req, 0); - end = p + break_op_buf_size; + p = ceph_databuf_enc_start(request); /* encode cls_lock_break_op struct */ ceph_start_encoding(&p, 1, 1, break_op_buf_size - CEPH_ENCODING_START_BLK_LEN); - ceph_encode_string(&p, end, lock_name, name_len); + ceph_encode_string(&p, lock_name, name_len); ceph_encode_copy(&p, locker, sizeof(*locker)); - ceph_encode_string(&p, end, cookie, cookie_len); - kunmap_local(p); - ceph_databuf_added_data(break_op_req, break_op_buf_size); + ceph_encode_string(&p, cookie, cookie_len); + ceph_databuf_enc_stop(request, p); dout("%s lock_name %s cookie %s locker %s%llu\n", __func__, lock_name, cookie, ENTITY_NAME(*locker)); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "break_lock", - CEPH_OSD_FLAG_WRITE, break_op_req, NULL); + CEPH_OSD_FLAG_WRITE, request, NULL); dout("%s: status %d\n", __func__, ret); - ceph_databuf_release(break_op_req); + ceph_databuf_release(request); return ret; } EXPORT_SYMBOL(ceph_cls_break_lock); @@ -196,13 +190,13 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc, char *lock_name, u8 type, char *old_cookie, char *tag, char *new_cookie) { - int cookie_op_buf_size; - int name_len = strlen(lock_name); - int old_cookie_len = strlen(old_cookie); - int tag_len = strlen(tag); - int new_cookie_len = strlen(new_cookie); - void *p, *end; - struct ceph_databuf *cookie_op_req; + struct ceph_databuf *request; + size_t cookie_op_buf_size; + size_t name_len = strlen(lock_name); + size_t old_cookie_len = strlen(old_cookie); + size_t tag_len = strlen(tag); + size_t new_cookie_len = strlen(new_cookie); + void *p; int ret; cookie_op_buf_size = name_len + sizeof(__le32) + @@ -213,31 +207,29 @@ int ceph_cls_set_cookie(struct ceph_osd_client *osdc, if (cookie_op_buf_size > PAGE_SIZE) return -E2BIG; - cookie_op_req = ceph_databuf_req_alloc(0, cookie_op_buf_size, GFP_NOIO); - if (!cookie_op_req) + request = ceph_databuf_req_alloc(1, cookie_op_buf_size, GFP_NOIO); + if (!request) return -ENOMEM; - p = kmap_ceph_databuf_page(cookie_op_req, 0); - end = p + cookie_op_buf_size; + p = ceph_databuf_enc_start(request); /* encode cls_lock_set_cookie_op struct */ ceph_start_encoding(&p, 1, 1, cookie_op_buf_size - CEPH_ENCODING_START_BLK_LEN); - ceph_encode_string(&p, end, lock_name, name_len); + ceph_encode_string(&p, lock_name, name_len); ceph_encode_8(&p, type); - ceph_encode_string(&p, end, old_cookie, old_cookie_len); - ceph_encode_string(&p, end, tag, tag_len); - ceph_encode_string(&p, end, new_cookie, new_cookie_len); - kunmap_local(p); - ceph_databuf_added_data(cookie_op_req, cookie_op_buf_size); + ceph_encode_string(&p, old_cookie, old_cookie_len); + ceph_encode_string(&p, tag, tag_len); + ceph_encode_string(&p, new_cookie, new_cookie_len); + ceph_databuf_enc_stop(request, p); dout("%s lock_name %s type %d old_cookie %s tag %s new_cookie %s\n", __func__, lock_name, type, old_cookie, tag, new_cookie); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "set_cookie", - CEPH_OSD_FLAG_WRITE, cookie_op_req, NULL); + CEPH_OSD_FLAG_WRITE, request, NULL); dout("%s: status %d\n", __func__, ret); - ceph_databuf_release(cookie_op_req); + ceph_databuf_release(request); return ret; } EXPORT_SYMBOL(ceph_cls_set_cookie); @@ -289,9 +281,10 @@ static int decode_locker(void **p, void *end, struct ceph_locker *locker) return 0; } -static int decode_lockers(void **p, void *end, u8 *type, char **tag, +static int decode_lockers(void **p, size_t size, u8 *type, char **tag, struct ceph_locker **lockers, u32 *num_lockers) { + void *end = *p + size; u8 struct_v; u32 struct_len; char *s; @@ -341,11 +334,10 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc, char *lock_name, u8 *type, char **tag, struct ceph_locker **lockers, u32 *num_lockers) { - struct ceph_databuf *reply; - int get_info_op_buf_size; - int name_len = strlen(lock_name); - struct ceph_databuf *get_info_op_req; - void *p, *end; + struct ceph_databuf *request, *reply; + size_t get_info_op_buf_size; + size_t name_len = strlen(lock_name); + void *p; int ret; get_info_op_buf_size = name_len + sizeof(__le32) + @@ -353,42 +345,39 @@ int ceph_cls_lock_info(struct ceph_osd_client *osdc, if (get_info_op_buf_size > PAGE_SIZE) return -E2BIG; - get_info_op_req = ceph_databuf_req_alloc(0, get_info_op_buf_size, - GFP_NOIO); - if (!get_info_op_req) + request = ceph_databuf_req_alloc(1, get_info_op_buf_size, GFP_NOIO); + if (!request) return -ENOMEM; reply = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO); if (!reply) { - ceph_databuf_release(get_info_op_req); + ceph_databuf_release(request); return -ENOMEM; } - p = kmap_ceph_databuf_page(get_info_op_req, 0); - end = p + get_info_op_buf_size; + p = ceph_databuf_enc_start(request); /* encode cls_lock_get_info_op struct */ ceph_start_encoding(&p, 1, 1, get_info_op_buf_size - CEPH_ENCODING_START_BLK_LEN); - ceph_encode_string(&p, end, lock_name, name_len); - kunmap_local(p); - ceph_databuf_added_data(get_info_op_req, get_info_op_buf_size); + ceph_encode_string(&p, lock_name, name_len); + ceph_databuf_enc_stop(request, p); dout("%s lock_name %s\n", __func__, lock_name); ret = ceph_osdc_call(osdc, oid, oloc, "lock", "get_info", - CEPH_OSD_FLAG_READ, get_info_op_req, reply); + CEPH_OSD_FLAG_READ, request, reply); dout("%s: status %d\n", __func__, ret); if (ret >= 0) { p = kmap_ceph_databuf_page(reply, 0); - end = p + ceph_databuf_len(reply); - ret = decode_lockers(&p, end, type, tag, lockers, num_lockers); + ret = decode_lockers(&p, ceph_databuf_len(reply), + type, tag, lockers, num_lockers); kunmap_local(p); } ceph_databuf_release(reply); - ceph_databuf_release(get_info_op_req); + ceph_databuf_release(request); return ret; } EXPORT_SYMBOL(ceph_cls_lock_info); @@ -396,12 +385,12 @@ EXPORT_SYMBOL(ceph_cls_lock_info); int ceph_cls_assert_locked(struct ceph_osd_request *req, int which, char *lock_name, u8 type, char *cookie, char *tag) { - struct ceph_databuf *dbuf; - int assert_op_buf_size; - int name_len = strlen(lock_name); - int cookie_len = strlen(cookie); - int tag_len = strlen(tag); - void *p, *end; + struct ceph_databuf *request; + size_t assert_op_buf_size; + size_t name_len = strlen(lock_name); + size_t cookie_len = strlen(cookie); + size_t tag_len = strlen(tag); + void *p; int ret; assert_op_buf_size = name_len + sizeof(__le32) + @@ -415,25 +404,23 @@ int ceph_cls_assert_locked(struct ceph_osd_request *req, int which, if (ret) return ret; - dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO); - if (!dbuf) + request = ceph_databuf_req_alloc(1, assert_op_buf_size, GFP_NOIO); + if (!request) return -ENOMEM; - p = kmap_ceph_databuf_page(dbuf, 0); - end = p + assert_op_buf_size; + p = ceph_databuf_enc_start(request); /* encode cls_lock_assert_op struct */ ceph_start_encoding(&p, 1, 1, assert_op_buf_size - CEPH_ENCODING_START_BLK_LEN); - ceph_encode_string(&p, end, lock_name, name_len); + ceph_encode_string(&p, lock_name, name_len); ceph_encode_8(&p, type); - ceph_encode_string(&p, end, cookie, cookie_len); - ceph_encode_string(&p, end, tag, tag_len); - kunmap(p); - WARN_ON(p != end); - ceph_databuf_added_data(dbuf, assert_op_buf_size); + ceph_encode_string(&p, cookie, cookie_len); + ceph_encode_string(&p, tag, tag_len); + ceph_databuf_enc_stop(request, p); + WARN_ON(ceph_databuf_len(request) != assert_op_buf_size); - osd_req_op_cls_request_databuf(req, which, dbuf); + osd_req_op_cls_request_databuf(req, which, request); return 0; } EXPORT_SYMBOL(ceph_cls_assert_locked); diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c index ab66b599ac47..39103e4bb07d 100644 --- a/net/ceph/mon_client.c +++ b/net/ceph/mon_client.c @@ -367,7 +367,8 @@ static void __send_subscribe(struct ceph_mon_client *monc) dout("%s %s start %llu flags 0x%x\n", __func__, buf, le64_to_cpu(monc->subs[i].item.start), monc->subs[i].item.flags); - ceph_encode_string(&p, end, buf, len); + BUG_ON(p + sizeof(__le32) + len > end); + ceph_encode_string(&p, buf, len); memcpy(p, &monc->subs[i].item, sizeof(monc->subs[i].item)); p += sizeof(monc->subs[i].item); } @@ -854,13 +855,14 @@ __ceph_monc_get_version(struct ceph_mon_client *monc, const char *what, ceph_monc_callback_t cb, u64 private_data) { struct ceph_mon_generic_request *req; + size_t wsize = strlen(what); req = alloc_generic_request(monc, GFP_NOIO); if (!req) goto err_put_req; req->request = ceph_msg_new(CEPH_MSG_MON_GET_VERSION, - sizeof(u64) + sizeof(u32) + strlen(what), + sizeof(u64) + sizeof(u32) + wsize, GFP_NOIO, true); if (!req->request) goto err_put_req; @@ -873,6 +875,8 @@ __ceph_monc_get_version(struct ceph_mon_client *monc, const char *what, req->complete_cb = cb; req->private_data = private_data; + BUG_ON(sizeof(__le64) + sizeof(__le32) + wsize > req->request->front_alloc_len); + mutex_lock(&monc->mutex); register_generic_request(req); { @@ -880,7 +884,7 @@ __ceph_monc_get_version(struct ceph_mon_client *monc, const char *what, void *const end = p + req->request->front_alloc_len; ceph_encode_64(&p, req->tid); /* handle */ - ceph_encode_string(&p, end, what, strlen(what)); + ceph_encode_string(&p, what, wsize); WARN_ON(p != end); } send_generic_request(monc, req); diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index c4525feb8e26..b4adb299f9cd 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1831,15 +1831,15 @@ static int hoid_encoding_size(const struct ceph_hobject_id *hoid) 4 + hoid->key_len + 4 + hoid->oid_len + 4 + hoid->nspace_len; } -static void encode_hoid(void **p, void *end, const struct ceph_hobject_id *hoid) +static void encode_hoid(void **p, const struct ceph_hobject_id *hoid) { ceph_start_encoding(p, 4, 3, hoid_encoding_size(hoid)); - ceph_encode_string(p, end, hoid->key, hoid->key_len); - ceph_encode_string(p, end, hoid->oid, hoid->oid_len); + ceph_encode_string(p, hoid->key, hoid->key_len); + ceph_encode_string(p, hoid->oid, hoid->oid_len); ceph_encode_64(p, hoid->snapid); ceph_encode_32(p, hoid->hash); ceph_encode_8(p, hoid->is_max); - ceph_encode_string(p, end, hoid->nspace, hoid->nspace_len); + ceph_encode_string(p, hoid->nspace, hoid->nspace_len); ceph_encode_64(p, hoid->pool); } @@ -2072,16 +2072,14 @@ static void encode_spgid(void **p, const struct ceph_spg *spgid) ceph_encode_8(p, spgid->shard); } -static void encode_oloc(void **p, void *end, - const struct ceph_object_locator *oloc) +static void encode_oloc(void **p, const struct ceph_object_locator *oloc) { ceph_start_encoding(p, 5, 4, ceph_oloc_encoding_size(oloc)); ceph_encode_64(p, oloc->pool); ceph_encode_32(p, -1); /* preferred */ ceph_encode_32(p, 0); /* key len */ if (oloc->pool_ns) - ceph_encode_string(p, end, oloc->pool_ns->str, - oloc->pool_ns->len); + ceph_encode_string(p, oloc->pool_ns->str, oloc->pool_ns->len); else ceph_encode_32(p, 0); } @@ -2122,8 +2120,8 @@ static void encode_request_partial(struct ceph_osd_request *req, ceph_encode_timespec64(p, &req->r_mtime); p += sizeof(struct ceph_timespec); - encode_oloc(&p, end, &req->r_t.target_oloc); - ceph_encode_string(&p, end, req->r_t.target_oid.name, + encode_oloc(&p, &req->r_t.target_oloc); + ceph_encode_string(&p, req->r_t.target_oid.name, req->r_t.target_oid.name_len); /* ops, can imply data */ @@ -4329,8 +4327,8 @@ static struct ceph_msg *create_backoff_message( ceph_encode_32(&p, map_epoch); ceph_encode_8(&p, CEPH_OSD_BACKOFF_OP_ACK_BLOCK); ceph_encode_64(&p, backoff->id); - encode_hoid(&p, end, backoff->begin); - encode_hoid(&p, end, backoff->end); + encode_hoid(&p, backoff->begin); + encode_hoid(&p, backoff->end); BUG_ON(p != end); msg->front.iov_len = p - msg->front.iov_base; @@ -5264,8 +5262,8 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req, p = page_address(pages[0]); end = p + PAGE_SIZE; - ceph_encode_string(&p, end, src_oid->name, src_oid->name_len); - encode_oloc(&p, end, src_oloc); + ceph_encode_string(&p, src_oid->name, src_oid->name_len); + encode_oloc(&p, src_oloc); ceph_encode_32(&p, truncate_seq); ceph_encode_64(&p, truncate_size); op->indata_len = PAGE_SIZE - (end - p); From patchwork Thu Mar 13 23:33:10 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016067 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5B3F11F941B for ; Thu, 13 Mar 2025 23:35:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908912; cv=none; b=RtgvRn/VQXrwz88Akob+5ZxM3QdyHgKRYKfhTqpPGyXjTFqeoB6hLODXsTG06nqhjElZqzmo7vOb4pKmHTQZxY4t35pitLXBsk90GiEtFaNpRCPB7+/rZxP2wUfgGckMn8ti6+KYlbm/rQ76VXOlkdNmVs+LTjsFSlKDuZVCNaA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908912; c=relaxed/simple; bh=12VyivBnSOi6WpeW/ZATShWFHykmsWSaTDt/u9iTSAE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=V2/15PTtwsezJk7nWgdzuwp87mlxetQ5dfTiXwffVpl3Xv9HyvYEnlDJiVL9Wt9VuJsUR9OizX+Vk5hcxIGCo2hQRMjqJY2sKTU9F0/nLWfrkpphZGD2tOl8gasq/KaRXCJgRn6D51PE75uEBvyYMAjNQXlOwvWTKZMMITC8lBs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=IXkg6RRH; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="IXkg6RRH" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908906; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lvTo2jsjsdmIbdq+HLumkuZzukCMSYlux8xlKnsNe+k=; b=IXkg6RRHiwVOY/ITkFRe1pG85xGLfJ4WxgNojmFTu4QeIo0oWWGsf9hWumRhay0Bo01Zva DlC1ktRe5SJPxMz5K6+riessiXTrmHJpPN9JQKKp/RmvQuhmZ6sPz4tstrDbSqfjG+jJqt xIGzXFYBB+GCQkfAzfPtJ3e9J+MikdE= Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-183-wmIOUuDUOBidYfeKMuS_gA-1; Thu, 13 Mar 2025 19:35:00 -0400 X-MC-Unique: wmIOUuDUOBidYfeKMuS_gA-1 X-Mimecast-MFC-AGG-ID: wmIOUuDUOBidYfeKMuS_gA_1741908897 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 451E31955DD0; Thu, 13 Mar 2025 23:34:57 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id E98901955BCB; Thu, 13 Mar 2025 23:34:54 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 18/35] libceph, rbd: Convert some page arrays to ceph_databuf Date: Thu, 13 Mar 2025 23:33:10 +0000 Message-ID: <20250313233341.1675324-19-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 Convert some miscellaneous page arrays to ceph_databuf containers. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- drivers/block/rbd.c | 12 ++++----- include/linux/ceph/osd_client.h | 3 +++ net/ceph/osd_client.c | 43 +++++++++++++++++++++------------ 3 files changed, 36 insertions(+), 22 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 078bb1e3e1da..eea12c7ab2a0 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2108,7 +2108,7 @@ static int rbd_obj_calc_img_extents(struct rbd_obj_request *obj_req, static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which) { - struct page **pages; + struct ceph_databuf *dbuf; /* * The response data for a STAT call consists of: @@ -2118,14 +2118,12 @@ static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which) * le32 tv_nsec; * } mtime; */ - pages = ceph_alloc_page_vector(1, GFP_NOIO); - if (IS_ERR(pages)) - return PTR_ERR(pages); + dbuf = ceph_databuf_reply_alloc(1, 8 + sizeof(struct ceph_timespec), GFP_NOIO); + if (!dbuf) + return -ENOMEM; osd_req_op_init(osd_req, which, CEPH_OSD_OP_STAT, 0); - osd_req_op_raw_data_in_pages(osd_req, which, pages, - 8 + sizeof(struct ceph_timespec), - 0, false, true); + osd_req_op_raw_data_in_databuf(osd_req, which, dbuf); return 0; } diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index d31e59bd128c..6e126e212271 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -482,6 +482,9 @@ extern void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *, struct page **pages, u64 length, u32 offset, bool pages_from_pool, bool own_pages); +void osd_req_op_raw_data_in_databuf(struct ceph_osd_request *osd_req, + unsigned int which, + struct ceph_databuf *databuf); extern void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *, unsigned int which, struct ceph_pagelist *pagelist); diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index b4adb299f9cd..64a06267e7b3 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -182,6 +182,17 @@ osd_req_op_extent_osd_data(struct ceph_osd_request *osd_req, } EXPORT_SYMBOL(osd_req_op_extent_osd_data); +void osd_req_op_raw_data_in_databuf(struct ceph_osd_request *osd_req, + unsigned int which, + struct ceph_databuf *dbuf) +{ + struct ceph_osd_data *osd_data; + + osd_data = osd_req_op_raw_data_in(osd_req, which); + ceph_osd_databuf_init(osd_data, dbuf); +} +EXPORT_SYMBOL(osd_req_op_raw_data_in_databuf); + void osd_req_op_raw_data_in_pages(struct ceph_osd_request *osd_req, unsigned int which, struct page **pages, u64 length, u32 offset, @@ -5000,7 +5011,7 @@ int ceph_osdc_list_watchers(struct ceph_osd_client *osdc, u32 *num_watchers) { struct ceph_osd_request *req; - struct page **pages; + struct ceph_databuf *dbuf; int ret; req = ceph_osdc_alloc_request(osdc, NULL, 1, false, GFP_NOIO); @@ -5011,16 +5022,16 @@ int ceph_osdc_list_watchers(struct ceph_osd_client *osdc, ceph_oloc_copy(&req->r_base_oloc, oloc); req->r_flags = CEPH_OSD_FLAG_READ; - pages = ceph_alloc_page_vector(1, GFP_NOIO); - if (IS_ERR(pages)) { - ret = PTR_ERR(pages); + dbuf = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO); + if (!dbuf) { + ret = -ENOMEM; goto out_put_req; } osd_req_op_init(req, 0, CEPH_OSD_OP_LIST_WATCHERS, 0); - ceph_osd_data_pages_init(osd_req_op_data(req, 0, list_watchers, - response_data), - pages, PAGE_SIZE, 0, false, true); + ceph_osd_databuf_init(osd_req_op_data(req, 0, list_watchers, + response_data), + dbuf); ret = ceph_osdc_alloc_messages(req, GFP_NOIO); if (ret) @@ -5029,10 +5040,11 @@ int ceph_osdc_list_watchers(struct ceph_osd_client *osdc, ceph_osdc_start_request(osdc, req); ret = ceph_osdc_wait_request(osdc, req); if (ret >= 0) { - void *p = page_address(pages[0]); + void *p = kmap_ceph_databuf_page(dbuf, 0); void *const end = p + req->r_ops[0].outdata_len; ret = decode_watchers(&p, end, watchers, num_watchers); + kunmap(p); } out_put_req: @@ -5246,12 +5258,12 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req, u8 copy_from_flags) { struct ceph_osd_req_op *op; - struct page **pages; + struct ceph_databuf *dbuf; void *p, *end; - pages = ceph_alloc_page_vector(1, GFP_KERNEL); - if (IS_ERR(pages)) - return PTR_ERR(pages); + dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_KERNEL); + if (!dbuf) + return -ENOMEM; op = osd_req_op_init(req, 0, CEPH_OSD_OP_COPY_FROM2, dst_fadvise_flags); @@ -5260,16 +5272,17 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req, op->copy_from.flags = copy_from_flags; op->copy_from.src_fadvise_flags = src_fadvise_flags; - p = page_address(pages[0]); + p = kmap_ceph_databuf_page(dbuf, 0); end = p + PAGE_SIZE; ceph_encode_string(&p, src_oid->name, src_oid->name_len); encode_oloc(&p, src_oloc); ceph_encode_32(&p, truncate_seq); ceph_encode_64(&p, truncate_size); op->indata_len = PAGE_SIZE - (end - p); + ceph_databuf_added_data(dbuf, op->indata_len); + kunmap_local(p); - ceph_osd_data_pages_init(&op->copy_from.osd_data, pages, - op->indata_len, 0, false, true); + ceph_osd_databuf_init(&op->copy_from.osd_data, dbuf); return 0; } EXPORT_SYMBOL(osd_req_op_copy_from_init); From patchwork Thu Mar 13 23:33:11 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016066 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 199BD2036FC for ; Thu, 13 Mar 2025 23:35:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908911; cv=none; b=U4CC9seiELUUNlYqU5qMbcMKRqZq7M7F/rFbuHSybaVAI3Iy1Ogz2cBb4F2QQS42tY5gcRQveTDF597nCzzngC9P7CfLHcXBti//1Hi+iUOyKIICDDd1NgIfrHtFXjSgKp3FKCPb1YsKpDaWfgC7XX5xnw9HeEFZVNGNCLYT0YU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908911; c=relaxed/simple; bh=vZ4f0SKQ1K//pN1RFVWQNYZC2HaLx6kNtSvr2iXq9is=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ew2HJDo+39YpHA9Utz37kPr40ulTmZTgGNTowhUwj8bKFNUZ82Y9YhxB4Fv/P3S4RRayp1D8B0mt2PRBdp8dBD3+TP5Gfmm5eAxjEsLEeon6eyZphITophb1M77B8xsirX7YEqc85ebW4oNDfc32LzS1dIJ4PV4JHtu/S1NTi4M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=H2EFPRd8; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="H2EFPRd8" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908908; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uUBZOhWuEQVpSjpi/CjhZKRTs0wsYH6r52iItV59hho=; b=H2EFPRd8QysLCOOnF9wpQV4ufeMjQcJrj9drEYII+9ZnUlccAF/GlDo38ZeqYMRrr7LPZR UButfQAhvtykeFkKCmHSerRqfp2C1kkPEz8PzHUeu0k9Ca2OOywjERQH9hnMUeqMnCGVOO unuAub8264z+3AmEdpXY9t/uHdBqq18= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-422-EUCU_udSMd2taUaoL9R1JA-1; Thu, 13 Mar 2025 19:35:02 -0400 X-MC-Unique: EUCU_udSMd2taUaoL9R1JA-1 X-Mimecast-MFC-AGG-ID: EUCU_udSMd2taUaoL9R1JA_1741908901 Received: from mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 227DB19560B7; Thu, 13 Mar 2025 23:35:01 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 698A119373D7; Thu, 13 Mar 2025 23:34:58 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 19/35] libceph, ceph: Convert users of ceph_pagelist to ceph_databuf Date: Thu, 13 Mar 2025 23:33:11 +0000 Message-ID: <20250313233341.1675324-20-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.15 Convert users of ceph_pagelist to use ceph_databuf instead. ceph_pagelist is then unused and can be removed. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/ceph/locks.c | 22 +++--- fs/ceph/mds_client.c | 122 +++++++++++++++----------------- fs/ceph/super.h | 6 +- include/linux/ceph/osd_client.h | 2 +- net/ceph/osd_client.c | 61 ++++++++-------- 5 files changed, 104 insertions(+), 109 deletions(-) diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index ebf4ac0055dd..32c7b0f0d61f 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -371,8 +371,8 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl) } /* - * Fills in the passed counter variables, so you can prepare pagelist metadata - * before calling ceph_encode_locks. + * Fills in the passed counter variables, so you can prepare metadata before + * calling ceph_encode_locks. */ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count) { @@ -483,38 +483,38 @@ int ceph_encode_locks_to_buffer(struct inode *inode, } /* - * Copy the encoded flock and fcntl locks into the pagelist. + * Copy the encoded flock and fcntl locks into the data buffer. * Format is: #fcntl locks, sequential fcntl locks, #flock locks, * sequential flock locks. * Returns zero on success. */ -int ceph_locks_to_pagelist(struct ceph_filelock *flocks, - struct ceph_pagelist *pagelist, +int ceph_locks_to_databuf(struct ceph_filelock *flocks, + struct ceph_databuf *dbuf, int num_fcntl_locks, int num_flock_locks) { int err = 0; __le32 nlocks; nlocks = cpu_to_le32(num_fcntl_locks); - err = ceph_pagelist_append(pagelist, &nlocks, sizeof(nlocks)); + err = ceph_databuf_append(dbuf, &nlocks, sizeof(nlocks)); if (err) goto out_fail; if (num_fcntl_locks > 0) { - err = ceph_pagelist_append(pagelist, flocks, - num_fcntl_locks * sizeof(*flocks)); + err = ceph_databuf_append(dbuf, flocks, + num_fcntl_locks * sizeof(*flocks)); if (err) goto out_fail; } nlocks = cpu_to_le32(num_flock_locks); - err = ceph_pagelist_append(pagelist, &nlocks, sizeof(nlocks)); + err = ceph_databuf_append(dbuf, &nlocks, sizeof(nlocks)); if (err) goto out_fail; if (num_flock_locks > 0) { - err = ceph_pagelist_append(pagelist, &flocks[num_fcntl_locks], - num_flock_locks * sizeof(*flocks)); + err = ceph_databuf_append(dbuf, &flocks[num_fcntl_locks], + num_flock_locks * sizeof(*flocks)); } out_fail: return err; diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 09661a34f287..f1c6d0ebf548 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -55,7 +55,7 @@ struct ceph_reconnect_state { struct ceph_mds_session *session; int nr_caps, nr_realms; - struct ceph_pagelist *pagelist; + struct ceph_databuf *dbuf; unsigned msg_version; bool allow_multi; }; @@ -4456,8 +4456,7 @@ static void replay_unsafe_requests(struct ceph_mds_client *mdsc, static int send_reconnect_partial(struct ceph_reconnect_state *recon_state) { struct ceph_msg *reply; - struct ceph_pagelist *_pagelist; - struct page *page; + struct ceph_databuf *_dbuf; __le32 *addr; int err = -ENOMEM; @@ -4467,9 +4466,9 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state) /* can't handle message that contains both caps and realm */ BUG_ON(!recon_state->nr_caps == !recon_state->nr_realms); - /* pre-allocate new pagelist */ - _pagelist = ceph_pagelist_alloc(GFP_NOFS); - if (!_pagelist) + /* pre-allocate new databuf */ + _dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOFS); + if (!_dbuf) return -ENOMEM; reply = ceph_msg_new2(CEPH_MSG_CLIENT_RECONNECT, 0, 1, GFP_NOFS, false); @@ -4477,28 +4476,27 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state) goto fail_msg; /* placeholder for nr_caps */ - err = ceph_pagelist_encode_32(_pagelist, 0); + err = ceph_databuf_encode_32(_dbuf, 0); if (err < 0) goto fail; if (recon_state->nr_caps) { /* currently encoding caps */ - err = ceph_pagelist_encode_32(recon_state->pagelist, 0); + err = ceph_databuf_encode_32(recon_state->dbuf, 0); if (err) goto fail; } else { /* placeholder for nr_realms (currently encoding relams) */ - err = ceph_pagelist_encode_32(_pagelist, 0); + err = ceph_databuf_encode_32(_dbuf, 0); if (err < 0) goto fail; } - err = ceph_pagelist_encode_8(recon_state->pagelist, 1); + err = ceph_databuf_encode_8(recon_state->dbuf, 1); if (err) goto fail; - page = list_first_entry(&recon_state->pagelist->head, struct page, lru); - addr = kmap_atomic(page); + addr = kmap_ceph_databuf_page(recon_state->dbuf, 0); if (recon_state->nr_caps) { /* currently encoding caps */ *addr = cpu_to_le32(recon_state->nr_caps); @@ -4506,18 +4504,18 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state) /* currently encoding relams */ *(addr + 1) = cpu_to_le32(recon_state->nr_realms); } - kunmap_atomic(addr); + kunmap_local(addr); reply->hdr.version = cpu_to_le16(5); reply->hdr.compat_version = cpu_to_le16(4); - reply->hdr.data_len = cpu_to_le32(recon_state->pagelist->length); - ceph_msg_data_add_pagelist(reply, recon_state->pagelist); + reply->hdr.data_len = cpu_to_le32(ceph_databuf_len(recon_state->dbuf)); + ceph_msg_data_add_databuf(reply, recon_state->dbuf); ceph_con_send(&recon_state->session->s_con, reply); - ceph_pagelist_release(recon_state->pagelist); + ceph_databuf_release(recon_state->dbuf); - recon_state->pagelist = _pagelist; + recon_state->dbuf = _dbuf; recon_state->nr_caps = 0; recon_state->nr_realms = 0; recon_state->msg_version = 5; @@ -4525,7 +4523,7 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state) fail: ceph_msg_put(reply); fail_msg: - ceph_pagelist_release(_pagelist); + ceph_databuf_release(_dbuf); return err; } @@ -4575,7 +4573,7 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg) } rec; struct ceph_inode_info *ci = ceph_inode(inode); struct ceph_reconnect_state *recon_state = arg; - struct ceph_pagelist *pagelist = recon_state->pagelist; + struct ceph_databuf *dbuf = recon_state->dbuf; struct dentry *dentry; struct ceph_cap *cap; char *path; @@ -4698,7 +4696,7 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg) struct_v = 2; } /* - * number of encoded locks is stable, so copy to pagelist + * number of encoded locks is stable, so copy to databuf */ struct_len = 2 * sizeof(u32) + (num_fcntl_locks + num_flock_locks) * @@ -4712,41 +4710,42 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg) total_len += struct_len; - if (pagelist->length + total_len > RECONNECT_MAX_SIZE) { + if (ceph_databuf_len(dbuf) + total_len > RECONNECT_MAX_SIZE) { err = send_reconnect_partial(recon_state); if (err) goto out_freeflocks; - pagelist = recon_state->pagelist; + dbuf = recon_state->dbuf; } - err = ceph_pagelist_reserve(pagelist, total_len); + err = ceph_databuf_reserve(dbuf, total_len, GFP_NOFS); if (err) goto out_freeflocks; - ceph_pagelist_encode_64(pagelist, ceph_ino(inode)); + ceph_databuf_encode_64(dbuf, ceph_ino(inode)); if (recon_state->msg_version >= 3) { - ceph_pagelist_encode_8(pagelist, struct_v); - ceph_pagelist_encode_8(pagelist, 1); - ceph_pagelist_encode_32(pagelist, struct_len); + ceph_databuf_encode_8(dbuf, struct_v); + ceph_databuf_encode_8(dbuf, 1); + ceph_databuf_encode_32(dbuf, struct_len); } - ceph_pagelist_encode_string(pagelist, path, pathlen); - ceph_pagelist_append(pagelist, &rec, sizeof(rec.v2)); - ceph_locks_to_pagelist(flocks, pagelist, - num_fcntl_locks, num_flock_locks); + ceph_databuf_encode_string(dbuf, path, pathlen); + ceph_databuf_append(dbuf, &rec, sizeof(rec.v2)); + ceph_locks_to_databuf(flocks, dbuf, + num_fcntl_locks, num_flock_locks); if (struct_v >= 2) - ceph_pagelist_encode_64(pagelist, snap_follows); + ceph_databuf_encode_64(dbuf, snap_follows); out_freeflocks: kfree(flocks); } else { - err = ceph_pagelist_reserve(pagelist, - sizeof(u64) + sizeof(u32) + - pathlen + sizeof(rec.v1)); + err = ceph_databuf_reserve(dbuf, + sizeof(u64) + sizeof(u32) + + pathlen + sizeof(rec.v1), + GFP_NOFS); if (err) goto out_err; - ceph_pagelist_encode_64(pagelist, ceph_ino(inode)); - ceph_pagelist_encode_string(pagelist, path, pathlen); - ceph_pagelist_append(pagelist, &rec, sizeof(rec.v1)); + ceph_databuf_encode_64(dbuf, ceph_ino(inode)); + ceph_databuf_encode_string(dbuf, path, pathlen); + ceph_databuf_append(dbuf, &rec, sizeof(rec.v1)); } out_err: @@ -4760,12 +4759,12 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc, struct ceph_reconnect_state *recon_state) { struct rb_node *p; - struct ceph_pagelist *pagelist = recon_state->pagelist; struct ceph_client *cl = mdsc->fsc->client; + struct ceph_databuf *dbuf = recon_state->dbuf; int err = 0; if (recon_state->msg_version >= 4) { - err = ceph_pagelist_encode_32(pagelist, mdsc->num_snap_realms); + err = ceph_databuf_encode_32(dbuf, mdsc->num_snap_realms); if (err < 0) goto fail; } @@ -4784,20 +4783,20 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc, size_t need = sizeof(u8) * 2 + sizeof(u32) + sizeof(sr_rec); - if (pagelist->length + need > RECONNECT_MAX_SIZE) { + if (ceph_databuf_len(dbuf) + need > RECONNECT_MAX_SIZE) { err = send_reconnect_partial(recon_state); if (err) goto fail; - pagelist = recon_state->pagelist; + dbuf = recon_state->dbuf; } - err = ceph_pagelist_reserve(pagelist, need); + err = ceph_databuf_reserve(dbuf, need, GFP_NOFS); if (err) goto fail; - ceph_pagelist_encode_8(pagelist, 1); - ceph_pagelist_encode_8(pagelist, 1); - ceph_pagelist_encode_32(pagelist, sizeof(sr_rec)); + ceph_databuf_encode_8(dbuf, 1); + ceph_databuf_encode_8(dbuf, 1); + ceph_databuf_encode_32(dbuf, sizeof(sr_rec)); } doutc(cl, " adding snap realm %llx seq %lld parent %llx\n", @@ -4806,7 +4805,7 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc, sr_rec.seq = cpu_to_le64(realm->seq); sr_rec.parent = cpu_to_le64(realm->parent_ino); - err = ceph_pagelist_append(pagelist, &sr_rec, sizeof(sr_rec)); + err = ceph_databuf_append(dbuf, &sr_rec, sizeof(sr_rec)); if (err) goto fail; @@ -4841,9 +4840,9 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc, pr_info_client(cl, "mds%d reconnect start\n", mds); - recon_state.pagelist = ceph_pagelist_alloc(GFP_NOFS); - if (!recon_state.pagelist) - goto fail_nopagelist; + recon_state.dbuf = ceph_databuf_req_alloc(1, 0, GFP_NOFS); + if (!recon_state.dbuf) + goto fail_nodatabuf; reply = ceph_msg_new2(CEPH_MSG_CLIENT_RECONNECT, 0, 1, GFP_NOFS, false); if (!reply) @@ -4891,7 +4890,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc, down_read(&mdsc->snap_rwsem); /* placeholder for nr_caps */ - err = ceph_pagelist_encode_32(recon_state.pagelist, 0); + err = ceph_databuf_encode_32(recon_state.dbuf, 0); if (err) goto fail; @@ -4916,7 +4915,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc, /* check if all realms can be encoded into current message */ if (mdsc->num_snap_realms) { size_t total_len = - recon_state.pagelist->length + + ceph_databuf_len(recon_state.dbuf) + mdsc->num_snap_realms * sizeof(struct ceph_mds_snaprealm_reconnect); if (recon_state.msg_version >= 4) { @@ -4945,31 +4944,28 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc, goto fail; if (recon_state.msg_version >= 5) { - err = ceph_pagelist_encode_8(recon_state.pagelist, 0); + err = ceph_databuf_encode_8(recon_state.dbuf, 0); if (err < 0) goto fail; } if (recon_state.nr_caps || recon_state.nr_realms) { - struct page *page = - list_first_entry(&recon_state.pagelist->head, - struct page, lru); - __le32 *addr = kmap_atomic(page); + __le32 *addr = kmap_ceph_databuf_page(recon_state.dbuf, 0); if (recon_state.nr_caps) { WARN_ON(recon_state.nr_realms != mdsc->num_snap_realms); *addr = cpu_to_le32(recon_state.nr_caps); } else if (recon_state.msg_version >= 4) { *(addr + 1) = cpu_to_le32(recon_state.nr_realms); } - kunmap_atomic(addr); + kunmap_local(addr); } reply->hdr.version = cpu_to_le16(recon_state.msg_version); if (recon_state.msg_version >= 4) reply->hdr.compat_version = cpu_to_le16(4); - reply->hdr.data_len = cpu_to_le32(recon_state.pagelist->length); - ceph_msg_data_add_pagelist(reply, recon_state.pagelist); + reply->hdr.data_len = cpu_to_le32(ceph_databuf_len(recon_state.dbuf)); + ceph_msg_data_add_databuf(reply, recon_state.dbuf); ceph_con_send(&session->s_con, reply); @@ -4980,7 +4976,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc, mutex_unlock(&mdsc->mutex); up_read(&mdsc->snap_rwsem); - ceph_pagelist_release(recon_state.pagelist); + ceph_databuf_release(recon_state.dbuf); return; fail: @@ -4988,8 +4984,8 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc, up_read(&mdsc->snap_rwsem); mutex_unlock(&session->s_mutex); fail_nomsg: - ceph_pagelist_release(recon_state.pagelist); -fail_nopagelist: + ceph_databuf_release(recon_state.dbuf); +fail_nodatabuf: pr_err_client(cl, "error %d preparing reconnect for mds%d\n", err, mds); return; diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 984a6d2a5378..b072572e2cf4 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -1351,9 +1351,9 @@ extern int ceph_encode_locks_to_buffer(struct inode *inode, struct ceph_filelock *flocks, int num_fcntl_locks, int num_flock_locks); -extern int ceph_locks_to_pagelist(struct ceph_filelock *flocks, - struct ceph_pagelist *pagelist, - int num_fcntl_locks, int num_flock_locks); +extern int ceph_locks_to_databuf(struct ceph_filelock *flocks, + struct ceph_databuf *dbuf, + int num_fcntl_locks, int num_flock_locks); /* debugfs.c */ extern void ceph_fs_debugfs_init(struct ceph_fs_client *client); diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 6e126e212271..ce04205b8143 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -334,7 +334,7 @@ struct ceph_osd_linger_request { rados_watcherrcb_t errcb; void *data; - struct ceph_pagelist *request_pl; + struct ceph_databuf *request_pl; struct ceph_databuf *notify_id_buf; struct page ***preply_pages; diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 64a06267e7b3..a967309d01a7 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -810,37 +810,37 @@ int osd_req_op_xattr_init(struct ceph_osd_request *osd_req, unsigned int which, { struct ceph_osd_req_op *op = osd_req_op_init(osd_req, which, opcode, 0); - struct ceph_pagelist *pagelist; + struct ceph_databuf *dbuf; size_t payload_len; int ret; BUG_ON(opcode != CEPH_OSD_OP_SETXATTR && opcode != CEPH_OSD_OP_CMPXATTR); - pagelist = ceph_pagelist_alloc(GFP_NOFS); - if (!pagelist) + dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOFS); + if (!dbuf) return -ENOMEM; payload_len = strlen(name); op->xattr.name_len = payload_len; - ret = ceph_pagelist_append(pagelist, name, payload_len); + ret = ceph_databuf_append(dbuf, name, payload_len); if (ret) - goto err_pagelist_free; + goto err_databuf_free; op->xattr.value_len = size; - ret = ceph_pagelist_append(pagelist, value, size); + ret = ceph_databuf_append(dbuf, value, size); if (ret) - goto err_pagelist_free; + goto err_databuf_free; payload_len += size; op->xattr.cmp_op = cmp_op; op->xattr.cmp_mode = cmp_mode; - ceph_osd_data_pagelist_init(&op->xattr.osd_data, pagelist); + ceph_osd_databuf_init(&op->xattr.osd_data, dbuf); op->indata_len = payload_len; return 0; -err_pagelist_free: - ceph_pagelist_release(pagelist); +err_databuf_free: + ceph_databuf_release(dbuf); return ret; } EXPORT_SYMBOL(osd_req_op_xattr_init); @@ -864,15 +864,15 @@ static void osd_req_op_watch_init(struct ceph_osd_request *req, int which, * encoded in @request_pl */ static void osd_req_op_notify_init(struct ceph_osd_request *req, int which, - u64 cookie, struct ceph_pagelist *request_pl) + u64 cookie, struct ceph_databuf *request_pl) { struct ceph_osd_req_op *op; op = osd_req_op_init(req, which, CEPH_OSD_OP_NOTIFY, 0); op->notify.cookie = cookie; - ceph_osd_data_pagelist_init(&op->notify.request_data, request_pl); - op->indata_len = request_pl->length; + ceph_osd_databuf_init(&op->notify.request_data, request_pl); + op->indata_len = ceph_databuf_len(request_pl); } /* @@ -2730,8 +2730,7 @@ static void linger_release(struct kref *kref) WARN_ON(!list_empty(&lreq->pending_lworks)); WARN_ON(lreq->osd); - if (lreq->request_pl) - ceph_pagelist_release(lreq->request_pl); + ceph_databuf_release(lreq->request_pl); ceph_databuf_release(lreq->notify_id_buf); ceph_osdc_put_request(lreq->reg_req); ceph_osdc_put_request(lreq->ping_req); @@ -4800,30 +4799,30 @@ static int osd_req_op_notify_ack_init(struct ceph_osd_request *req, int which, u32 payload_len) { struct ceph_osd_req_op *op; - struct ceph_pagelist *pl; + struct ceph_databuf *dbuf; int ret; op = osd_req_op_init(req, which, CEPH_OSD_OP_NOTIFY_ACK, 0); - pl = ceph_pagelist_alloc(GFP_NOIO); - if (!pl) + dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO); + if (!dbuf) return -ENOMEM; - ret = ceph_pagelist_encode_64(pl, notify_id); - ret |= ceph_pagelist_encode_64(pl, cookie); + ret = ceph_databuf_encode_64(dbuf, notify_id); + ret |= ceph_databuf_encode_64(dbuf, cookie); if (payload) { - ret |= ceph_pagelist_encode_32(pl, payload_len); - ret |= ceph_pagelist_append(pl, payload, payload_len); + ret |= ceph_databuf_encode_32(dbuf, payload_len); + ret |= ceph_databuf_append(dbuf, payload, payload_len); } else { - ret |= ceph_pagelist_encode_32(pl, 0); + ret |= ceph_databuf_encode_32(dbuf, 0); } if (ret) { - ceph_pagelist_release(pl); + ceph_databuf_release(dbuf); return -ENOMEM; } - ceph_osd_data_pagelist_init(&op->notify_ack.request_data, pl); - op->indata_len = pl->length; + ceph_osd_databuf_init(&op->notify_ack.request_data, dbuf); + op->indata_len = ceph_databuf_len(dbuf); return 0; } @@ -4894,16 +4893,16 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc, if (!lreq) return -ENOMEM; - lreq->request_pl = ceph_pagelist_alloc(GFP_NOIO); + lreq->request_pl = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO); if (!lreq->request_pl) { ret = -ENOMEM; goto out_put_lreq; } - ret = ceph_pagelist_encode_32(lreq->request_pl, 1); /* prot_ver */ - ret |= ceph_pagelist_encode_32(lreq->request_pl, timeout); - ret |= ceph_pagelist_encode_32(lreq->request_pl, payload_len); - ret |= ceph_pagelist_append(lreq->request_pl, payload, payload_len); + ret = ceph_databuf_encode_32(lreq->request_pl, 1); /* prot_ver */ + ret |= ceph_databuf_encode_32(lreq->request_pl, timeout); + ret |= ceph_databuf_encode_32(lreq->request_pl, payload_len); + ret |= ceph_databuf_append(lreq->request_pl, payload, payload_len); if (ret) { ret = -ENOMEM; goto out_put_lreq; From patchwork Thu Mar 13 23:33:12 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016068 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 359B5203714 for ; Thu, 13 Mar 2025 23:35:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908913; cv=none; b=hV2qSVPYGPx2MLmulHDRWfanZFH9nD/vLYOZm6enlQ2ysFqs4Oc/AVZTpmQMUU6zI7J9P0qyDyzc2etnf7xN4F+yvWLikzmFumlTIYGPnvg/zebnU2sPQPP0B4+zQmcj+w7yANU0fjiHJ4A4aFB/Y/PW7sg8pVHk8nJvlseT77Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908913; c=relaxed/simple; bh=ArM9OWUTAn0K5vjn5Rgh2LuHSghdGpcmrBT0XVNRNfo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=tbgIcZNTLEQb3JVKj3yFyhVoPcJOKx9mp4D/es6aJGh7ugX7wb39QjSbZ5nnEQeuMa1YBQj81YAOr2lsPoD+OB6gxC/4iTB48SRB6MWZ5vObl9gHfjGVm5b3sT7oOeIg7DtaJelSr0/kvyhkfre2iwigYD4Spz/sjQrv8KFeP10= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ewdZqyGA; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ewdZqyGA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908909; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8oqM+bSpor+ZPn2TUfgGU/pxeVwnQi1plAkYuexpVBE=; b=ewdZqyGANZE2LH3jzQb7ePfJpFoAOUTHRplsaE71c11kjxlS+bsuwfk0f2RfCDr7KlG0XT UMO6BUCQPp1gb04KxBFyC6Nh1iT+DH55wQ3cO8jDTl510fLgovdtOu8freaZzUIDcuPDfj 65XLd7/QezOeSlU4eFtG29xQFRaaZtI= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-678-oCgbq2qOPNetbFWjbd3yOw-1; Thu, 13 Mar 2025 19:35:06 -0400 X-MC-Unique: oCgbq2qOPNetbFWjbd3yOw-1 X-Mimecast-MFC-AGG-ID: oCgbq2qOPNetbFWjbd3yOw_1741908904 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B0DB0195608F; Thu, 13 Mar 2025 23:35:04 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 5A4E018001F6; Thu, 13 Mar 2025 23:35:02 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 20/35] libceph: Remove ceph_pagelist Date: Thu, 13 Mar 2025 23:33:12 +0000 Message-ID: <20250313233341.1675324-21-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Remove ceph_pagelist and its helpers. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/ceph/locks.c | 1 - fs/ceph/mds_client.c | 1 - fs/ceph/xattr.c | 1 - include/linux/ceph/messenger.h | 8 -- include/linux/ceph/osd_client.h | 9 --- include/linux/ceph/pagelist.h | 60 -------------- net/ceph/Makefile | 2 +- net/ceph/messenger.c | 110 -------------------------- net/ceph/osd_client.c | 41 ---------- net/ceph/pagelist.c | 133 -------------------------------- 10 files changed, 1 insertion(+), 365 deletions(-) delete mode 100644 include/linux/ceph/pagelist.h delete mode 100644 net/ceph/pagelist.c diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index 32c7b0f0d61f..451b92d99cf1 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -8,7 +8,6 @@ #include "super.h" #include "mds_client.h" #include -#include static u64 lock_secret; static int ceph_lock_wait_for_completion(struct ceph_mds_client *mdsc, diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index f1c6d0ebf548..26fa39d07ef0 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -21,7 +21,6 @@ #include #include #include -#include #include #include diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index b083cd3b3974..de7b1c364bec 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -1,6 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 #include -#include #include "super.h" #include "mds_client.h" diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index ff0aea6d2d31..36896a71291c 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -119,7 +119,6 @@ enum ceph_msg_data_type { CEPH_MSG_DATA_NONE, /* message contains no data payload */ CEPH_MSG_DATA_DATABUF, /* data source/destination is a data buffer */ CEPH_MSG_DATA_PAGES, /* data source/destination is a page array */ - CEPH_MSG_DATA_PAGELIST, /* data source/destination is a pagelist */ CEPH_MSG_DATA_ITER, /* data source/destination is an iov_iter */ }; @@ -135,7 +134,6 @@ struct ceph_msg_data { unsigned int offset; /* first page */ bool own_pages; }; - struct ceph_pagelist *pagelist; }; }; @@ -152,10 +150,6 @@ struct ceph_msg_data_cursor { unsigned short page_index; /* index in array */ unsigned short page_count; /* pages in array */ }; - struct { /* pagelist */ - struct page *page; /* page from list */ - size_t offset; /* bytes from list */ - }; struct { struct iov_iter iov_iter; struct iov_iter crc_iter; @@ -511,8 +505,6 @@ extern bool ceph_con_keepalive_expired(struct ceph_connection *con, void ceph_msg_data_add_databuf(struct ceph_msg *msg, struct ceph_databuf *dbuf); void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages, size_t length, size_t offset, bool own_pages); -extern void ceph_msg_data_add_pagelist(struct ceph_msg *msg, - struct ceph_pagelist *pagelist); void ceph_msg_data_add_iter(struct ceph_msg *msg, struct iov_iter *iter); diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index ce04205b8143..5a1ee66ca216 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -15,7 +15,6 @@ #include #include #include -#include #include struct ceph_msg; @@ -106,7 +105,6 @@ enum ceph_osd_data_type { CEPH_OSD_DATA_TYPE_NONE = 0, CEPH_OSD_DATA_TYPE_DATABUF, CEPH_OSD_DATA_TYPE_PAGES, - CEPH_OSD_DATA_TYPE_PAGELIST, CEPH_OSD_DATA_TYPE_ITER, }; @@ -122,7 +120,6 @@ struct ceph_osd_data { bool pages_from_pool; bool own_pages; }; - struct ceph_pagelist *pagelist; }; }; @@ -485,18 +482,12 @@ extern void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *, void osd_req_op_raw_data_in_databuf(struct ceph_osd_request *osd_req, unsigned int which, struct ceph_databuf *databuf); -extern void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *, - unsigned int which, - struct ceph_pagelist *pagelist); void osd_req_op_extent_osd_iter(struct ceph_osd_request *osd_req, unsigned int which, struct iov_iter *iter); void osd_req_op_cls_request_databuf(struct ceph_osd_request *req, unsigned int which, struct ceph_databuf *dbuf); -extern void osd_req_op_cls_request_data_pagelist(struct ceph_osd_request *, - unsigned int which, - struct ceph_pagelist *pagelist); void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req, unsigned int which, struct ceph_databuf *dbuf); diff --git a/include/linux/ceph/pagelist.h b/include/linux/ceph/pagelist.h deleted file mode 100644 index 879bec0863aa..000000000000 --- a/include/linux/ceph/pagelist.h +++ /dev/null @@ -1,60 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef __FS_CEPH_PAGELIST_H -#define __FS_CEPH_PAGELIST_H - -#include -#include -#include -#include - -struct ceph_pagelist { - struct list_head head; - void *mapped_tail; - size_t length; - size_t room; - struct list_head free_list; - size_t num_pages_free; - refcount_t refcnt; -}; - -struct ceph_pagelist *ceph_pagelist_alloc(gfp_t gfp_flags); - -extern void ceph_pagelist_release(struct ceph_pagelist *pl); - -extern int ceph_pagelist_append(struct ceph_pagelist *pl, const void *d, size_t l); - -extern int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space); - -extern int ceph_pagelist_free_reserve(struct ceph_pagelist *pl); - -static inline int ceph_pagelist_encode_64(struct ceph_pagelist *pl, u64 v) -{ - __le64 ev = cpu_to_le64(v); - return ceph_pagelist_append(pl, &ev, sizeof(ev)); -} -static inline int ceph_pagelist_encode_32(struct ceph_pagelist *pl, u32 v) -{ - __le32 ev = cpu_to_le32(v); - return ceph_pagelist_append(pl, &ev, sizeof(ev)); -} -static inline int ceph_pagelist_encode_16(struct ceph_pagelist *pl, u16 v) -{ - __le16 ev = cpu_to_le16(v); - return ceph_pagelist_append(pl, &ev, sizeof(ev)); -} -static inline int ceph_pagelist_encode_8(struct ceph_pagelist *pl, u8 v) -{ - return ceph_pagelist_append(pl, &v, 1); -} -static inline int ceph_pagelist_encode_string(struct ceph_pagelist *pl, - char *s, u32 len) -{ - int ret = ceph_pagelist_encode_32(pl, len); - if (ret) - return ret; - if (len) - return ceph_pagelist_append(pl, s, len); - return 0; -} - -#endif diff --git a/net/ceph/Makefile b/net/ceph/Makefile index 4b2e0b654e45..0c8787e2e733 100644 --- a/net/ceph/Makefile +++ b/net/ceph/Makefile @@ -4,7 +4,7 @@ # obj-$(CONFIG_CEPH_LIB) += libceph.o -libceph-y := ceph_common.o messenger.o msgpool.o buffer.o pagelist.o \ +libceph-y := ceph_common.o messenger.o msgpool.o buffer.o \ mon_client.o decode.o \ cls_lock_client.o \ osd_client.o osdmap.o crush/crush.o crush/mapper.o crush/hash.o \ diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index cb66a768bd7c..4b20df1ab8e4 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -20,7 +20,6 @@ #include #include #include -#include #include /* @@ -775,87 +774,6 @@ static bool ceph_msg_data_pages_advance(struct ceph_msg_data_cursor *cursor, return true; } -/* - * For a pagelist, a piece is whatever remains to be consumed in the - * first page in the list, or the front of the next page. - */ -static void -ceph_msg_data_pagelist_cursor_init(struct ceph_msg_data_cursor *cursor, - size_t length) -{ - struct ceph_msg_data *data = cursor->data; - struct ceph_pagelist *pagelist; - struct page *page; - - BUG_ON(data->type != CEPH_MSG_DATA_PAGELIST); - - pagelist = data->pagelist; - BUG_ON(!pagelist); - - if (!length) - return; /* pagelist can be assigned but empty */ - - BUG_ON(list_empty(&pagelist->head)); - page = list_first_entry(&pagelist->head, struct page, lru); - - cursor->resid = min(length, pagelist->length); - cursor->page = page; - cursor->offset = 0; -} - -static struct page * -ceph_msg_data_pagelist_next(struct ceph_msg_data_cursor *cursor, - size_t *page_offset, size_t *length) -{ - struct ceph_msg_data *data = cursor->data; - struct ceph_pagelist *pagelist; - - BUG_ON(data->type != CEPH_MSG_DATA_PAGELIST); - - pagelist = data->pagelist; - BUG_ON(!pagelist); - - BUG_ON(!cursor->page); - BUG_ON(cursor->offset + cursor->resid != pagelist->length); - - /* offset of first page in pagelist is always 0 */ - *page_offset = cursor->offset & ~PAGE_MASK; - *length = min_t(size_t, cursor->resid, PAGE_SIZE - *page_offset); - return cursor->page; -} - -static bool ceph_msg_data_pagelist_advance(struct ceph_msg_data_cursor *cursor, - size_t bytes) -{ - struct ceph_msg_data *data = cursor->data; - struct ceph_pagelist *pagelist; - - BUG_ON(data->type != CEPH_MSG_DATA_PAGELIST); - - pagelist = data->pagelist; - BUG_ON(!pagelist); - - BUG_ON(cursor->offset + cursor->resid != pagelist->length); - BUG_ON((cursor->offset & ~PAGE_MASK) + bytes > PAGE_SIZE); - - /* Advance the cursor offset */ - - cursor->resid -= bytes; - cursor->offset += bytes; - /* offset of first page in pagelist is always 0 */ - if (!bytes || cursor->offset & ~PAGE_MASK) - return false; /* more bytes to process in the current page */ - - if (!cursor->resid) - return false; /* no more data */ - - /* Move on to the next page */ - - BUG_ON(list_is_last(&cursor->page->lru, &pagelist->head)); - cursor->page = list_next_entry(cursor->page, lru); - return true; -} - static void ceph_msg_data_iter_cursor_init(struct ceph_msg_data_cursor *cursor, size_t length) { @@ -926,9 +844,6 @@ static void __ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor) size_t length = cursor->total_resid; switch (cursor->data->type) { - case CEPH_MSG_DATA_PAGELIST: - ceph_msg_data_pagelist_cursor_init(cursor, length); - break; case CEPH_MSG_DATA_PAGES: ceph_msg_data_pages_cursor_init(cursor, length); break; @@ -969,9 +884,6 @@ struct page *ceph_msg_data_next(struct ceph_msg_data_cursor *cursor, struct page *page; switch (cursor->data->type) { - case CEPH_MSG_DATA_PAGELIST: - page = ceph_msg_data_pagelist_next(cursor, page_offset, length); - break; case CEPH_MSG_DATA_PAGES: page = ceph_msg_data_pages_next(cursor, page_offset, length); break; @@ -1003,9 +915,6 @@ void ceph_msg_data_advance(struct ceph_msg_data_cursor *cursor, size_t bytes) BUG_ON(bytes > cursor->resid); switch (cursor->data->type) { - case CEPH_MSG_DATA_PAGELIST: - new_piece = ceph_msg_data_pagelist_advance(cursor, bytes); - break; case CEPH_MSG_DATA_PAGES: new_piece = ceph_msg_data_pages_advance(cursor, bytes); break; @@ -1744,8 +1653,6 @@ static void ceph_msg_data_destroy(struct ceph_msg_data *data) } else if (data->type == CEPH_MSG_DATA_PAGES && data->own_pages) { int num_pages = calc_pages_for(data->offset, data->length); ceph_release_page_vector(data->pages, num_pages); - } else if (data->type == CEPH_MSG_DATA_PAGELIST) { - ceph_pagelist_release(data->pagelist); } } @@ -1784,23 +1691,6 @@ void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages, } EXPORT_SYMBOL(ceph_msg_data_add_pages); -void ceph_msg_data_add_pagelist(struct ceph_msg *msg, - struct ceph_pagelist *pagelist) -{ - struct ceph_msg_data *data; - - BUG_ON(!pagelist); - BUG_ON(!pagelist->length); - - data = ceph_msg_data_add(msg); - data->type = CEPH_MSG_DATA_PAGELIST; - refcount_inc(&pagelist->refcnt); - data->pagelist = pagelist; - - msg->data_length += pagelist->length; -} -EXPORT_SYMBOL(ceph_msg_data_add_pagelist); - void ceph_msg_data_add_iter(struct ceph_msg *msg, struct iov_iter *iter) { diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index a967309d01a7..0ac439e7e730 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -16,7 +16,6 @@ #include #include #include -#include #include #define OSD_OPREPLY_FRONT_LEN 512 @@ -138,16 +137,6 @@ static void ceph_osd_data_pages_init(struct ceph_osd_data *osd_data, osd_data->own_pages = own_pages; } -/* - * Consumes a ref on @pagelist. - */ -static void ceph_osd_data_pagelist_init(struct ceph_osd_data *osd_data, - struct ceph_pagelist *pagelist) -{ - osd_data->type = CEPH_OSD_DATA_TYPE_PAGELIST; - osd_data->pagelist = pagelist; -} - static void ceph_osd_iter_init(struct ceph_osd_data *osd_data, struct iov_iter *iter) { @@ -230,16 +219,6 @@ void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *osd_req, } EXPORT_SYMBOL(osd_req_op_extent_osd_data_pages); -void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *osd_req, - unsigned int which, struct ceph_pagelist *pagelist) -{ - struct ceph_osd_data *osd_data; - - osd_data = osd_req_op_data(osd_req, which, extent, osd_data); - ceph_osd_data_pagelist_init(osd_data, pagelist); -} -EXPORT_SYMBOL(osd_req_op_extent_osd_data_pagelist); - /** * osd_req_op_extent_osd_iter - Set up an operation with an iterator buffer * @osd_req: The request to set up @@ -281,19 +260,6 @@ void osd_req_op_cls_request_databuf(struct ceph_osd_request *osd_req, } EXPORT_SYMBOL(osd_req_op_cls_request_databuf); -void osd_req_op_cls_request_data_pagelist( - struct ceph_osd_request *osd_req, - unsigned int which, struct ceph_pagelist *pagelist) -{ - struct ceph_osd_data *osd_data; - - osd_data = osd_req_op_data(osd_req, which, cls, request_data); - ceph_osd_data_pagelist_init(osd_data, pagelist); - osd_req->r_ops[which].cls.indata_len += pagelist->length; - osd_req->r_ops[which].indata_len += pagelist->length; -} -EXPORT_SYMBOL(osd_req_op_cls_request_data_pagelist); - void osd_req_op_cls_response_databuf(struct ceph_osd_request *osd_req, unsigned int which, struct ceph_databuf *dbuf) @@ -316,8 +282,6 @@ static u64 ceph_osd_data_length(struct ceph_osd_data *osd_data) return ceph_databuf_len(osd_data->dbuf); case CEPH_OSD_DATA_TYPE_PAGES: return osd_data->length; - case CEPH_OSD_DATA_TYPE_PAGELIST: - return (u64)osd_data->pagelist->length; case CEPH_OSD_DATA_TYPE_ITER: return iov_iter_count(&osd_data->iter); default: @@ -336,8 +300,6 @@ static void ceph_osd_data_release(struct ceph_osd_data *osd_data) num_pages = calc_pages_for((u64)osd_data->offset, (u64)osd_data->length); ceph_release_page_vector(osd_data->pages, num_pages); - } else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) { - ceph_pagelist_release(osd_data->pagelist); } ceph_osd_data_init(osd_data); } @@ -913,9 +875,6 @@ static void ceph_osdc_msg_data_add(struct ceph_msg *msg, if (length) ceph_msg_data_add_pages(msg, osd_data->pages, length, osd_data->offset, false); - } else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) { - BUG_ON(!length); - ceph_msg_data_add_pagelist(msg, osd_data->pagelist); } else if (osd_data->type == CEPH_OSD_DATA_TYPE_ITER) { ceph_msg_data_add_iter(msg, &osd_data->iter); } else { diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c deleted file mode 100644 index 5a9c4be5f222..000000000000 --- a/net/ceph/pagelist.c +++ /dev/null @@ -1,133 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -#include -#include -#include -#include -#include -#include - -struct ceph_pagelist *ceph_pagelist_alloc(gfp_t gfp_flags) -{ - struct ceph_pagelist *pl; - - pl = kmalloc(sizeof(*pl), gfp_flags); - if (!pl) - return NULL; - - INIT_LIST_HEAD(&pl->head); - pl->mapped_tail = NULL; - pl->length = 0; - pl->room = 0; - INIT_LIST_HEAD(&pl->free_list); - pl->num_pages_free = 0; - refcount_set(&pl->refcnt, 1); - - return pl; -} -EXPORT_SYMBOL(ceph_pagelist_alloc); - -static void ceph_pagelist_unmap_tail(struct ceph_pagelist *pl) -{ - if (pl->mapped_tail) { - struct page *page = list_entry(pl->head.prev, struct page, lru); - kunmap(page); - pl->mapped_tail = NULL; - } -} - -void ceph_pagelist_release(struct ceph_pagelist *pl) -{ - if (!refcount_dec_and_test(&pl->refcnt)) - return; - ceph_pagelist_unmap_tail(pl); - while (!list_empty(&pl->head)) { - struct page *page = list_first_entry(&pl->head, struct page, - lru); - list_del(&page->lru); - __free_page(page); - } - ceph_pagelist_free_reserve(pl); - kfree(pl); -} -EXPORT_SYMBOL(ceph_pagelist_release); - -static int ceph_pagelist_addpage(struct ceph_pagelist *pl) -{ - struct page *page; - - if (!pl->num_pages_free) { - page = __page_cache_alloc(GFP_NOFS); - } else { - page = list_first_entry(&pl->free_list, struct page, lru); - list_del(&page->lru); - --pl->num_pages_free; - } - if (!page) - return -ENOMEM; - pl->room += PAGE_SIZE; - ceph_pagelist_unmap_tail(pl); - list_add_tail(&page->lru, &pl->head); - pl->mapped_tail = kmap(page); - return 0; -} - -int ceph_pagelist_append(struct ceph_pagelist *pl, const void *buf, size_t len) -{ - while (pl->room < len) { - size_t bit = pl->room; - int ret; - - memcpy(pl->mapped_tail + (pl->length & ~PAGE_MASK), - buf, bit); - pl->length += bit; - pl->room -= bit; - buf += bit; - len -= bit; - ret = ceph_pagelist_addpage(pl); - if (ret) - return ret; - } - - memcpy(pl->mapped_tail + (pl->length & ~PAGE_MASK), buf, len); - pl->length += len; - pl->room -= len; - return 0; -} -EXPORT_SYMBOL(ceph_pagelist_append); - -/* Allocate enough pages for a pagelist to append the given amount - * of data without allocating. - * Returns: 0 on success, -ENOMEM on error. - */ -int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space) -{ - if (space <= pl->room) - return 0; - space -= pl->room; - space = (space + PAGE_SIZE - 1) >> PAGE_SHIFT; /* conv to num pages */ - - while (space > pl->num_pages_free) { - struct page *page = __page_cache_alloc(GFP_NOFS); - if (!page) - return -ENOMEM; - list_add_tail(&page->lru, &pl->free_list); - ++pl->num_pages_free; - } - return 0; -} -EXPORT_SYMBOL(ceph_pagelist_reserve); - -/* Free any pages that have been preallocated. */ -int ceph_pagelist_free_reserve(struct ceph_pagelist *pl) -{ - while (!list_empty(&pl->free_list)) { - struct page *page = list_first_entry(&pl->free_list, - struct page, lru); - list_del(&page->lru); - __free_page(page); - --pl->num_pages_free; - } - BUG_ON(pl->num_pages_free); - return 0; -} -EXPORT_SYMBOL(ceph_pagelist_free_reserve); From patchwork Thu Mar 13 23:33:13 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016069 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B551E1FCCF4 for ; Thu, 13 Mar 2025 23:35:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908916; cv=none; b=JU6A2Gf2ZJaz/YmR5zNgmBWH1+ou75CfiwlFbsX0xPuOFxNKIOWkRAzNkPdPxxOcypPwvtfPlD8JEX8YOsPEm8qVqoyw8uk+nYhjFBSyEdgsfQeH3KeC//MenOyPapmdgdWN8bN2Z8c5CORx4cILd0NGT5YUxuFJTLSMJNg9BJw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908916; c=relaxed/simple; bh=SIRF3c+W95g4vAKDgABYo1J0Uf6hNOLvdEFtKJ+MwOI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=jEcG77JUoSlQLxjtNLqAfArJaNH/CQzuducSkOKtx2unczciOYQ3lkPCmZ6efLp09S7fl5kOGnenh8EVK6GUmPekWU8ISfoolufFXdWEWsBTVC3imc6Jb6sTWIhWdSgtAEWGsVxdpBKcNB1emUZ62PRgaZvbtkp07NwASeWHSb8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=g0KlkZX2; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="g0KlkZX2" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908913; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=m4Ip384RhgQAfEUHcpHjr8S8u0Z7qUgyqfCla2grBlE=; b=g0KlkZX2OIg+/5PFfqn5c+XinfIJoGDvnfU1wxj0bfGQCVYSgaE6EDYAs7Gy38o/xoCAn8 7bfgmcC5gOQSVGhWGP90R/mvElntZc4afZO8hGED/EpPN1qvQ3qW0flSOVBRJWH2U42eqN e/ngoIHV/D1eTnGxH7HD+kPxS+0ncdo= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-349-mYGjWPVuMhaRJvRUnFsaSg-1; Thu, 13 Mar 2025 19:35:10 -0400 X-MC-Unique: mYGjWPVuMhaRJvRUnFsaSg-1 X-Mimecast-MFC-AGG-ID: mYGjWPVuMhaRJvRUnFsaSg_1741908908 Received: from mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 79693195608B; Thu, 13 Mar 2025 23:35:08 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id E4FA819560AB; Thu, 13 Mar 2025 23:35:05 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 21/35] libceph: Make notify code use ceph_databuf_enc_start/stop Date: Thu, 13 Mar 2025 23:33:13 +0000 Message-ID: <20250313233341.1675324-22-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.40 Make the ceph_osdc_notify*() functions use ceph_databuf_enc_start() and ceph_databuf_enc_stop() when filling out the request data. Also use ceph_encode_*() rather than ceph_databuf_encode_*() as the latter will do an iterator copy to deal with page crossing and misalignment (the latter being something that the CPU will handle on some arches). Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- net/ceph/osd_client.c | 55 +++++++++++++++++++++---------------------- 1 file changed, 27 insertions(+), 28 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 0ac439e7e730..1a0cb2cdcc52 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -4759,7 +4759,10 @@ static int osd_req_op_notify_ack_init(struct ceph_osd_request *req, int which, { struct ceph_osd_req_op *op; struct ceph_databuf *dbuf; - int ret; + void *p; + + if (!payload) + payload_len = 0; op = osd_req_op_init(req, which, CEPH_OSD_OP_NOTIFY_ACK, 0); @@ -4767,18 +4770,13 @@ static int osd_req_op_notify_ack_init(struct ceph_osd_request *req, int which, if (!dbuf) return -ENOMEM; - ret = ceph_databuf_encode_64(dbuf, notify_id); - ret |= ceph_databuf_encode_64(dbuf, cookie); - if (payload) { - ret |= ceph_databuf_encode_32(dbuf, payload_len); - ret |= ceph_databuf_append(dbuf, payload, payload_len); - } else { - ret |= ceph_databuf_encode_32(dbuf, 0); - } - if (ret) { - ceph_databuf_release(dbuf); - return -ENOMEM; - } + p = ceph_databuf_enc_start(dbuf); + ceph_encode_64(&p, notify_id); + ceph_encode_64(&p, cookie); + ceph_encode_32(&p, payload_len); + if (payload) + ceph_encode_copy(&p, payload, payload_len); + ceph_databuf_enc_stop(dbuf, p); ceph_osd_databuf_init(&op->notify_ack.request_data, dbuf); op->indata_len = ceph_databuf_len(dbuf); @@ -4840,8 +4838,12 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc, size_t *preply_len) { struct ceph_osd_linger_request *lreq; + void *p; int ret; + if (WARN_ON_ONCE(payload_len > PAGE_SIZE - 3 * 4)) + return -EIO; + WARN_ON(!timeout); if (preply_pages) { *preply_pages = NULL; @@ -4852,20 +4854,19 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc, if (!lreq) return -ENOMEM; - lreq->request_pl = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO); + lreq->request_pl = ceph_databuf_req_alloc(0, 3 * 4 + payload_len, + GFP_NOIO); if (!lreq->request_pl) { ret = -ENOMEM; goto out_put_lreq; } - ret = ceph_databuf_encode_32(lreq->request_pl, 1); /* prot_ver */ - ret |= ceph_databuf_encode_32(lreq->request_pl, timeout); - ret |= ceph_databuf_encode_32(lreq->request_pl, payload_len); - ret |= ceph_databuf_append(lreq->request_pl, payload, payload_len); - if (ret) { - ret = -ENOMEM; - goto out_put_lreq; - } + p = ceph_databuf_enc_start(lreq->request_pl); + ceph_encode_32(&p, 1); /* prot_ver */ + ceph_encode_32(&p, timeout); + ceph_encode_32(&p, payload_len); + ceph_encode_copy(&p, payload, payload_len); + ceph_databuf_enc_stop(lreq->request_pl, p); /* for notify_id */ lreq->notify_id_buf = ceph_databuf_reply_alloc(1, PAGE_SIZE, GFP_NOIO); @@ -5217,7 +5218,7 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req, { struct ceph_osd_req_op *op; struct ceph_databuf *dbuf; - void *p, *end; + void *p; dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_KERNEL); if (!dbuf) @@ -5230,15 +5231,13 @@ int osd_req_op_copy_from_init(struct ceph_osd_request *req, op->copy_from.flags = copy_from_flags; op->copy_from.src_fadvise_flags = src_fadvise_flags; - p = kmap_ceph_databuf_page(dbuf, 0); - end = p + PAGE_SIZE; + p = ceph_databuf_enc_start(dbuf); ceph_encode_string(&p, src_oid->name, src_oid->name_len); encode_oloc(&p, src_oloc); ceph_encode_32(&p, truncate_seq); ceph_encode_64(&p, truncate_size); - op->indata_len = PAGE_SIZE - (end - p); - ceph_databuf_added_data(dbuf, op->indata_len); - kunmap_local(p); + ceph_databuf_enc_stop(dbuf, p); + op->indata_len = ceph_databuf_len(dbuf); ceph_osd_databuf_init(&op->copy_from.osd_data, dbuf); return 0; From patchwork Thu Mar 13 23:33:14 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016070 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 52068204C3C for ; Thu, 13 Mar 2025 23:35:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908922; cv=none; b=FvT0e7/l7Xun0kxWbd2Ym1O82E88lQpxUVVmj6u5a2KMKOsEHKYyiAqap0iYOtAqDYdFmmOPjNfClyh65fjrM+5NSIbk7T6lcMvGkmBrTmZUa5SIep5ptSiGZUl8Oh+zAv04o7COid/9nkCQ8yrGpBShv7m4tVPniIbLTMYI3R4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908922; c=relaxed/simple; bh=3fiy4ny7SU0U9G1VFve5WdbbBxtpaSXQoJ3a9FoMPqo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=A6mIe90BRluxn0BxVBMHwRmyyBKl4Xpn1/4B+ycTB6kcC0fJtf3w9IE8hHZsm146C/g3WcPN9dbenEnKaixDj6ECYpDBk5gfdzNeWxPDfFFk4fWVxF8TE39Zgyb1yI96yb2ZgOlJF0ZshGbS/CLgKTjrjiymZEUFtOo2T1V6oeM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=iiaOeV+m; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="iiaOeV+m" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908919; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/hHJAzYYf9ym0VVtgdV6fYyO92FHlFfcLtf50k26yFk=; b=iiaOeV+mmJZkmiOlDRSFMUdqkxZH5FdO6Meal9XS97Egm2yF34hL3ap20k0uyraXkDO5r0 TKXuaHbyqDavhZz0h6rbMTBQRaUWGHLwdlMWsTLCnFzmKDM5E7EVtG7ywD0SGNIwynlt8x +KaGqVa8u2hV+zCh0HgXiZS1n9MO7Hw= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-563-nrTmkHJOPMyDQVSKYqLXYw-1; Thu, 13 Mar 2025 19:35:14 -0400 X-MC-Unique: nrTmkHJOPMyDQVSKYqLXYw-1 X-Mimecast-MFC-AGG-ID: nrTmkHJOPMyDQVSKYqLXYw_1741908912 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 8FB441956046; Thu, 13 Mar 2025 23:35:12 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id BF7AC1800268; Thu, 13 Mar 2025 23:35:09 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 22/35] libceph, rbd: Convert ceph_osdc_notify() reply to ceph_databuf Date: Thu, 13 Mar 2025 23:33:14 +0000 Message-ID: <20250313233341.1675324-23-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Convert the reply buffer of ceph_osdc_notify() to ceph_databuf rather than an array of pages. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- drivers/block/rbd.c | 36 +++++++++++++++++---------- include/linux/ceph/databuf.h | 16 ++++++++++++ include/linux/ceph/osd_client.h | 7 ++---- net/ceph/osd_client.c | 44 +++++++++++---------------------- 4 files changed, 55 insertions(+), 48 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index eea12c7ab2a0..a2674077edea 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3585,8 +3585,7 @@ static void rbd_unlock(struct rbd_device *rbd_dev) static int __rbd_notify_op_lock(struct rbd_device *rbd_dev, enum rbd_notify_op notify_op, - struct page ***preply_pages, - size_t *preply_len) + struct ceph_databuf *reply) { struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; struct rbd_client_id cid = rbd_get_cid(rbd_dev); @@ -3604,13 +3603,13 @@ static int __rbd_notify_op_lock(struct rbd_device *rbd_dev, return ceph_osdc_notify(osdc, &rbd_dev->header_oid, &rbd_dev->header_oloc, buf, buf_size, - RBD_NOTIFY_TIMEOUT, preply_pages, preply_len); + RBD_NOTIFY_TIMEOUT, reply); } static void rbd_notify_op_lock(struct rbd_device *rbd_dev, enum rbd_notify_op notify_op) { - __rbd_notify_op_lock(rbd_dev, notify_op, NULL, NULL); + __rbd_notify_op_lock(rbd_dev, notify_op, NULL); } static void rbd_notify_acquired_lock(struct work_struct *work) @@ -3631,23 +3630,29 @@ static void rbd_notify_released_lock(struct work_struct *work) static int rbd_request_lock(struct rbd_device *rbd_dev) { - struct page **reply_pages; - size_t reply_len; + struct ceph_databuf *reply; bool lock_owner_responded = false; int ret; dout("%s rbd_dev %p\n", __func__, rbd_dev); - ret = __rbd_notify_op_lock(rbd_dev, RBD_NOTIFY_OP_REQUEST_LOCK, - &reply_pages, &reply_len); + /* The actual reply pages will be allocated in the read path and then + * pasted in in handle_watch_notify(). + */ + reply = ceph_databuf_reply_alloc(0, 0, GFP_KERNEL); + if (!reply) + return -ENOMEM; + + ret = __rbd_notify_op_lock(rbd_dev, RBD_NOTIFY_OP_REQUEST_LOCK, reply); if (ret && ret != -ETIMEDOUT) { rbd_warn(rbd_dev, "failed to request lock: %d", ret); goto out; } - if (reply_len > 0 && reply_len <= PAGE_SIZE) { - void *p = page_address(reply_pages[0]); - void *const end = p + reply_len; + if (ceph_databuf_len(reply) > 0 && ceph_databuf_len(reply) <= PAGE_SIZE) { + void *s = kmap_ceph_databuf_page(reply, 0); + void *p = s; + void *const end = p + ceph_databuf_len(reply); u32 n; ceph_decode_32_safe(&p, end, n, e_inval); /* num_acks */ @@ -3659,10 +3664,12 @@ static int rbd_request_lock(struct rbd_device *rbd_dev) p += 8 + 8; /* skip gid and cookie */ ceph_decode_32_safe(&p, end, len, e_inval); - if (!len) + if (!len) { continue; + } if (lock_owner_responded) { + kunmap_local(s); rbd_warn(rbd_dev, "duplicate lock owners detected"); ret = -EIO; @@ -3673,6 +3680,7 @@ static int rbd_request_lock(struct rbd_device *rbd_dev) ret = ceph_start_decoding(&p, end, 1, "ResponseMessage", &struct_v, &len); if (ret) { + kunmap_local(s); rbd_warn(rbd_dev, "failed to decode ResponseMessage: %d", ret); @@ -3681,6 +3689,8 @@ static int rbd_request_lock(struct rbd_device *rbd_dev) ret = ceph_decode_32(&p); } + + kunmap_local(s); } if (!lock_owner_responded) { @@ -3689,7 +3699,7 @@ static int rbd_request_lock(struct rbd_device *rbd_dev) } out: - ceph_release_page_vector(reply_pages, calc_pages_for(0, reply_len)); + ceph_databuf_release(reply); return ret; e_inval: diff --git a/include/linux/ceph/databuf.h b/include/linux/ceph/databuf.h index 54b76d0c91a0..25154b3d08fa 100644 --- a/include/linux/ceph/databuf.h +++ b/include/linux/ceph/databuf.h @@ -150,4 +150,20 @@ static inline bool ceph_databuf_is_all_zero(struct ceph_databuf *dbuf, size_t co ceph_databuf_scan_for_nonzero) == count; } +static inline void ceph_databuf_transfer(struct ceph_databuf *to, + struct ceph_databuf *from) +{ + BUG_ON(to->nr_bvec || to->bvec); + to->bvec = from->bvec; + to->nr_bvec = from->nr_bvec; + to->max_bvec = from->max_bvec; + to->limit = from->limit; + to->iter = from->iter; + + from->bvec = NULL; + from->nr_bvec = from->max_bvec = 0; + from->limit = 0; + iov_iter_discard(&from->iter, ITER_DEST, 0); +} + #endif /* __FS_CEPH_DATABUF_H */ diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 5a1ee66ca216..7eff589711cc 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -333,9 +333,7 @@ struct ceph_osd_linger_request { struct ceph_databuf *request_pl; struct ceph_databuf *notify_id_buf; - - struct page ***preply_pages; - size_t *preply_len; + struct ceph_databuf *reply; }; struct ceph_watch_item { @@ -589,8 +587,7 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc, void *payload, u32 payload_len, u32 timeout, - struct page ***preply_pages, - size_t *preply_len); + struct ceph_databuf *reply); int ceph_osdc_list_watchers(struct ceph_osd_client *osdc, struct ceph_object_id *oid, struct ceph_object_locator *oloc, diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 1a0cb2cdcc52..92aaa5ed9145 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -4523,17 +4523,11 @@ static void handle_watch_notify(struct ceph_osd_client *osdc, dout("lreq %p notify_id %llu != %llu, ignoring\n", lreq, lreq->notify_id, notify_id); } else if (!completion_done(&lreq->notify_finish_wait)) { - struct ceph_msg_data *data = - msg->num_data_items ? &msg->data[0] : NULL; - - if (data) { - if (lreq->preply_pages) { - WARN_ON(data->type != - CEPH_MSG_DATA_PAGES); - *lreq->preply_pages = data->pages; - *lreq->preply_len = data->length; - data->own_pages = false; - } + if (msg->num_data_items && lreq->reply) { + struct ceph_msg_data *data = &msg->data[0]; + + WARN_ON(data->type != CEPH_MSG_DATA_DATABUF); + ceph_databuf_transfer(lreq->reply, data->dbuf); } lreq->notify_finish_error = return_code; complete_all(&lreq->notify_finish_wait); @@ -4823,10 +4817,7 @@ EXPORT_SYMBOL(ceph_osdc_notify_ack); /* * @timeout: in seconds * - * @preply_{pages,len} are initialized both on success and error. - * The caller is responsible for: - * - * ceph_release_page_vector(reply_pages, calc_pages_for(0, reply_len)) + * @reply should be an empty ceph_databuf. */ int ceph_osdc_notify(struct ceph_osd_client *osdc, struct ceph_object_id *oid, @@ -4834,8 +4825,7 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc, void *payload, u32 payload_len, u32 timeout, - struct page ***preply_pages, - size_t *preply_len) + struct ceph_databuf *reply) { struct ceph_osd_linger_request *lreq; void *p; @@ -4845,10 +4835,6 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc, return -EIO; WARN_ON(!timeout); - if (preply_pages) { - *preply_pages = NULL; - *preply_len = 0; - } lreq = linger_alloc(osdc); if (!lreq) @@ -4875,8 +4861,7 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc, goto out_put_lreq; } - lreq->preply_pages = preply_pages; - lreq->preply_len = preply_len; + lreq->reply = reply; ceph_oid_copy(&lreq->t.base_oid, oid); ceph_oloc_copy(&lreq->t.base_oloc, oloc); @@ -5383,7 +5368,7 @@ static struct ceph_msg *get_reply(struct ceph_connection *con, return m; } -static struct ceph_msg *alloc_msg_with_page_vector(struct ceph_msg_header *hdr) +static struct ceph_msg *alloc_msg_with_data_buffer(struct ceph_msg_header *hdr) { struct ceph_msg *m; int type = le16_to_cpu(hdr->type); @@ -5395,16 +5380,15 @@ static struct ceph_msg *alloc_msg_with_page_vector(struct ceph_msg_header *hdr) return NULL; if (data_len) { - struct page **pages; + struct ceph_databuf *dbuf; - pages = ceph_alloc_page_vector(calc_pages_for(0, data_len), - GFP_NOIO); - if (IS_ERR(pages)) { + dbuf = ceph_databuf_reply_alloc(0, data_len, GFP_NOIO); + if (!dbuf) { ceph_msg_put(m); return NULL; } - ceph_msg_data_add_pages(m, pages, data_len, 0, true); + ceph_msg_data_add_databuf(m, dbuf); } return m; @@ -5422,7 +5406,7 @@ static struct ceph_msg *osd_alloc_msg(struct ceph_connection *con, case CEPH_MSG_OSD_MAP: case CEPH_MSG_OSD_BACKOFF: case CEPH_MSG_WATCH_NOTIFY: - return alloc_msg_with_page_vector(hdr); + return alloc_msg_with_data_buffer(hdr); case CEPH_MSG_OSD_OPREPLY: return get_reply(con, hdr, skip); default: From patchwork Thu Mar 13 23:33:15 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016071 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4A69B2054FE for ; Thu, 13 Mar 2025 23:35:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908925; cv=none; b=CkhUDNz/Y9Czs1AEUor46I1/tDjUc6arC6j7t+LuwT5AfeZ+6strOBfKAU5OS83zrkxRe925ES5l7bfKOugAB8BEUCqb4JW/418i6ihLZvIIzHV6s1eGxX9tc68YgM2oJNjNpRMhAtPmlFuOSdRrbV4bBL6ZqIfsNCUqfC5CGAs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908925; c=relaxed/simple; bh=E2/FfSQEVeXizQhNicf4rR6freuOTQONEexZxKOIxUI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=lBdYNHiV6nDcm2ePKrMyIJRrtjmvw8+tCJiAJel8HGG6bnL/X+jRyBYbGtkyZYUFzKn5rKol3mfbOo/5yq67CbqKbiz+kdpTaMUAnYpd7Zx01hHfMHdddS9fLb6MEAu9pCCyXivwTd6NWyZvlf0ewBQ/xpVoxPAIKWlhlmgv2cQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=BDvGfSu5; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="BDvGfSu5" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908922; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ateNCwwhc3gg0uYNIgkikIKVmAOxANHfCT4U4GJyDlA=; b=BDvGfSu5YRrYs0GedsZk22BRn1lcR3sf95AyrvvEeww8qbFEP1n2jlyagrB0Gd+ZcEh4hK LTjlpjVjrd8fE/Nhh4nc5t5IIfgU9CT8Il0fM5KxxyGhP41maA+oUEbbPLt3KO52eM6WHP DKhlWIl13lgszFr+YfoPLMAeiRdYY3c= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-653-zdgi842RPDOkS6X_NwWd3w-1; Thu, 13 Mar 2025 19:35:17 -0400 X-MC-Unique: zdgi842RPDOkS6X_NwWd3w-1 X-Mimecast-MFC-AGG-ID: zdgi842RPDOkS6X_NwWd3w_1741908916 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 188EE1800257; Thu, 13 Mar 2025 23:35:16 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id BEFA01828A93; Thu, 13 Mar 2025 23:35:13 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 23/35] rbd: Use ceph_databuf_enc_start/stop() Date: Thu, 13 Mar 2025 23:33:15 +0000 Message-ID: <20250313233341.1675324-24-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Make rbd use ceph_databuf_enc_start() and ceph_databuf_enc_stop() when filling out the request data. Also use ceph_encode_*() rather than ceph_databuf_encode_*() as the latter will do an iterator copy to deal with page crossing and misalignment (the latter being something that the CPU will handle on some arches). Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- drivers/block/rbd.c | 64 ++++++++++++++++++++++----------------------- 1 file changed, 31 insertions(+), 33 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index a2674077edea..956fc4a8f1da 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1970,19 +1970,19 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req, int which, u64 objno, u8 new_state, const u8 *current_state) { - struct ceph_databuf *dbuf; - void *p, *start; + struct ceph_databuf *request; + void *p; int ret; ret = osd_req_op_cls_init(req, which, "rbd", "object_map_update"); if (ret) return ret; - dbuf = ceph_databuf_req_alloc(1, PAGE_SIZE, GFP_NOIO); - if (!dbuf) + request = ceph_databuf_req_alloc(1, 8 * 2 + 3 * 1, GFP_NOIO); + if (!request) return -ENOMEM; - p = start = kmap_ceph_databuf_page(dbuf, 0); + p = ceph_databuf_enc_start(request); ceph_encode_64(&p, objno); ceph_encode_64(&p, objno + 1); ceph_encode_8(&p, new_state); @@ -1992,10 +1992,9 @@ static int rbd_cls_object_map_update(struct ceph_osd_request *req, } else { ceph_encode_8(&p, 0); } - kunmap_local(p); - ceph_databuf_added_data(dbuf, p - start); + ceph_databuf_enc_stop(request, p); - osd_req_op_cls_request_databuf(req, which, dbuf); + osd_req_op_cls_request_databuf(req, which, request); return 0; } @@ -2108,7 +2107,7 @@ static int rbd_obj_calc_img_extents(struct rbd_obj_request *obj_req, static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which) { - struct ceph_databuf *dbuf; + struct ceph_databuf *request; /* * The response data for a STAT call consists of: @@ -2118,12 +2117,12 @@ static int rbd_osd_setup_stat(struct ceph_osd_request *osd_req, int which) * le32 tv_nsec; * } mtime; */ - dbuf = ceph_databuf_reply_alloc(1, 8 + sizeof(struct ceph_timespec), GFP_NOIO); - if (!dbuf) + request = ceph_databuf_reply_alloc(1, 8 + sizeof(struct ceph_timespec), GFP_NOIO); + if (!request) return -ENOMEM; osd_req_op_init(osd_req, which, CEPH_OSD_OP_STAT, 0); - osd_req_op_raw_data_in_databuf(osd_req, which, dbuf); + osd_req_op_raw_data_in_databuf(osd_req, which, request); return 0; } @@ -2964,16 +2963,16 @@ static int rbd_obj_copyup_current_snapc(struct rbd_obj_request *obj_req, static int setup_copyup_buf(struct rbd_obj_request *obj_req, u64 obj_overlap) { - struct ceph_databuf *dbuf; + struct ceph_databuf *request; rbd_assert(!obj_req->copyup_buf); - dbuf = ceph_databuf_req_alloc(calc_pages_for(0, obj_overlap), + request = ceph_databuf_req_alloc(calc_pages_for(0, obj_overlap), obj_overlap, GFP_NOIO); - if (!dbuf) + if (!request) return -ENOMEM; - obj_req->copyup_buf = dbuf; + obj_req->copyup_buf = request; return 0; } @@ -4580,10 +4579,9 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev, if (!request) return -ENOMEM; - p = kmap_ceph_databuf_page(request, 0); - memcpy(p, outbound, outbound_size); - kunmap_local(p); - ceph_databuf_added_data(request, outbound_size); + p = ceph_databuf_enc_start(request); + ceph_encode_copy(&p, outbound, outbound_size); + ceph_databuf_enc_stop(request, p); } reply = ceph_databuf_reply_alloc(1, inbound_size, GFP_KERNEL); @@ -4712,7 +4710,7 @@ static void rbd_free_disk(struct rbd_device *rbd_dev) static int rbd_obj_read_sync(struct rbd_device *rbd_dev, struct ceph_object_id *oid, struct ceph_object_locator *oloc, - struct ceph_databuf *dbuf, int len) + struct ceph_databuf *request, int len) { struct ceph_osd_client *osdc = &rbd_dev->rbd_client->client->osdc; struct ceph_osd_request *req; @@ -4727,7 +4725,7 @@ static int rbd_obj_read_sync(struct rbd_device *rbd_dev, req->r_flags = CEPH_OSD_FLAG_READ; osd_req_op_extent_init(req, 0, CEPH_OSD_OP_READ, 0, len, 0, 0); - osd_req_op_extent_osd_databuf(req, 0, dbuf); + osd_req_op_extent_osd_databuf(req, 0, request); ret = ceph_osdc_alloc_messages(req, GFP_KERNEL); if (ret) @@ -4750,16 +4748,16 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev, bool first_time) { struct rbd_image_header_ondisk *ondisk; - struct ceph_databuf *dbuf = NULL; + struct ceph_databuf *request = NULL; u32 snap_count = 0; u64 names_size = 0; u32 want_count; int ret; - dbuf = ceph_databuf_req_alloc(1, sizeof(*ondisk), GFP_KERNEL); - if (!dbuf) + request = ceph_databuf_req_alloc(1, sizeof(*ondisk), GFP_KERNEL); + if (!request) return -ENOMEM; - ondisk = kmap_ceph_databuf_page(dbuf, 0); + ondisk = kmap_ceph_databuf_page(request, 0); /* * The complete header will include an array of its 64-bit @@ -4776,13 +4774,13 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev, size += names_size; ret = -ENOMEM; - if (size > dbuf->limit && - ceph_databuf_reserve(dbuf, size - dbuf->limit, + if (size > request->limit && + ceph_databuf_reserve(request, size - request->limit, GFP_KERNEL) < 0) goto out; ret = rbd_obj_read_sync(rbd_dev, &rbd_dev->header_oid, - &rbd_dev->header_oloc, dbuf, size); + &rbd_dev->header_oloc, request, size); if (ret < 0) goto out; if ((size_t)ret < size) { @@ -4806,7 +4804,7 @@ static int rbd_dev_v1_header_info(struct rbd_device *rbd_dev, ret = rbd_header_from_disk(header, ondisk, first_time); out: kunmap_local(ondisk); - ceph_databuf_release(dbuf); + ceph_databuf_release(request); return ret; } @@ -5625,10 +5623,10 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev, if (!reply) goto out_free; - p = kmap_ceph_databuf_page(request, 0); + p = ceph_databuf_enc_start(request); ceph_encode_64(&p, rbd_dev->spec->snap_id); - kunmap_local(p); - ceph_databuf_added_data(request, sizeof(__le64)); + ceph_databuf_enc_stop(request, p); + ret = __get_parent_info(rbd_dev, request, reply, pii); if (ret > 0) ret = __get_parent_info_legacy(rbd_dev, request, reply, pii); From patchwork Thu Mar 13 23:33:16 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016072 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 98EEC205E13 for ; Thu, 13 Mar 2025 23:35:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908928; cv=none; b=GSqviG0bunW3//O7t6Cij8k3A26WuGN7uf+VRoNQajOvBu+7VC0uzSJyYMD1wShsOj9Uyc5n85Ib8Da5pXqUpK3YMywjUygk26dF+JuFkgqx7wfHiniUX+nonpSh0er2hhSE0lmmtfl35PR1Bh9nY/KwQpBmV3/7d+WdP7m3LCY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908928; c=relaxed/simple; bh=6OzzDKzbbzQbGvjekWWPuzdxFcthY11POqQiisXIVZc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=hlCT7Pho+3X99WHBpZTV5Qoydt+Du1ZaDai3/SAXtWJk7KlD8Mn64smP5in52q67+1Y5boFeAHaVGxsG1TSD/8AmSZEc3K9V/l70eX1oNP7TuRpFCo9pnXoYNg/O+oLvp2oXgTBmrMT9Ibl2GhCHULej/gx0hZ93Y2DUYGvpzt4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=IJmjGVnW; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="IJmjGVnW" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908925; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LT2VwrPDEakuKlQLz45hjD3tXAYQbwEpQFgWzIEY3WI=; b=IJmjGVnWd823NhQlLCyjOwXtyKJ+RxB3+EpPGCzSo84PbIrnh4TxpK+qYlXqzOFT2rUtmq khDrpjMQW5vbdHkhDBxIpvvxXp/cCd4yMHDFK3eNcrIavm0WRnb4RICse269wulw+xrC01 WCuKiJoCNGFa2+yeuencceHzYVWLLaU= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-32-2l4YVEmmOEC1NkRBqXosiw-1; Thu, 13 Mar 2025 19:35:21 -0400 X-MC-Unique: 2l4YVEmmOEC1NkRBqXosiw-1 X-Mimecast-MFC-AGG-ID: 2l4YVEmmOEC1NkRBqXosiw_1741908919 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 831A51801A00; Thu, 13 Mar 2025 23:35:19 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 493F51828A83; Thu, 13 Mar 2025 23:35:17 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 24/35] ceph: Make ceph_calc_file_object_mapping() return size as size_t Date: Thu, 13 Mar 2025 23:33:16 +0000 Message-ID: <20250313233341.1675324-25-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Make ceph_calc_file_object_mapping() return the size as a size_t. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/ceph/addr.c | 4 ++-- fs/ceph/crypto.c | 2 +- fs/ceph/file.c | 9 ++++----- fs/ceph/ioctl.c | 2 +- include/linux/ceph/striper.h | 6 +++--- net/ceph/osd_client.c | 2 +- net/ceph/striper.c | 4 ++-- 7 files changed, 14 insertions(+), 15 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 482a9f41a685..7c89cafcb91a 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -335,8 +335,8 @@ static int ceph_netfs_prepare_read(struct netfs_io_subrequest *subreq) struct inode *inode = rreq->inode; struct ceph_inode_info *ci = ceph_inode(inode); struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); + size_t xlen; u64 objno, objoff; - u32 xlen; /* Truncate the extent at the end of the current block */ ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, subreq->len, @@ -1205,9 +1205,9 @@ void ceph_allocate_page_array(struct address_space *mapping, { struct inode *inode = mapping->host; struct ceph_inode_info *ci = ceph_inode(inode); + size_t xlen; u64 objnum; u64 objoff; - u32 xlen; /* prepare async write request */ ceph_wbc->offset = (u64)folio_pos(folio); diff --git a/fs/ceph/crypto.c b/fs/ceph/crypto.c index 3b3c4d8d401e..a28dea74ca6f 100644 --- a/fs/ceph/crypto.c +++ b/fs/ceph/crypto.c @@ -594,8 +594,8 @@ int ceph_fscrypt_decrypt_extents(struct inode *inode, struct page **page, struct ceph_client *cl = ceph_inode_to_client(inode); int i, ret = 0; struct ceph_inode_info *ci = ceph_inode(inode); + size_t xlen; u64 objno, objoff; - u32 xlen; /* Nothing to do for empty array */ if (ext_cnt == 0) { diff --git a/fs/ceph/file.c b/fs/ceph/file.c index fb4024bc8274..ffd36e00b0de 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -1731,12 +1731,11 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos, u64 write_pos = pos; u64 write_len = len; u64 objnum, objoff; - u32 xlen; u64 assert_ver = 0; bool rmw; bool first, last; struct iov_iter saved_iter = *from; - size_t off; + size_t off, xlen; ceph_fscrypt_adjust_off_and_len(inode, &write_pos, &write_len); @@ -2870,8 +2869,8 @@ static ssize_t ceph_do_objects_copy(struct ceph_inode_info *src_ci, u64 *src_off struct ceph_osd_client *osdc; struct ceph_osd_request *req; size_t bytes = 0; + size_t src_objlen, dst_objlen; u64 src_objnum, src_objoff, dst_objnum, dst_objoff; - u32 src_objlen, dst_objlen; u32 object_size = src_ci->i_layout.object_size; struct ceph_client *cl = fsc->client; int ret; @@ -2948,8 +2947,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, struct ceph_client *cl = src_fsc->client; loff_t size; ssize_t ret = -EIO, bytes; + size_t src_objlen, dst_objlen; u64 src_objnum, dst_objnum, src_objoff, dst_objoff; - u32 src_objlen, dst_objlen; int src_got = 0, dst_got = 0, err, dirty; if (src_inode->i_sb != dst_inode->i_sb) { @@ -3060,7 +3059,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, * starting at the src_off */ if (src_objoff) { - doutc(cl, "Initial partial copy of %u bytes\n", src_objlen); + doutc(cl, "Initial partial copy of %zu bytes\n", src_objlen); /* * we need to temporarily drop all caps as we'll be calling diff --git a/fs/ceph/ioctl.c b/fs/ceph/ioctl.c index e861de3c79b9..fab0e89ad7b4 100644 --- a/fs/ceph/ioctl.c +++ b/fs/ceph/ioctl.c @@ -186,7 +186,7 @@ static long ceph_ioctl_get_dataloc(struct file *file, void __user *arg) &ceph_sb_to_fs_client(inode->i_sb)->client->osdc; struct ceph_object_locator oloc; CEPH_DEFINE_OID_ONSTACK(oid); - u32 xlen; + size_t xlen; u64 tmp; struct ceph_pg pgid; int r; diff --git a/include/linux/ceph/striper.h b/include/linux/ceph/striper.h index 50bc1b88c5c4..e1036e953d7b 100644 --- a/include/linux/ceph/striper.h +++ b/include/linux/ceph/striper.h @@ -10,7 +10,7 @@ struct ceph_file_layout; void ceph_calc_file_object_mapping(struct ceph_file_layout *l, u64 off, u64 len, - u64 *objno, u64 *objoff, u32 *xlen); + u64 *objno, u64 *objoff, size_t *xlen); struct ceph_object_extent { struct list_head oe_item; @@ -97,14 +97,14 @@ int ceph_iterate_extents(struct ceph_file_layout *l, u64 off, u64 len, while (len) { struct ceph_object_extent *ex; u64 objno, objoff; - u32 xlen; + size_t xlen; ceph_calc_file_object_mapping(l, off, len, &objno, &objoff, &xlen); ex = ceph_lookup_containing(object_extents, objno, objoff, xlen); if (!ex) { - WARN(1, "%s: objno %llu %llu~%u not found!\n", + WARN(1, "%s: objno %llu %llu~%zu not found!\n", __func__, objno, objoff, xlen); return -EINVAL; } diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 92aaa5ed9145..f943d4e85a13 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -100,7 +100,7 @@ static int calc_layout(struct ceph_file_layout *layout, u64 off, u64 *plen, u64 *objnum, u64 *objoff, u64 *objlen) { u64 orig_len = *plen; - u32 xlen; + size_t xlen; /* object extent? */ ceph_calc_file_object_mapping(layout, off, orig_len, objnum, diff --git a/net/ceph/striper.c b/net/ceph/striper.c index 3dedbf018fa6..c934c9addc9d 100644 --- a/net/ceph/striper.c +++ b/net/ceph/striper.c @@ -23,7 +23,7 @@ */ void ceph_calc_file_object_mapping(struct ceph_file_layout *l, u64 off, u64 len, - u64 *objno, u64 *objoff, u32 *xlen) + u64 *objno, u64 *objoff, size_t *xlen) { u32 stripes_per_object = l->object_size / l->stripe_unit; u64 blockno; /* which su in the file (i.e. globally) */ @@ -100,7 +100,7 @@ int ceph_file_to_extents(struct ceph_file_layout *l, u64 off, u64 len, while (len) { struct list_head *add_pos = NULL; u64 objno, objoff; - u32 xlen; + size_t xlen; ceph_calc_file_object_mapping(l, off, len, &objno, &objoff, &xlen); From patchwork Thu Mar 13 23:33:17 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016073 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 651892063D0 for ; Thu, 13 Mar 2025 23:35:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908930; cv=none; b=CbgFWaJAejmGsxu2zUS5B9TrLD7w1bxpEZ+QG2RIjwfcr3Qx0FhhfVwPPnjYmnJT5oog/h3dGLuaApX3DZK7ZF+eg0tCEErt0AKDLU1oa0IducnrKUi8c5mqNQCHLVBaeZrXlQRTG1rUjeluzsJDbdaabwRmsx9eousOolBDD2Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908930; c=relaxed/simple; bh=WXrY3C4aBoCZOlpcPVhTK2SK1Y40kISeIyz7iYk66Ks=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=dTK3WsP8AbZE/WCIxvq8ujjA1Kk3+2TEaXBYu7YcLOJHXLqho+7d2xOQ2zNPopyrpJCqleu+jk6nrhjhJzwS+kvkXOQfvINpedtauNaF0ouwSawEqWXyBgwxSwBIKzt8ANJc8AfsOZruai2AEmGKj6X+yOtF/PEGICFtb1WdMkY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=QdGr5iRb; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="QdGr5iRb" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908927; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bmqeGbTSvYMVxLKr2hdetvu52z+vdNBfpZpodP6P95c=; b=QdGr5iRb7GQBMkmOx6SU561v3AAN99elsLhzz/oefOyl8XVEFtkTbFJ7/wXTLxAhsO928y H3Jala9VXPcQRyHPuSqu7vGjJypJmpqWqTl1Ujv1GT22kdnZGvC57jc5BgEepdASBsAsJT jTq2ViziBBAPUq68yovH0Dsjnpig0yk= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-571-jolOU4DDPX2zQ91cDEmtFg-1; Thu, 13 Mar 2025 19:35:24 -0400 X-MC-Unique: jolOU4DDPX2zQ91cDEmtFg-1 X-Mimecast-MFC-AGG-ID: jolOU4DDPX2zQ91cDEmtFg_1741908923 Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 19D0E180025B; Thu, 13 Mar 2025 23:35:23 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id CA7971800268; Thu, 13 Mar 2025 23:35:20 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 25/35] ceph: Wrap POSIX_FADV_WILLNEED to get caps Date: Thu, 13 Mar 2025 23:33:17 +0000 Message-ID: <20250313233341.1675324-26-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111 Wrap the handling of fadvise(POSIX_FADV_WILLNEED) so that we get the appropriate caps needed to do it. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/ceph/file.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index ffd36e00b0de..b876cecbaba5 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "super.h" #include "mds_client.h" @@ -3150,6 +3151,49 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off, return ret; } +/* + * If the user wants to manually trigger readahead, we have to get a cap to + * allow that. + */ +static int ceph_fadvise(struct file *file, loff_t offset, loff_t len, int advice) +{ + struct inode *inode = file_inode(file); + struct ceph_file_info *fi = file->private_data; + struct ceph_client *cl = ceph_inode_to_client(inode); + int want = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO, got = 0; + int ret; + + if (advice != POSIX_FADV_WILLNEED) + return generic_fadvise(file, offset, len, advice); + + if (!(fi->flags & CEPH_F_SYNC)) + return -EACCES; + if (fi->fmode & CEPH_FILE_MODE_LAZY) + return -EACCES; + + ret = ceph_get_caps(file, CEPH_CAP_FILE_RD, want, -1, &got); + if (ret < 0) { + doutc(cl, "%llx.%llx, error getting cap\n", ceph_vinop(inode)); + goto out; + } + + if ((got & want) == want) { + doutc(cl, "fadvise(WILLNEED) %p %llx.%llx %llu~%llu got cap refs on %s\n", + inode, ceph_vinop(inode), offset, len, + ceph_cap_string(got)); + ret = generic_fadvise(file, offset, len, advice); + } else { + doutc(cl, "%llx.%llx, no cache cap\n", ceph_vinop(inode)); + ret = -EACCES; + } + + doutc(cl, "%p %llx.%llx dropping cap refs on %s = %d\n", + inode, ceph_vinop(inode), ceph_cap_string(got), ret); + ceph_put_cap_refs(ceph_inode(inode), got); +out: + return ret; +} + const struct file_operations ceph_file_fops = { .open = ceph_open, .release = ceph_release, @@ -3167,4 +3211,5 @@ const struct file_operations ceph_file_fops = { .compat_ioctl = compat_ptr_ioctl, .fallocate = ceph_fallocate, .copy_file_range = ceph_copy_file_range, + .fadvise = ceph_fadvise, }; From patchwork Thu Mar 13 23:33:18 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016074 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 800522063F5 for ; Thu, 13 Mar 2025 23:35:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908932; cv=none; b=B2r8iR7WEaWYZRLAdkFijzPQfuLY7ZfE4PiFqwDSYOU05CLYFL02GUrUEwCekwMxbjvz8wXVTuhYQo0OV9Bd3694p19rBiIwgHOxMCRcMhAqabVwCpfmTFu3GVMYN4OYduCAl5aJkODJddh0eq7zLeRrcXGpU+DRJHgVbx4wNoQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908932; c=relaxed/simple; bh=ufQudNEzzeiDLcQykRdQgPkD6KuVET09igARrkJdeTk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=tOEif2RKAByNvDriIl5trmbpjVf6L7dtYEpFhM1JJ/66SkrmZiU4e4UQ9FDRM9sd5W6/dvhH6ihkaK6OGqTGvheJySMSGO1Kn6v9B5MBOZ1v+nH1HApLv0gSh3mwOLQui+IaVqep/3ubqw0I+xGTndvLXk6OslC+5eey7rZA4EU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=cHlMjrYU; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="cHlMjrYU" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908929; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5MrayBHV+6dK7LiAjQhlOclbDaGg8mYI23n6D75wm7Q=; b=cHlMjrYU+pJSAyu1SBVBgDXJu5Em+b/rG/915ARq+koBcOIVaekcmOXcVcdQFpw46kIpiI O5Ecz0zlPZ59r0kYMG/4kNfjHpoocHowvj8/qMPxeDW6z/DapGLQcnNPc8pMsAak7pmrUk pFZxVbnhwtfsa8MvbBzqOeoVcvuadYM= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-611-aWQDqZpaOmOITLt0AqEJnw-1; Thu, 13 Mar 2025 19:35:28 -0400 X-MC-Unique: aWQDqZpaOmOITLt0AqEJnw-1 X-Mimecast-MFC-AGG-ID: aWQDqZpaOmOITLt0AqEJnw_1741908926 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B32531800259; Thu, 13 Mar 2025 23:35:26 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 617181955BCB; Thu, 13 Mar 2025 23:35:24 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 26/35] ceph: Kill ceph_rw_context Date: Thu, 13 Mar 2025 23:33:18 +0000 Message-ID: <20250313233341.1675324-27-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 With all invokers of readahead: - read() and co. - splice() - fadvise(POSIX_FADV_WILLNEED) - madvise(MADV_WILLNEED) - fault-in now getting the FILE_CACHE cap or the LAZYIO cap and holding it across readahead invocation, there's no need for the ceph_rw_context. It can be assumed that we have one or other cap - and apparently it doesn't matter which as we don't actually check rw_ctx->caps. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/ceph/addr.c | 19 +++++++------------ fs/ceph/file.c | 13 +------------ fs/ceph/super.h | 47 ----------------------------------------------- 3 files changed, 8 insertions(+), 71 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 7c89cafcb91a..27f27ab24446 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -473,18 +473,16 @@ static int ceph_init_request(struct netfs_io_request *rreq, struct file *file) if (!priv) return -ENOMEM; + /* + * If we are doing readahead triggered by a read, fault-in or + * MADV/FADV_WILLNEED, someone higher up the stack must be holding the + * FILE_CACHE and/or LAZYIO caps. + */ if (file) { - struct ceph_rw_context *rw_ctx; - struct ceph_file_info *fi = file->private_data; - priv->file_ra_pages = file->f_ra.ra_pages; priv->file_ra_disabled = file->f_mode & FMODE_RANDOM; - - rw_ctx = ceph_find_rw_context(fi); - if (rw_ctx) { - rreq->netfs_priv = priv; - return 0; - } + rreq->netfs_priv = priv; + return 0; } /* @@ -1982,10 +1980,7 @@ static vm_fault_t ceph_filemap_fault(struct vm_fault *vmf) if ((got & (CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO)) || !ceph_has_inline_data(ci)) { - CEPH_DEFINE_RW_CONTEXT(rw_ctx, got); - ceph_add_rw_context(fi, &rw_ctx); ret = filemap_fault(vmf); - ceph_del_rw_context(fi, &rw_ctx); doutc(cl, "%llx.%llx %llu drop cap refs %s ret %x\n", ceph_vinop(inode), off, ceph_cap_string(got), ret); } else diff --git a/fs/ceph/file.c b/fs/ceph/file.c index b876cecbaba5..4512215cccc6 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -229,8 +229,6 @@ static int ceph_init_file_info(struct inode *inode, struct file *file, ceph_get_fmode(ci, fmode, 1); fi->fmode = fmode; - spin_lock_init(&fi->rw_contexts_lock); - INIT_LIST_HEAD(&fi->rw_contexts); fi->filp_gen = READ_ONCE(ceph_inode_to_fs_client(inode)->filp_gen); if ((file->f_mode & FMODE_WRITE) && ceph_has_inline_data(ci)) { @@ -999,7 +997,6 @@ int ceph_release(struct inode *inode, struct file *file) struct ceph_dir_file_info *dfi = file->private_data; doutc(cl, "%p %llx.%llx dir file %p\n", inode, ceph_vinop(inode), file); - WARN_ON(!list_empty(&dfi->file_info.rw_contexts)); ceph_put_fmode(ci, dfi->file_info.fmode, 1); @@ -1012,7 +1009,6 @@ int ceph_release(struct inode *inode, struct file *file) struct ceph_file_info *fi = file->private_data; doutc(cl, "%p %llx.%llx regular file %p\n", inode, ceph_vinop(inode), file); - WARN_ON(!list_empty(&fi->rw_contexts)); ceph_fscache_unuse_cookie(inode, file->f_mode & FMODE_WRITE); ceph_put_fmode(ci, fi->fmode, 1); @@ -2154,13 +2150,10 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to) retry_op = READ_INLINE; } } else { - CEPH_DEFINE_RW_CONTEXT(rw_ctx, got); doutc(cl, "async %p %llx.%llx %llu~%u got cap refs on %s\n", inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len, ceph_cap_string(got)); - ceph_add_rw_context(fi, &rw_ctx); ret = generic_file_read_iter(iocb, to); - ceph_del_rw_context(fi, &rw_ctx); } doutc(cl, "%p %llx.%llx dropping cap refs on %s = %d\n", @@ -2256,7 +2249,6 @@ static ssize_t ceph_splice_read(struct file *in, loff_t *ppos, struct ceph_inode_info *ci = ceph_inode(inode); ssize_t ret; int want = 0, got = 0; - CEPH_DEFINE_RW_CONTEXT(rw_ctx, 0); dout("splice_read %p %llx.%llx %llu~%zu trying to get caps on %p\n", inode, ceph_vinop(inode), *ppos, len, inode); @@ -2291,10 +2283,7 @@ static ssize_t ceph_splice_read(struct file *in, loff_t *ppos, dout("splice_read %p %llx.%llx %llu~%zu got cap refs on %s\n", inode, ceph_vinop(inode), *ppos, len, ceph_cap_string(got)); - rw_ctx.caps = got; - ceph_add_rw_context(fi, &rw_ctx); ret = filemap_splice_read(in, ppos, pipe, len, flags); - ceph_del_rw_context(fi, &rw_ctx); dout("splice_read %p %llx.%llx dropping cap refs on %s = %zd\n", inode, ceph_vinop(inode), ceph_cap_string(got), ret); @@ -3177,7 +3166,7 @@ static int ceph_fadvise(struct file *file, loff_t offset, loff_t len, int advice goto out; } - if ((got & want) == want) { + if (got & want) { doutc(cl, "fadvise(WILLNEED) %p %llx.%llx %llu~%llu got cap refs on %s\n", inode, ceph_vinop(inode), offset, len, ceph_cap_string(got)); diff --git a/fs/ceph/super.h b/fs/ceph/super.h index b072572e2cf4..14784ad86670 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -833,10 +833,6 @@ extern void change_auth_cap_ses(struct ceph_inode_info *ci, struct ceph_file_info { short fmode; /* initialized on open */ short flags; /* CEPH_F_* */ - - spinlock_t rw_contexts_lock; - struct list_head rw_contexts; - u32 filp_gen; }; @@ -859,49 +855,6 @@ struct ceph_dir_file_info { int dir_info_len; }; -struct ceph_rw_context { - struct list_head list; - struct task_struct *thread; - int caps; -}; - -#define CEPH_DEFINE_RW_CONTEXT(_name, _caps) \ - struct ceph_rw_context _name = { \ - .thread = current, \ - .caps = _caps, \ - } - -static inline void ceph_add_rw_context(struct ceph_file_info *cf, - struct ceph_rw_context *ctx) -{ - spin_lock(&cf->rw_contexts_lock); - list_add(&ctx->list, &cf->rw_contexts); - spin_unlock(&cf->rw_contexts_lock); -} - -static inline void ceph_del_rw_context(struct ceph_file_info *cf, - struct ceph_rw_context *ctx) -{ - spin_lock(&cf->rw_contexts_lock); - list_del(&ctx->list); - spin_unlock(&cf->rw_contexts_lock); -} - -static inline struct ceph_rw_context* -ceph_find_rw_context(struct ceph_file_info *cf) -{ - struct ceph_rw_context *ctx, *found = NULL; - spin_lock(&cf->rw_contexts_lock); - list_for_each_entry(ctx, &cf->rw_contexts, list) { - if (ctx->thread == current) { - found = ctx; - break; - } - } - spin_unlock(&cf->rw_contexts_lock); - return found; -} - struct ceph_readdir_cache_control { struct folio *folio; struct dentry **dentries; From patchwork Thu Mar 13 23:33:19 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016075 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 14EA920B1FD for ; Thu, 13 Mar 2025 23:35:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908937; cv=none; b=TQTmt7Z97EaXkTE6azC9Ff1DpPxadxjVkGO76RKGP391h29IT3X0dyQFCs+/pbgSGQQTqBwKoVY0Yu1dg1u5bW7jKc2ikKMjL1jxQC3IItp/ks6tLlTc0g0HS+y/jhqCDcurP9YQEGG6PToDW1hnRv+YW9Az6v9ZuU1L/yqUEEs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908937; c=relaxed/simple; bh=8qJA1t/oP1wKnFSji3XC0BItsvE3qCEUW/Gd2fboNIc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=OqTr336cvWTZcPc+Zt/RHT/xlkeMr9FA13Kd5yVN68FUgMEpotQB0Elf1MNb01gcW+fxtoD3dp8qGqG4y1fGYkoH0z2/8SGvh6vnTlDfMpEG1LBp6mrFk30Wz5NXdy3IKW7g02I0IIpbZ+QEkctZK1/a/YnXoNeCohb8JN4VlYo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=NJhkggEa; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="NJhkggEa" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908935; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xGcz01WnWG0q81+QmgZlmMhhOegvNur4a75Sm9xwI30=; b=NJhkggEavoCk5SQWC1w9j+8VOzWp5J/LL474wCZtWBHEeOhz+mPyb9vb3V2cpCXseGJV3M Gcm7rgCitYTGh1uas9WZihznMgSqUiUi1bNZfRUGUJNhHqcaNwFavAHiTYocGFPfFnBPA3 ipDw3Dn9rM4DTUzLz9aIwaHBt8ZZ5tU= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-412-ME2h5ZTGOhyxIk538zxFgA-1; Thu, 13 Mar 2025 19:35:31 -0400 X-MC-Unique: ME2h5ZTGOhyxIk538zxFgA-1 X-Mimecast-MFC-AGG-ID: ME2h5ZTGOhyxIk538zxFgA_1741908930 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 821511801A07; Thu, 13 Mar 2025 23:35:30 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id EA2551955F2D; Thu, 13 Mar 2025 23:35:27 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Xiubo Li Subject: [RFC PATCH 27/35] netfs: Pass extra write context to write functions Date: Thu, 13 Mar 2025 23:33:19 +0000 Message-ID: <20250313233341.1675324-28-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Allow the filesystem to pass in an extra bit of context to certain write functions so that netfs_page_mkwrite() and netfs_perform_write() can pass it back to the filesystem's ->post_modify() function. This can be used by ceph to pass in a preallocated ceph_cap_flush record. Signed-off-by: David Howells cc: Jeff Layton cc: Viacheslav Dubeyko cc: Alex Markuze cc: Xiubo Li cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/9p/vfs_file.c | 2 +- fs/afs/write.c | 2 +- fs/netfs/buffered_write.c | 21 ++++++++++++--------- fs/smb/client/file.c | 4 ++-- include/linux/netfs.h | 9 +++++---- 5 files changed, 21 insertions(+), 17 deletions(-) diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c index 348cc90bf9c5..838332d5372c 100644 --- a/fs/9p/vfs_file.c +++ b/fs/9p/vfs_file.c @@ -477,7 +477,7 @@ v9fs_file_mmap(struct file *filp, struct vm_area_struct *vma) static vm_fault_t v9fs_vm_page_mkwrite(struct vm_fault *vmf) { - return netfs_page_mkwrite(vmf, NULL); + return netfs_page_mkwrite(vmf, NULL, NULL); } static void v9fs_mmap_vm_close(struct vm_area_struct *vma) diff --git a/fs/afs/write.c b/fs/afs/write.c index 18b0a9f1615e..054f3a07d2a5 100644 --- a/fs/afs/write.c +++ b/fs/afs/write.c @@ -276,7 +276,7 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf) if (afs_validate(AFS_FS_I(file_inode(file)), afs_file_key(file)) < 0) return VM_FAULT_SIGBUS; - return netfs_page_mkwrite(vmf, NULL); + return netfs_page_mkwrite(vmf, NULL, NULL); } /* diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c index f3370846ba18..0245449b93e3 100644 --- a/fs/netfs/buffered_write.c +++ b/fs/netfs/buffered_write.c @@ -86,7 +86,8 @@ static void netfs_update_i_size(struct netfs_inode *ctx, struct inode *inode, * netfs_perform_write - Copy data into the pagecache. * @iocb: The operation parameters * @iter: The source buffer - * @netfs_group: Grouping for dirty folios (eg. ceph snaps). + * @netfs_group: Grouping for dirty folios (eg. ceph snaps) + * @fs_priv: Private data to be passed to ->post_modify() * * Copy data into pagecache folios attached to the inode specified by @iocb. * The caller must hold appropriate inode locks. @@ -97,7 +98,7 @@ static void netfs_update_i_size(struct netfs_inode *ctx, struct inode *inode, * a new one is started. */ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter, - struct netfs_group *netfs_group) + struct netfs_group *netfs_group, void *fs_priv) { struct file *file = iocb->ki_filp; struct inode *inode = file_inode(file); @@ -382,7 +383,7 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter, */ set_bit(NETFS_ICTX_MODIFIED_ATTR, &ctx->flags); if (unlikely(ctx->ops->post_modify)) - ctx->ops->post_modify(inode); + ctx->ops->post_modify(inode, fs_priv); } if (unlikely(wreq)) { @@ -411,7 +412,8 @@ EXPORT_SYMBOL(netfs_perform_write); * netfs_buffered_write_iter_locked - write data to a file * @iocb: IO state structure (file, offset, etc.) * @from: iov_iter with data to write - * @netfs_group: Grouping for dirty folios (eg. ceph snaps). + * @netfs_group: Grouping for dirty folios (eg. ceph snaps) + * @fs_priv: Private data to be passed to ->post_modify() * * This function does all the work needed for actually writing data to a * file. It does all basic checks, removes SUID from the file, updates @@ -431,7 +433,7 @@ EXPORT_SYMBOL(netfs_perform_write); * * negative error code if no data has been written at all */ ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from, - struct netfs_group *netfs_group) + struct netfs_group *netfs_group, void *fs_priv) { struct file *file = iocb->ki_filp; ssize_t ret; @@ -446,7 +448,7 @@ ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *fr if (ret) return ret; - return netfs_perform_write(iocb, from, netfs_group); + return netfs_perform_write(iocb, from, netfs_group, fs_priv); } EXPORT_SYMBOL(netfs_buffered_write_iter_locked); @@ -485,7 +487,7 @@ ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from) ret = generic_write_checks(iocb, from); if (ret > 0) - ret = netfs_buffered_write_iter_locked(iocb, from, NULL); + ret = netfs_buffered_write_iter_locked(iocb, from, NULL, NULL); netfs_end_io_write(inode); if (ret > 0) ret = generic_write_sync(iocb, ret); @@ -499,7 +501,8 @@ EXPORT_SYMBOL(netfs_file_write_iter); * we only track group on a per-folio basis, so we block more often than * we might otherwise. */ -vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group) +vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group, + void *fs_priv) { struct netfs_group *group; struct folio *folio = page_folio(vmf->page); @@ -554,7 +557,7 @@ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_gr file_update_time(file); set_bit(NETFS_ICTX_MODIFIED_ATTR, &ictx->flags); if (ictx->ops->post_modify) - ictx->ops->post_modify(inode); + ictx->ops->post_modify(inode, fs_priv); ret = VM_FAULT_LOCKED; out: sb_end_pagefault(inode->i_sb); diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c index 8582cf61242c..4329c2bbf74f 100644 --- a/fs/smb/client/file.c +++ b/fs/smb/client/file.c @@ -2779,7 +2779,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from) goto out; } - rc = netfs_buffered_write_iter_locked(iocb, from, NULL); + rc = netfs_buffered_write_iter_locked(iocb, from, NULL, NULL); out: up_read(&cinode->lock_sem); @@ -2955,7 +2955,7 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to) static vm_fault_t cifs_page_mkwrite(struct vm_fault *vmf) { - return netfs_page_mkwrite(vmf, NULL); + return netfs_page_mkwrite(vmf, NULL, NULL); } static const struct vm_operations_struct cifs_file_vm_ops = { diff --git a/include/linux/netfs.h b/include/linux/netfs.h index ec1c51697c04..a67297de8a20 100644 --- a/include/linux/netfs.h +++ b/include/linux/netfs.h @@ -335,7 +335,7 @@ struct netfs_request_ops { /* Modification handling */ void (*update_i_size)(struct inode *inode, loff_t i_size); - void (*post_modify)(struct inode *inode); + void (*post_modify)(struct inode *inode, void *fs_priv); /* Write request handling */ void (*begin_writeback)(struct netfs_io_request *wreq); @@ -435,9 +435,9 @@ ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter); /* High-level write API */ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter, - struct netfs_group *netfs_group); + struct netfs_group *netfs_group, void *fs_priv); ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from, - struct netfs_group *netfs_group); + struct netfs_group *netfs_group, void *fs_priv); ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *from); ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *iter, struct netfs_group *netfs_group); @@ -466,7 +466,8 @@ void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length); bool netfs_release_folio(struct folio *folio, gfp_t gfp); /* VMA operations API. */ -vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group); +vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group, + void *fs_priv); /* (Sub)request management API. */ void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq); From patchwork Thu Mar 13 23:33:20 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016076 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 298A620D51C for ; Thu, 13 Mar 2025 23:35:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908943; cv=none; b=pbXBk9HhXoWEDcLwzSaBCAHvmgKNsq/lEHpP+dCqJmGaomuRhUBAtydh971HlJmk92pJ1DBzLLc80nQpD2m9B4Egmlkeh3U2H0ngjkYbojXIhLaDRO/oTxgfWRo6L7wry2B05nUF+4WQo8Y7AxbvfcpoklTQBfTaVIwROFqtkzI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908943; c=relaxed/simple; bh=7t4syejk2Ap8A2Bv+ZlpSBwz2DzYMksQcfs7WWVUA5Q=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=oISjkj/YgOUSOsOLt89XF4C9nS7FE9YUZJpt/R++7T0SGgc3J6Pajx+6q/9jGjSvavtLZPX5LRPtN02QBvGbtWF34e76UsTyhZeNCw956G7CUp7zNKDZI9H3DFE/IwDnuHXXvvwjdwAjiMG2nQFfd83Hp+zc9ONiYqyF0Sggspw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=SjpypCXx; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="SjpypCXx" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908940; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YLiTeAvB3TgtihHrmX2Xj+lq/9LyEs5c10u6Qa3bzY4=; b=SjpypCXxo8XAt28/977YNvgSwXR/o4/N3TEXV2rec+rBfbQ5HZhMBcPfNMUTI6PiKujCy0 eiDvjDeuBE26RAq9wxAJ1MDCHGScFbEALm4YNFRjJm4GVE82fU+1zeczPyj4biIHdqeQLm 6O0e48lrBFHC10VkLEOTog2cs9xzTBE= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-385-5kZeNjvvMFeDQV-68Zevtg-1; Thu, 13 Mar 2025 19:35:37 -0400 X-MC-Unique: 5kZeNjvvMFeDQV-68Zevtg-1 X-Mimecast-MFC-AGG-ID: 5kZeNjvvMFeDQV-68Zevtg_1741908934 Received: from mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 2ECFD180049D; Thu, 13 Mar 2025 23:35:34 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id CD6021954B32; Thu, 13 Mar 2025 23:35:31 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 28/35] netfs: Adjust group handling Date: Thu, 13 Mar 2025 23:33:20 +0000 Message-ID: <20250313233341.1675324-29-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.40 Make some adjustments to the handling of netfs groups so that ceph can use them for snap contexts: - Move netfs_get_group(), netfs_put_group() and netfs_put_group_many() to linux/netfs.h so that ceph can build its snap context on netfs groups. - Move netfs_set_group() and __netfs_set_group() to linux/netfs.h so that ceph_dirty_folio() can call them from inside of the locked section in which it finds the snap context to attach. - Provide a netfs_writepages_group() that takes a group as a parameter and attaches it to the request and make netfs_free_request() drop the ref on it. netfs_writepages() then becomes a wrapper that passes in a NULL group. - In netfs_perform_write(), only consider a folio to have a conflicting group if the folio's group pointer isn't NULL and if the folio is dirty. - In netfs_perform_write(), interject a small 10ms sleep after every 16 attempts to flush a folio within a single call. Signed-off-by: David Howells cc: Jeff Layton cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/netfs/buffered_write.c | 25 ++++------------- fs/netfs/internal.h | 32 --------------------- fs/netfs/objects.c | 1 + fs/netfs/write_issue.c | 38 +++++++++++++++++++++---- include/linux/netfs.h | 59 +++++++++++++++++++++++++++++++++++++++ 5 files changed, 98 insertions(+), 57 deletions(-) diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c index 0245449b93e3..12ddbe9bc78b 100644 --- a/fs/netfs/buffered_write.c +++ b/fs/netfs/buffered_write.c @@ -11,26 +11,9 @@ #include #include #include +#include #include "internal.h" -static void __netfs_set_group(struct folio *folio, struct netfs_group *netfs_group) -{ - if (netfs_group) - folio_attach_private(folio, netfs_get_group(netfs_group)); -} - -static void netfs_set_group(struct folio *folio, struct netfs_group *netfs_group) -{ - void *priv = folio_get_private(folio); - - if (unlikely(priv != netfs_group)) { - if (netfs_group && (!priv || priv == NETFS_FOLIO_COPY_TO_CACHE)) - folio_attach_private(folio, netfs_get_group(netfs_group)); - else if (!netfs_group && priv == NETFS_FOLIO_COPY_TO_CACHE) - folio_detach_private(folio); - } -} - /* * Grab a folio for writing and lock it. Attempt to allocate as large a folio * as possible to hold as much of the remaining length as possible in one go. @@ -113,6 +96,7 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter, }; struct netfs_io_request *wreq = NULL; struct folio *folio = NULL, *writethrough = NULL; + unsigned int flush_counter = 0; unsigned int bdp_flags = (iocb->ki_flags & IOCB_NOWAIT) ? BDP_ASYNC : 0; ssize_t written = 0, ret, ret2; loff_t i_size, pos = iocb->ki_pos; @@ -208,7 +192,8 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter, group = netfs_folio_group(folio); if (unlikely(group != netfs_group) && - group != NETFS_FOLIO_COPY_TO_CACHE) + group != NETFS_FOLIO_COPY_TO_CACHE && + (group || folio_test_dirty(folio))) goto flush_content; if (folio_test_uptodate(folio)) { @@ -341,6 +326,8 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter, trace_netfs_folio(folio, netfs_flush_content); folio_unlock(folio); folio_put(folio); + if ((++flush_counter & 0xf) == 0xf) + msleep(10); ret = filemap_write_and_wait_range(mapping, fpos, fpos + flen - 1); if (ret < 0) goto error_folio_unlock; diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h index eebb4f0f660e..2a6123c4da35 100644 --- a/fs/netfs/internal.h +++ b/fs/netfs/internal.h @@ -261,38 +261,6 @@ static inline bool netfs_is_cache_enabled(struct netfs_inode *ctx) #endif } -/* - * Get a ref on a netfs group attached to a dirty page (e.g. a ceph snap). - */ -static inline struct netfs_group *netfs_get_group(struct netfs_group *netfs_group) -{ - if (netfs_group && netfs_group != NETFS_FOLIO_COPY_TO_CACHE) - refcount_inc(&netfs_group->ref); - return netfs_group; -} - -/* - * Dispose of a netfs group attached to a dirty page (e.g. a ceph snap). - */ -static inline void netfs_put_group(struct netfs_group *netfs_group) -{ - if (netfs_group && - netfs_group != NETFS_FOLIO_COPY_TO_CACHE && - refcount_dec_and_test(&netfs_group->ref)) - netfs_group->free(netfs_group); -} - -/* - * Dispose of a netfs group attached to a dirty page (e.g. a ceph snap). - */ -static inline void netfs_put_group_many(struct netfs_group *netfs_group, int nr) -{ - if (netfs_group && - netfs_group != NETFS_FOLIO_COPY_TO_CACHE && - refcount_sub_and_test(nr, &netfs_group->ref)) - netfs_group->free(netfs_group); -} - /* * Check to see if a buffer aligns with the crypto block size. If it doesn't * the crypto layer is going to copy all the data - in which case relying on diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c index 52d6fce70837..7fdbaa5c5cab 100644 --- a/fs/netfs/objects.c +++ b/fs/netfs/objects.c @@ -153,6 +153,7 @@ static void netfs_free_request(struct work_struct *work) kvfree(rreq->direct_bv); } + netfs_put_group(rreq->group); rolling_buffer_clear(&rreq->buffer); rolling_buffer_clear(&rreq->bounce); if (test_bit(NETFS_RREQ_PUT_RMW_TAIL, &rreq->flags)) diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c index 93601033ba08..3921fcf4f859 100644 --- a/fs/netfs/write_issue.c +++ b/fs/netfs/write_issue.c @@ -418,7 +418,7 @@ static int netfs_write_folio(struct netfs_io_request *wreq, netfs_issue_write(wreq, upload); } else if (fgroup != wreq->group) { /* We can't write this page to the server yet. */ - kdebug("wrong group"); + kdebug("wrong group %px != %px", fgroup, wreq->group); folio_redirty_for_writepage(wbc, folio); folio_unlock(folio); netfs_issue_write(wreq, upload); @@ -593,11 +593,19 @@ static void netfs_end_issue_write(struct netfs_io_request *wreq) netfs_wake_write_collector(wreq, false); } -/* - * Write some of the pending data back to the server +/** + * netfs_writepages_group - Flush data from the pagecache for a file + * @mapping: The file to flush from + * @wbc: Details of what should be flushed + * @group: The write grouping to flush (or NULL) + * + * Start asynchronous write back operations to flush dirty data belonging to a + * particular group in a file's pagecache back to the server and to the local + * cache. */ -int netfs_writepages(struct address_space *mapping, - struct writeback_control *wbc) +int netfs_writepages_group(struct address_space *mapping, + struct writeback_control *wbc, + struct netfs_group *group) { struct netfs_inode *ictx = netfs_inode(mapping->host); struct netfs_io_request *wreq = NULL; @@ -618,12 +626,15 @@ int netfs_writepages(struct address_space *mapping, if (!folio) goto out; - wreq = netfs_create_write_req(mapping, NULL, folio_pos(folio), NETFS_WRITEBACK); + wreq = netfs_create_write_req(mapping, NULL, folio_pos(folio), + NETFS_WRITEBACK); if (IS_ERR(wreq)) { error = PTR_ERR(wreq); goto couldnt_start; } + wreq->group = netfs_get_group(group); + trace_netfs_write(wreq, netfs_write_trace_writeback); netfs_stat(&netfs_n_wh_writepages); @@ -659,6 +670,21 @@ int netfs_writepages(struct address_space *mapping, _leave(" = %d", error); return error; } +EXPORT_SYMBOL(netfs_writepages_group); + +/** + * netfs_writepages - Flush data from the pagecache for a file + * @mapping: The file to flush from + * @wbc: Details of what should be flushed + * + * Start asynchronous write back operations to flush dirty data in a file's + * pagecache back to the server and to the local cache. + */ +int netfs_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + return netfs_writepages_group(mapping, wbc, NULL); +} EXPORT_SYMBOL(netfs_writepages); /* diff --git a/include/linux/netfs.h b/include/linux/netfs.h index a67297de8a20..69052ac47ab1 100644 --- a/include/linux/netfs.h +++ b/include/linux/netfs.h @@ -457,6 +457,9 @@ int netfs_read_folio(struct file *, struct folio *); int netfs_write_begin(struct netfs_inode *, struct file *, struct address_space *, loff_t pos, unsigned int len, struct folio **, void **fsdata); +int netfs_writepages_group(struct address_space *mapping, + struct writeback_control *wbc, + struct netfs_group *group); int netfs_writepages(struct address_space *mapping, struct writeback_control *wbc); bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio); @@ -597,4 +600,60 @@ static inline void netfs_wait_for_outstanding_io(struct inode *inode) wait_var_event(&ictx->io_count, atomic_read(&ictx->io_count) == 0); } +/* + * Get a ref on a netfs group attached to a dirty page (e.g. a ceph snap). + */ +static inline struct netfs_group *netfs_get_group(struct netfs_group *netfs_group) +{ + if (netfs_group && netfs_group != NETFS_FOLIO_COPY_TO_CACHE) + refcount_inc(&netfs_group->ref); + return netfs_group; +} + +/* + * Dispose of a netfs group attached to a dirty page (e.g. a ceph snap). + */ +static inline void netfs_put_group(struct netfs_group *netfs_group) +{ + if (netfs_group && + netfs_group != NETFS_FOLIO_COPY_TO_CACHE && + refcount_dec_and_test(&netfs_group->ref)) + netfs_group->free(netfs_group); +} + +/* + * Dispose of a netfs group attached to a dirty page (e.g. a ceph snap). + */ +static inline void netfs_put_group_many(struct netfs_group *netfs_group, int nr) +{ + if (netfs_group && + netfs_group != NETFS_FOLIO_COPY_TO_CACHE && + refcount_sub_and_test(nr, &netfs_group->ref)) + netfs_group->free(netfs_group); +} + +/* + * Set the group pointer directly on a folio. + */ +static inline void __netfs_set_group(struct folio *folio, struct netfs_group *netfs_group) +{ + if (netfs_group) + folio_attach_private(folio, netfs_get_group(netfs_group)); +} + +/* + * Set the group pointer on a folio or the folio info record. + */ +static inline void netfs_set_group(struct folio *folio, struct netfs_group *netfs_group) +{ + void *priv = folio_get_private(folio); + + if (unlikely(priv != netfs_group)) { + if (netfs_group && (!priv || priv == NETFS_FOLIO_COPY_TO_CACHE)) + folio_attach_private(folio, netfs_get_group(netfs_group)); + else if (!netfs_group && priv == NETFS_FOLIO_COPY_TO_CACHE) + folio_detach_private(folio); + } +} + #endif /* _LINUX_NETFS_H */ From patchwork Thu Mar 13 23:33:21 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016077 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 348E11FCFEC for ; Thu, 13 Mar 2025 23:35:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908944; cv=none; b=oPz54BAFYP3KSM0JMDNDVN8lHPilD1hudsvvtTrxNtvRSbJIegWeef3djEp5fkQQA41+D0wxSRY31HmKZcen2CxNVtWBAlQQctSNQFOxoqGzA4zu8fpnwMI7Pd08lyJCwFID0kAgigRB7VlndhpVkki5I1rv5RptkY01fIMx50w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908944; c=relaxed/simple; bh=o/Zr6xlF5B6lSctEAHGp+Z0dT3it5fkkfwvSYDNK+ck=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=KLAI7UQrNNRl5oVDuWGOBR9ggaI3fyAaQR/APqpbm1HAgJzkLga/HbD1wmf1tHur7l4s1cPCXoHp4AuYbOeTJPz1JGVI5x4Ox8wjkJmNEMFPRWLmtd3AOu4asv/RoE4gRmQ3nUsMh/Ez+ualet7cWh7x8B9kqdB+2u6rAP7tBFY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ACz1Wmy8; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ACz1Wmy8" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908941; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=4S9pWAwkvg/dSOZrmn04PqAhUhdiqhkkXPp5wd+gSpI=; b=ACz1Wmy8rX5amX4iTJvD664WsJuwHcIH2bmoRywFjREkOj3IpcOq9358sFvjlrFMDLwc61 LyuB5EczL23zN0gGjgJX0eOu96oTQWgAWnIjjdxFQFz3i7G/mQn2JoaZA0x2OLKWAtthm0 4KACuiV45Ovdmf/iITAwFkc5x3LGCr8= Received: from mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-290-sUUhUPBnPb2374m58HGoLQ-1; Thu, 13 Mar 2025 19:35:39 -0400 X-MC-Unique: sUUhUPBnPb2374m58HGoLQ-1 X-Mimecast-MFC-AGG-ID: sUUhUPBnPb2374m58HGoLQ_1741908937 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id AC6111956087; Thu, 13 Mar 2025 23:35:37 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 7188618001F6; Thu, 13 Mar 2025 23:35:35 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 29/35] netfs: Allow fs-private data to be handed through to request alloc Date: Thu, 13 Mar 2025 23:33:21 +0000 Message-ID: <20250313233341.1675324-30-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Allow an fs-private pointer to be handed through to request alloc and stashed in the netfs_io_request struct for the filesystem to retrieve. This will be used by ceph to pass a pointer to the ceph_writeback_ctl to the netfs operation functions. Signed-off-by: David Howells cc: Jeff Layton cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/netfs/buffered_read.c | 11 ++++++----- fs/netfs/direct_read.c | 2 +- fs/netfs/direct_write.c | 2 +- fs/netfs/internal.h | 2 ++ fs/netfs/objects.c | 2 ++ fs/netfs/read_pgpriv2.c | 2 +- fs/netfs/read_single.c | 2 +- fs/netfs/write_issue.c | 17 +++++++++++------ fs/netfs/write_retry.c | 2 +- include/linux/netfs.h | 3 ++- 10 files changed, 28 insertions(+), 17 deletions(-) diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c index 4dd505053fba..10daf2452324 100644 --- a/fs/netfs/buffered_read.c +++ b/fs/netfs/buffered_read.c @@ -343,7 +343,7 @@ void netfs_readahead(struct readahead_control *ractl) int ret; rreq = netfs_alloc_request(ractl->mapping, ractl->file, start, size, - NETFS_READAHEAD); + NULL, NETFS_READAHEAD); if (IS_ERR(rreq)) return; @@ -414,7 +414,8 @@ static int netfs_read_gaps(struct file *file, struct folio *folio) _enter("%lx", folio->index); - rreq = netfs_alloc_request(mapping, file, folio_pos(folio), flen, NETFS_READ_GAPS); + rreq = netfs_alloc_request(mapping, file, folio_pos(folio), flen, + NULL, NETFS_READ_GAPS); if (IS_ERR(rreq)) { ret = PTR_ERR(rreq); goto alloc_error; @@ -510,7 +511,7 @@ int netfs_read_folio(struct file *file, struct folio *folio) rreq = netfs_alloc_request(mapping, file, folio_pos(folio), folio_size(folio), - NETFS_READPAGE); + NULL, NETFS_READPAGE); if (IS_ERR(rreq)) { ret = PTR_ERR(rreq); goto alloc_error; @@ -665,7 +666,7 @@ int netfs_write_begin(struct netfs_inode *ctx, rreq = netfs_alloc_request(mapping, file, folio_pos(folio), folio_size(folio), - NETFS_READ_FOR_WRITE); + NULL, NETFS_READ_FOR_WRITE); if (IS_ERR(rreq)) { ret = PTR_ERR(rreq); goto error; @@ -730,7 +731,7 @@ int netfs_prefetch_for_write(struct file *file, struct folio *folio, ret = -ENOMEM; rreq = netfs_alloc_request(mapping, file, start, flen, - NETFS_READ_FOR_WRITE); + NULL, NETFS_READ_FOR_WRITE); if (IS_ERR(rreq)) { ret = PTR_ERR(rreq); goto error; diff --git a/fs/netfs/direct_read.c b/fs/netfs/direct_read.c index fc0a053ad5a8..15a6923a92ca 100644 --- a/fs/netfs/direct_read.c +++ b/fs/netfs/direct_read.c @@ -264,7 +264,7 @@ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *i rreq = netfs_alloc_request(iocb->ki_filp->f_mapping, iocb->ki_filp, iocb->ki_pos, orig_count, - NETFS_DIO_READ); + NULL, NETFS_DIO_READ); if (IS_ERR(rreq)) return PTR_ERR(rreq); diff --git a/fs/netfs/direct_write.c b/fs/netfs/direct_write.c index e41614687e49..83c5c06c4710 100644 --- a/fs/netfs/direct_write.c +++ b/fs/netfs/direct_write.c @@ -300,7 +300,7 @@ ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter * _debug("uw %llx-%llx", start, end); - wreq = netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp, start, + wreq = netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp, start, NULL, iocb->ki_flags & IOCB_DIRECT ? NETFS_DIO_WRITE : NETFS_UNBUFFERED_WRITE); if (IS_ERR(wreq)) diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h index 2a6123c4da35..9724d5a1ddc7 100644 --- a/fs/netfs/internal.h +++ b/fs/netfs/internal.h @@ -101,6 +101,7 @@ int netfs_alloc_bounce(struct netfs_io_request *wreq, unsigned long long to, gfp struct netfs_io_request *netfs_alloc_request(struct address_space *mapping, struct file *file, loff_t start, size_t len, + void *netfs_priv2, enum netfs_io_origin origin); void netfs_get_request(struct netfs_io_request *rreq, enum netfs_rreq_ref_trace what); void netfs_clear_subrequests(struct netfs_io_request *rreq, bool was_async); @@ -218,6 +219,7 @@ void netfs_wake_write_collector(struct netfs_io_request *wreq, bool was_async); struct netfs_io_request *netfs_create_write_req(struct address_space *mapping, struct file *file, loff_t start, + void *netfs_priv2, enum netfs_io_origin origin); void netfs_prepare_write(struct netfs_io_request *wreq, struct netfs_io_stream *stream, diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c index 7fdbaa5c5cab..4606e830c116 100644 --- a/fs/netfs/objects.c +++ b/fs/netfs/objects.c @@ -16,6 +16,7 @@ struct netfs_io_request *netfs_alloc_request(struct address_space *mapping, struct file *file, loff_t start, size_t len, + void *netfs_priv2, enum netfs_io_origin origin) { static atomic_t debug_ids; @@ -38,6 +39,7 @@ struct netfs_io_request *netfs_alloc_request(struct address_space *mapping, rreq->len = len; rreq->origin = origin; rreq->netfs_ops = ctx->ops; + rreq->netfs_priv2 = netfs_priv2; rreq->mapping = mapping; rreq->inode = inode; rreq->i_size = i_size_read(inode); diff --git a/fs/netfs/read_pgpriv2.c b/fs/netfs/read_pgpriv2.c index cf7727060215..e94140ebc6fb 100644 --- a/fs/netfs/read_pgpriv2.c +++ b/fs/netfs/read_pgpriv2.c @@ -103,7 +103,7 @@ static struct netfs_io_request *netfs_pgpriv2_begin_copy_to_cache( goto cancel; creq = netfs_create_write_req(rreq->mapping, NULL, folio_pos(folio), - NETFS_PGPRIV2_COPY_TO_CACHE); + NULL, NETFS_PGPRIV2_COPY_TO_CACHE); if (IS_ERR(creq)) goto cancel; diff --git a/fs/netfs/read_single.c b/fs/netfs/read_single.c index b36a3020bb90..3a20e8340e06 100644 --- a/fs/netfs/read_single.c +++ b/fs/netfs/read_single.c @@ -169,7 +169,7 @@ ssize_t netfs_read_single(struct inode *inode, struct file *file, struct iov_ite ssize_t ret; rreq = netfs_alloc_request(inode->i_mapping, file, 0, iov_iter_count(iter), - NETFS_READ_SINGLE); + NULL, NETFS_READ_SINGLE); if (IS_ERR(rreq)) return PTR_ERR(rreq); diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c index 3921fcf4f859..9b8d99477405 100644 --- a/fs/netfs/write_issue.c +++ b/fs/netfs/write_issue.c @@ -90,6 +90,7 @@ static void netfs_kill_dirty_pages(struct address_space *mapping, struct netfs_io_request *netfs_create_write_req(struct address_space *mapping, struct file *file, loff_t start, + void *netfs_priv2, enum netfs_io_origin origin) { struct netfs_io_request *wreq; @@ -99,7 +100,7 @@ struct netfs_io_request *netfs_create_write_req(struct address_space *mapping, origin == NETFS_WRITETHROUGH || origin == NETFS_PGPRIV2_COPY_TO_CACHE); - wreq = netfs_alloc_request(mapping, file, start, 0, origin); + wreq = netfs_alloc_request(mapping, file, start, 0, netfs_priv2, origin); if (IS_ERR(wreq)) return wreq; @@ -598,14 +599,18 @@ static void netfs_end_issue_write(struct netfs_io_request *wreq) * @mapping: The file to flush from * @wbc: Details of what should be flushed * @group: The write grouping to flush (or NULL) + * @netfs_priv2: Private data specific to the netfs (or NULL) * * Start asynchronous write back operations to flush dirty data belonging to a * particular group in a file's pagecache back to the server and to the local * cache. + * + * If not NULL, @netfs_priv2 will be set on wreq->netfs_priv2 */ int netfs_writepages_group(struct address_space *mapping, struct writeback_control *wbc, - struct netfs_group *group) + struct netfs_group *group, + void *netfs_priv2) { struct netfs_inode *ictx = netfs_inode(mapping->host); struct netfs_io_request *wreq = NULL; @@ -627,7 +632,7 @@ int netfs_writepages_group(struct address_space *mapping, goto out; wreq = netfs_create_write_req(mapping, NULL, folio_pos(folio), - NETFS_WRITEBACK); + netfs_priv2, NETFS_WRITEBACK); if (IS_ERR(wreq)) { error = PTR_ERR(wreq); goto couldnt_start; @@ -683,7 +688,7 @@ EXPORT_SYMBOL(netfs_writepages_group); int netfs_writepages(struct address_space *mapping, struct writeback_control *wbc) { - return netfs_writepages_group(mapping, wbc, NULL); + return netfs_writepages_group(mapping, wbc, NULL, NULL); } EXPORT_SYMBOL(netfs_writepages); @@ -698,7 +703,7 @@ struct netfs_io_request *netfs_begin_writethrough(struct kiocb *iocb, size_t len mutex_lock(&ictx->wb_lock); wreq = netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp, - iocb->ki_pos, NETFS_WRITETHROUGH); + iocb->ki_pos, NULL, NETFS_WRITETHROUGH); if (IS_ERR(wreq)) { mutex_unlock(&ictx->wb_lock); return wreq; @@ -953,7 +958,7 @@ int netfs_writeback_single(struct address_space *mapping, mutex_lock(&ictx->wb_lock); } - wreq = netfs_create_write_req(mapping, NULL, 0, NETFS_WRITEBACK_SINGLE); + wreq = netfs_create_write_req(mapping, NULL, 0, NULL, NETFS_WRITEBACK_SINGLE); if (IS_ERR(wreq)) { ret = PTR_ERR(wreq); goto couldnt_start; diff --git a/fs/netfs/write_retry.c b/fs/netfs/write_retry.c index 187882801d57..f727b48e2bfe 100644 --- a/fs/netfs/write_retry.c +++ b/fs/netfs/write_retry.c @@ -328,7 +328,7 @@ ssize_t netfs_rmw_read(struct netfs_io_request *wreq, struct file *file, bufsize = bsize * 2; } - rreq = netfs_alloc_request(wreq->mapping, file, start, len, NETFS_RMW_READ); + rreq = netfs_alloc_request(wreq->mapping, file, start, len, NULL, NETFS_RMW_READ); if (IS_ERR(rreq)) return PTR_ERR(rreq); diff --git a/include/linux/netfs.h b/include/linux/netfs.h index 69052ac47ab1..9d17d4bd9753 100644 --- a/include/linux/netfs.h +++ b/include/linux/netfs.h @@ -459,7 +459,8 @@ int netfs_write_begin(struct netfs_inode *, struct file *, struct folio **, void **fsdata); int netfs_writepages_group(struct address_space *mapping, struct writeback_control *wbc, - struct netfs_group *group); + struct netfs_group *group, + void *netfs_priv2); int netfs_writepages(struct address_space *mapping, struct writeback_control *wbc); bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio); From patchwork Thu Mar 13 23:33:22 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016078 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B4F071FDA6D for ; Thu, 13 Mar 2025 23:35:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908948; cv=none; b=rzYPFbNhmu7hTCaTNeRIM17oK0lfJK1HfgUZCnXoHhYqJv6k+v5BaHHfnPkkQ9gtmBQSqQa21YuQSTK4YV30arsm87JG0HZsNnNpDHyMWCN//h7oqk8OmFyqMfCw+KAUkU5tJDQPHuu7eFNb3ALsDGrhauqFjX9gJg52A5W65tY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908948; c=relaxed/simple; bh=uny3v++/iPmenzGDz3NHnhj0MBwuYiajQoqDAzVy0pY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=YpctnI+A/8jrdV1JV/vIekUvZbRK4JVPcKCga2kLc4mTQRZtz/vtOTrEYSxjgsSjDJK33HVci5O2X7cYrqdatGFkpnpm4J62zf+o+5jNUTOHaa/ZYxr4j3Ct2LuoVli2bE2/Y0mxqekdb0rJ1mosK/hKbCYe9tsCTquNryk2cyU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=VqLnLSat; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="VqLnLSat" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908945; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cCZDwnJsjHU0vEGdI7THy8y1XEAMyhcYhOuUJ0O6NiA=; b=VqLnLSat0U+8a3BkmpYaBXmj7/dY3GzJtREFfoK8wYzwXXIVPBr8+8CEKA5gCO6tYYuZWN O8zrh5tGJaMtDcnVA9a6evDLbxU/1l+/0YVDwE3B9zipCenZgUmBQf4+7uikyw4oOr+oqZ vk58/wQ1+Y2wwe3qfUxM/h2j2T3GQbo= Received: from mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-369-EjBfogNUNre6JggkmeyI-w-1; Thu, 13 Mar 2025 19:35:42 -0400 X-MC-Unique: EjBfogNUNre6JggkmeyI-w-1 X-Mimecast-MFC-AGG-ID: EjBfogNUNre6JggkmeyI-w_1741908941 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 35D461955E90; Thu, 13 Mar 2025 23:35:41 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id DFE501955BCB; Thu, 13 Mar 2025 23:35:38 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 30/35] netfs: Make netfs_page_mkwrite() use folio_mkwrite_check_truncate() Date: Thu, 13 Mar 2025 23:33:22 +0000 Message-ID: <20250313233341.1675324-31-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 Make netfs_page_mkwrite() use folio_mkwrite_check_truncate() rather than doing the checks itself (and it doesn't currently do all the checks). Signed-off-by: David Howells cc: Jeff Layton cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/netfs/buffered_write.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c index 12ddbe9bc78b..64a0f0620399 100644 --- a/fs/netfs/buffered_write.c +++ b/fs/netfs/buffered_write.c @@ -506,7 +506,7 @@ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_gr if (folio_lock_killable(folio) < 0) goto out; - if (folio->mapping != mapping) + if (folio_mkwrite_check_truncate(folio, inode) < 0) goto unlock; if (folio_wait_writeback_killable(folio) < 0) goto unlock; From patchwork Thu Mar 13 23:33:23 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016079 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E8F5B211A12 for ; Thu, 13 Mar 2025 23:35:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908953; cv=none; b=KdrBNMzKuvLmrWL+USe89gSEp4vAzKBkhg1/XyY4ujw89IvJVRAplzHJkUirE1LcOmq7tj8mN6vYtCKGzBfYaRdCdP43SrXX+GW1bl7wHzpYSPoQxLGz/nGqqI/xQICatMDDceI3oEzYHUf/LfE0WOTs/VttkNxbjjh0BOLuwj4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908953; c=relaxed/simple; bh=gCChVgMBzGvrlJ7sZIvOLzGNcSU2zki3HyYavWVM5pM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Kk9qyvuJsl4BMddK89mUfXSmT1svjx6gIwqY6TwfT3XkZLP6RYJKkyrS3fn4gPZFjcRR7zAlt1cTRvUyXxLVZoxhePMijsrUwtvo4E7YJCmRn5HylatEjFwNatPmXfFgw3NltnktQs1ngINZgfGLyxOPGHtHAh/pjTdNiyWidzQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=QKTQ+FJo; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="QKTQ+FJo" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908951; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=d01oWNmun+xrcu/GBCdvzw0vab4iyfs4NG678lUElAM=; b=QKTQ+FJoTDP2TZKg6caFY1nYWVNzHKAsoTsnlJjwRl4228MmCPQ4ks2sTqzItiFoe9ggSk 4kPkeWyA+gzlMKcelpTPtQuwMYm41Y3AT4BykgjFM7B8xvY5ZCtSxct9LS2cH5zWDnN+cy AIw8B7V43K3snAezBll7NUMhpQiGrc4= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-459-PyevtBxrP3KxREWu0FWF4g-1; Thu, 13 Mar 2025 19:35:46 -0400 X-MC-Unique: PyevtBxrP3KxREWu0FWF4g-1 X-Mimecast-MFC-AGG-ID: PyevtBxrP3KxREWu0FWF4g_1741908944 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B36101801A07; Thu, 13 Mar 2025 23:35:44 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 6E5E8300376F; Thu, 13 Mar 2025 23:35:42 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 31/35] netfs: Fix netfs_unbuffered_read() to return ssize_t rather than int Date: Thu, 13 Mar 2025 23:33:23 +0000 Message-ID: <20250313233341.1675324-32-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 Fix netfs_unbuffered_read() to return an ssize_t rather than an int as netfs_wait_for_read() returns ssize_t and this gets implicitly truncated. Signed-off-by: David Howells cc: Jeff Layton cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/netfs/direct_read.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/netfs/direct_read.c b/fs/netfs/direct_read.c index 15a6923a92ca..5e4bd1e5a378 100644 --- a/fs/netfs/direct_read.c +++ b/fs/netfs/direct_read.c @@ -201,9 +201,9 @@ static int netfs_dispatch_unbuffered_reads(struct netfs_io_request *rreq) * Perform a read to an application buffer, bypassing the pagecache and the * local disk cache. */ -static int netfs_unbuffered_read(struct netfs_io_request *rreq, bool sync) +static ssize_t netfs_unbuffered_read(struct netfs_io_request *rreq, bool sync) { - int ret; + ssize_t ret; _enter("R=%x %llx-%llx", rreq->debug_id, rreq->start, rreq->start + rreq->len - 1); @@ -231,7 +231,7 @@ static int netfs_unbuffered_read(struct netfs_io_request *rreq, bool sync) else ret = -EIOCBQUEUED; out: - _leave(" = %d", ret); + _leave(" = %zd", ret); return ret; } From patchwork Thu Mar 13 23:33:24 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016080 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A48A6211A12 for ; Thu, 13 Mar 2025 23:35:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908956; cv=none; b=ZN+v17mJWe7VuF+ibxDKQaYkxkr1x/WI71pKYcFH+tN7KfNjNlweIFlSqAjOaayOnmyI2ulAwT4KJfM5XURthfC9Y4BV2hIKfRp3kbRU6zWBDUsVI7zTsz+BhzpIHrAVws/hUBLiXyPCwjyYCmXmnuzcS/wrelRFvZmOobGOPvo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908956; c=relaxed/simple; bh=fHs1s2WCKQ84Yspbi4jtwD+PSbtcWohuKJ09bCFjMKc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=akCTC1EzM9Oe88uMQULIhcxhYzBtKFEuyWbhUbdA7Iuohn79Ezvj5oJtG+aXQ44WPwuS9+GA2CQGs/OVf0GFpQ82hsgPYa/wE9rQVi5I96l0aTR0I2xXFO2TfD1UFXY1YxtO64Sg74ubFwzYOKyGHaDBjZMl68i25f8e8/9ARAM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=iAFcZIne; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="iAFcZIne" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908953; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YW3tQO1OIxbJZ9HZnJpB3E2VWDABEHP8uIr97iicS+8=; b=iAFcZIneQphFN5cC63VYpE9uUJzoIhB6DU8tuFPVVsYHUo91jueqOQZTKuA1FSBPGuyeFT igN4w/TX3rFglx+xaKomGMPz9HmXTdjwy273+Hk190eY1enITRLOMcVE6+XWeKCA7da3RA ycvxSkvMPfeew+S7deF2P27r66rh3Hc= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-563-cdmA10zOMiClYN4h6-rr4g-1; Thu, 13 Mar 2025 19:35:49 -0400 X-MC-Unique: cdmA10zOMiClYN4h6-rr4g-1 X-Mimecast-MFC-AGG-ID: cdmA10zOMiClYN4h6-rr4g_1741908948 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 5763B1801A07; Thu, 13 Mar 2025 23:35:48 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 12ED81955BCB; Thu, 13 Mar 2025 23:35:45 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 32/35] netfs: Add some more RMW support for ceph Date: Thu, 13 Mar 2025 23:33:24 +0000 Message-ID: <20250313233341.1675324-33-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 Add some support for RMW in ceph: (1) Add netfs_unbuffered_read_from_inode() to allow reading from an inode without having a file pointer so that truncate can modify a now-partial tail block of a content-encrypted file. This takes an additional argument to cause it to fail or give a short read if a hole is encountered. This is noted on the request with NETFS_RREQ_NO_READ_HOLE for the filesystem to pick up. (2) Set NETFS_RREQ_RMW when doing an RMW as part of a request. (3) Provide a ->rmw_read_done() op for netfslib to tell the filesystem that it has completed the read required for RMW. Signed-off-by: David Howells cc: Jeff Layton cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/netfs/direct_read.c | 75 ++++++++++++++++++++++++++++++++++++ fs/netfs/direct_write.c | 1 + fs/netfs/main.c | 1 + fs/netfs/objects.c | 1 + fs/netfs/read_collect.c | 2 + fs/netfs/write_retry.c | 3 ++ include/linux/netfs.h | 7 ++++ include/trace/events/netfs.h | 3 ++ 8 files changed, 93 insertions(+) diff --git a/fs/netfs/direct_read.c b/fs/netfs/direct_read.c index 5e4bd1e5a378..4061f934dfe6 100644 --- a/fs/netfs/direct_read.c +++ b/fs/netfs/direct_read.c @@ -373,3 +373,78 @@ ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter) return ret; } EXPORT_SYMBOL(netfs_unbuffered_read_iter); + +/** + * netfs_unbuffered_read_from_inode - Perform an unbuffered sync I/O read + * @inode: The inode being accessed + * @pos: The file position to read from + * @iter: The output buffer (also specifies read length) + * @nohole: True to return short/ENODATA if hole encountered + * + * Perform a synchronous unbuffered I/O from the inode to the output buffer. + * No use is made of the pagecache. The output buffer must be suitably aligned + * if content encryption is to be used. If @nohole is true then the read will + * stop short if a hole is encountered and return -ENODATA if the read begins + * with a hole. + * + * The caller must hold any appropriate locks. + */ +ssize_t netfs_unbuffered_read_from_inode(struct inode *inode, loff_t pos, + struct iov_iter *iter, bool nohole) +{ + struct netfs_io_request *rreq; + ssize_t ret; + size_t orig_count = iov_iter_count(iter); + + _enter(""); + + if (WARN_ON(user_backed_iter(iter))) + return -EIO; + + if (!orig_count) + return 0; /* Don't update atime */ + + ret = filemap_write_and_wait_range(inode->i_mapping, pos, orig_count); + if (ret < 0) + return ret; + inode_update_time(inode, S_ATIME); + + rreq = netfs_alloc_request(inode->i_mapping, NULL, pos, orig_count, + NULL, NETFS_UNBUFFERED_READ); + if (IS_ERR(rreq)) + return PTR_ERR(rreq); + + ret = -EIO; + if (test_bit(NETFS_RREQ_CONTENT_ENCRYPTION, &rreq->flags) && + WARN_ON(!netfs_is_crypto_aligned(rreq, iter))) + goto out; + + netfs_stat(&netfs_n_rh_dio_read); + trace_netfs_read(rreq, rreq->start, rreq->len, + netfs_read_trace_unbuffered_read_from_inode); + + rreq->buffer.iter = *iter; + rreq->len = orig_count; + rreq->direct_bv_unpin = false; + iov_iter_advance(iter, orig_count); + + if (nohole) + __set_bit(NETFS_RREQ_NO_READ_HOLE, &rreq->flags); + + /* We're going to do the crypto in place in the destination buffer. */ + if (test_bit(NETFS_RREQ_CONTENT_ENCRYPTION, &rreq->flags)) + __set_bit(NETFS_RREQ_CRYPT_IN_PLACE, &rreq->flags); + + ret = netfs_dispatch_unbuffered_reads(rreq); + + if (!rreq->submitted) { + netfs_put_request(rreq, false, netfs_rreq_trace_put_no_submit); + goto out; + } + + ret = netfs_wait_for_read(rreq); +out: + netfs_put_request(rreq, false, netfs_rreq_trace_put_return); + return ret; +} +EXPORT_SYMBOL(netfs_unbuffered_read_from_inode); diff --git a/fs/netfs/direct_write.c b/fs/netfs/direct_write.c index 83c5c06c4710..a99722f90c71 100644 --- a/fs/netfs/direct_write.c +++ b/fs/netfs/direct_write.c @@ -145,6 +145,7 @@ static ssize_t netfs_write_through_bounce_buffer(struct netfs_io_request *wreq, wreq->start = gstart; wreq->len = gend - gstart; + __set_bit(NETFS_RREQ_RMW, &ictx->flags); if (gstart >= end) { /* At or after EOF, nothing to read. */ } else { diff --git a/fs/netfs/main.c b/fs/netfs/main.c index 07f8cffbda8c..0900dea53e4a 100644 --- a/fs/netfs/main.c +++ b/fs/netfs/main.c @@ -39,6 +39,7 @@ static const char *netfs_origins[nr__netfs_io_origin] = { [NETFS_READ_GAPS] = "RG", [NETFS_READ_SINGLE] = "R1", [NETFS_READ_FOR_WRITE] = "RW", + [NETFS_UNBUFFERED_READ] = "UR", [NETFS_DIO_READ] = "DR", [NETFS_WRITEBACK] = "WB", [NETFS_WRITEBACK_SINGLE] = "W1", diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c index 4606e830c116..958c4d460d07 100644 --- a/fs/netfs/objects.c +++ b/fs/netfs/objects.c @@ -60,6 +60,7 @@ struct netfs_io_request *netfs_alloc_request(struct address_space *mapping, origin == NETFS_READ_GAPS || origin == NETFS_READ_SINGLE || origin == NETFS_READ_FOR_WRITE || + origin == NETFS_UNBUFFERED_READ || origin == NETFS_DIO_READ) { INIT_WORK(&rreq->work, netfs_read_collection_worker); rreq->io_streams[0].avail = true; diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c index 0a0bff90ca9e..013a90738dcd 100644 --- a/fs/netfs/read_collect.c +++ b/fs/netfs/read_collect.c @@ -462,6 +462,7 @@ static void netfs_read_collection(struct netfs_io_request *rreq) //netfs_rreq_is_still_valid(rreq); switch (rreq->origin) { + case NETFS_UNBUFFERED_READ: case NETFS_DIO_READ: case NETFS_READ_GAPS: case NETFS_RMW_READ: @@ -681,6 +682,7 @@ ssize_t netfs_wait_for_read(struct netfs_io_request *rreq) if (ret == 0) { ret = rreq->transferred; switch (rreq->origin) { + case NETFS_UNBUFFERED_READ: case NETFS_DIO_READ: case NETFS_READ_SINGLE: ret = rreq->transferred; diff --git a/fs/netfs/write_retry.c b/fs/netfs/write_retry.c index f727b48e2bfe..9e4e79d5a403 100644 --- a/fs/netfs/write_retry.c +++ b/fs/netfs/write_retry.c @@ -386,6 +386,9 @@ ssize_t netfs_rmw_read(struct netfs_io_request *wreq, struct file *file, ret = 0; } + if (ret == 0 && rreq->netfs_ops->rmw_read_done) + rreq->netfs_ops->rmw_read_done(wreq, rreq); + error: netfs_put_request(rreq, false, netfs_rreq_trace_put_return); return ret; diff --git a/include/linux/netfs.h b/include/linux/netfs.h index 9d17d4bd9753..4049c985b9b4 100644 --- a/include/linux/netfs.h +++ b/include/linux/netfs.h @@ -220,6 +220,7 @@ enum netfs_io_origin { NETFS_READ_GAPS, /* This read is a synchronous read to fill gaps */ NETFS_READ_SINGLE, /* This read should be treated as a single object */ NETFS_READ_FOR_WRITE, /* This read is to prepare a write */ + NETFS_UNBUFFERED_READ, /* This is an unbuffered I/O read */ NETFS_DIO_READ, /* This is a direct I/O read */ NETFS_WRITEBACK, /* This write was triggered by writepages */ NETFS_WRITEBACK_SINGLE, /* This monolithic write was triggered by writepages */ @@ -308,6 +309,9 @@ struct netfs_io_request { #define NETFS_RREQ_CONTENT_ENCRYPTION 16 /* Content encryption is in use */ #define NETFS_RREQ_CRYPT_IN_PLACE 17 /* Do decryption in place */ #define NETFS_RREQ_PUT_RMW_TAIL 18 /* Need to put ->rmw_tail */ +#define NETFS_RREQ_RMW 19 /* Performing RMW cycle */ +#define NETFS_RREQ_REPEAT_RMW 20 /* Need to perform an RMW cycle */ +#define NETFS_RREQ_NO_READ_HOLE 21 /* Give short read/error if hole encountered */ #define NETFS_RREQ_USE_PGPRIV2 31 /* [DEPRECATED] Use PG_private_2 to mark * write to cache on read */ const struct netfs_request_ops *netfs_ops; @@ -336,6 +340,7 @@ struct netfs_request_ops { /* Modification handling */ void (*update_i_size)(struct inode *inode, loff_t i_size); void (*post_modify)(struct inode *inode, void *fs_priv); + void (*rmw_read_done)(struct netfs_io_request *wreq, struct netfs_io_request *rreq); /* Write request handling */ void (*begin_writeback)(struct netfs_io_request *wreq); @@ -432,6 +437,8 @@ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *i ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter); ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter); ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter); +ssize_t netfs_unbuffered_read_from_inode(struct inode *inode, loff_t pos, + struct iov_iter *iter, bool nohole); /* High-level write API */ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter, diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h index 74af82d773bd..9254c6f0e604 100644 --- a/include/trace/events/netfs.h +++ b/include/trace/events/netfs.h @@ -23,6 +23,7 @@ EM(netfs_read_trace_read_gaps, "READ-GAPS") \ EM(netfs_read_trace_read_single, "READ-SNGL") \ EM(netfs_read_trace_prefetch_for_write, "PREFETCHW") \ + EM(netfs_read_trace_unbuffered_read_from_inode, "READ-INOD") \ E_(netfs_read_trace_write_begin, "WRITEBEGN") #define netfs_write_traces \ @@ -38,6 +39,7 @@ EM(NETFS_READ_GAPS, "RG") \ EM(NETFS_READ_SINGLE, "R1") \ EM(NETFS_READ_FOR_WRITE, "RW") \ + EM(NETFS_UNBUFFERED_READ, "UR") \ EM(NETFS_DIO_READ, "DR") \ EM(NETFS_WRITEBACK, "WB") \ EM(NETFS_WRITEBACK_SINGLE, "W1") \ @@ -104,6 +106,7 @@ EM(netfs_sreq_trace_io_progress, "IO ") \ EM(netfs_sreq_trace_limited, "LIMIT") \ EM(netfs_sreq_trace_need_clear, "N-CLR") \ + EM(netfs_sreq_trace_need_rmw, "N-RMW") \ EM(netfs_sreq_trace_partial_read, "PARTR") \ EM(netfs_sreq_trace_need_retry, "ND-RT") \ EM(netfs_sreq_trace_pending, "PEND ") \ From patchwork Thu Mar 13 23:33:25 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016091 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E7D511FBC86 for ; Thu, 13 Mar 2025 23:37:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741909071; cv=none; b=YsOWdlp1mgBe8ttLhCWcBYmX1Bh6Jx2GkPRViVFRsn8qSeChk8sNZxUlXncO891AXI4twTcOVSWU2eLSCiBOuRW81PUGDynmmrVubrioB+RzEGzv3sVzVBXZVhd6zS5N8zfgeQ++xasGbMiiTszBF2Ippc8Zz/lCsLKJq4JmW8A= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741909071; c=relaxed/simple; bh=8djWi7SuWKOGXUwbQY4fSWBD+ETv7VH/9upz2BW8Py4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=rnOvuQJaEKWhwMffFGHzRur1qka1kPmygqnw8BapLnnN9Vy9v5XuX5NGlplHoo6cmXTpXsxSsOCSJ4dy0WgUQeZbS37OEBitzX9Xox3E2e+/Ggxe/1CXAn1nodFYiMYmG4AD0eULnn/mZ16CUNyx+P5YE/YLUM1WlOepVbtLmh0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=i8N1V+3b; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="i8N1V+3b" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741909065; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7NFx2gstzGhKTf0+VpOC9RFirA8o4q6xxXCeal6JkFo=; b=i8N1V+3b+E88X1X4NaecLfxlzS9mUNnIVq6SmbVnZV8+74VauzgXukmeguKnmGPqFUe565 i62GlsqyYVB6iFSEL/SVsVX8FlxVn1A0xGIwtn/YvVpLIwOSolyyixpW45IvyBy08z1pEt qD/27Xc8YjF5f3xftpaYjvTpk7V2jv4= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-661-mnQ529y7Pp-V0Pt3_TBoog-1; Thu, 13 Mar 2025 19:35:53 -0400 X-MC-Unique: mnQ529y7Pp-V0Pt3_TBoog-1 X-Mimecast-MFC-AGG-ID: mnQ529y7Pp-V0Pt3_TBoog_1741908952 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 5B0F3180899B; Thu, 13 Mar 2025 23:35:52 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id A4D95300376F; Thu, 13 Mar 2025 23:35:49 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 33/35] ceph: Use netfslib [INCOMPLETE] Date: Thu, 13 Mar 2025 23:33:25 +0000 Message-ID: <20250313233341.1675324-34-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 Implement netfslib support for ceph. Note that I've put the new code into its own file for now rather than attempting to modify the old code or putting it into an existing file. The old code is just #if'd out for removal in a subsequent patch to make this patch easier to review. Note also that this is incomplete as sparse map support and content crypto support are currently non-functional - but plain I/O should work. There may also be an inode ref leak due to the way the ceph sometimes takes and holds on to an extra inode ref under some circumstances. I'm not sure these are actually necessary. For instance, ceph_dirty_folio() will ihold the inode if ci->i_wrbuffer_ref is 0 Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- drivers/block/rbd.c | 2 +- fs/ceph/Makefile | 2 +- fs/ceph/addr.c | 46 +- fs/ceph/cache.h | 5 + fs/ceph/caps.c | 2 +- fs/ceph/crypto.c | 54 ++ fs/ceph/file.c | 15 +- fs/ceph/inode.c | 30 +- fs/ceph/rdwr.c | 1006 +++++++++++++++++++++++++++++++ fs/ceph/super.h | 39 +- fs/netfs/internal.h | 6 +- fs/netfs/main.c | 4 +- fs/netfs/write_issue.c | 6 +- include/linux/ceph/libceph.h | 3 +- include/linux/ceph/osd_client.h | 1 + include/linux/netfs.h | 13 +- net/ceph/snapshot.c | 20 +- 17 files changed, 1190 insertions(+), 64 deletions(-) create mode 100644 fs/ceph/rdwr.c diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 956fc4a8f1da..94bb29c95b0d 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -468,7 +468,7 @@ static DEFINE_IDA(rbd_dev_id_ida); static struct workqueue_struct *rbd_wq; static struct ceph_snap_context rbd_empty_snapc = { - .nref = REFCOUNT_INIT(1), + .group.ref = REFCOUNT_INIT(1), }; /* diff --git a/fs/ceph/Makefile b/fs/ceph/Makefile index 1f77ca04c426..e4d3c2d6e9c2 100644 --- a/fs/ceph/Makefile +++ b/fs/ceph/Makefile @@ -5,7 +5,7 @@ obj-$(CONFIG_CEPH_FS) += ceph.o -ceph-y := super.o inode.o dir.o file.o locks.o addr.o ioctl.o \ +ceph-y := super.o inode.o dir.o file.o locks.o addr.o rdwr.o ioctl.o \ export.o caps.o snap.o xattr.o quota.o io.o \ mds_client.o mdsmap.o strings.o ceph_frag.o \ debugfs.o util.o metric.o diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 27f27ab24446..325fbbce1eaa 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -64,27 +64,30 @@ (CONGESTION_ON_THRESH(congestion_kb) - \ (CONGESTION_ON_THRESH(congestion_kb) >> 2)) +#if 0 // TODO: Remove after netfs conversion static int ceph_netfs_check_write_begin(struct file *file, loff_t pos, unsigned int len, struct folio **foliop, void **_fsdata); -static inline struct ceph_snap_context *page_snap_context(struct page *page) +static struct ceph_snap_context *page_snap_context(struct page *page) { if (PagePrivate(page)) return (void *)page->private; return NULL; } +#endif // TODO: Remove after netfs conversion /* * Dirty a page. Optimistically adjust accounting, on the assumption * that we won't race with invalidate. If we do, readjust. */ -static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio) +bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio) { struct inode *inode = mapping->host; struct ceph_client *cl = ceph_inode_to_client(inode); struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb); struct ceph_inode_info *ci; struct ceph_snap_context *snapc; + struct netfs_group *group; if (folio_test_dirty(folio)) { doutc(cl, "%llx.%llx %p idx %lu -- already dirty\n", @@ -101,16 +104,28 @@ static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio) spin_lock(&ci->i_ceph_lock); if (__ceph_have_pending_cap_snap(ci)) { struct ceph_cap_snap *capsnap = - list_last_entry(&ci->i_cap_snaps, - struct ceph_cap_snap, - ci_item); - snapc = ceph_get_snap_context(capsnap->context); + list_last_entry(&ci->i_cap_snaps, + struct ceph_cap_snap, + ci_item); + snapc = capsnap->context; capsnap->dirty_pages++; } else { - BUG_ON(!ci->i_head_snapc); - snapc = ceph_get_snap_context(ci->i_head_snapc); + snapc = ci->i_head_snapc; + BUG_ON(!snapc); ++ci->i_wrbuffer_ref_head; } + + /* Attach a reference to the snap/group to the folio. */ + group = netfs_folio_group(folio); + if (group != &snapc->group) { + netfs_set_group(folio, &snapc->group); + if (group) { + doutc(cl, "Different group %px != %px\n", + group, &snapc->group); + netfs_put_group(group); + } + } + if (ci->i_wrbuffer_ref == 0) ihold(inode); ++ci->i_wrbuffer_ref; @@ -122,16 +137,10 @@ static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio) snapc, snapc->seq, snapc->num_snaps); spin_unlock(&ci->i_ceph_lock); - /* - * Reference snap context in folio->private. Also set - * PagePrivate so that we get invalidate_folio callback. - */ - VM_WARN_ON_FOLIO(folio->private, folio); - folio_attach_private(folio, snapc); - - return ceph_fscache_dirty_folio(mapping, folio); + return netfs_dirty_folio(mapping, folio); } +#if 0 // TODO: Remove after netfs conversion /* * If we are truncating the full folio (i.e. offset == 0), adjust the * dirty folio counters appropriately. Only called if there is private @@ -1236,6 +1245,7 @@ bool is_num_ops_too_big(struct ceph_writeback_ctl *ceph_wbc) return ceph_wbc->num_ops >= (ceph_wbc->from_pool ? CEPH_OSD_SLAB_OPS : CEPH_OSD_MAX_OPS); } +#endif // TODO: Remove after netfs conversion static inline bool is_write_congestion_happened(struct ceph_fs_client *fsc) @@ -1244,6 +1254,7 @@ bool is_write_congestion_happened(struct ceph_fs_client *fsc) CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb); } +#if 0 // TODO: Remove after netfs conversion static inline int move_dirty_folio_in_page_array(struct address_space *mapping, struct writeback_control *wbc, struct ceph_writeback_ctl *ceph_wbc, struct folio *folio) @@ -1930,6 +1941,7 @@ const struct address_space_operations ceph_aops = { .direct_IO = noop_direct_IO, .migrate_folio = filemap_migrate_folio, }; +#endif // TODO: Remove after netfs conversion static void ceph_block_sigs(sigset_t *oldset) { @@ -2034,6 +2046,7 @@ static vm_fault_t ceph_filemap_fault(struct vm_fault *vmf) return ret; } +#if 0 // TODO: Remove after netfs conversion static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; @@ -2137,6 +2150,7 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf) ret = vmf_error(err); return ret; } +#endif // TODO: Remove after netfs conversion void ceph_fill_inline_data(struct inode *inode, struct page *locked_page, char *data, size_t len) diff --git a/fs/ceph/cache.h b/fs/ceph/cache.h index 20efac020394..d6afca292f08 100644 --- a/fs/ceph/cache.h +++ b/fs/ceph/cache.h @@ -43,6 +43,8 @@ static inline void ceph_fscache_resize(struct inode *inode, loff_t to) } } +#if 0 // TODO: Remove after netfs conversion + static inline int ceph_fscache_unpin_writeback(struct inode *inode, struct writeback_control *wbc) { @@ -50,6 +52,7 @@ static inline int ceph_fscache_unpin_writeback(struct inode *inode, } #define ceph_fscache_dirty_folio netfs_dirty_folio +#endif // TODO: Remove after netfs conversion static inline bool ceph_is_cache_enabled(struct inode *inode) { @@ -100,6 +103,7 @@ static inline void ceph_fscache_resize(struct inode *inode, loff_t to) { } +#if 0 // TODO: Remove after netfs conversion static inline int ceph_fscache_unpin_writeback(struct inode *inode, struct writeback_control *wbc) { @@ -107,6 +111,7 @@ static inline int ceph_fscache_unpin_writeback(struct inode *inode, } #define ceph_fscache_dirty_folio filemap_dirty_folio +#endif // TODO: Remove after netfs conversion static inline bool ceph_is_cache_enabled(struct inode *inode) { diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index a8d8b56cf9d2..53f23f351003 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -2536,7 +2536,7 @@ int ceph_write_inode(struct inode *inode, struct writeback_control *wbc) int wait = (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync); doutc(cl, "%p %llx.%llx wait=%d\n", inode, ceph_vinop(inode), wait); - ceph_fscache_unpin_writeback(inode, wbc); + netfs_unpin_writeback(inode, wbc); if (wait) { err = ceph_wait_on_async_create(inode); if (err) diff --git a/fs/ceph/crypto.c b/fs/ceph/crypto.c index a28dea74ca6f..8d4e908da7d8 100644 --- a/fs/ceph/crypto.c +++ b/fs/ceph/crypto.c @@ -636,6 +636,60 @@ int ceph_fscrypt_decrypt_extents(struct inode *inode, struct page **page, return ret; } +#if 0 +int ceph_decrypt_block(struct netfs_io_request *rreq, loff_t pos, size_t len, + struct scatterlist *source_sg, unsigned int n_source, + struct scatterlist *dest_sg, unsigned int n_dest) +{ + struct ceph_sparse_extent *map = op->extent.sparse_ext; + struct ceph_inode_info *ci = ceph_inode(inode); + size_t xlen; + u64 objno, objoff; + u32 ext_cnt = op->extent.sparse_ext_cnt; + int i, ret = 0; + + /* Nothing to do for empty array */ + if (ext_cnt == 0) { + dout("%s: empty array, ret 0\n", __func__); + return 0; + } + + ceph_calc_file_object_mapping(&ci->i_layout, pos, map[0].len, + &objno, &objoff, &xlen); + + for (i = 0; i < ext_cnt; ++i) { + struct ceph_sparse_extent *ext = &map[i]; + int pgsoff = ext->off - objoff; + int pgidx = pgsoff >> PAGE_SHIFT; + int fret; + + if ((ext->off | ext->len) & ~CEPH_FSCRYPT_BLOCK_MASK) { + pr_warn("%s: bad encrypted sparse extent idx %d off %llx len %llx\n", + __func__, i, ext->off, ext->len); + return -EIO; + } + fret = ceph_fscrypt_decrypt_pages(inode, &page[pgidx], + off + pgsoff, ext->len); + dout("%s: [%d] 0x%llx~0x%llx fret %d\n", __func__, i, + ext->off, ext->len, fret); + if (fret < 0) { + if (ret == 0) + ret = fret; + break; + } + ret = pgsoff + fret; + } + dout("%s: ret %d\n", __func__, ret); + return ret; +} + +int ceph_encrypt_block(struct netfs_io_request *wreq, loff_t pos, size_t len, + struct scatterlist *source_sg, unsigned int n_source, + struct scatterlist *dest_sg, unsigned int n_dest) +{ +} +#endif + /** * ceph_fscrypt_encrypt_pages - encrypt an array of pages * @inode: pointer to inode associated with these pages diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 4512215cccc6..94b91b5bc843 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -77,6 +77,7 @@ static __le32 ceph_flags_sys2wire(struct ceph_mds_client *mdsc, u32 flags) * need to wait for MDS acknowledgement. */ +#if 0 // TODO: Remove after netfs conversion /* * How many pages to get in one call to iov_iter_get_pages(). This * determines the size of the on-stack array used as a buffer. @@ -165,6 +166,7 @@ static void ceph_dirty_pages(struct ceph_databuf *dbuf) if (bvec[i].bv_page) set_page_dirty_lock(bvec[i].bv_page); } +#endif // TODO: Remove after netfs conversion /* * Prepare an open request. Preallocate ceph_cap to avoid an @@ -1021,6 +1023,7 @@ int ceph_release(struct inode *inode, struct file *file) return 0; } +#if 0 // TODO: Remove after netfs conversion enum { HAVE_RETRIED = 1, CHECK_EOF = 2, @@ -2234,6 +2237,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to) return ret; } +#endif // TODO: Remove after netfs conversion /* * Wrap filemap_splice_read with checks for cap bits on the inode. @@ -2294,6 +2298,7 @@ static ssize_t ceph_splice_read(struct file *in, loff_t *ppos, return ret; } +#if 0 // TODO: Remove after netfs conversion /* * Take cap references to avoid releasing caps to MDS mid-write. * @@ -2488,6 +2493,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from) ceph_free_cap_flush(prealloc_cf); return written ? written : err; } +#endif // TODO: Remove after netfs conversion /* * llseek. be sure to verify file size on SEEK_END. @@ -3160,6 +3166,10 @@ static int ceph_fadvise(struct file *file, loff_t offset, loff_t len, int advice if (fi->fmode & CEPH_FILE_MODE_LAZY) return -EACCES; + ret = netfs_start_io_read(inode); + if (ret < 0) + return ret; + ret = ceph_get_caps(file, CEPH_CAP_FILE_RD, want, -1, &got); if (ret < 0) { doutc(cl, "%llx.%llx, error getting cap\n", ceph_vinop(inode)); @@ -3180,6 +3190,7 @@ static int ceph_fadvise(struct file *file, loff_t offset, loff_t len, int advice inode, ceph_vinop(inode), ceph_cap_string(got), ret); ceph_put_cap_refs(ceph_inode(inode), got); out: + netfs_end_io_read(inode); return ret; } @@ -3187,8 +3198,8 @@ const struct file_operations ceph_file_fops = { .open = ceph_open, .release = ceph_release, .llseek = ceph_llseek, - .read_iter = ceph_read_iter, - .write_iter = ceph_write_iter, + .read_iter = ceph_netfs_read_iter, + .write_iter = ceph_netfs_write_iter, .mmap = ceph_mmap, .fsync = ceph_fsync, .lock = ceph_lock, diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index ec9b80fec7be..8f73f3a55a3e 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -2345,11 +2345,9 @@ static int fill_fscrypt_truncate(struct inode *inode, struct iov_iter iter; struct ceph_fscrypt_truncate_size_header *header; void *p; - int retry_op = 0; int len = CEPH_FSCRYPT_BLOCK_SIZE; loff_t i_size = i_size_read(inode); int got, ret, issued; - u64 objver; ret = __ceph_get_caps(inode, NULL, CEPH_CAP_FILE_RD, 0, -1, &got); if (ret < 0) @@ -2361,16 +2359,6 @@ static int fill_fscrypt_truncate(struct inode *inode, i_size, attr->ia_size, ceph_cap_string(got), ceph_cap_string(issued)); - /* Try to writeback the dirty pagecaches */ - if (issued & (CEPH_CAP_FILE_BUFFER)) { - loff_t lend = orig_pos + CEPH_FSCRYPT_BLOCK_SIZE - 1; - - ret = filemap_write_and_wait_range(inode->i_mapping, - orig_pos, lend); - if (ret < 0) - goto out; - } - ret = -ENOMEM; dbuf = ceph_databuf_req_alloc(2, 0, GFP_KERNEL); if (!dbuf) @@ -2382,10 +2370,8 @@ static int fill_fscrypt_truncate(struct inode *inode, goto out; iov_iter_bvec(&iter, ITER_DEST, &dbuf->bvec[1], 1, len); - - pos = orig_pos; - ret = __ceph_sync_read(inode, &pos, &iter, &retry_op, &objver); - if (ret < 0) + ret = netfs_unbuffered_read_from_inode(inode, orig_pos, &iter, true); + if (ret < 0 && ret != -ENODATA) goto out; header = kmap_ceph_databuf_page(dbuf, 0); @@ -2402,16 +2388,14 @@ static int fill_fscrypt_truncate(struct inode *inode, header->block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE); /* - * If we hit a hole here, we should just skip filling - * the fscrypt for the request, because once the fscrypt - * is enabled, the file will be split into many blocks - * with the size of CEPH_FSCRYPT_BLOCK_SIZE, if there - * has a hole, the hole size should be multiple of block - * size. + * If we hit a hole here, we should just skip filling the fscrypt for + * the request, because once the fscrypt is enabled, the file will be + * split into many blocks with the size of CEPH_FSCRYPT_BLOCK_SIZE. If + * there was a hole, the hole size should be multiple of block size. * * If the Rados object doesn't exist, it will be set to 0. */ - if (!objver) { + if (ret != -ENODATA) { doutc(cl, "hit hole, ppos %lld < size %lld\n", pos, i_size); header->data_len = cpu_to_le32(8 + 8 + 4); diff --git a/fs/ceph/rdwr.c b/fs/ceph/rdwr.c new file mode 100644 index 000000000000..952c36be2cd9 --- /dev/null +++ b/fs/ceph/rdwr.c @@ -0,0 +1,1006 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Ceph netfs-based file read-write operations. + * + * There are a few funny things going on here. + * + * The page->private field is used to reference a struct ceph_snap_context for + * _every_ dirty page. This indicates which snapshot the page was logically + * dirtied in, and thus which snap context needs to be associated with the osd + * write during writeback. + * + * Similarly, struct ceph_inode_info maintains a set of counters to count dirty + * pages on the inode. In the absence of snapshots, i_wrbuffer_ref == + * i_wrbuffer_ref_head == the dirty page count. + * + * When a snapshot is taken (that is, when the client receives notification + * that a snapshot was taken), each inode with caps and with dirty pages (dirty + * pages implies there is a cap) gets a new ceph_cap_snap in the i_cap_snaps + * list (which is sorted in ascending order, new snaps go to the tail). The + * i_wrbuffer_ref_head count is moved to capsnap->dirty. (Unless a sync write + * is currently in progress. In that case, the capsnap is said to be + * "pending", new writes cannot start, and the capsnap isn't "finalized" until + * the write completes (or fails) and a final size/mtime for the inode for that + * snap can be settled upon.) i_wrbuffer_ref_head is reset to 0. + * + * On writeback, we must submit writes to the osd IN SNAP ORDER. So, we look + * for the first capsnap in i_cap_snaps and write out pages in that snap + * context _only_. Then we move on to the next capsnap, eventually reaching + * the "live" or "head" context (i.e., pages that are not yet snapped) and are + * writing the most recently dirtied pages. + * + * Invalidate and so forth must take care to ensure the dirty page accounting + * is preserved. + * + * Copyright (C) 2025 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + */ +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "super.h" +#include "mds_client.h" +#include "cache.h" +#include "metric.h" +#include "crypto.h" +#include +#include + +struct ceph_writeback_ctl +{ + loff_t i_size; + u64 truncate_size; + u32 truncate_seq; + bool size_stable; + bool head_snapc; +}; + +struct kmem_cache *ceph_io_request_cachep; +struct kmem_cache *ceph_io_subrequest_cachep; + +static struct ceph_io_subrequest *ceph_sreq2io(struct netfs_io_subrequest *subreq) +{ + BUILD_BUG_ON(sizeof(struct ceph_io_request) > NETFS_DEF_IO_REQUEST_SIZE); + BUILD_BUG_ON(sizeof(struct ceph_io_subrequest) > NETFS_DEF_IO_SUBREQUEST_SIZE); + + return container_of(subreq, struct ceph_io_subrequest, sreq); +} + +/* + * Get the snapc from the group attached to a request + */ +static struct ceph_snap_context *ceph_wreq_snapc(struct netfs_io_request *wreq) +{ + struct ceph_snap_context *snapc = + container_of(wreq->group, struct ceph_snap_context, group); + return snapc; +} + +#if 0 +static void ceph_put_many_snap_context(struct ceph_snap_context *sc, unsigned int nr) +{ + if (sc) + netfs_put_group_many(&sc->group, nr); +} +#endif + +/* + * Handle the termination of a write to the server. + */ +static void ceph_netfs_write_callback(struct ceph_osd_request *req) +{ + struct netfs_io_subrequest *subreq = req->r_subreq; + struct ceph_io_subrequest *csub = ceph_sreq2io(subreq); + struct ceph_io_request *creq = csub->creq; + struct inode *inode = creq->rreq.inode; + struct ceph_inode_info *ci = ceph_inode(inode); + struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); + struct ceph_client *cl = ceph_inode_to_client(inode); + size_t wrote = req->r_result ? 0 : subreq->len; + int err = req->r_result; + + trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress); + + ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency, + req->r_end_latency, wrote, err); + + if (err) { + doutc(cl, "sync_write osd write returned %d\n", err); + /* Version changed! Must re-do the rmw cycle */ + if ((creq->rmw_assert_version && (err == -ERANGE || err == -EOVERFLOW)) || + (!creq->rmw_assert_version && err == -EEXIST)) { + /* We should only ever see this on a rmw */ + WARN_ON_ONCE(!test_bit(NETFS_RREQ_RMW, &ci->netfs.flags)); + + /* The version should never go backward */ + WARN_ON_ONCE(err == -EOVERFLOW); + + /* FIXME: limit number of times we loop? */ + set_bit(NETFS_RREQ_REPEAT_RMW, &creq->rreq.flags); + trace_netfs_sreq(subreq, netfs_sreq_trace_need_rmw); + } + ceph_set_error_write(ci); + } else { + ceph_clear_error_write(ci); + } + + csub->req = NULL; + ceph_osdc_put_request(req); + netfs_write_subrequest_terminated(subreq, err ?: wrote, true); +} + +/* + * Issue a subrequest to upload to the server. + */ +static void ceph_issue_write(struct netfs_io_subrequest *subreq) +{ + struct ceph_io_subrequest *csub = ceph_sreq2io(subreq); + struct ceph_snap_context *snapc = ceph_wreq_snapc(subreq->rreq); + struct ceph_osd_request *req; + struct ceph_io_request *creq = csub->creq; + struct ceph_fs_client *fsc = ceph_inode_to_fs_client(subreq->rreq->inode); + struct ceph_osd_client *osdc = &fsc->client->osdc; + struct inode *inode = subreq->rreq->inode; + struct ceph_inode_info *ci = ceph_inode(inode); + struct ceph_client *cl = ceph_inode_to_client(inode); + unsigned long long len; + unsigned int rmw = test_bit(NETFS_RREQ_RMW, &ci->netfs.flags) ? 1 : 0; + + doutc(cl, "issue_write R=%08x[%x] ino %llx %lld~%zu -- %srmw\n", + subreq->rreq->debug_id, subreq->debug_index, ci->i_vino.ino, + subreq->start, subreq->len, + rmw ? "" : "no "); + + len = subreq->len; + req = ceph_osdc_new_request(osdc, &ci->i_layout, ci->i_vino, + subreq->start, &len, + rmw, /* which: 0 or 1 */ + rmw + 1, /* num_ops: 1 or 2 */ + CEPH_OSD_OP_WRITE, + CEPH_OSD_FLAG_WRITE, + snapc, + ci->i_truncate_seq, + ci->i_truncate_size, false); + if (IS_ERR(req)) { + netfs_write_subrequest_terminated(subreq, PTR_ERR(req), false); + return netfs_prepare_write_failed(subreq); + } + + subreq->len = len; + doutc(cl, "write op %lld~%zu\n", subreq->start, subreq->len); + iov_iter_truncate(&subreq->io_iter, len); + osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter); + req->r_inode = inode; + req->r_mtime = current_time(inode); + req->r_callback = ceph_netfs_write_callback; + req->r_subreq = subreq; + csub->req = req; + + /* + * If we're doing an RMW cycle, set up an assertion that the remote + * data hasn't changed. If we don't have a version number, then the + * object doesn't exist yet. Use an exclusive create instead of a + * version assertion in that case. + */ + if (rmw) { + if (creq->rmw_assert_version) { + osd_req_op_init(req, 0, CEPH_OSD_OP_ASSERT_VER, 0); + req->r_ops[0].assert_ver.ver = creq->rmw_assert_version; + } else { + osd_req_op_init(req, 0, CEPH_OSD_OP_CREATE, + CEPH_OSD_OP_FLAG_EXCL); + } + } + + trace_netfs_sreq(subreq, netfs_sreq_trace_submit); + ceph_osdc_start_request(osdc, req); +} + +/* + * Prepare a subrequest to upload to the server. + */ +static void ceph_prepare_write(struct netfs_io_subrequest *subreq) +{ + struct ceph_inode_info *ci = ceph_inode(subreq->rreq->inode); + struct ceph_fs_client *fsc = ceph_inode_to_fs_client(subreq->rreq->inode); + u64 objnum, objoff; + + /* Clamp the length to the next object boundary. */ + ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, + fsc->mount_options->wsize, + &objnum, &objoff, + &subreq->rreq->io_streams[0].sreq_max_len); +} + +/* + * Mark the caps as dirty + */ +static void ceph_netfs_post_modify(struct inode *inode, void *fs_priv) +{ + struct ceph_inode_info *ci = ceph_inode(inode); + struct ceph_cap_flush **prealloc_cf = fs_priv; + int dirty; + + spin_lock(&ci->i_ceph_lock); + dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR, prealloc_cf); + spin_unlock(&ci->i_ceph_lock); + if (dirty) + __mark_inode_dirty(inode, dirty); +} + +static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq) +{ + struct inode *inode = rreq->inode; + struct ceph_inode_info *ci = ceph_inode(inode); + struct ceph_file_layout *lo = &ci->i_layout; + unsigned long max_pages = inode->i_sb->s_bdi->ra_pages; + loff_t end = rreq->start + rreq->len, new_end; + struct ceph_io_request *priv = container_of(rreq, struct ceph_io_request, rreq); + unsigned long max_len; + u32 blockoff; + + if (priv) { + /* Readahead is disabled by posix_fadvise POSIX_FADV_RANDOM */ + if (priv->file_ra_disabled) + max_pages = 0; + else + max_pages = priv->file_ra_pages; + + } + + /* Readahead is disabled */ + if (!max_pages) + return; + + max_len = max_pages << PAGE_SHIFT; + + /* + * Try to expand the length forward by rounding up it to the next + * block, but do not exceed the file size, unless the original + * request already exceeds it. + */ + new_end = umin(round_up(end, lo->stripe_unit), rreq->i_size); + if (new_end > end && new_end <= rreq->start + max_len) + rreq->len = new_end - rreq->start; + + /* Try to expand the start downward */ + div_u64_rem(rreq->start, lo->stripe_unit, &blockoff); + if (rreq->len + blockoff <= max_len) { + rreq->start -= blockoff; + rreq->len += blockoff; + } +} + +static int ceph_netfs_prepare_read(struct netfs_io_subrequest *subreq) +{ + struct netfs_io_request *rreq = subreq->rreq; + struct ceph_inode_info *ci = ceph_inode(rreq->inode); + struct ceph_fs_client *fsc = ceph_inode_to_fs_client(rreq->inode); + size_t xlen; + u64 objno, objoff; + + /* Truncate the extent at the end of the current block */ + ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, subreq->len, + &objno, &objoff, &xlen); + rreq->io_streams[0].sreq_max_len = umin(xlen, fsc->mount_options->rsize); + return 0; +} + +static void ceph_netfs_read_callback(struct ceph_osd_request *req) +{ + struct inode *inode = req->r_inode; + struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); + struct ceph_client *cl = fsc->client; + struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0); + struct netfs_io_subrequest *subreq = req->r_priv; + struct ceph_osd_req_op *op = &req->r_ops[0]; + bool sparse = (op->op == CEPH_OSD_OP_SPARSE_READ); + int err = req->r_result; + + ceph_update_read_metrics(&fsc->mdsc->metric, req->r_start_latency, + req->r_end_latency, osd_data->iter.count, err); + + doutc(cl, "result %d subreq->len=%zu i_size=%lld\n", req->r_result, + subreq->len, i_size_read(req->r_inode)); + + /* no object means success but no data */ + if (err == -ENOENT) + err = 0; + else if (err == -EBLOCKLISTED) + fsc->blocklisted = true; + + if (err >= 0) { + if (sparse && err > 0) + err = ceph_sparse_ext_map_end(op); + if (err < subreq->len && + subreq->rreq->origin != NETFS_DIO_READ) + __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags); + if (IS_ENCRYPTED(inode) && err > 0) { +#if 0 + err = ceph_fscrypt_decrypt_extents(inode, osd_data->dbuf, + subreq->start, + op->extent.sparse_ext, + op->extent.sparse_ext_cnt); + if (err > subreq->len) + err = subreq->len; +#else + pr_err("TODO: Content-decrypt currently disabled\n"); + err = -EOPNOTSUPP; +#endif + } + } + + if (err > 0) { + subreq->transferred = err; + err = 0; + } + + subreq->error = err; + trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress); + ceph_dec_osd_stopping_blocker(fsc->mdsc); + netfs_read_subreq_terminated(subreq); +} + +static void ceph_rmw_read_done(struct netfs_io_request *wreq, struct netfs_io_request *rreq) +{ + struct ceph_io_request *cwreq = container_of(wreq, struct ceph_io_request, rreq); + struct ceph_io_request *crreq = container_of(rreq, struct ceph_io_request, rreq); + + cwreq->rmw_assert_version = crreq->rmw_assert_version; +} + +static bool ceph_netfs_issue_read_inline(struct netfs_io_subrequest *subreq) +{ + struct netfs_io_request *rreq = subreq->rreq; + struct inode *inode = rreq->inode; + struct ceph_mds_reply_info_parsed *rinfo; + struct ceph_mds_reply_info_in *iinfo; + struct ceph_mds_request *req; + struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb); + struct ceph_inode_info *ci = ceph_inode(inode); + ssize_t err = 0; + size_t len, copied; + int mode; + + __clear_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags); + + if (subreq->start >= inode->i_size) + goto out; + + /* We need to fetch the inline data. */ + mode = ceph_try_to_choose_auth_mds(inode, CEPH_STAT_CAP_INLINE_DATA); + req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_GETATTR, mode); + if (IS_ERR(req)) { + err = PTR_ERR(req); + goto out; + } + req->r_ino1 = ci->i_vino; + req->r_args.getattr.mask = cpu_to_le32(CEPH_STAT_CAP_INLINE_DATA); + req->r_num_caps = 2; + + trace_netfs_sreq(subreq, netfs_sreq_trace_submit); + err = ceph_mdsc_do_request(mdsc, NULL, req); + if (err < 0) + goto out; + + rinfo = &req->r_reply_info; + iinfo = &rinfo->targeti; + if (iinfo->inline_version == CEPH_INLINE_NONE) { + /* The data got uninlined */ + ceph_mdsc_put_request(req); + return false; + } + + len = umin(iinfo->inline_len - subreq->start, subreq->len); + copied = copy_to_iter(iinfo->inline_data + subreq->start, len, &subreq->io_iter); + if (copied) { + subreq->transferred += copied; + if (copied == len) + __set_bit(NETFS_SREQ_HIT_EOF, &subreq->flags); + subreq->error = 0; + } else { + subreq->error = -EFAULT; + } + + ceph_mdsc_put_request(req); +out: + netfs_read_subreq_terminated(subreq); + return true; +} + +static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq) +{ + struct netfs_io_request *rreq = subreq->rreq; + struct inode *inode = rreq->inode; + struct ceph_inode_info *ci = ceph_inode(inode); + struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); + struct ceph_client *cl = fsc->client; + struct ceph_osd_request *req = NULL; + struct ceph_vino vino = ceph_vino(inode); + int extent_cnt; + bool sparse = IS_ENCRYPTED(inode) || ceph_test_mount_opt(fsc, SPARSEREAD); + u64 off = subreq->start, len = subreq->len; + int err = 0; + + if (ceph_inode_is_shutdown(inode)) { + err = -EIO; + goto out; + } + + if (ceph_has_inline_data(ci) && ceph_netfs_issue_read_inline(subreq)) + return; + + req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout, vino, + off, &len, 0, 1, + sparse ? CEPH_OSD_OP_SPARSE_READ : CEPH_OSD_OP_READ, + CEPH_OSD_FLAG_READ, /* read_from_replica will be or'd in */ + NULL, ci->i_truncate_seq, ci->i_truncate_size, false); + if (IS_ERR(req)) { + err = PTR_ERR(req); + req = NULL; + goto out; + } + + if (sparse) { + extent_cnt = __ceph_sparse_read_ext_count(inode, len); + err = ceph_alloc_sparse_ext_map(&req->r_ops[0], extent_cnt); + if (err) + goto out; + } + + doutc(cl, "%llx.%llx pos=%llu orig_len=%zu len=%llu\n", + ceph_vinop(inode), subreq->start, subreq->len, len); + + osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter); + if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { + err = -EIO; + goto out; + } + req->r_callback = ceph_netfs_read_callback; + req->r_priv = subreq; + req->r_inode = inode; + + trace_netfs_sreq(subreq, netfs_sreq_trace_submit); + ceph_osdc_start_request(req->r_osdc, req); +out: + ceph_osdc_put_request(req); + doutc(cl, "%llx.%llx result %d\n", ceph_vinop(inode), err); + if (err) { + subreq->error = err; + netfs_read_subreq_terminated(subreq); + } +} + +static int ceph_init_request(struct netfs_io_request *rreq, struct file *file) +{ + struct ceph_io_request *priv = container_of(rreq, struct ceph_io_request, rreq); + struct inode *inode = rreq->inode; + struct ceph_client *cl = ceph_inode_to_client(inode); + struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); + int got = 0, want = CEPH_CAP_FILE_CACHE; + int ret = 0; + + rreq->rsize = 1024 * 1024; + rreq->wsize = umin(i_blocksize(inode), fsc->mount_options->wsize); + + switch (rreq->origin) { + case NETFS_READAHEAD: + goto init_readahead; + case NETFS_WRITEBACK: + case NETFS_WRITETHROUGH: + case NETFS_UNBUFFERED_WRITE: + case NETFS_DIO_WRITE: + if (S_ISREG(rreq->inode->i_mode)) + rreq->io_streams[0].avail = true; + return 0; + default: + return 0; + } + +init_readahead: + /* + * If we are doing readahead triggered by a read, fault-in or + * MADV/FADV_WILLNEED, someone higher up the stack must be holding the + * FILE_CACHE and/or LAZYIO caps. + */ + if (file) { + priv->file_ra_pages = file->f_ra.ra_pages; + priv->file_ra_disabled = file->f_mode & FMODE_RANDOM; + rreq->netfs_priv = priv; + return 0; + } + + /* + * readahead callers do not necessarily hold Fcb caps + * (e.g. fadvise, madvise). + */ + ret = ceph_try_get_caps(inode, CEPH_CAP_FILE_RD, want, true, &got); + if (ret < 0) { + doutc(cl, "%llx.%llx, error getting cap\n", ceph_vinop(inode)); + goto out; + } + + if (!(got & want)) { + doutc(cl, "%llx.%llx, no cache cap\n", ceph_vinop(inode)); + ret = -EACCES; + goto out; + } + if (ret > 0) + priv->caps = got; + else + ret = -EACCES; + + rreq->io_streams[0].sreq_max_len = fsc->mount_options->rsize; +out: + return ret; +} + +static void ceph_netfs_free_request(struct netfs_io_request *rreq) +{ + struct ceph_io_request *creq = container_of(rreq, struct ceph_io_request, rreq); + + if (creq->caps) + ceph_put_cap_refs(ceph_inode(rreq->inode), creq->caps); +} + +const struct netfs_request_ops ceph_netfs_ops = { + .init_request = ceph_init_request, + .free_request = ceph_netfs_free_request, + .expand_readahead = ceph_netfs_expand_readahead, + .prepare_read = ceph_netfs_prepare_read, + .issue_read = ceph_netfs_issue_read, + .rmw_read_done = ceph_rmw_read_done, + .post_modify = ceph_netfs_post_modify, + .prepare_write = ceph_prepare_write, + .issue_write = ceph_issue_write, +}; + +/* + * Get ref for the oldest snapc for an inode with dirty data... that is, the + * only snap context we are allowed to write back. + */ +static struct ceph_snap_context * +ceph_get_oldest_context(struct inode *inode, struct ceph_writeback_ctl *ctl, + struct ceph_snap_context *folio_snapc) +{ + struct ceph_snap_context *snapc = NULL; + struct ceph_inode_info *ci = ceph_inode(inode); + struct ceph_cap_snap *capsnap = NULL; + struct ceph_client *cl = ceph_inode_to_client(inode); + + spin_lock(&ci->i_ceph_lock); + list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) { + doutc(cl, " capsnap %p snapc %p has %d dirty pages\n", + capsnap, capsnap->context, capsnap->dirty_pages); + if (!capsnap->dirty_pages) + continue; + + /* get i_size, truncate_{seq,size} for folio_snapc? */ + if (snapc && capsnap->context != folio_snapc) + continue; + + if (ctl) { + if (capsnap->writing) { + ctl->i_size = i_size_read(inode); + ctl->size_stable = false; + } else { + ctl->i_size = capsnap->size; + ctl->size_stable = true; + } + ctl->truncate_size = capsnap->truncate_size; + ctl->truncate_seq = capsnap->truncate_seq; + ctl->head_snapc = false; + } + + if (snapc) + break; + + snapc = ceph_get_snap_context(capsnap->context); + if (!folio_snapc || + folio_snapc == snapc || + folio_snapc->seq > snapc->seq) + break; + } + if (!snapc && ci->i_wrbuffer_ref_head) { + snapc = ceph_get_snap_context(ci->i_head_snapc); + doutc(cl, " head snapc %p has %d dirty pages\n", snapc, + ci->i_wrbuffer_ref_head); + if (ctl) { + ctl->i_size = i_size_read(inode); + ctl->truncate_size = ci->i_truncate_size; + ctl->truncate_seq = ci->i_truncate_seq; + ctl->size_stable = false; + ctl->head_snapc = true; + } + } + spin_unlock(&ci->i_ceph_lock); + return snapc; +} + +/* + * Flush dirty data. We have to start with the oldest snap as that's the only + * one we're allowed to write back. + */ +static int ceph_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct ceph_writeback_ctl ceph_wbc; + struct ceph_snap_context *snapc; + struct ceph_inode_info *ci = ceph_inode(mapping->host); + loff_t actual_start = wbc->range_start, actual_end = wbc->range_end; + int ret; + + do { + snapc = ceph_get_oldest_context(mapping->host, &ceph_wbc, NULL); + if (snapc == ci->i_head_snapc) { + wbc->range_start = actual_start; + wbc->range_end = actual_end; + } else { + /* Do not respect wbc->range_{start,end}. Dirty pages + * in that range can be associated with newer snapc. + * They are not writeable until we write all dirty + * pages associated with an older snapc get written. + */ + wbc->range_start = 0; + wbc->range_end = LLONG_MAX; + } + + ret = netfs_writepages_group(mapping, wbc, &snapc->group, &ceph_wbc); + ceph_put_snap_context(snapc); + if (snapc == ci->i_head_snapc) + break; + } while (ret == 0 && wbc->nr_to_write > 0); + + return ret; +} + +const struct address_space_operations ceph_aops = { + .read_folio = netfs_read_folio, + .readahead = netfs_readahead, + .writepages = ceph_writepages, + .dirty_folio = ceph_dirty_folio, + .invalidate_folio = netfs_invalidate_folio, + .release_folio = netfs_release_folio, + .direct_IO = noop_direct_IO, + .migrate_folio = filemap_migrate_folio, +}; + +/* + * Wrap generic_file_aio_read with checks for cap bits on the inode. + * Atomically grab references, so that those bits are not released + * back to the MDS mid-read. + * + * Hmm, the sync read case isn't actually async... should it be? + */ +ssize_t ceph_netfs_read_iter(struct kiocb *iocb, struct iov_iter *to) +{ + struct file *filp = iocb->ki_filp; + struct inode *inode = file_inode(filp); + struct ceph_inode_info *ci = ceph_inode(inode); + struct ceph_file_info *fi = filp->private_data; + struct ceph_client *cl = ceph_inode_to_client(inode); + ssize_t ret; + size_t len = iov_iter_count(to); + bool dio = iocb->ki_flags & IOCB_DIRECT; + int want = 0, got = 0; + + doutc(cl, "%llu~%zu trying to get caps on %p %llx.%llx\n", + iocb->ki_pos, len, inode, ceph_vinop(inode)); + + if (ceph_inode_is_shutdown(inode)) + return -ESTALE; + + if (dio) + ret = netfs_start_io_direct(inode); + else + ret = netfs_start_io_read(inode); + if (ret < 0) + return ret; + + if (!(fi->flags & CEPH_F_SYNC) && !dio) + want |= CEPH_CAP_FILE_CACHE; + if (fi->fmode & CEPH_FILE_MODE_LAZY) + want |= CEPH_CAP_FILE_LAZYIO; + + ret = ceph_get_caps(filp, CEPH_CAP_FILE_RD, want, -1, &got); + if (ret < 0) + goto out; + + if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 || + dio || + (fi->flags & CEPH_F_SYNC)) { + doutc(cl, "sync %p %llx.%llx %llu~%zu got cap refs on %s\n", + inode, ceph_vinop(inode), iocb->ki_pos, len, + ceph_cap_string(got)); + + ret = netfs_unbuffered_read_iter(iocb, to); + } else { + doutc(cl, "async %p %llx.%llx %llu~%zu got cap refs on %s\n", + inode, ceph_vinop(inode), iocb->ki_pos, len, + ceph_cap_string(got)); + ret = filemap_read(iocb, to, 0); + } + + doutc(cl, "%p %llx.%llx dropping cap refs on %s = %zd\n", + inode, ceph_vinop(inode), ceph_cap_string(got), ret); + ceph_put_cap_refs(ci, got); + +out: + if (dio) + netfs_end_io_direct(inode); + else + netfs_end_io_read(inode); + return ret; +} + +/* + * Get the most recent snap context in the list to which the inode subscribes. + * This is the only one we are allowed to modify. If a folio points to an + * earlier snapshot, it must be flushed first. + */ +static struct ceph_snap_context *ceph_get_most_recent_snapc(struct inode *inode) +{ + struct ceph_snap_context *snapc; + struct ceph_inode_info *ci = ceph_inode(inode); + + /* Get the snap this write is going to belong to. */ + spin_lock(&ci->i_ceph_lock); + if (__ceph_have_pending_cap_snap(ci)) { + struct ceph_cap_snap *capsnap = + list_last_entry(&ci->i_cap_snaps, + struct ceph_cap_snap, ci_item); + + snapc = ceph_get_snap_context(capsnap->context); + } else { + BUG_ON(!ci->i_head_snapc); + snapc = ceph_get_snap_context(ci->i_head_snapc); + } + spin_unlock(&ci->i_ceph_lock); + + return snapc; +} + +/* + * Take cap references to avoid releasing caps to MDS mid-write. + * + * If we are synchronous, and write with an old snap context, the OSD + * may return EOLDSNAPC. In that case, retry the write.. _after_ + * dropping our cap refs and allowing the pending snap to logically + * complete _before_ this write occurs. + * + * If we are near ENOSPC, write synchronously. + */ +ssize_t ceph_netfs_write_iter(struct kiocb *iocb, struct iov_iter *from) +{ + struct file *file = iocb->ki_filp; + struct inode *inode = file_inode(file); + struct ceph_snap_context *snapc; + struct ceph_inode_info *ci = ceph_inode(inode); + struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); + struct ceph_file_info *fi = file->private_data; + struct ceph_osd_client *osdc = &fsc->client->osdc; + struct ceph_cap_flush *prealloc_cf; + struct ceph_client *cl = fsc->client; + ssize_t count, written = 0; + loff_t limit = max(i_size_read(inode), fsc->max_file_size); + loff_t pos; + bool direct_lock = false; + u64 pool_flags; + u32 map_flags; + int err, want = 0, got; + + if (ceph_inode_is_shutdown(inode)) + return -ESTALE; + + if (ceph_snap(inode) != CEPH_NOSNAP) + return -EROFS; + + prealloc_cf = ceph_alloc_cap_flush(); + if (!prealloc_cf) + return -ENOMEM; + + if ((iocb->ki_flags & (IOCB_DIRECT | IOCB_APPEND)) == IOCB_DIRECT) + direct_lock = true; + +retry_snap: + if (direct_lock) + netfs_start_io_direct(inode); + else + netfs_start_io_write(inode); + + if (iocb->ki_flags & IOCB_APPEND) { + err = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false); + if (err < 0) + goto out; + } + + err = generic_write_checks(iocb, from); + if (err <= 0) + goto out; + + pos = iocb->ki_pos; + if (unlikely(pos >= limit)) { + err = -EFBIG; + goto out; + } else { + iov_iter_truncate(from, limit - pos); + } + + count = iov_iter_count(from); + if (ceph_quota_is_max_bytes_exceeded(inode, pos + count)) { + err = -EDQUOT; + goto out; + } + + down_read(&osdc->lock); + map_flags = osdc->osdmap->flags; + pool_flags = ceph_pg_pool_flags(osdc->osdmap, ci->i_layout.pool_id); + up_read(&osdc->lock); + if ((map_flags & CEPH_OSDMAP_FULL) || + (pool_flags & CEPH_POOL_FLAG_FULL)) { + err = -ENOSPC; + goto out; + } + + err = file_remove_privs(file); + if (err) + goto out; + + doutc(cl, "%p %llx.%llx %llu~%zd getting caps. i_size %llu\n", + inode, ceph_vinop(inode), pos, count, + i_size_read(inode)); + if (!(fi->flags & CEPH_F_SYNC) && !direct_lock) + want |= CEPH_CAP_FILE_BUFFER; + if (fi->fmode & CEPH_FILE_MODE_LAZY) + want |= CEPH_CAP_FILE_LAZYIO; + got = 0; + err = ceph_get_caps(file, CEPH_CAP_FILE_WR, want, pos + count, &got); + if (err < 0) + goto out; + + err = file_update_time(file); + if (err) + goto out_caps; + + inode_inc_iversion_raw(inode); + + doutc(cl, "%p %llx.%llx %llu~%zd got cap refs on %s\n", + inode, ceph_vinop(inode), pos, count, ceph_cap_string(got)); + + /* Get the snap this write is going to belong to. */ + snapc = ceph_get_most_recent_snapc(inode); + + if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 || + (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC) || + (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) { + struct iov_iter data; + + /* we might need to revert back to that point */ + data = *from; + written = netfs_unbuffered_write_iter_locked(iocb, &data, &snapc->group); + if (direct_lock) + netfs_end_io_direct(inode); + else + netfs_end_io_write(inode); + if (written > 0) + iov_iter_advance(from, written); + ceph_put_snap_context(snapc); + } else { + /* + * No need to acquire the i_truncate_mutex. Because the MDS + * revokes Fwb caps before sending truncate message to us. We + * can't get Fwb cap while there are pending vmtruncate. So + * write and vmtruncate can not run at the same time + */ + written = netfs_perform_write(iocb, from, &snapc->group, &prealloc_cf); + netfs_end_io_write(inode); + } + + if (written >= 0) { + int dirty; + + spin_lock(&ci->i_ceph_lock); + dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR, + &prealloc_cf); + spin_unlock(&ci->i_ceph_lock); + if (dirty) + __mark_inode_dirty(inode, dirty); + if (ceph_quota_is_max_bytes_approaching(inode, iocb->ki_pos)) + ceph_check_caps(ci, CHECK_CAPS_FLUSH); + } + + doutc(cl, "%p %llx.%llx %llu~%u dropping cap refs on %s\n", + inode, ceph_vinop(inode), pos, (unsigned)count, + ceph_cap_string(got)); + ceph_put_cap_refs(ci, got); + + if (written == -EOLDSNAPC) { + doutc(cl, "%p %llx.%llx %llu~%u" "got EOLDSNAPC, retrying\n", + inode, ceph_vinop(inode), pos, (unsigned)count); + goto retry_snap; + } + + if (written >= 0) { + if ((map_flags & CEPH_OSDMAP_NEARFULL) || + (pool_flags & CEPH_POOL_FLAG_NEARFULL)) + iocb->ki_flags |= IOCB_DSYNC; + written = generic_write_sync(iocb, written); + } + + goto out_unlocked; +out_caps: + ceph_put_cap_refs(ci, got); +out: + if (direct_lock) + netfs_end_io_direct(inode); + else + netfs_end_io_write(inode); +out_unlocked: + ceph_free_cap_flush(prealloc_cf); + return written ? written : err; +} + +vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf) +{ + struct ceph_snap_context *snapc; + struct vm_area_struct *vma = vmf->vma; + struct inode *inode = file_inode(vma->vm_file); + struct ceph_client *cl = ceph_inode_to_client(inode); + struct ceph_inode_info *ci = ceph_inode(inode); + struct ceph_file_info *fi = vma->vm_file->private_data; + struct ceph_cap_flush *prealloc_cf; + struct folio *folio = page_folio(vmf->page); + loff_t size = i_size_read(inode); + loff_t off = folio_pos(folio); + size_t len = folio_size(folio); + int want, got, err; + vm_fault_t ret = VM_FAULT_SIGBUS; + + if (ceph_inode_is_shutdown(inode)) + return ret; + + prealloc_cf = ceph_alloc_cap_flush(); + if (!prealloc_cf) + return -ENOMEM; + + doutc(cl, "%llx.%llx %llu~%zd getting caps i_size %llu\n", + ceph_vinop(inode), off, len, size); + if (fi->fmode & CEPH_FILE_MODE_LAZY) + want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO; + else + want = CEPH_CAP_FILE_BUFFER; + + got = 0; + err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, off + len, &got); + if (err < 0) + goto out_free; + + doutc(cl, "%llx.%llx %llu~%zd got cap refs on %s\n", ceph_vinop(inode), + off, len, ceph_cap_string(got)); + + /* Get the snap this write is going to belong to. */ + snapc = ceph_get_most_recent_snapc(inode); + + ret = netfs_page_mkwrite(vmf, &snapc->group, &prealloc_cf); + + doutc(cl, "%llx.%llx %llu~%zd dropping cap refs on %s ret %x\n", + ceph_vinop(inode), off, len, ceph_cap_string(got), ret); + ceph_put_cap_refs_async(ci, got); +out_free: + ceph_free_cap_flush(prealloc_cf); + if (err < 0) + ret = vmf_error(err); + return ret; +} diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 14784ad86670..acd5c4821ded 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -470,7 +470,7 @@ struct ceph_inode_info { #endif }; -struct ceph_netfs_request_data { +struct ceph_netfs_request_data { // TODO: Remove int caps; /* @@ -483,6 +483,29 @@ struct ceph_netfs_request_data { bool file_ra_disabled; }; +struct ceph_io_request { + struct netfs_io_request rreq; + u64 rmw_assert_version; + int caps; + + /* + * Maximum size of a file readahead request. + * The fadvise could update the bdi's default ra_pages. + */ + unsigned int file_ra_pages; + + /* Set it if fadvise disables file readahead entirely */ + bool file_ra_disabled; +}; + +struct ceph_io_subrequest { + union { + struct netfs_io_subrequest sreq; + struct ceph_io_request *creq; + }; + struct ceph_osd_request *req; +}; + static inline struct ceph_inode_info * ceph_inode(const struct inode *inode) { @@ -1237,8 +1260,10 @@ extern void __ceph_touch_fmode(struct ceph_inode_info *ci, struct ceph_mds_client *mdsc, int fmode); /* addr.c */ -extern const struct address_space_operations ceph_aops; +#if 0 // TODO: Remove after netfs conversion extern const struct netfs_request_ops ceph_netfs_ops; +#endif // TODO: Remove after netfs conversion +bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio); extern int ceph_mmap(struct file *file, struct vm_area_struct *vma); extern int ceph_uninline_data(struct file *file); extern int ceph_pool_perm_check(struct inode *inode, int need); @@ -1253,6 +1278,14 @@ static inline bool ceph_has_inline_data(struct ceph_inode_info *ci) return true; } +/* rdwr.c */ +extern const struct netfs_request_ops ceph_netfs_ops; +extern const struct address_space_operations ceph_aops; + +ssize_t ceph_netfs_read_iter(struct kiocb *iocb, struct iov_iter *to); +ssize_t ceph_netfs_write_iter(struct kiocb *iocb, struct iov_iter *from); +vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf); + /* file.c */ extern const struct file_operations ceph_file_fops; @@ -1260,9 +1293,11 @@ extern int ceph_renew_caps(struct inode *inode, int fmode); extern int ceph_open(struct inode *inode, struct file *file); extern int ceph_atomic_open(struct inode *dir, struct dentry *dentry, struct file *file, unsigned flags, umode_t mode); +#if 0 // TODO: Remove after netfs conversion extern ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos, struct iov_iter *to, int *retry_op, u64 *last_objver); +#endif extern int ceph_release(struct inode *inode, struct file *filp); extern void ceph_fill_inline_data(struct inode *inode, struct page *locked_page, char *data, size_t len); diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h index 9724d5a1ddc7..a82eb3be9737 100644 --- a/fs/netfs/internal.h +++ b/fs/netfs/internal.h @@ -264,9 +264,9 @@ static inline bool netfs_is_cache_enabled(struct netfs_inode *ctx) } /* - * Check to see if a buffer aligns with the crypto block size. If it doesn't - * the crypto layer is going to copy all the data - in which case relying on - * the crypto op for a free copy is pointless. + * Check to see if a buffer aligns with the crypto unit block size. If it + * doesn't the crypto layer is going to copy all the data - in which case + * relying on the crypto op for a free copy is pointless. */ static inline bool netfs_is_crypto_aligned(struct netfs_io_request *rreq, struct iov_iter *iter) diff --git a/fs/netfs/main.c b/fs/netfs/main.c index 0900dea53e4a..d431ba261920 100644 --- a/fs/netfs/main.c +++ b/fs/netfs/main.c @@ -139,7 +139,7 @@ static int __init netfs_init(void) goto error_folio_pool; netfs_request_slab = kmem_cache_create("netfs_request", - sizeof(struct netfs_io_request), 0, + NETFS_DEF_IO_REQUEST_SIZE, 0, SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL); if (!netfs_request_slab) @@ -149,7 +149,7 @@ static int __init netfs_init(void) goto error_reqpool; netfs_subrequest_slab = kmem_cache_create("netfs_subrequest", - sizeof(struct netfs_io_subrequest) + 16, 0, + NETFS_DEF_IO_SUBREQUEST_SIZE, 0, SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL); if (!netfs_subrequest_slab) diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c index 9b8d99477405..091328596533 100644 --- a/fs/netfs/write_issue.c +++ b/fs/netfs/write_issue.c @@ -652,7 +652,8 @@ int netfs_writepages_group(struct address_space *mapping, if (netfs_folio_group(folio) != NETFS_FOLIO_COPY_TO_CACHE && unlikely(!test_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))) { set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags); - wreq->netfs_ops->begin_writeback(wreq); + if (wreq->netfs_ops->begin_writeback) + wreq->netfs_ops->begin_writeback(wreq); } error = netfs_write_folio(wreq, wbc, folio); @@ -967,7 +968,8 @@ int netfs_writeback_single(struct address_space *mapping, trace_netfs_write(wreq, netfs_write_trace_writeback); netfs_stat(&netfs_n_wh_writepages); - if (__test_and_set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags)) + if (__test_and_set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags) && + wreq->netfs_ops->begin_writeback) wreq->netfs_ops->begin_writeback(wreq); for (fq = (struct folio_queue *)iter->folioq; fq; fq = fq->next) { diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h index 733e7f93db66..0c626a7d32f4 100644 --- a/include/linux/ceph/libceph.h +++ b/include/linux/ceph/libceph.h @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -161,7 +162,7 @@ static inline bool ceph_msgr2(struct ceph_client *client) * dirtied. */ struct ceph_snap_context { - refcount_t nref; + struct netfs_group group; u64 seq; u32 num_snaps; u64 snaps[]; diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 7eff589711cc..7f8d28b2c41b 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -246,6 +246,7 @@ struct ceph_osd_request { struct completion r_completion; /* private to osd_client.c */ ceph_osdc_callback_t r_callback; + struct netfs_io_subrequest *r_subreq; struct inode *r_inode; /* for use by callbacks */ struct list_head r_private_item; /* ditto */ void *r_priv; /* ditto */ diff --git a/include/linux/netfs.h b/include/linux/netfs.h index 4049c985b9b4..3253352fcbfa 100644 --- a/include/linux/netfs.h +++ b/include/linux/netfs.h @@ -26,6 +26,14 @@ enum netfs_sreq_ref_trace; typedef struct mempool_s mempool_t; struct folio_queue; +/* + * Size of allocations for default netfs_io_(sub)request object slabs and + * mempools. If a filesystem's request and subrequest objects fit within this + * size, they can use these otherwise they must provide their own. + */ +#define NETFS_DEF_IO_REQUEST_SIZE (sizeof(struct netfs_io_request) + 24) +#define NETFS_DEF_IO_SUBREQUEST_SIZE (sizeof(struct netfs_io_subrequest) + 16) + /** * folio_start_private_2 - Start an fscache write on a folio. [DEPRECATED] * @folio: The folio. @@ -184,7 +192,10 @@ struct netfs_io_subrequest { struct list_head rreq_link; /* Link in req/stream::subrequests */ struct list_head ioq_link; /* Link in io_stream::io_queue */ union { - struct iov_iter io_iter; /* Iterator for this subrequest */ + struct { + struct iov_iter io_iter; /* Iterator for this subrequest */ + void *fs_private; /* Filesystem specific */ + }; struct { struct scatterlist src_sg; /* Source for crypto subreq */ struct scatterlist dst_sg; /* Dest for crypto subreq */ diff --git a/net/ceph/snapshot.c b/net/ceph/snapshot.c index e24315937c45..92f63cbca183 100644 --- a/net/ceph/snapshot.c +++ b/net/ceph/snapshot.c @@ -17,6 +17,11 @@ * the entire structure is freed. */ +static void ceph_snap_context_kfree(struct netfs_group *group) +{ + kfree(group); +} + /* * Create a new ceph snapshot context large enough to hold the * indicated number of snapshot ids (which can be 0). Caller has @@ -36,8 +41,9 @@ struct ceph_snap_context *ceph_create_snap_context(u32 snap_count, if (!snapc) return NULL; - refcount_set(&snapc->nref, 1); - snapc->num_snaps = snap_count; + refcount_set(&snapc->group.ref, 1); + snapc->group.free = ceph_snap_context_kfree; + snapc->num_snaps = snap_count; return snapc; } @@ -46,18 +52,14 @@ EXPORT_SYMBOL(ceph_create_snap_context); struct ceph_snap_context *ceph_get_snap_context(struct ceph_snap_context *sc) { if (sc) - refcount_inc(&sc->nref); + netfs_get_group(&sc->group); return sc; } EXPORT_SYMBOL(ceph_get_snap_context); void ceph_put_snap_context(struct ceph_snap_context *sc) { - if (!sc) - return; - if (refcount_dec_and_test(&sc->nref)) { - /*printk(" deleting snap_context %p\n", sc);*/ - kfree(sc); - } + if (sc) + netfs_put_group(&sc->group); } EXPORT_SYMBOL(ceph_put_snap_context); From patchwork Thu Mar 13 23:33:26 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016081 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BBDAC213239 for ; Thu, 13 Mar 2025 23:36:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908966; cv=none; b=eDm8aFRy5Cd4Z1n7XE3DQL53ov2I95Xkn5BiHLnorRMby932KtTepaUoN12Mu2WJacmjxbxXIOgEu5qvYvQNODaagrjkuN8uxEEa/Q2DawsbR659PFrjDm4V2qzyxnCwSfhKkxNOa5Yh+Ew+odL2bvrMMY1Co0UX990gmDcDIZc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908966; c=relaxed/simple; bh=6wx3nsqEiunJAujERA5+1OowUi6IHLPi7uf+FD2TuoQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=SGZVsT8q/0War+O+SEDEcHul4oKu9JVbuQuMxYyZSLw/zFT7fJqzQORknXv4CmpI6Hydvp5kQfpisGm3/xkSeEnKtMV2SI3cHKPvhRHEK4DwwiGlKASfVqaqRbg4U7x7llckf30MUDKx+Ub64DNHiIae9SVF3O3WekGfZ6tJJEI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ZaQpHkJJ; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZaQpHkJJ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908962; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Q78YRNQpD6xsITKm16cqESA0LOSKT47Tzy7RuuPnluw=; b=ZaQpHkJJyZBBBXNPT02KhryWESytPaPgNZVZqKYStUorhvhOAuQnzpDAEz2kyDMz3m8H9G hMkOw7AT4Lo1us9EzcMOjP3McZ8oUlMTsgvH0woNDy004T+L4Wqc7FElAZlzGEzCN4r3Go Y6GMgaNXBJYFeaxI8wNnsMqB6HFk604= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-610-PuDbLpdIPNmkgWK93z-qmA-1; Thu, 13 Mar 2025 19:35:57 -0400 X-MC-Unique: PuDbLpdIPNmkgWK93z-qmA-1 X-Mimecast-MFC-AGG-ID: PuDbLpdIPNmkgWK93z-qmA_1741908956 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 169AA18001F7; Thu, 13 Mar 2025 23:35:56 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id A4B7D1955F2D; Thu, 13 Mar 2025 23:35:53 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 34/35] ceph: Enable multipage folios for ceph files Date: Thu, 13 Mar 2025 23:33:26 +0000 Message-ID: <20250313233341.1675324-35-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Enable multipage folio support for ceph regular files. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/ceph/inode.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 8f73f3a55a3e..d9215423e011 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -1184,6 +1184,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page, case S_IFREG: inode->i_op = &ceph_file_iops; inode->i_fop = &ceph_file_fops; + mapping_set_large_folios(inode->i_mapping); break; case S_IFLNK: if (!ci->i_symlink) { From patchwork Thu Mar 13 23:33:27 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 14016082 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D162D213239 for ; Thu, 13 Mar 2025 23:36:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908973; cv=none; b=uEdQz2gL0gdXi7uOIRy0liOc/tRGBy8gL9KR1J76K1DQGgB6IYeTCqTKL0pJEGL12deBdv82Nqti09R+bXQBPtY6jQzXgKQ7Q3OAloROcbBeadmbQW0RSMOPdf96Tgtyj1T84/MEY19/HUfOT4OCI3g0gH/KvRX5p0WPEOOrsd0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741908973; c=relaxed/simple; bh=2mIn4zxCGIq5t482gSW3AYCC1uNyd+9U89tW2k20wH4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=jTZnE1RM/vlUrxtVag4H38KngjnINx0R/1mV5x5bM1glzswaHmsriL9JJxKiXmFHFnG1F9mv7GZtDq40lo0BYJih1Lpk02B7Xw3IzMjtltZ4rag3clwBsRoX6lfmod+VuLEHUOIVP2qvu00BdpSiqepj4395odSfDEs+zvPPbOM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=WIAO+iro; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="WIAO+iro" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1741908967; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jsxkE/Jb2ImKiLpdvCaPwOfw95YpPExSUU1wkEoFpBs=; b=WIAO+irow4jyAAT3MC/FUUAVQf3crBKWvUHTFWjW2NafgjUwJ1ZwX+xqI1uD4M9PQBXLyj 7rvcMZLxkGs1nfFA5XAb20zVGPUY8HsZeUgTHKuXzpwRdyNTroDeqrRUb8fH4wWf/8lz7/ LfQodCNq7YcItuSkoMV5DBiL+YYNqUk= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-583-8ETjJIpIPvW8rkY37NLh-A-1; Thu, 13 Mar 2025 19:36:01 -0400 X-MC-Unique: 8ETjJIpIPvW8rkY37NLh-A-1 X-Mimecast-MFC-AGG-ID: 8ETjJIpIPvW8rkY37NLh-A_1741908960 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 05C7318001E0; Thu, 13 Mar 2025 23:36:00 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.42.28.61]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 326B91828A87; Thu, 13 Mar 2025 23:35:57 +0000 (UTC) From: David Howells To: Viacheslav Dubeyko , Alex Markuze Cc: David Howells , Ilya Dryomov , Jeff Layton , Dongsheng Yang , ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 35/35] ceph: Remove old I/O API bits Date: Thu, 13 Mar 2025 23:33:27 +0000 Message-ID: <20250313233341.1675324-36-dhowells@redhat.com> In-Reply-To: <20250313233341.1675324-1-dhowells@redhat.com> References: <20250313233341.1675324-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Remove the #if'd out bits of the old I/O API. This is separate to the implementation to reduce the size of the reviewable patch. Signed-off-by: David Howells cc: Viacheslav Dubeyko cc: Alex Markuze cc: Ilya Dryomov cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org --- fs/ceph/addr.c | 2018 ++--------------------------------------------- fs/ceph/file.c | 1504 ----------------------------------- fs/ceph/super.h | 21 - 3 files changed, 46 insertions(+), 3497 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 325fbbce1eaa..b3ba102af60b 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -59,1890 +59,70 @@ * accounting is preserved. */ -#define CONGESTION_ON_THRESH(congestion_kb) (congestion_kb >> (PAGE_SHIFT-10)) -#define CONGESTION_OFF_THRESH(congestion_kb) \ - (CONGESTION_ON_THRESH(congestion_kb) - \ - (CONGESTION_ON_THRESH(congestion_kb) >> 2)) - -#if 0 // TODO: Remove after netfs conversion -static int ceph_netfs_check_write_begin(struct file *file, loff_t pos, unsigned int len, - struct folio **foliop, void **_fsdata); - -static struct ceph_snap_context *page_snap_context(struct page *page) -{ - if (PagePrivate(page)) - return (void *)page->private; - return NULL; -} -#endif // TODO: Remove after netfs conversion - -/* - * Dirty a page. Optimistically adjust accounting, on the assumption - * that we won't race with invalidate. If we do, readjust. - */ -bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio) -{ - struct inode *inode = mapping->host; - struct ceph_client *cl = ceph_inode_to_client(inode); - struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb); - struct ceph_inode_info *ci; - struct ceph_snap_context *snapc; - struct netfs_group *group; - - if (folio_test_dirty(folio)) { - doutc(cl, "%llx.%llx %p idx %lu -- already dirty\n", - ceph_vinop(inode), folio, folio->index); - VM_BUG_ON_FOLIO(!folio_test_private(folio), folio); - return false; - } - - atomic64_inc(&mdsc->dirty_folios); - - ci = ceph_inode(inode); - - /* dirty the head */ - spin_lock(&ci->i_ceph_lock); - if (__ceph_have_pending_cap_snap(ci)) { - struct ceph_cap_snap *capsnap = - list_last_entry(&ci->i_cap_snaps, - struct ceph_cap_snap, - ci_item); - snapc = capsnap->context; - capsnap->dirty_pages++; - } else { - snapc = ci->i_head_snapc; - BUG_ON(!snapc); - ++ci->i_wrbuffer_ref_head; - } - - /* Attach a reference to the snap/group to the folio. */ - group = netfs_folio_group(folio); - if (group != &snapc->group) { - netfs_set_group(folio, &snapc->group); - if (group) { - doutc(cl, "Different group %px != %px\n", - group, &snapc->group); - netfs_put_group(group); - } - } - - if (ci->i_wrbuffer_ref == 0) - ihold(inode); - ++ci->i_wrbuffer_ref; - doutc(cl, "%llx.%llx %p idx %lu head %d/%d -> %d/%d " - "snapc %p seq %lld (%d snaps)\n", - ceph_vinop(inode), folio, folio->index, - ci->i_wrbuffer_ref-1, ci->i_wrbuffer_ref_head-1, - ci->i_wrbuffer_ref, ci->i_wrbuffer_ref_head, - snapc, snapc->seq, snapc->num_snaps); - spin_unlock(&ci->i_ceph_lock); - - return netfs_dirty_folio(mapping, folio); -} - -#if 0 // TODO: Remove after netfs conversion -/* - * If we are truncating the full folio (i.e. offset == 0), adjust the - * dirty folio counters appropriately. Only called if there is private - * data on the folio. - */ -static void ceph_invalidate_folio(struct folio *folio, size_t offset, - size_t length) -{ - struct inode *inode = folio->mapping->host; - struct ceph_client *cl = ceph_inode_to_client(inode); - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_snap_context *snapc; - - - if (offset != 0 || length != folio_size(folio)) { - doutc(cl, "%llx.%llx idx %lu partial dirty page %zu~%zu\n", - ceph_vinop(inode), folio->index, offset, length); - return; - } - - WARN_ON(!folio_test_locked(folio)); - if (folio_test_private(folio)) { - doutc(cl, "%llx.%llx idx %lu full dirty page\n", - ceph_vinop(inode), folio->index); - - snapc = folio_detach_private(folio); - ceph_put_wrbuffer_cap_refs(ci, 1, snapc); - ceph_put_snap_context(snapc); - } - - netfs_invalidate_folio(folio, offset, length); -} - -static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq) -{ - struct inode *inode = rreq->inode; - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_file_layout *lo = &ci->i_layout; - unsigned long max_pages = inode->i_sb->s_bdi->ra_pages; - loff_t end = rreq->start + rreq->len, new_end; - struct ceph_netfs_request_data *priv = rreq->netfs_priv; - unsigned long max_len; - u32 blockoff; - - if (priv) { - /* Readahead is disabled by posix_fadvise POSIX_FADV_RANDOM */ - if (priv->file_ra_disabled) - max_pages = 0; - else - max_pages = priv->file_ra_pages; - - } - - /* Readahead is disabled */ - if (!max_pages) - return; - - max_len = max_pages << PAGE_SHIFT; - - /* - * Try to expand the length forward by rounding up it to the next - * block, but do not exceed the file size, unless the original - * request already exceeds it. - */ - new_end = umin(round_up(end, lo->stripe_unit), rreq->i_size); - if (new_end > end && new_end <= rreq->start + max_len) - rreq->len = new_end - rreq->start; - - /* Try to expand the start downward */ - div_u64_rem(rreq->start, lo->stripe_unit, &blockoff); - if (rreq->len + blockoff <= max_len) { - rreq->start -= blockoff; - rreq->len += blockoff; - } -} - -static void finish_netfs_read(struct ceph_osd_request *req) -{ - struct inode *inode = req->r_inode; - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = fsc->client; - struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0); - struct netfs_io_subrequest *subreq = req->r_priv; - struct ceph_osd_req_op *op = &req->r_ops[0]; - int err = req->r_result; - bool sparse = (op->op == CEPH_OSD_OP_SPARSE_READ); - - ceph_update_read_metrics(&fsc->mdsc->metric, req->r_start_latency, - req->r_end_latency, osd_data->length, err); - - doutc(cl, "result %d subreq->len=%zu i_size=%lld\n", req->r_result, - subreq->len, i_size_read(req->r_inode)); - - /* no object means success but no data */ - if (err == -ENOENT) { - __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags); - __set_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags); - err = 0; - } else if (err == -EBLOCKLISTED) { - fsc->blocklisted = true; - } - - if (err >= 0) { - if (sparse && err > 0) - err = ceph_sparse_ext_map_end(op); - if (err < subreq->len && - subreq->rreq->origin != NETFS_DIO_READ) - __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags); - if (IS_ENCRYPTED(inode) && err > 0) { - err = ceph_fscrypt_decrypt_extents(inode, - osd_data->pages, subreq->start, - op->extent.sparse_ext, - op->extent.sparse_ext_cnt); - if (err > subreq->len) - err = subreq->len; - } - if (err > 0) - __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags); - } - - if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) { - ceph_put_page_vector(osd_data->pages, - calc_pages_for(osd_data->offset, - osd_data->length), false); - } - if (err > 0) { - subreq->transferred = err; - err = 0; - } - subreq->error = err; - trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress); - netfs_read_subreq_terminated(subreq); - iput(req->r_inode); - ceph_dec_osd_stopping_blocker(fsc->mdsc); -} - -static bool ceph_netfs_issue_op_inline(struct netfs_io_subrequest *subreq) -{ - struct netfs_io_request *rreq = subreq->rreq; - struct inode *inode = rreq->inode; - struct ceph_mds_reply_info_parsed *rinfo; - struct ceph_mds_reply_info_in *iinfo; - struct ceph_mds_request *req; - struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb); - struct ceph_inode_info *ci = ceph_inode(inode); - ssize_t err = 0; - size_t len; - int mode; - - if (rreq->origin != NETFS_DIO_READ) - __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags); - __clear_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags); - - if (subreq->start >= inode->i_size) - goto out; - - /* We need to fetch the inline data. */ - mode = ceph_try_to_choose_auth_mds(inode, CEPH_STAT_CAP_INLINE_DATA); - req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_GETATTR, mode); - if (IS_ERR(req)) { - err = PTR_ERR(req); - goto out; - } - req->r_ino1 = ci->i_vino; - req->r_args.getattr.mask = cpu_to_le32(CEPH_STAT_CAP_INLINE_DATA); - req->r_num_caps = 2; - - trace_netfs_sreq(subreq, netfs_sreq_trace_submit); - err = ceph_mdsc_do_request(mdsc, NULL, req); - if (err < 0) - goto out; - - rinfo = &req->r_reply_info; - iinfo = &rinfo->targeti; - if (iinfo->inline_version == CEPH_INLINE_NONE) { - /* The data got uninlined */ - ceph_mdsc_put_request(req); - return false; - } - - len = min_t(size_t, iinfo->inline_len - subreq->start, subreq->len); - err = copy_to_iter(iinfo->inline_data + subreq->start, len, &subreq->io_iter); - if (err == 0) { - err = -EFAULT; - } else { - subreq->transferred += err; - err = 0; - } - - ceph_mdsc_put_request(req); -out: - subreq->error = err; - trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress); - netfs_read_subreq_terminated(subreq); - return true; -} - -static int ceph_netfs_prepare_read(struct netfs_io_subrequest *subreq) -{ - struct netfs_io_request *rreq = subreq->rreq; - struct inode *inode = rreq->inode; - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - size_t xlen; - u64 objno, objoff; - - /* Truncate the extent at the end of the current block */ - ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, subreq->len, - &objno, &objoff, &xlen); - rreq->io_streams[0].sreq_max_len = umin(xlen, fsc->mount_options->rsize); - return 0; -} - -static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq) -{ - struct netfs_io_request *rreq = subreq->rreq; - struct inode *inode = rreq->inode; - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = fsc->client; - struct ceph_osd_request *req = NULL; - struct ceph_vino vino = ceph_vino(inode); - int err; - u64 len; - bool sparse = IS_ENCRYPTED(inode) || ceph_test_mount_opt(fsc, SPARSEREAD); - u64 off = subreq->start; - int extent_cnt; - - if (ceph_inode_is_shutdown(inode)) { - err = -EIO; - goto out; - } - - if (ceph_has_inline_data(ci) && ceph_netfs_issue_op_inline(subreq)) - return; - - // TODO: This rounding here is slightly dodgy. It *should* work, for - // now, as the cache only deals in blocks that are a multiple of - // PAGE_SIZE and fscrypt blocks are at most PAGE_SIZE. What needs to - // happen is for the fscrypt driving to be moved into netfslib and the - // data in the cache also to be stored encrypted. - len = subreq->len; - ceph_fscrypt_adjust_off_and_len(inode, &off, &len); - - req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout, vino, - off, &len, 0, 1, sparse ? CEPH_OSD_OP_SPARSE_READ : CEPH_OSD_OP_READ, - CEPH_OSD_FLAG_READ, NULL, ci->i_truncate_seq, - ci->i_truncate_size, false); - if (IS_ERR(req)) { - err = PTR_ERR(req); - req = NULL; - goto out; - } - - if (sparse) { - extent_cnt = __ceph_sparse_read_ext_count(inode, len); - err = ceph_alloc_sparse_ext_map(&req->r_ops[0], extent_cnt); - if (err) - goto out; - } - - doutc(cl, "%llx.%llx pos=%llu orig_len=%zu len=%llu\n", - ceph_vinop(inode), subreq->start, subreq->len, len); - - /* - * FIXME: For now, use CEPH_OSD_DATA_TYPE_PAGES instead of _ITER for - * encrypted inodes. We'd need infrastructure that handles an iov_iter - * instead of page arrays, and we don't have that as of yet. Once the - * dust settles on the write helpers and encrypt/decrypt routines for - * netfs, we should be able to rework this. - */ - if (IS_ENCRYPTED(inode)) { - struct page **pages; - size_t page_off; - - /* - * The io_iter.count needs to be corrected to aligned length. - * Otherwise, iov_iter_get_pages_alloc2() operates with - * the initial unaligned length value. As a result, - * ceph_msg_data_cursor_init() triggers BUG_ON() in the case - * if msg->sparse_read_total > msg->data_length. - */ - subreq->io_iter.count = len; - - err = iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len, &page_off); - if (err < 0) { - doutc(cl, "%llx.%llx failed to allocate pages, %d\n", - ceph_vinop(inode), err); - goto out; - } - - /* should always give us a page-aligned read */ - WARN_ON_ONCE(page_off); - - len = err; - err = 0; - - osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0, false, - false); - } else { - osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter); - } - if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { - err = -EIO; - goto out; - } - req->r_callback = finish_netfs_read; - req->r_priv = subreq; - req->r_inode = inode; - ihold(inode); - - trace_netfs_sreq(subreq, netfs_sreq_trace_submit); - ceph_osdc_start_request(req->r_osdc, req); -out: - ceph_osdc_put_request(req); - if (err) { - subreq->error = err; - netfs_read_subreq_terminated(subreq); - } - doutc(cl, "%llx.%llx result %d\n", ceph_vinop(inode), err); -} - -static int ceph_init_request(struct netfs_io_request *rreq, struct file *file) -{ - struct inode *inode = rreq->inode; - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = ceph_inode_to_client(inode); - int got = 0, want = CEPH_CAP_FILE_CACHE; - struct ceph_netfs_request_data *priv; - int ret = 0; - - /* [DEPRECATED] Use PG_private_2 to mark folio being written to the cache. */ - __set_bit(NETFS_RREQ_USE_PGPRIV2, &rreq->flags); - - if (rreq->origin != NETFS_READAHEAD) - return 0; - - priv = kzalloc(sizeof(*priv), GFP_NOFS); - if (!priv) - return -ENOMEM; - - /* - * If we are doing readahead triggered by a read, fault-in or - * MADV/FADV_WILLNEED, someone higher up the stack must be holding the - * FILE_CACHE and/or LAZYIO caps. - */ - if (file) { - priv->file_ra_pages = file->f_ra.ra_pages; - priv->file_ra_disabled = file->f_mode & FMODE_RANDOM; - rreq->netfs_priv = priv; - return 0; - } - - /* - * readahead callers do not necessarily hold Fcb caps - * (e.g. fadvise, madvise). - */ - ret = ceph_try_get_caps(inode, CEPH_CAP_FILE_RD, want, true, &got); - if (ret < 0) { - doutc(cl, "%llx.%llx, error getting cap\n", ceph_vinop(inode)); - goto out; - } - - if (!(got & want)) { - doutc(cl, "%llx.%llx, no cache cap\n", ceph_vinop(inode)); - ret = -EACCES; - goto out; - } - if (ret == 0) { - ret = -EACCES; - goto out; - } - - priv->caps = got; - rreq->netfs_priv = priv; - rreq->io_streams[0].sreq_max_len = fsc->mount_options->rsize; - -out: - if (ret < 0) { - if (got) - ceph_put_cap_refs(ceph_inode(inode), got); - kfree(priv); - } - - return ret; -} - -static void ceph_netfs_free_request(struct netfs_io_request *rreq) -{ - struct ceph_netfs_request_data *priv = rreq->netfs_priv; - - if (!priv) - return; - - if (priv->caps) - ceph_put_cap_refs(ceph_inode(rreq->inode), priv->caps); - kfree(priv); - rreq->netfs_priv = NULL; -} - -const struct netfs_request_ops ceph_netfs_ops = { - .init_request = ceph_init_request, - .free_request = ceph_netfs_free_request, - .prepare_read = ceph_netfs_prepare_read, - .issue_read = ceph_netfs_issue_read, - .expand_readahead = ceph_netfs_expand_readahead, - .check_write_begin = ceph_netfs_check_write_begin, -}; - -#ifdef CONFIG_CEPH_FSCACHE -static void ceph_set_page_fscache(struct page *page) -{ - folio_start_private_2(page_folio(page)); /* [DEPRECATED] */ -} - -static void ceph_fscache_write_terminated(void *priv, ssize_t error, bool was_async) -{ - struct inode *inode = priv; - - if (IS_ERR_VALUE(error) && error != -ENOBUFS) - ceph_fscache_invalidate(inode, false); -} - -static void ceph_fscache_write_to_cache(struct inode *inode, u64 off, u64 len, bool caching) -{ - struct ceph_inode_info *ci = ceph_inode(inode); - struct fscache_cookie *cookie = ceph_fscache_cookie(ci); - - fscache_write_to_cache(cookie, inode->i_mapping, off, len, i_size_read(inode), - ceph_fscache_write_terminated, inode, true, caching); -} -#else -static inline void ceph_set_page_fscache(struct page *page) -{ -} - -static inline void ceph_fscache_write_to_cache(struct inode *inode, u64 off, u64 len, bool caching) -{ -} -#endif /* CONFIG_CEPH_FSCACHE */ - -struct ceph_writeback_ctl -{ - loff_t i_size; - u64 truncate_size; - u32 truncate_seq; - bool size_stable; - - bool head_snapc; - struct ceph_snap_context *snapc; - struct ceph_snap_context *last_snapc; - - bool done; - bool should_loop; - bool range_whole; - pgoff_t start_index; - pgoff_t index; - pgoff_t end; - xa_mark_t tag; - - pgoff_t strip_unit_end; - unsigned int wsize; - unsigned int nr_folios; - unsigned int max_pages; - unsigned int locked_pages; - - int op_idx; - int num_ops; - u64 offset; - u64 len; - - struct folio_batch fbatch; - unsigned int processed_in_fbatch; - - bool from_pool; - struct page **pages; - struct page **data_pages; -}; - -/* - * Get ref for the oldest snapc for an inode with dirty data... that is, the - * only snap context we are allowed to write back. - */ -static struct ceph_snap_context * -get_oldest_context(struct inode *inode, struct ceph_writeback_ctl *ctl, - struct ceph_snap_context *page_snapc) -{ - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_client *cl = ceph_inode_to_client(inode); - struct ceph_snap_context *snapc = NULL; - struct ceph_cap_snap *capsnap = NULL; - - spin_lock(&ci->i_ceph_lock); - list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) { - doutc(cl, " capsnap %p snapc %p has %d dirty pages\n", - capsnap, capsnap->context, capsnap->dirty_pages); - if (!capsnap->dirty_pages) - continue; - - /* get i_size, truncate_{seq,size} for page_snapc? */ - if (snapc && capsnap->context != page_snapc) - continue; - - if (ctl) { - if (capsnap->writing) { - ctl->i_size = i_size_read(inode); - ctl->size_stable = false; - } else { - ctl->i_size = capsnap->size; - ctl->size_stable = true; - } - ctl->truncate_size = capsnap->truncate_size; - ctl->truncate_seq = capsnap->truncate_seq; - ctl->head_snapc = false; - } - - if (snapc) - break; - - snapc = ceph_get_snap_context(capsnap->context); - if (!page_snapc || - page_snapc == snapc || - page_snapc->seq > snapc->seq) - break; - } - if (!snapc && ci->i_wrbuffer_ref_head) { - snapc = ceph_get_snap_context(ci->i_head_snapc); - doutc(cl, " head snapc %p has %d dirty pages\n", snapc, - ci->i_wrbuffer_ref_head); - if (ctl) { - ctl->i_size = i_size_read(inode); - ctl->truncate_size = ci->i_truncate_size; - ctl->truncate_seq = ci->i_truncate_seq; - ctl->size_stable = false; - ctl->head_snapc = true; - } - } - spin_unlock(&ci->i_ceph_lock); - return snapc; -} - -static u64 get_writepages_data_length(struct inode *inode, - struct page *page, u64 start) -{ - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_snap_context *snapc; - struct ceph_cap_snap *capsnap = NULL; - u64 end = i_size_read(inode); - u64 ret; - - snapc = page_snap_context(ceph_fscrypt_pagecache_page(page)); - if (snapc != ci->i_head_snapc) { - bool found = false; - spin_lock(&ci->i_ceph_lock); - list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) { - if (capsnap->context == snapc) { - if (!capsnap->writing) - end = capsnap->size; - found = true; - break; - } - } - spin_unlock(&ci->i_ceph_lock); - WARN_ON(!found); - } - if (end > ceph_fscrypt_page_offset(page) + thp_size(page)) - end = ceph_fscrypt_page_offset(page) + thp_size(page); - ret = end > start ? end - start : 0; - if (ret && fscrypt_is_bounce_page(page)) - ret = round_up(ret, CEPH_FSCRYPT_BLOCK_SIZE); - return ret; -} - -/* - * Write a folio, but leave it locked. - * - * If we get a write error, mark the mapping for error, but still adjust the - * dirty page accounting (i.e., folio is no longer dirty). - */ -static int write_folio_nounlock(struct folio *folio, - struct writeback_control *wbc) -{ - struct page *page = &folio->page; - struct inode *inode = folio->mapping->host; - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = fsc->client; - struct ceph_snap_context *snapc, *oldest; - loff_t page_off = folio_pos(folio); - int err; - loff_t len = folio_size(folio); - loff_t wlen; - struct ceph_writeback_ctl ceph_wbc; - struct ceph_osd_client *osdc = &fsc->client->osdc; - struct ceph_osd_request *req; - bool caching = ceph_is_cache_enabled(inode); - struct page *bounce_page = NULL; - - doutc(cl, "%llx.%llx folio %p idx %lu\n", ceph_vinop(inode), folio, - folio->index); - - if (ceph_inode_is_shutdown(inode)) - return -EIO; - - /* verify this is a writeable snap context */ - snapc = page_snap_context(&folio->page); - if (!snapc) { - doutc(cl, "%llx.%llx folio %p not dirty?\n", ceph_vinop(inode), - folio); - return 0; - } - oldest = get_oldest_context(inode, &ceph_wbc, snapc); - if (snapc->seq > oldest->seq) { - doutc(cl, "%llx.%llx folio %p snapc %p not writeable - noop\n", - ceph_vinop(inode), folio, snapc); - /* we should only noop if called by kswapd */ - WARN_ON(!(current->flags & PF_MEMALLOC)); - ceph_put_snap_context(oldest); - folio_redirty_for_writepage(wbc, folio); - return 0; - } - ceph_put_snap_context(oldest); - - /* is this a partial page at end of file? */ - if (page_off >= ceph_wbc.i_size) { - doutc(cl, "%llx.%llx folio at %lu beyond eof %llu\n", - ceph_vinop(inode), folio->index, ceph_wbc.i_size); - folio_invalidate(folio, 0, folio_size(folio)); - return 0; - } - - if (ceph_wbc.i_size < page_off + len) - len = ceph_wbc.i_size - page_off; - - wlen = IS_ENCRYPTED(inode) ? round_up(len, CEPH_FSCRYPT_BLOCK_SIZE) : len; - doutc(cl, "%llx.%llx folio %p index %lu on %llu~%llu snapc %p seq %lld\n", - ceph_vinop(inode), folio, folio->index, page_off, wlen, snapc, - snapc->seq); - - if (atomic_long_inc_return(&fsc->writeback_count) > - CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb)) - fsc->write_congested = true; - - req = ceph_osdc_new_request(osdc, &ci->i_layout, ceph_vino(inode), - page_off, &wlen, 0, 1, CEPH_OSD_OP_WRITE, - CEPH_OSD_FLAG_WRITE, snapc, - ceph_wbc.truncate_seq, - ceph_wbc.truncate_size, true); - if (IS_ERR(req)) { - folio_redirty_for_writepage(wbc, folio); - return PTR_ERR(req); - } - - if (wlen < len) - len = wlen; - - folio_start_writeback(folio); - if (caching) - ceph_set_page_fscache(&folio->page); - ceph_fscache_write_to_cache(inode, page_off, len, caching); - - if (IS_ENCRYPTED(inode)) { - bounce_page = fscrypt_encrypt_pagecache_blocks(&folio->page, - CEPH_FSCRYPT_BLOCK_SIZE, 0, - GFP_NOFS); - if (IS_ERR(bounce_page)) { - folio_redirty_for_writepage(wbc, folio); - folio_end_writeback(folio); - ceph_osdc_put_request(req); - return PTR_ERR(bounce_page); - } - } - - /* it may be a short write due to an object boundary */ - WARN_ON_ONCE(len > folio_size(folio)); - osd_req_op_extent_osd_data_pages(req, 0, - bounce_page ? &bounce_page : &page, wlen, 0, - false, false); - doutc(cl, "%llx.%llx %llu~%llu (%llu bytes, %sencrypted)\n", - ceph_vinop(inode), page_off, len, wlen, - IS_ENCRYPTED(inode) ? "" : "not "); - - req->r_mtime = inode_get_mtime(inode); - ceph_osdc_start_request(osdc, req); - err = ceph_osdc_wait_request(osdc, req); - - ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency, - req->r_end_latency, len, err); - fscrypt_free_bounce_page(bounce_page); - ceph_osdc_put_request(req); - if (err == 0) - err = len; - - if (err < 0) { - struct writeback_control tmp_wbc; - if (!wbc) - wbc = &tmp_wbc; - if (err == -ERESTARTSYS) { - /* killed by SIGKILL */ - doutc(cl, "%llx.%llx interrupted page %p\n", - ceph_vinop(inode), folio); - folio_redirty_for_writepage(wbc, folio); - folio_end_writeback(folio); - return err; - } - if (err == -EBLOCKLISTED) - fsc->blocklisted = true; - doutc(cl, "%llx.%llx setting mapping error %d %p\n", - ceph_vinop(inode), err, folio); - mapping_set_error(&inode->i_data, err); - wbc->pages_skipped++; - } else { - doutc(cl, "%llx.%llx cleaned page %p\n", - ceph_vinop(inode), folio); - err = 0; /* vfs expects us to return 0 */ - } - oldest = folio_detach_private(folio); - WARN_ON_ONCE(oldest != snapc); - folio_end_writeback(folio); - ceph_put_wrbuffer_cap_refs(ci, 1, snapc); - ceph_put_snap_context(snapc); /* page's reference */ - - if (atomic_long_dec_return(&fsc->writeback_count) < - CONGESTION_OFF_THRESH(fsc->mount_options->congestion_kb)) - fsc->write_congested = false; - - return err; -} - -/* - * async writeback completion handler. - * - * If we get an error, set the mapping error bit, but not the individual - * page error bits. - */ -static void writepages_finish(struct ceph_osd_request *req) -{ - struct inode *inode = req->r_inode; - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_client *cl = ceph_inode_to_client(inode); - struct ceph_osd_data *osd_data; - struct page *page; - int num_pages, total_pages = 0; - int i, j; - int rc = req->r_result; - struct ceph_snap_context *snapc = req->r_snapc; - struct address_space *mapping = inode->i_mapping; - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb); - unsigned int len = 0; - bool remove_page; - - doutc(cl, "%llx.%llx rc %d\n", ceph_vinop(inode), rc); - if (rc < 0) { - mapping_set_error(mapping, rc); - ceph_set_error_write(ci); - if (rc == -EBLOCKLISTED) - fsc->blocklisted = true; - } else { - ceph_clear_error_write(ci); - } - - /* - * We lost the cache cap, need to truncate the page before - * it is unlocked, otherwise we'd truncate it later in the - * page truncation thread, possibly losing some data that - * raced its way in - */ - remove_page = !(ceph_caps_issued(ci) & - (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)); - - /* clean all pages */ - for (i = 0; i < req->r_num_ops; i++) { - if (req->r_ops[i].op != CEPH_OSD_OP_WRITE) { - pr_warn_client(cl, - "%llx.%llx incorrect op %d req %p index %d tid %llu\n", - ceph_vinop(inode), req->r_ops[i].op, req, i, - req->r_tid); - break; - } - - osd_data = osd_req_op_extent_osd_data(req, i); - BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_PAGES); - len += osd_data->length; - num_pages = calc_pages_for((u64)osd_data->offset, - (u64)osd_data->length); - total_pages += num_pages; - for (j = 0; j < num_pages; j++) { - page = osd_data->pages[j]; - if (fscrypt_is_bounce_page(page)) { - page = fscrypt_pagecache_page(page); - fscrypt_free_bounce_page(osd_data->pages[j]); - osd_data->pages[j] = page; - } - BUG_ON(!page); - WARN_ON(!PageUptodate(page)); - - if (atomic_long_dec_return(&fsc->writeback_count) < - CONGESTION_OFF_THRESH( - fsc->mount_options->congestion_kb)) - fsc->write_congested = false; - - ceph_put_snap_context(detach_page_private(page)); - end_page_writeback(page); - - if (atomic64_dec_return(&mdsc->dirty_folios) <= 0) { - wake_up_all(&mdsc->flush_end_wq); - WARN_ON(atomic64_read(&mdsc->dirty_folios) < 0); - } - - doutc(cl, "unlocking %p\n", page); - - if (remove_page) - generic_error_remove_folio(inode->i_mapping, - page_folio(page)); - - unlock_page(page); - } - doutc(cl, "%llx.%llx wrote %llu bytes cleaned %d pages\n", - ceph_vinop(inode), osd_data->length, - rc >= 0 ? num_pages : 0); - - release_pages(osd_data->pages, num_pages); - } - - ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency, - req->r_end_latency, len, rc); - - ceph_put_wrbuffer_cap_refs(ci, total_pages, snapc); - - osd_data = osd_req_op_extent_osd_data(req, 0); - if (osd_data->pages_from_pool) - mempool_free(osd_data->pages, ceph_wb_pagevec_pool); - else - kfree(osd_data->pages); - ceph_osdc_put_request(req); - ceph_dec_osd_stopping_blocker(fsc->mdsc); -} - -static inline -bool is_forced_umount(struct address_space *mapping) -{ - struct inode *inode = mapping->host; - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = fsc->client; - - if (ceph_inode_is_shutdown(inode)) { - if (ci->i_wrbuffer_ref > 0) { - pr_warn_ratelimited_client(cl, - "%llx.%llx %lld forced umount\n", - ceph_vinop(inode), ceph_ino(inode)); - } - mapping_set_error(mapping, -EIO); - return true; - } - - return false; -} - -static inline -unsigned int ceph_define_write_size(struct address_space *mapping) -{ - struct inode *inode = mapping->host; - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - unsigned int wsize = i_blocksize(inode); - - if (fsc->mount_options->wsize < wsize) - wsize = fsc->mount_options->wsize; - - return wsize; -} - -static inline -void ceph_folio_batch_init(struct ceph_writeback_ctl *ceph_wbc) -{ - folio_batch_init(&ceph_wbc->fbatch); - ceph_wbc->processed_in_fbatch = 0; -} - -static inline -void ceph_folio_batch_reinit(struct ceph_writeback_ctl *ceph_wbc) -{ - folio_batch_release(&ceph_wbc->fbatch); - ceph_folio_batch_init(ceph_wbc); -} - -static inline -void ceph_init_writeback_ctl(struct address_space *mapping, - struct writeback_control *wbc, - struct ceph_writeback_ctl *ceph_wbc) -{ - ceph_wbc->snapc = NULL; - ceph_wbc->last_snapc = NULL; - - ceph_wbc->strip_unit_end = 0; - ceph_wbc->wsize = ceph_define_write_size(mapping); - - ceph_wbc->nr_folios = 0; - ceph_wbc->max_pages = 0; - ceph_wbc->locked_pages = 0; - - ceph_wbc->done = false; - ceph_wbc->should_loop = false; - ceph_wbc->range_whole = false; - - ceph_wbc->start_index = wbc->range_cyclic ? mapping->writeback_index : 0; - ceph_wbc->index = ceph_wbc->start_index; - ceph_wbc->end = -1; - - if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) { - ceph_wbc->tag = PAGECACHE_TAG_TOWRITE; - } else { - ceph_wbc->tag = PAGECACHE_TAG_DIRTY; - } - - ceph_wbc->op_idx = -1; - ceph_wbc->num_ops = 0; - ceph_wbc->offset = 0; - ceph_wbc->len = 0; - ceph_wbc->from_pool = false; - - ceph_folio_batch_init(ceph_wbc); - - ceph_wbc->pages = NULL; - ceph_wbc->data_pages = NULL; -} - -static inline -int ceph_define_writeback_range(struct address_space *mapping, - struct writeback_control *wbc, - struct ceph_writeback_ctl *ceph_wbc) -{ - struct inode *inode = mapping->host; - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = fsc->client; - - /* find oldest snap context with dirty data */ - ceph_wbc->snapc = get_oldest_context(inode, ceph_wbc, NULL); - if (!ceph_wbc->snapc) { - /* hmm, why does writepages get called when there - is no dirty data? */ - doutc(cl, " no snap context with dirty data?\n"); - return -ENODATA; - } - - doutc(cl, " oldest snapc is %p seq %lld (%d snaps)\n", - ceph_wbc->snapc, ceph_wbc->snapc->seq, - ceph_wbc->snapc->num_snaps); - - ceph_wbc->should_loop = false; - - if (ceph_wbc->head_snapc && ceph_wbc->snapc != ceph_wbc->last_snapc) { - /* where to start/end? */ - if (wbc->range_cyclic) { - ceph_wbc->index = ceph_wbc->start_index; - ceph_wbc->end = -1; - if (ceph_wbc->index > 0) - ceph_wbc->should_loop = true; - doutc(cl, " cyclic, start at %lu\n", ceph_wbc->index); - } else { - ceph_wbc->index = wbc->range_start >> PAGE_SHIFT; - ceph_wbc->end = wbc->range_end >> PAGE_SHIFT; - if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) - ceph_wbc->range_whole = true; - doutc(cl, " not cyclic, %lu to %lu\n", - ceph_wbc->index, ceph_wbc->end); - } - } else if (!ceph_wbc->head_snapc) { - /* Do not respect wbc->range_{start,end}. Dirty pages - * in that range can be associated with newer snapc. - * They are not writeable until we write all dirty pages - * associated with 'snapc' get written */ - if (ceph_wbc->index > 0) - ceph_wbc->should_loop = true; - doutc(cl, " non-head snapc, range whole\n"); - } - - ceph_put_snap_context(ceph_wbc->last_snapc); - ceph_wbc->last_snapc = ceph_wbc->snapc; - - return 0; -} - -static inline -bool has_writeback_done(struct ceph_writeback_ctl *ceph_wbc) -{ - return ceph_wbc->done && ceph_wbc->index > ceph_wbc->end; -} - -static inline -bool can_next_page_be_processed(struct ceph_writeback_ctl *ceph_wbc, - unsigned index) -{ - return index < ceph_wbc->nr_folios && - ceph_wbc->locked_pages < ceph_wbc->max_pages; -} - -static -int ceph_check_page_before_write(struct address_space *mapping, - struct writeback_control *wbc, - struct ceph_writeback_ctl *ceph_wbc, - struct folio *folio) -{ - struct inode *inode = mapping->host; - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = fsc->client; - struct ceph_snap_context *pgsnapc; - - /* only dirty folios, or our accounting breaks */ - if (unlikely(!folio_test_dirty(folio) || folio->mapping != mapping)) { - doutc(cl, "!dirty or !mapping %p\n", folio); - return -ENODATA; - } - - /* only if matching snap context */ - pgsnapc = page_snap_context(&folio->page); - if (pgsnapc != ceph_wbc->snapc) { - doutc(cl, "folio snapc %p %lld != oldest %p %lld\n", - pgsnapc, pgsnapc->seq, - ceph_wbc->snapc, ceph_wbc->snapc->seq); - - if (!ceph_wbc->should_loop && !ceph_wbc->head_snapc && - wbc->sync_mode != WB_SYNC_NONE) - ceph_wbc->should_loop = true; - - return -ENODATA; - } - - if (folio_pos(folio) >= ceph_wbc->i_size) { - doutc(cl, "folio at %lu beyond eof %llu\n", - folio->index, ceph_wbc->i_size); - - if ((ceph_wbc->size_stable || - folio_pos(folio) >= i_size_read(inode)) && - folio_clear_dirty_for_io(folio)) - folio_invalidate(folio, 0, folio_size(folio)); - - return -ENODATA; - } - - if (ceph_wbc->strip_unit_end && - (folio->index > ceph_wbc->strip_unit_end)) { - doutc(cl, "end of strip unit %p\n", folio); - return -E2BIG; - } - - return 0; -} - -static inline -void __ceph_allocate_page_array(struct ceph_writeback_ctl *ceph_wbc, - unsigned int max_pages) -{ - ceph_wbc->pages = kmalloc_array(max_pages, - sizeof(*ceph_wbc->pages), - GFP_NOFS); - if (!ceph_wbc->pages) { - ceph_wbc->from_pool = true; - ceph_wbc->pages = mempool_alloc(ceph_wb_pagevec_pool, GFP_NOFS); - BUG_ON(!ceph_wbc->pages); - } -} - -static inline -void ceph_allocate_page_array(struct address_space *mapping, - struct ceph_writeback_ctl *ceph_wbc, - struct folio *folio) -{ - struct inode *inode = mapping->host; - struct ceph_inode_info *ci = ceph_inode(inode); - size_t xlen; - u64 objnum; - u64 objoff; - - /* prepare async write request */ - ceph_wbc->offset = (u64)folio_pos(folio); - ceph_calc_file_object_mapping(&ci->i_layout, - ceph_wbc->offset, ceph_wbc->wsize, - &objnum, &objoff, &xlen); - - ceph_wbc->num_ops = 1; - ceph_wbc->strip_unit_end = folio->index + ((xlen - 1) >> PAGE_SHIFT); - - BUG_ON(ceph_wbc->pages); - ceph_wbc->max_pages = calc_pages_for(0, (u64)xlen); - __ceph_allocate_page_array(ceph_wbc, ceph_wbc->max_pages); - - ceph_wbc->len = 0; -} - -static inline -bool is_folio_index_contiguous(const struct ceph_writeback_ctl *ceph_wbc, - const struct folio *folio) -{ - return folio->index == (ceph_wbc->offset + ceph_wbc->len) >> PAGE_SHIFT; -} - -static inline -bool is_num_ops_too_big(struct ceph_writeback_ctl *ceph_wbc) -{ - return ceph_wbc->num_ops >= - (ceph_wbc->from_pool ? CEPH_OSD_SLAB_OPS : CEPH_OSD_MAX_OPS); -} -#endif // TODO: Remove after netfs conversion - -static inline -bool is_write_congestion_happened(struct ceph_fs_client *fsc) -{ - return atomic_long_inc_return(&fsc->writeback_count) > - CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb); -} - -#if 0 // TODO: Remove after netfs conversion -static inline int move_dirty_folio_in_page_array(struct address_space *mapping, - struct writeback_control *wbc, - struct ceph_writeback_ctl *ceph_wbc, struct folio *folio) -{ - struct inode *inode = mapping->host; - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = fsc->client; - struct page **pages = ceph_wbc->pages; - unsigned int index = ceph_wbc->locked_pages; - gfp_t gfp_flags = ceph_wbc->locked_pages ? GFP_NOWAIT : GFP_NOFS; - - if (IS_ENCRYPTED(inode)) { - pages[index] = fscrypt_encrypt_pagecache_blocks(&folio->page, - PAGE_SIZE, - 0, - gfp_flags); - if (IS_ERR(pages[index])) { - if (PTR_ERR(pages[index]) == -EINVAL) { - pr_err_client(cl, "inode->i_blkbits=%hhu\n", - inode->i_blkbits); - } - - /* better not fail on first page! */ - BUG_ON(ceph_wbc->locked_pages == 0); - - pages[index] = NULL; - return PTR_ERR(pages[index]); - } - } else { - pages[index] = &folio->page; - } - - ceph_wbc->locked_pages++; - - return 0; -} - -static -int ceph_process_folio_batch(struct address_space *mapping, - struct writeback_control *wbc, - struct ceph_writeback_ctl *ceph_wbc) -{ - struct inode *inode = mapping->host; - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = fsc->client; - struct folio *folio = NULL; - unsigned i; - int rc = 0; - - for (i = 0; can_next_page_be_processed(ceph_wbc, i); i++) { - folio = ceph_wbc->fbatch.folios[i]; - - if (!folio) - continue; - - doutc(cl, "? %p idx %lu, folio_test_writeback %#x, " - "folio_test_dirty %#x, folio_test_locked %#x\n", - folio, folio->index, folio_test_writeback(folio), - folio_test_dirty(folio), - folio_test_locked(folio)); - - if (folio_test_writeback(folio) || - folio_test_private_2(folio) /* [DEPRECATED] */) { - doutc(cl, "waiting on writeback %p\n", folio); - folio_wait_writeback(folio); - folio_wait_private_2(folio); /* [DEPRECATED] */ - continue; - } - - if (ceph_wbc->locked_pages == 0) - folio_lock(folio); - else if (!folio_trylock(folio)) - break; - - rc = ceph_check_page_before_write(mapping, wbc, - ceph_wbc, folio); - if (rc == -ENODATA) { - rc = 0; - folio_unlock(folio); - ceph_wbc->fbatch.folios[i] = NULL; - continue; - } else if (rc == -E2BIG) { - rc = 0; - folio_unlock(folio); - ceph_wbc->fbatch.folios[i] = NULL; - break; - } - - if (!folio_clear_dirty_for_io(folio)) { - doutc(cl, "%p !folio_clear_dirty_for_io\n", folio); - folio_unlock(folio); - ceph_wbc->fbatch.folios[i] = NULL; - continue; - } - - /* - * We have something to write. If this is - * the first locked page this time through, - * calculate max possible write size and - * allocate a page array - */ - if (ceph_wbc->locked_pages == 0) { - ceph_allocate_page_array(mapping, ceph_wbc, folio); - } else if (!is_folio_index_contiguous(ceph_wbc, folio)) { - if (is_num_ops_too_big(ceph_wbc)) { - folio_redirty_for_writepage(wbc, folio); - folio_unlock(folio); - break; - } - - ceph_wbc->num_ops++; - ceph_wbc->offset = (u64)folio_pos(folio); - ceph_wbc->len = 0; - } - - /* note position of first page in fbatch */ - doutc(cl, "%llx.%llx will write folio %p idx %lu\n", - ceph_vinop(inode), folio, folio->index); - - fsc->write_congested = is_write_congestion_happened(fsc); - - rc = move_dirty_folio_in_page_array(mapping, wbc, ceph_wbc, - folio); - if (rc) { - folio_redirty_for_writepage(wbc, folio); - folio_unlock(folio); - break; - } - - ceph_wbc->fbatch.folios[i] = NULL; - ceph_wbc->len += folio_size(folio); - } - - ceph_wbc->processed_in_fbatch = i; - - return rc; -} - -static inline -void ceph_shift_unused_folios_left(struct folio_batch *fbatch) -{ - unsigned j, n = 0; - - /* shift unused page to beginning of fbatch */ - for (j = 0; j < folio_batch_count(fbatch); j++) { - if (!fbatch->folios[j]) - continue; - - if (n < j) { - fbatch->folios[n] = fbatch->folios[j]; - } - - n++; - } - - fbatch->nr = n; -} - -static -int ceph_submit_write(struct address_space *mapping, - struct writeback_control *wbc, - struct ceph_writeback_ctl *ceph_wbc) -{ - struct inode *inode = mapping->host; - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = fsc->client; - struct ceph_vino vino = ceph_vino(inode); - struct ceph_osd_request *req = NULL; - struct page *page = NULL; - bool caching = ceph_is_cache_enabled(inode); - u64 offset; - u64 len; - unsigned i; - -new_request: - offset = ceph_fscrypt_page_offset(ceph_wbc->pages[0]); - len = ceph_wbc->wsize; - - req = ceph_osdc_new_request(&fsc->client->osdc, - &ci->i_layout, vino, - offset, &len, 0, ceph_wbc->num_ops, - CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE, - ceph_wbc->snapc, ceph_wbc->truncate_seq, - ceph_wbc->truncate_size, false); - if (IS_ERR(req)) { - req = ceph_osdc_new_request(&fsc->client->osdc, - &ci->i_layout, vino, - offset, &len, 0, - min(ceph_wbc->num_ops, - CEPH_OSD_SLAB_OPS), - CEPH_OSD_OP_WRITE, - CEPH_OSD_FLAG_WRITE, - ceph_wbc->snapc, - ceph_wbc->truncate_seq, - ceph_wbc->truncate_size, - true); - BUG_ON(IS_ERR(req)); - } - - page = ceph_wbc->pages[ceph_wbc->locked_pages - 1]; - BUG_ON(len < ceph_fscrypt_page_offset(page) + thp_size(page) - offset); - - if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { - for (i = 0; i < folio_batch_count(&ceph_wbc->fbatch); i++) { - struct folio *folio = ceph_wbc->fbatch.folios[i]; - - if (!folio) - continue; - - page = &folio->page; - redirty_page_for_writepage(wbc, page); - unlock_page(page); - } - - for (i = 0; i < ceph_wbc->locked_pages; i++) { - page = ceph_fscrypt_pagecache_page(ceph_wbc->pages[i]); - - if (!page) - continue; - - redirty_page_for_writepage(wbc, page); - unlock_page(page); - } - - ceph_osdc_put_request(req); - return -EIO; - } - - req->r_callback = writepages_finish; - req->r_inode = inode; - - /* Format the osd request message and submit the write */ - len = 0; - ceph_wbc->data_pages = ceph_wbc->pages; - ceph_wbc->op_idx = 0; - for (i = 0; i < ceph_wbc->locked_pages; i++) { - u64 cur_offset; - - page = ceph_fscrypt_pagecache_page(ceph_wbc->pages[i]); - cur_offset = page_offset(page); - - /* - * Discontinuity in page range? Ceph can handle that by just passing - * multiple extents in the write op. - */ - if (offset + len != cur_offset) { - /* If it's full, stop here */ - if (ceph_wbc->op_idx + 1 == req->r_num_ops) - break; - - /* Kick off an fscache write with what we have so far. */ - ceph_fscache_write_to_cache(inode, offset, len, caching); - - /* Start a new extent */ - osd_req_op_extent_dup_last(req, ceph_wbc->op_idx, - cur_offset - offset); - - doutc(cl, "got pages at %llu~%llu\n", offset, len); - - osd_req_op_extent_osd_data_pages(req, ceph_wbc->op_idx, - ceph_wbc->data_pages, - len, 0, - ceph_wbc->from_pool, - false); - osd_req_op_extent_update(req, ceph_wbc->op_idx, len); - - len = 0; - offset = cur_offset; - ceph_wbc->data_pages = ceph_wbc->pages + i; - ceph_wbc->op_idx++; - } - - set_page_writeback(page); - - if (caching) - ceph_set_page_fscache(page); - - len += thp_size(page); - } - - ceph_fscache_write_to_cache(inode, offset, len, caching); - - if (ceph_wbc->size_stable) { - len = min(len, ceph_wbc->i_size - offset); - } else if (i == ceph_wbc->locked_pages) { - /* writepages_finish() clears writeback pages - * according to the data length, so make sure - * data length covers all locked pages */ - u64 min_len = len + 1 - thp_size(page); - len = get_writepages_data_length(inode, - ceph_wbc->pages[i - 1], - offset); - len = max(len, min_len); - } - - if (IS_ENCRYPTED(inode)) - len = round_up(len, CEPH_FSCRYPT_BLOCK_SIZE); - - doutc(cl, "got pages at %llu~%llu\n", offset, len); - - if (IS_ENCRYPTED(inode) && - ((offset | len) & ~CEPH_FSCRYPT_BLOCK_MASK)) { - pr_warn_client(cl, - "bad encrypted write offset=%lld len=%llu\n", - offset, len); - } - - osd_req_op_extent_osd_data_pages(req, ceph_wbc->op_idx, - ceph_wbc->data_pages, len, - 0, ceph_wbc->from_pool, false); - osd_req_op_extent_update(req, ceph_wbc->op_idx, len); - - BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops); - - ceph_wbc->from_pool = false; - if (i < ceph_wbc->locked_pages) { - BUG_ON(ceph_wbc->num_ops <= req->r_num_ops); - ceph_wbc->num_ops -= req->r_num_ops; - ceph_wbc->locked_pages -= i; - - /* allocate new pages array for next request */ - ceph_wbc->data_pages = ceph_wbc->pages; - __ceph_allocate_page_array(ceph_wbc, ceph_wbc->locked_pages); - memcpy(ceph_wbc->pages, ceph_wbc->data_pages + i, - ceph_wbc->locked_pages * sizeof(*ceph_wbc->pages)); - memset(ceph_wbc->data_pages + i, 0, - ceph_wbc->locked_pages * sizeof(*ceph_wbc->pages)); - } else { - BUG_ON(ceph_wbc->num_ops != req->r_num_ops); - /* request message now owns the pages array */ - ceph_wbc->pages = NULL; - } - - req->r_mtime = inode_get_mtime(inode); - ceph_osdc_start_request(&fsc->client->osdc, req); - req = NULL; - - wbc->nr_to_write -= i; - if (ceph_wbc->pages) - goto new_request; - - return 0; -} - -static -void ceph_wait_until_current_writes_complete(struct address_space *mapping, - struct writeback_control *wbc, - struct ceph_writeback_ctl *ceph_wbc) -{ - struct page *page; - unsigned i, nr; - - if (wbc->sync_mode != WB_SYNC_NONE && - ceph_wbc->start_index == 0 && /* all dirty pages were checked */ - !ceph_wbc->head_snapc) { - ceph_wbc->index = 0; - - while ((ceph_wbc->index <= ceph_wbc->end) && - (nr = filemap_get_folios_tag(mapping, - &ceph_wbc->index, - (pgoff_t)-1, - PAGECACHE_TAG_WRITEBACK, - &ceph_wbc->fbatch))) { - for (i = 0; i < nr; i++) { - page = &ceph_wbc->fbatch.folios[i]->page; - if (page_snap_context(page) != ceph_wbc->snapc) - continue; - wait_on_page_writeback(page); - } - - folio_batch_release(&ceph_wbc->fbatch); - cond_resched(); - } - } -} - /* - * initiate async writeback + * Dirty a page. Optimistically adjust accounting, on the assumption + * that we won't race with invalidate. If we do, readjust. */ -static int ceph_writepages_start(struct address_space *mapping, - struct writeback_control *wbc) +bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio) { struct inode *inode = mapping->host; - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = fsc->client; - struct ceph_writeback_ctl ceph_wbc; - int rc = 0; - - if (wbc->sync_mode == WB_SYNC_NONE && fsc->write_congested) - return 0; - - doutc(cl, "%llx.%llx (mode=%s)\n", ceph_vinop(inode), - wbc->sync_mode == WB_SYNC_NONE ? "NONE" : - (wbc->sync_mode == WB_SYNC_ALL ? "ALL" : "HOLD")); - - if (is_forced_umount(mapping)) { - /* we're in a forced umount, don't write! */ - return -EIO; - } - - ceph_init_writeback_ctl(mapping, wbc, &ceph_wbc); - - if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { - rc = -EIO; - goto out; - } - -retry: - rc = ceph_define_writeback_range(mapping, wbc, &ceph_wbc); - if (rc == -ENODATA) { - /* hmm, why does writepages get called when there - is no dirty data? */ - rc = 0; - goto dec_osd_stopping_blocker; - } - - if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) - tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end); - - while (!has_writeback_done(&ceph_wbc)) { - ceph_wbc.locked_pages = 0; - ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT; - -get_more_pages: - ceph_folio_batch_reinit(&ceph_wbc); - - ceph_wbc.nr_folios = filemap_get_folios_tag(mapping, - &ceph_wbc.index, - ceph_wbc.end, - ceph_wbc.tag, - &ceph_wbc.fbatch); - doutc(cl, "pagevec_lookup_range_tag for tag %#x got %d\n", - ceph_wbc.tag, ceph_wbc.nr_folios); - - if (!ceph_wbc.nr_folios && !ceph_wbc.locked_pages) - break; - -process_folio_batch: - rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc); - if (rc) - goto release_folios; - - /* did we get anything? */ - if (!ceph_wbc.locked_pages) - goto release_folios; - - if (ceph_wbc.processed_in_fbatch) { - ceph_shift_unused_folios_left(&ceph_wbc.fbatch); - - if (folio_batch_count(&ceph_wbc.fbatch) == 0 && - ceph_wbc.locked_pages < ceph_wbc.max_pages) { - doutc(cl, "reached end fbatch, trying for more\n"); - goto get_more_pages; - } - } - - rc = ceph_submit_write(mapping, wbc, &ceph_wbc); - if (rc) - goto release_folios; - - ceph_wbc.locked_pages = 0; - ceph_wbc.strip_unit_end = 0; - - if (folio_batch_count(&ceph_wbc.fbatch) > 0) { - ceph_wbc.nr_folios = - folio_batch_count(&ceph_wbc.fbatch); - goto process_folio_batch; - } - - /* - * We stop writing back only if we are not doing - * integrity sync. In case of integrity sync we have to - * keep going until we have written all the pages - * we tagged for writeback prior to entering this loop. - */ - if (wbc->nr_to_write <= 0 && wbc->sync_mode == WB_SYNC_NONE) - ceph_wbc.done = true; - -release_folios: - doutc(cl, "folio_batch release on %d folios (%p)\n", - (int)ceph_wbc.fbatch.nr, - ceph_wbc.fbatch.nr ? ceph_wbc.fbatch.folios[0] : NULL); - folio_batch_release(&ceph_wbc.fbatch); - } - - if (ceph_wbc.should_loop && !ceph_wbc.done) { - /* more to do; loop back to beginning of file */ - doutc(cl, "looping back to beginning of file\n"); - /* OK even when start_index == 0 */ - ceph_wbc.end = ceph_wbc.start_index - 1; - - /* to write dirty pages associated with next snapc, - * we need to wait until current writes complete */ - ceph_wait_until_current_writes_complete(mapping, wbc, &ceph_wbc); - - ceph_wbc.start_index = 0; - ceph_wbc.index = 0; - goto retry; - } - - if (wbc->range_cyclic || (ceph_wbc.range_whole && wbc->nr_to_write > 0)) - mapping->writeback_index = ceph_wbc.index; - -dec_osd_stopping_blocker: - ceph_dec_osd_stopping_blocker(fsc->mdsc); - -out: - ceph_put_snap_context(ceph_wbc.last_snapc); - doutc(cl, "%llx.%llx dend - startone, rc = %d\n", ceph_vinop(inode), - rc); - - return rc; -} - -/* - * See if a given @snapc is either writeable, or already written. - */ -static int context_is_writeable_or_written(struct inode *inode, - struct ceph_snap_context *snapc) -{ - struct ceph_snap_context *oldest = get_oldest_context(inode, NULL, NULL); - int ret = !oldest || snapc->seq <= oldest->seq; - - ceph_put_snap_context(oldest); - return ret; -} - -/** - * ceph_find_incompatible - find an incompatible context and return it - * @folio: folio being dirtied - * - * We are only allowed to write into/dirty a folio if the folio is - * clean, or already dirty within the same snap context. Returns a - * conflicting context if there is one, NULL if there isn't, or a - * negative error code on other errors. - * - * Must be called with folio lock held. - */ -static struct ceph_snap_context * -ceph_find_incompatible(struct folio *folio) -{ - struct inode *inode = folio->mapping->host; struct ceph_client *cl = ceph_inode_to_client(inode); - struct ceph_inode_info *ci = ceph_inode(inode); - - if (ceph_inode_is_shutdown(inode)) { - doutc(cl, " %llx.%llx folio %p is shutdown\n", - ceph_vinop(inode), folio); - return ERR_PTR(-ESTALE); - } - - for (;;) { - struct ceph_snap_context *snapc, *oldest; - - folio_wait_writeback(folio); - - snapc = page_snap_context(&folio->page); - if (!snapc || snapc == ci->i_head_snapc) - break; - - /* - * this folio is already dirty in another (older) snap - * context! is it writeable now? - */ - oldest = get_oldest_context(inode, NULL, NULL); - if (snapc->seq > oldest->seq) { - /* not writeable -- return it for the caller to deal with */ - ceph_put_snap_context(oldest); - doutc(cl, " %llx.%llx folio %p snapc %p not current or oldest\n", - ceph_vinop(inode), folio, snapc); - return ceph_get_snap_context(snapc); - } - ceph_put_snap_context(oldest); - - /* yay, writeable, do it now (without dropping folio lock) */ - doutc(cl, " %llx.%llx folio %p snapc %p not current, but oldest\n", - ceph_vinop(inode), folio, snapc); - if (folio_clear_dirty_for_io(folio)) { - int r = write_folio_nounlock(folio, NULL); - if (r < 0) - return ERR_PTR(r); - } - } - return NULL; -} - -static int ceph_netfs_check_write_begin(struct file *file, loff_t pos, unsigned int len, - struct folio **foliop, void **_fsdata) -{ - struct inode *inode = file_inode(file); - struct ceph_inode_info *ci = ceph_inode(inode); + struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb); + struct ceph_inode_info *ci; struct ceph_snap_context *snapc; + struct netfs_group *group; - snapc = ceph_find_incompatible(*foliop); - if (snapc) { - int r; - - folio_unlock(*foliop); - folio_put(*foliop); - *foliop = NULL; - if (IS_ERR(snapc)) - return PTR_ERR(snapc); - - ceph_queue_writeback(inode); - r = wait_event_killable(ci->i_cap_wq, - context_is_writeable_or_written(inode, snapc)); - ceph_put_snap_context(snapc); - return r == 0 ? -EAGAIN : r; + if (folio_test_dirty(folio)) { + doutc(cl, "%llx.%llx %p idx %lu -- already dirty\n", + ceph_vinop(inode), folio, folio->index); + VM_BUG_ON_FOLIO(!folio_test_private(folio), folio); + return false; } - return 0; -} - -/* - * We are only allowed to write into/dirty the page if the page is - * clean, or already dirty within the same snap context. - */ -static int ceph_write_begin(struct file *file, struct address_space *mapping, - loff_t pos, unsigned len, - struct folio **foliop, void **fsdata) -{ - struct inode *inode = file_inode(file); - struct ceph_inode_info *ci = ceph_inode(inode); - int r; - - r = netfs_write_begin(&ci->netfs, file, inode->i_mapping, pos, len, foliop, NULL); - if (r < 0) - return r; - folio_wait_private_2(*foliop); /* [DEPRECATED] */ - WARN_ON_ONCE(!folio_test_locked(*foliop)); - return 0; -} + atomic64_inc(&mdsc->dirty_folios); -/* - * we don't do anything in here that simple_write_end doesn't do - * except adjust dirty page accounting - */ -static int ceph_write_end(struct file *file, struct address_space *mapping, - loff_t pos, unsigned len, unsigned copied, - struct folio *folio, void *fsdata) -{ - struct inode *inode = file_inode(file); - struct ceph_client *cl = ceph_inode_to_client(inode); - bool check_cap = false; + ci = ceph_inode(inode); - doutc(cl, "%llx.%llx file %p folio %p %d~%d (%d)\n", ceph_vinop(inode), - file, folio, (int)pos, (int)copied, (int)len); + /* dirty the head */ + spin_lock(&ci->i_ceph_lock); + if (__ceph_have_pending_cap_snap(ci)) { + struct ceph_cap_snap *capsnap = + list_last_entry(&ci->i_cap_snaps, + struct ceph_cap_snap, + ci_item); + snapc = capsnap->context; + capsnap->dirty_pages++; + } else { + snapc = ci->i_head_snapc; + BUG_ON(!snapc); + ++ci->i_wrbuffer_ref_head; + } - if (!folio_test_uptodate(folio)) { - /* just return that nothing was copied on a short copy */ - if (copied < len) { - copied = 0; - goto out; + /* Attach a reference to the snap/group to the folio. */ + group = netfs_folio_group(folio); + if (group != &snapc->group) { + netfs_set_group(folio, &snapc->group); + if (group) { + doutc(cl, "Different group %px != %px\n", + group, &snapc->group); + netfs_put_group(group); } - folio_mark_uptodate(folio); } - /* did file size increase? */ - if (pos+copied > i_size_read(inode)) - check_cap = ceph_inode_set_size(inode, pos+copied); - - folio_mark_dirty(folio); - -out: - folio_unlock(folio); - folio_put(folio); - - if (check_cap) - ceph_check_caps(ceph_inode(inode), CHECK_CAPS_AUTHONLY); + if (ci->i_wrbuffer_ref == 0) + ihold(inode); + ++ci->i_wrbuffer_ref; + doutc(cl, "%llx.%llx %p idx %lu head %d/%d -> %d/%d " + "snapc %p seq %lld (%d snaps)\n", + ceph_vinop(inode), folio, folio->index, + ci->i_wrbuffer_ref-1, ci->i_wrbuffer_ref_head-1, + ci->i_wrbuffer_ref, ci->i_wrbuffer_ref_head, + snapc, snapc->seq, snapc->num_snaps); + spin_unlock(&ci->i_ceph_lock); - return copied; + return netfs_dirty_folio(mapping, folio); } -const struct address_space_operations ceph_aops = { - .read_folio = netfs_read_folio, - .readahead = netfs_readahead, - .writepages = ceph_writepages_start, - .write_begin = ceph_write_begin, - .write_end = ceph_write_end, - .dirty_folio = ceph_dirty_folio, - .invalidate_folio = ceph_invalidate_folio, - .release_folio = netfs_release_folio, - .direct_IO = noop_direct_IO, - .migrate_folio = filemap_migrate_folio, -}; -#endif // TODO: Remove after netfs conversion - static void ceph_block_sigs(sigset_t *oldset) { sigset_t mask; @@ -2046,112 +226,6 @@ static vm_fault_t ceph_filemap_fault(struct vm_fault *vmf) return ret; } -#if 0 // TODO: Remove after netfs conversion -static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf) -{ - struct vm_area_struct *vma = vmf->vma; - struct inode *inode = file_inode(vma->vm_file); - struct ceph_client *cl = ceph_inode_to_client(inode); - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_file_info *fi = vma->vm_file->private_data; - struct ceph_cap_flush *prealloc_cf; - struct folio *folio = page_folio(vmf->page); - loff_t off = folio_pos(folio); - loff_t size = i_size_read(inode); - size_t len; - int want, got, err; - sigset_t oldset; - vm_fault_t ret = VM_FAULT_SIGBUS; - - if (ceph_inode_is_shutdown(inode)) - return ret; - - prealloc_cf = ceph_alloc_cap_flush(); - if (!prealloc_cf) - return VM_FAULT_OOM; - - sb_start_pagefault(inode->i_sb); - ceph_block_sigs(&oldset); - - if (off + folio_size(folio) <= size) - len = folio_size(folio); - else - len = offset_in_folio(folio, size); - - doutc(cl, "%llx.%llx %llu~%zd getting caps i_size %llu\n", - ceph_vinop(inode), off, len, size); - if (fi->fmode & CEPH_FILE_MODE_LAZY) - want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO; - else - want = CEPH_CAP_FILE_BUFFER; - - got = 0; - err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, off + len, &got); - if (err < 0) - goto out_free; - - doutc(cl, "%llx.%llx %llu~%zd got cap refs on %s\n", ceph_vinop(inode), - off, len, ceph_cap_string(got)); - - /* Update time before taking folio lock */ - file_update_time(vma->vm_file); - inode_inc_iversion_raw(inode); - - do { - struct ceph_snap_context *snapc; - - folio_lock(folio); - - if (folio_mkwrite_check_truncate(folio, inode) < 0) { - folio_unlock(folio); - ret = VM_FAULT_NOPAGE; - break; - } - - snapc = ceph_find_incompatible(folio); - if (!snapc) { - /* success. we'll keep the folio locked. */ - folio_mark_dirty(folio); - ret = VM_FAULT_LOCKED; - break; - } - - folio_unlock(folio); - - if (IS_ERR(snapc)) { - ret = VM_FAULT_SIGBUS; - break; - } - - ceph_queue_writeback(inode); - err = wait_event_killable(ci->i_cap_wq, - context_is_writeable_or_written(inode, snapc)); - ceph_put_snap_context(snapc); - } while (err == 0); - - if (ret == VM_FAULT_LOCKED) { - int dirty; - spin_lock(&ci->i_ceph_lock); - dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR, - &prealloc_cf); - spin_unlock(&ci->i_ceph_lock); - if (dirty) - __mark_inode_dirty(inode, dirty); - } - - doutc(cl, "%llx.%llx %llu~%zd dropping cap refs on %s ret %x\n", - ceph_vinop(inode), off, len, ceph_cap_string(got), ret); - ceph_put_cap_refs_async(ci, got); -out_free: - ceph_restore_sigs(&oldset); - sb_end_pagefault(inode->i_sb); - ceph_free_cap_flush(prealloc_cf); - if (err < 0) - ret = vmf_error(err); - return ret; -} -#endif // TODO: Remove after netfs conversion - void ceph_fill_inline_data(struct inode *inode, struct page *locked_page, char *data, size_t len) { diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 94b91b5bc843..d7684f4b2e10 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -77,97 +77,6 @@ static __le32 ceph_flags_sys2wire(struct ceph_mds_client *mdsc, u32 flags) * need to wait for MDS acknowledgement. */ -#if 0 // TODO: Remove after netfs conversion -/* - * How many pages to get in one call to iov_iter_get_pages(). This - * determines the size of the on-stack array used as a buffer. - */ -#define ITER_GET_BVECS_PAGES 64 - -static int __iter_get_bvecs(struct iov_iter *iter, size_t maxsize, - struct ceph_databuf *dbuf) -{ - size_t size = 0; - - if (maxsize > iov_iter_count(iter)) - maxsize = iov_iter_count(iter); - - while (size < maxsize) { - struct page *pages[ITER_GET_BVECS_PAGES]; - ssize_t bytes; - size_t start; - int idx = 0; - - bytes = iov_iter_get_pages2(iter, pages, maxsize - size, - ITER_GET_BVECS_PAGES, &start); - if (bytes < 0) { - if (size == 0) - return bytes; - break; - } - - while (bytes) { - int len = min_t(int, bytes, PAGE_SIZE - start); - - ceph_databuf_append_page(dbuf, pages[idx++], start, len); - bytes -= len; - size += len; - start = 0; - } - } - - return 0; -} - -/* - * iov_iter_get_pages() only considers one iov_iter segment, no matter - * what maxsize or maxpages are given. For ITER_BVEC that is a single - * page. - * - * Attempt to get up to @maxsize bytes worth of pages from @iter. - * Return the number of bytes in the created bio_vec array, or an error. - */ -static struct ceph_databuf *iter_get_bvecs_alloc(struct iov_iter *iter, - size_t maxsize, bool write) -{ - struct ceph_databuf *dbuf; - size_t orig_count = iov_iter_count(iter); - int npages, ret; - - iov_iter_truncate(iter, maxsize); - npages = iov_iter_npages(iter, INT_MAX); - iov_iter_reexpand(iter, orig_count); - - if (write) - dbuf = ceph_databuf_req_alloc(npages, 0, GFP_KERNEL); - else - dbuf = ceph_databuf_reply_alloc(npages, 0, GFP_KERNEL); - if (!dbuf) - return ERR_PTR(-ENOMEM); - - ret = __iter_get_bvecs(iter, maxsize, dbuf); - if (ret < 0) { - /* - * No pages were pinned -- just free the array. - */ - ceph_databuf_release(dbuf); - return ERR_PTR(ret); - } - - return dbuf; -} - -static void ceph_dirty_pages(struct ceph_databuf *dbuf) -{ - struct bio_vec *bvec = dbuf->bvec; - int i; - - for (i = 0; i < dbuf->nr_bvec; i++) - if (bvec[i].bv_page) - set_page_dirty_lock(bvec[i].bv_page); -} -#endif // TODO: Remove after netfs conversion - /* * Prepare an open request. Preallocate ceph_cap to avoid an * inopportune ENOMEM later. @@ -1023,1222 +932,6 @@ int ceph_release(struct inode *inode, struct file *file) return 0; } -#if 0 // TODO: Remove after netfs conversion -enum { - HAVE_RETRIED = 1, - CHECK_EOF = 2, - READ_INLINE = 3, -}; - -/* - * Completely synchronous read and write methods. Direct from __user - * buffer to osd, or directly to user pages (if O_DIRECT). - * - * If the read spans object boundary, just do multiple reads. (That's not - * atomic, but good enough for now.) - * - * If we get a short result from the OSD, check against i_size; we need to - * only return a short read to the caller if we hit EOF. - */ -ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos, - struct iov_iter *to, int *retry_op, - u64 *last_objver) -{ - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = fsc->client; - struct ceph_osd_client *osdc = &fsc->client->osdc; - ssize_t ret; - u64 off = *ki_pos; - u64 len = iov_iter_count(to); - u64 i_size = i_size_read(inode); - bool sparse = IS_ENCRYPTED(inode) || ceph_test_mount_opt(fsc, SPARSEREAD); - u64 objver = 0; - - doutc(cl, "on inode %p %llx.%llx %llx~%llx\n", inode, - ceph_vinop(inode), *ki_pos, len); - - if (ceph_inode_is_shutdown(inode)) - return -EIO; - - if (!len || !i_size) - return 0; - /* - * flush any page cache pages in this range. this - * will make concurrent normal and sync io slow, - * but it will at least behave sensibly when they are - * in sequence. - */ - ret = filemap_write_and_wait_range(inode->i_mapping, - off, off + len - 1); - if (ret < 0) - return ret; - - ret = 0; - while ((len = iov_iter_count(to)) > 0) { - struct ceph_osd_request *req; - struct page **pages; - int num_pages; - size_t page_off; - bool more; - int idx = 0; - size_t left; - struct ceph_osd_req_op *op; - u64 read_off = off; - u64 read_len = len; - int extent_cnt; - - /* determine new offset/length if encrypted */ - ceph_fscrypt_adjust_off_and_len(inode, &read_off, &read_len); - - doutc(cl, "orig %llu~%llu reading %llu~%llu", off, len, - read_off, read_len); - - req = ceph_osdc_new_request(osdc, &ci->i_layout, - ci->i_vino, read_off, &read_len, 0, 1, - sparse ? CEPH_OSD_OP_SPARSE_READ : - CEPH_OSD_OP_READ, - CEPH_OSD_FLAG_READ, - NULL, ci->i_truncate_seq, - ci->i_truncate_size, false); - if (IS_ERR(req)) { - ret = PTR_ERR(req); - break; - } - - /* adjust len downward if the request truncated the len */ - if (off + len > read_off + read_len) - len = read_off + read_len - off; - more = len < iov_iter_count(to); - - op = &req->r_ops[0]; - if (sparse) { - extent_cnt = __ceph_sparse_read_ext_count(inode, read_len); - ret = ceph_alloc_sparse_ext_map(op, extent_cnt); - if (ret) { - ceph_osdc_put_request(req); - break; - } - } - - num_pages = calc_pages_for(read_off, read_len); - page_off = offset_in_page(off); - pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL); - if (IS_ERR(pages)) { - ceph_osdc_put_request(req); - ret = PTR_ERR(pages); - break; - } - - osd_req_op_extent_osd_data_pages(req, 0, pages, read_len, - offset_in_page(read_off), - false, true); - - ceph_osdc_start_request(osdc, req); - ret = ceph_osdc_wait_request(osdc, req); - - ceph_update_read_metrics(&fsc->mdsc->metric, - req->r_start_latency, - req->r_end_latency, - read_len, ret); - - if (ret > 0) - objver = req->r_version; - - i_size = i_size_read(inode); - doutc(cl, "%llu~%llu got %zd i_size %llu%s\n", off, len, - ret, i_size, (more ? " MORE" : "")); - - /* Fix it to go to end of extent map */ - if (sparse && ret >= 0) - ret = ceph_sparse_ext_map_end(op); - else if (ret == -ENOENT) - ret = 0; - - if (ret < 0) { - ceph_osdc_put_request(req); - if (ret == -EBLOCKLISTED) - fsc->blocklisted = true; - break; - } - - if (IS_ENCRYPTED(inode)) { - int fret; - - fret = ceph_fscrypt_decrypt_extents(inode, pages, - read_off, op->extent.sparse_ext, - op->extent.sparse_ext_cnt); - if (fret < 0) { - ret = fret; - ceph_osdc_put_request(req); - break; - } - - /* account for any partial block at the beginning */ - fret -= (off - read_off); - - /* - * Short read after big offset adjustment? - * Nothing is usable, just call it a zero - * len read. - */ - fret = max(fret, 0); - - /* account for partial block at the end */ - ret = min_t(ssize_t, fret, len); - } - - /* Short read but not EOF? Zero out the remainder. */ - if (ret < len && (off + ret < i_size)) { - int zlen = min(len - ret, i_size - off - ret); - int zoff = page_off + ret; - - doutc(cl, "zero gap %llu~%llu\n", off + ret, - off + ret + zlen); - ceph_zero_page_vector_range(zoff, zlen, pages); - ret += zlen; - } - - if (off + ret > i_size) - left = (i_size > off) ? i_size - off : 0; - else - left = ret; - - while (left > 0) { - size_t plen, copied; - - plen = min_t(size_t, left, PAGE_SIZE - page_off); - SetPageUptodate(pages[idx]); - copied = copy_page_to_iter(pages[idx++], - page_off, plen, to); - off += copied; - left -= copied; - page_off = 0; - if (copied < plen) { - ret = -EFAULT; - break; - } - } - - ceph_osdc_put_request(req); - - if (off >= i_size || !more) - break; - } - - if (ret > 0) { - if (off >= i_size) { - *retry_op = CHECK_EOF; - ret = i_size - *ki_pos; - *ki_pos = i_size; - } else { - ret = off - *ki_pos; - *ki_pos = off; - } - - if (last_objver) - *last_objver = objver; - } - doutc(cl, "result %zd retry_op %d\n", ret, *retry_op); - return ret; -} - -static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to, - int *retry_op) -{ - struct file *file = iocb->ki_filp; - struct inode *inode = file_inode(file); - struct ceph_client *cl = ceph_inode_to_client(inode); - - doutc(cl, "on file %p %llx~%zx %s\n", file, iocb->ki_pos, - iov_iter_count(to), - (file->f_flags & O_DIRECT) ? "O_DIRECT" : ""); - - return __ceph_sync_read(inode, &iocb->ki_pos, to, retry_op, NULL); -} - -struct ceph_aio_request { - struct kiocb *iocb; - size_t total_len; - bool write; - bool should_dirty; - int error; - struct list_head osd_reqs; - unsigned num_reqs; - atomic_t pending_reqs; - struct timespec64 mtime; - struct ceph_cap_flush *prealloc_cf; -}; - -struct ceph_aio_work { - struct work_struct work; - struct ceph_osd_request *req; -}; - -static void ceph_aio_retry_work(struct work_struct *work); - -static void ceph_aio_complete(struct inode *inode, - struct ceph_aio_request *aio_req) -{ - struct ceph_client *cl = ceph_inode_to_client(inode); - struct ceph_inode_info *ci = ceph_inode(inode); - int ret; - - if (!atomic_dec_and_test(&aio_req->pending_reqs)) - return; - - if (aio_req->iocb->ki_flags & IOCB_DIRECT) - inode_dio_end(inode); - - ret = aio_req->error; - if (!ret) - ret = aio_req->total_len; - - doutc(cl, "%p %llx.%llx rc %d\n", inode, ceph_vinop(inode), ret); - - if (ret >= 0 && aio_req->write) { - int dirty; - - loff_t endoff = aio_req->iocb->ki_pos + aio_req->total_len; - if (endoff > i_size_read(inode)) { - if (ceph_inode_set_size(inode, endoff)) - ceph_check_caps(ci, CHECK_CAPS_AUTHONLY); - } - - spin_lock(&ci->i_ceph_lock); - dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR, - &aio_req->prealloc_cf); - spin_unlock(&ci->i_ceph_lock); - if (dirty) - __mark_inode_dirty(inode, dirty); - - } - - ceph_put_cap_refs(ci, (aio_req->write ? CEPH_CAP_FILE_WR : - CEPH_CAP_FILE_RD)); - - aio_req->iocb->ki_complete(aio_req->iocb, ret); - - ceph_free_cap_flush(aio_req->prealloc_cf); - kfree(aio_req); -} - -static void ceph_aio_complete_req(struct ceph_osd_request *req) -{ - int rc = req->r_result; - struct inode *inode = req->r_inode; - struct ceph_aio_request *aio_req = req->r_priv; - struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0); - struct ceph_osd_req_op *op = &req->r_ops[0]; - struct ceph_client_metric *metric = &ceph_sb_to_mdsc(inode->i_sb)->metric; - size_t len = osd_data->iter.count; - bool sparse = (op->op == CEPH_OSD_OP_SPARSE_READ); - struct ceph_client *cl = ceph_inode_to_client(inode); - - doutc(cl, "req %p inode %p %llx.%llx, rc %d bytes %zu\n", req, - inode, ceph_vinop(inode), rc, len); - - if (rc == -EOLDSNAPC) { - struct ceph_aio_work *aio_work; - BUG_ON(!aio_req->write); - - aio_work = kmalloc(sizeof(*aio_work), GFP_NOFS); - if (aio_work) { - INIT_WORK(&aio_work->work, ceph_aio_retry_work); - aio_work->req = req; - queue_work(ceph_inode_to_fs_client(inode)->inode_wq, - &aio_work->work); - return; - } - rc = -ENOMEM; - } else if (!aio_req->write) { - if (sparse && rc >= 0) - rc = ceph_sparse_ext_map_end(op); - if (rc == -ENOENT) - rc = 0; - if (rc >= 0 && len > rc) { - int zlen = len - rc; - - /* - * If read is satisfied by single OSD request, - * it can pass EOF. Otherwise read is within - * i_size. - */ - if (aio_req->num_reqs == 1) { - loff_t i_size = i_size_read(inode); - loff_t endoff = aio_req->iocb->ki_pos + rc; - if (endoff < i_size) - zlen = min_t(size_t, zlen, - i_size - endoff); - aio_req->total_len = rc + zlen; - } - - iov_iter_advance(&osd_data->iter, rc); - iov_iter_zero(zlen, &osd_data->iter); - } - } - - /* r_start_latency == 0 means the request was not submitted */ - if (req->r_start_latency) { - if (aio_req->write) - ceph_update_write_metrics(metric, req->r_start_latency, - req->r_end_latency, len, rc); - else - ceph_update_read_metrics(metric, req->r_start_latency, - req->r_end_latency, len, rc); - } - - if (aio_req->should_dirty) - ceph_dirty_pages(osd_data->dbuf); - ceph_osdc_put_request(req); - - if (rc < 0) - cmpxchg(&aio_req->error, 0, rc); - - ceph_aio_complete(inode, aio_req); - return; -} - -static void ceph_aio_retry_work(struct work_struct *work) -{ - struct ceph_aio_work *aio_work = - container_of(work, struct ceph_aio_work, work); - struct ceph_osd_request *orig_req = aio_work->req; - struct ceph_aio_request *aio_req = orig_req->r_priv; - struct inode *inode = orig_req->r_inode; - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_snap_context *snapc; - struct ceph_osd_request *req; - int ret; - - spin_lock(&ci->i_ceph_lock); - if (__ceph_have_pending_cap_snap(ci)) { - struct ceph_cap_snap *capsnap = - list_last_entry(&ci->i_cap_snaps, - struct ceph_cap_snap, - ci_item); - snapc = ceph_get_snap_context(capsnap->context); - } else { - BUG_ON(!ci->i_head_snapc); - snapc = ceph_get_snap_context(ci->i_head_snapc); - } - spin_unlock(&ci->i_ceph_lock); - - req = ceph_osdc_alloc_request(orig_req->r_osdc, snapc, 1, - false, GFP_NOFS); - if (!req) { - ret = -ENOMEM; - req = orig_req; - goto out; - } - - req->r_flags = /* CEPH_OSD_FLAG_ORDERSNAP | */ CEPH_OSD_FLAG_WRITE; - ceph_oloc_copy(&req->r_base_oloc, &orig_req->r_base_oloc); - ceph_oid_copy(&req->r_base_oid, &orig_req->r_base_oid); - - req->r_ops[0] = orig_req->r_ops[0]; - - req->r_mtime = aio_req->mtime; - req->r_data_offset = req->r_ops[0].extent.offset; - - ret = ceph_osdc_alloc_messages(req, GFP_NOFS); - if (ret) { - ceph_osdc_put_request(req); - req = orig_req; - goto out; - } - - ceph_osdc_put_request(orig_req); - - req->r_callback = ceph_aio_complete_req; - req->r_inode = inode; - req->r_priv = aio_req; - - ceph_osdc_start_request(req->r_osdc, req); -out: - if (ret < 0) { - req->r_result = ret; - ceph_aio_complete_req(req); - } - - ceph_put_snap_context(snapc); - kfree(aio_work); -} - -static ssize_t -ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter, - struct ceph_snap_context *snapc, - struct ceph_cap_flush **pcf) -{ - struct file *file = iocb->ki_filp; - struct inode *inode = file_inode(file); - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = fsc->client; - struct ceph_client_metric *metric = &fsc->mdsc->metric; - struct ceph_vino vino; - struct ceph_osd_request *req; - struct ceph_aio_request *aio_req = NULL; - struct ceph_databuf *dbuf = NULL; - int flags; - int ret = 0; - struct timespec64 mtime = current_time(inode); - size_t count = iov_iter_count(iter); - loff_t pos = iocb->ki_pos; - bool write = iov_iter_rw(iter) == WRITE; - bool should_dirty = !write && user_backed_iter(iter); - bool sparse = ceph_test_mount_opt(fsc, SPARSEREAD); - - if (write && ceph_snap(file_inode(file)) != CEPH_NOSNAP) - return -EROFS; - - doutc(cl, "sync_direct_%s on file %p %lld~%u snapc %p seq %lld\n", - (write ? "write" : "read"), file, pos, (unsigned)count, - snapc, snapc ? snapc->seq : 0); - - if (write) { - int ret2; - - ceph_fscache_invalidate(inode, true); - - ret2 = invalidate_inode_pages2_range(inode->i_mapping, - pos >> PAGE_SHIFT, - (pos + count - 1) >> PAGE_SHIFT); - if (ret2 < 0) - doutc(cl, "invalidate_inode_pages2_range returned %d\n", - ret2); - - flags = /* CEPH_OSD_FLAG_ORDERSNAP | */ CEPH_OSD_FLAG_WRITE; - } else { - flags = CEPH_OSD_FLAG_READ; - } - - while (iov_iter_count(iter) > 0) { - u64 size = iov_iter_count(iter); - struct ceph_osd_req_op *op; - size_t len; - int readop = sparse ? CEPH_OSD_OP_SPARSE_READ : CEPH_OSD_OP_READ; - int extent_cnt; - - if (write) - size = min_t(u64, size, fsc->mount_options->wsize); - else - size = min_t(u64, size, fsc->mount_options->rsize); - - vino = ceph_vino(inode); - req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout, - vino, pos, &size, 0, - 1, - write ? CEPH_OSD_OP_WRITE : readop, - flags, snapc, - ci->i_truncate_seq, - ci->i_truncate_size, - false); - if (IS_ERR(req)) { - ret = PTR_ERR(req); - break; - } - - op = &req->r_ops[0]; - if (!write && sparse) { - extent_cnt = __ceph_sparse_read_ext_count(inode, size); - ret = ceph_alloc_sparse_ext_map(op, extent_cnt); - if (ret) { - ceph_osdc_put_request(req); - break; - } - } - - dbuf = iter_get_bvecs_alloc(iter, size, write); - if (IS_ERR(dbuf)) { - ceph_osdc_put_request(req); - ret = PTR_ERR(dbuf); - break; - } - len = ceph_databuf_len(dbuf); - if (len != size) - osd_req_op_extent_update(req, 0, len); - - osd_req_op_extent_osd_databuf(req, 0, dbuf); - - /* - * To simplify error handling, allow AIO when IO within i_size - * or IO can be satisfied by single OSD request. - */ - if (pos == iocb->ki_pos && !is_sync_kiocb(iocb) && - (len == count || pos + count <= i_size_read(inode))) { - aio_req = kzalloc(sizeof(*aio_req), GFP_KERNEL); - if (aio_req) { - aio_req->iocb = iocb; - aio_req->write = write; - aio_req->should_dirty = should_dirty; - INIT_LIST_HEAD(&aio_req->osd_reqs); - if (write) { - aio_req->mtime = mtime; - swap(aio_req->prealloc_cf, *pcf); - } - } - /* ignore error */ - } - - if (write) { - /* - * throw out any page cache pages in this range. this - * may block. - */ - truncate_inode_pages_range(inode->i_mapping, pos, - PAGE_ALIGN(pos + len) - 1); - - req->r_mtime = mtime; - } - - if (aio_req) { - aio_req->total_len += len; - aio_req->num_reqs++; - atomic_inc(&aio_req->pending_reqs); - - req->r_callback = ceph_aio_complete_req; - req->r_inode = inode; - req->r_priv = aio_req; - list_add_tail(&req->r_private_item, &aio_req->osd_reqs); - - pos += len; - continue; - } - - ceph_osdc_start_request(req->r_osdc, req); - ret = ceph_osdc_wait_request(&fsc->client->osdc, req); - - if (write) - ceph_update_write_metrics(metric, req->r_start_latency, - req->r_end_latency, len, ret); - else - ceph_update_read_metrics(metric, req->r_start_latency, - req->r_end_latency, len, ret); - - size = i_size_read(inode); - if (!write) { - if (sparse && ret >= 0) - ret = ceph_sparse_ext_map_end(op); - else if (ret == -ENOENT) - ret = 0; - - if (ret >= 0 && ret < len && pos + ret < size) { - int zlen = min_t(size_t, len - ret, - size - pos - ret); - - iov_iter_advance(&dbuf->iter, ret); - iov_iter_zero(zlen, &dbuf->iter); - ret += zlen; - } - if (ret >= 0) - len = ret; - } - - ceph_osdc_put_request(req); - if (ret < 0) - break; - - pos += len; - if (!write && pos >= size) - break; - - if (write && pos > size) { - if (ceph_inode_set_size(inode, pos)) - ceph_check_caps(ceph_inode(inode), - CHECK_CAPS_AUTHONLY); - } - } - - if (aio_req) { - LIST_HEAD(osd_reqs); - - if (aio_req->num_reqs == 0) { - kfree(aio_req); - return ret; - } - - ceph_get_cap_refs(ci, write ? CEPH_CAP_FILE_WR : - CEPH_CAP_FILE_RD); - - list_splice(&aio_req->osd_reqs, &osd_reqs); - inode_dio_begin(inode); - while (!list_empty(&osd_reqs)) { - req = list_first_entry(&osd_reqs, - struct ceph_osd_request, - r_private_item); - list_del_init(&req->r_private_item); - if (ret >= 0) - ceph_osdc_start_request(req->r_osdc, req); - if (ret < 0) { - req->r_result = ret; - ceph_aio_complete_req(req); - } - } - return -EIOCBQUEUED; - } - - if (ret != -EOLDSNAPC && pos > iocb->ki_pos) { - ret = pos - iocb->ki_pos; - iocb->ki_pos = pos; - } - return ret; -} - -/* - * Synchronous write, straight from __user pointer or user pages. - * - * If write spans object boundary, just do multiple writes. (For a - * correct atomic write, we should e.g. take write locks on all - * objects, rollback on failure, etc.) - */ -static ssize_t -ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos, - struct ceph_snap_context *snapc) -{ - struct file *file = iocb->ki_filp; - struct inode *inode = file_inode(file); - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = fsc->client; - struct ceph_osd_client *osdc = &fsc->client->osdc; - struct ceph_osd_request *req; - struct page **pages; - u64 len; - int num_pages; - int written = 0; - int ret; - bool check_caps = false; - struct timespec64 mtime = current_time(inode); - size_t count = iov_iter_count(from); - - if (ceph_snap(file_inode(file)) != CEPH_NOSNAP) - return -EROFS; - - doutc(cl, "on file %p %lld~%u snapc %p seq %lld\n", file, pos, - (unsigned)count, snapc, snapc->seq); - - ret = filemap_write_and_wait_range(inode->i_mapping, - pos, pos + count - 1); - if (ret < 0) - return ret; - - ceph_fscache_invalidate(inode, false); - - while ((len = iov_iter_count(from)) > 0) { - size_t left; - int n; - u64 write_pos = pos; - u64 write_len = len; - u64 objnum, objoff; - u64 assert_ver = 0; - bool rmw; - bool first, last; - struct iov_iter saved_iter = *from; - size_t off, xlen; - - ceph_fscrypt_adjust_off_and_len(inode, &write_pos, &write_len); - - /* clamp the length to the end of first object */ - ceph_calc_file_object_mapping(&ci->i_layout, write_pos, - write_len, &objnum, &objoff, - &xlen); - write_len = xlen; - - /* adjust len downward if it goes beyond current object */ - if (pos + len > write_pos + write_len) - len = write_pos + write_len - pos; - - /* - * If we had to adjust the length or position to align with a - * crypto block, then we must do a read/modify/write cycle. We - * use a version assertion to redrive the thing if something - * changes in between. - */ - first = pos != write_pos; - last = (pos + len) != (write_pos + write_len); - rmw = first || last; - - doutc(cl, "ino %llx %lld~%llu adjusted %lld~%llu -- %srmw\n", - ci->i_vino.ino, pos, len, write_pos, write_len, - rmw ? "" : "no "); - - /* - * The data is emplaced into the page as it would be if it were - * in an array of pagecache pages. - */ - num_pages = calc_pages_for(write_pos, write_len); - pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL); - if (IS_ERR(pages)) { - ret = PTR_ERR(pages); - break; - } - - /* Do we need to preload the pages? */ - if (rmw) { - u64 first_pos = write_pos; - u64 last_pos = (write_pos + write_len) - CEPH_FSCRYPT_BLOCK_SIZE; - u64 read_len = CEPH_FSCRYPT_BLOCK_SIZE; - struct ceph_osd_req_op *op; - - /* We should only need to do this for encrypted inodes */ - WARN_ON_ONCE(!IS_ENCRYPTED(inode)); - - /* No need to do two reads if first and last blocks are same */ - if (first && last_pos == first_pos) - last = false; - - /* - * Allocate a read request for one or two extents, - * depending on how the request was aligned. - */ - req = ceph_osdc_new_request(osdc, &ci->i_layout, - ci->i_vino, first ? first_pos : last_pos, - &read_len, 0, (first && last) ? 2 : 1, - CEPH_OSD_OP_SPARSE_READ, CEPH_OSD_FLAG_READ, - NULL, ci->i_truncate_seq, - ci->i_truncate_size, false); - if (IS_ERR(req)) { - ceph_release_page_vector(pages, num_pages); - ret = PTR_ERR(req); - break; - } - - /* Something is misaligned! */ - if (read_len != CEPH_FSCRYPT_BLOCK_SIZE) { - ceph_osdc_put_request(req); - ceph_release_page_vector(pages, num_pages); - ret = -EIO; - break; - } - - /* Add extent for first block? */ - op = &req->r_ops[0]; - - if (first) { - osd_req_op_extent_osd_data_pages(req, 0, pages, - CEPH_FSCRYPT_BLOCK_SIZE, - offset_in_page(first_pos), - false, false); - /* We only expect a single extent here */ - ret = __ceph_alloc_sparse_ext_map(op, 1); - if (ret) { - ceph_osdc_put_request(req); - ceph_release_page_vector(pages, num_pages); - break; - } - } - - /* Add extent for last block */ - if (last) { - /* Init the other extent if first extent has been used */ - if (first) { - op = &req->r_ops[1]; - osd_req_op_extent_init(req, 1, - CEPH_OSD_OP_SPARSE_READ, - last_pos, CEPH_FSCRYPT_BLOCK_SIZE, - ci->i_truncate_size, - ci->i_truncate_seq); - } - - ret = __ceph_alloc_sparse_ext_map(op, 1); - if (ret) { - ceph_osdc_put_request(req); - ceph_release_page_vector(pages, num_pages); - break; - } - - osd_req_op_extent_osd_data_pages(req, first ? 1 : 0, - &pages[num_pages - 1], - CEPH_FSCRYPT_BLOCK_SIZE, - offset_in_page(last_pos), - false, false); - } - - ceph_osdc_start_request(osdc, req); - ret = ceph_osdc_wait_request(osdc, req); - - /* FIXME: length field is wrong if there are 2 extents */ - ceph_update_read_metrics(&fsc->mdsc->metric, - req->r_start_latency, - req->r_end_latency, - read_len, ret); - - /* Ok if object is not already present */ - if (ret == -ENOENT) { - /* - * If there is no object, then we can't assert - * on its version. Set it to 0, and we'll use an - * exclusive create instead. - */ - ceph_osdc_put_request(req); - ret = 0; - - /* - * zero out the soon-to-be uncopied parts of the - * first and last pages. - */ - if (first) - zero_user_segment(pages[0], 0, - offset_in_page(first_pos)); - if (last) - zero_user_segment(pages[num_pages - 1], - offset_in_page(last_pos), - PAGE_SIZE); - } else { - if (ret < 0) { - ceph_osdc_put_request(req); - ceph_release_page_vector(pages, num_pages); - break; - } - - op = &req->r_ops[0]; - if (op->extent.sparse_ext_cnt == 0) { - if (first) - zero_user_segment(pages[0], 0, - offset_in_page(first_pos)); - else - zero_user_segment(pages[num_pages - 1], - offset_in_page(last_pos), - PAGE_SIZE); - } else if (op->extent.sparse_ext_cnt != 1 || - ceph_sparse_ext_map_end(op) != - CEPH_FSCRYPT_BLOCK_SIZE) { - ret = -EIO; - ceph_osdc_put_request(req); - ceph_release_page_vector(pages, num_pages); - break; - } - - if (first && last) { - op = &req->r_ops[1]; - if (op->extent.sparse_ext_cnt == 0) { - zero_user_segment(pages[num_pages - 1], - offset_in_page(last_pos), - PAGE_SIZE); - } else if (op->extent.sparse_ext_cnt != 1 || - ceph_sparse_ext_map_end(op) != - CEPH_FSCRYPT_BLOCK_SIZE) { - ret = -EIO; - ceph_osdc_put_request(req); - ceph_release_page_vector(pages, num_pages); - break; - } - } - - /* Grab assert version. It must be non-zero. */ - assert_ver = req->r_version; - WARN_ON_ONCE(ret > 0 && assert_ver == 0); - - ceph_osdc_put_request(req); - if (first) { - ret = ceph_fscrypt_decrypt_block_inplace(inode, - pages[0], CEPH_FSCRYPT_BLOCK_SIZE, - offset_in_page(first_pos), - first_pos >> CEPH_FSCRYPT_BLOCK_SHIFT); - if (ret < 0) { - ceph_release_page_vector(pages, num_pages); - break; - } - } - if (last) { - ret = ceph_fscrypt_decrypt_block_inplace(inode, - pages[num_pages - 1], - CEPH_FSCRYPT_BLOCK_SIZE, - offset_in_page(last_pos), - last_pos >> CEPH_FSCRYPT_BLOCK_SHIFT); - if (ret < 0) { - ceph_release_page_vector(pages, num_pages); - break; - } - } - } - } - - left = len; - off = offset_in_page(pos); - for (n = 0; n < num_pages; n++) { - size_t plen = min_t(size_t, left, PAGE_SIZE - off); - - /* copy the data */ - ret = copy_page_from_iter(pages[n], off, plen, from); - if (ret != plen) { - ret = -EFAULT; - break; - } - off = 0; - left -= ret; - } - if (ret < 0) { - doutc(cl, "write failed with %d\n", ret); - ceph_release_page_vector(pages, num_pages); - break; - } - - if (IS_ENCRYPTED(inode)) { - ret = ceph_fscrypt_encrypt_pages(inode, pages, - write_pos, write_len, - GFP_KERNEL); - if (ret < 0) { - doutc(cl, "encryption failed with %d\n", ret); - ceph_release_page_vector(pages, num_pages); - break; - } - } - - req = ceph_osdc_new_request(osdc, &ci->i_layout, - ci->i_vino, write_pos, &write_len, - rmw ? 1 : 0, rmw ? 2 : 1, - CEPH_OSD_OP_WRITE, - CEPH_OSD_FLAG_WRITE, - snapc, ci->i_truncate_seq, - ci->i_truncate_size, false); - if (IS_ERR(req)) { - ret = PTR_ERR(req); - ceph_release_page_vector(pages, num_pages); - break; - } - - doutc(cl, "write op %lld~%llu\n", write_pos, write_len); - osd_req_op_extent_osd_data_pages(req, rmw ? 1 : 0, pages, write_len, - offset_in_page(write_pos), false, - true); - req->r_inode = inode; - req->r_mtime = mtime; - - /* Set up the assertion */ - if (rmw) { - /* - * Set up the assertion. If we don't have a version - * number, then the object doesn't exist yet. Use an - * exclusive create instead of a version assertion in - * that case. - */ - if (assert_ver) { - osd_req_op_init(req, 0, CEPH_OSD_OP_ASSERT_VER, 0); - req->r_ops[0].assert_ver.ver = assert_ver; - } else { - osd_req_op_init(req, 0, CEPH_OSD_OP_CREATE, - CEPH_OSD_OP_FLAG_EXCL); - } - } - - ceph_osdc_start_request(osdc, req); - ret = ceph_osdc_wait_request(osdc, req); - - ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency, - req->r_end_latency, len, ret); - ceph_osdc_put_request(req); - if (ret != 0) { - doutc(cl, "osd write returned %d\n", ret); - /* Version changed! Must re-do the rmw cycle */ - if ((assert_ver && (ret == -ERANGE || ret == -EOVERFLOW)) || - (!assert_ver && ret == -EEXIST)) { - /* We should only ever see this on a rmw */ - WARN_ON_ONCE(!rmw); - - /* The version should never go backward */ - WARN_ON_ONCE(ret == -EOVERFLOW); - - *from = saved_iter; - - /* FIXME: limit number of times we loop? */ - continue; - } - ceph_set_error_write(ci); - break; - } - - ceph_clear_error_write(ci); - - /* - * We successfully wrote to a range of the file. Declare - * that region of the pagecache invalid. - */ - ret = invalidate_inode_pages2_range( - inode->i_mapping, - pos >> PAGE_SHIFT, - (pos + len - 1) >> PAGE_SHIFT); - if (ret < 0) { - doutc(cl, "invalidate_inode_pages2_range returned %d\n", - ret); - ret = 0; - } - pos += len; - written += len; - doutc(cl, "written %d\n", written); - if (pos > i_size_read(inode)) { - check_caps = ceph_inode_set_size(inode, pos); - if (check_caps) - ceph_check_caps(ceph_inode(inode), - CHECK_CAPS_AUTHONLY); - } - - } - - if (ret != -EOLDSNAPC && written > 0) { - ret = written; - iocb->ki_pos = pos; - } - doutc(cl, "returning %d\n", ret); - return ret; -} - -/* - * Wrap generic_file_aio_read with checks for cap bits on the inode. - * Atomically grab references, so that those bits are not released - * back to the MDS mid-read. - * - * Hmm, the sync read case isn't actually async... should it be? - */ -static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to) -{ - struct file *filp = iocb->ki_filp; - struct ceph_file_info *fi = filp->private_data; - size_t len = iov_iter_count(to); - struct inode *inode = file_inode(filp); - struct ceph_inode_info *ci = ceph_inode(inode); - bool direct_lock = iocb->ki_flags & IOCB_DIRECT; - struct ceph_client *cl = ceph_inode_to_client(inode); - ssize_t ret; - int want = 0, got = 0; - int retry_op = 0, read = 0; - -again: - doutc(cl, "%llu~%u trying to get caps on %p %llx.%llx\n", - iocb->ki_pos, (unsigned)len, inode, ceph_vinop(inode)); - - if (ceph_inode_is_shutdown(inode)) - return -ESTALE; - - if (direct_lock) - ceph_start_io_direct(inode); - else - ceph_start_io_read(inode); - - if (!(fi->flags & CEPH_F_SYNC) && !direct_lock) - want |= CEPH_CAP_FILE_CACHE; - if (fi->fmode & CEPH_FILE_MODE_LAZY) - want |= CEPH_CAP_FILE_LAZYIO; - - ret = ceph_get_caps(filp, CEPH_CAP_FILE_RD, want, -1, &got); - if (ret < 0) { - if (direct_lock) - ceph_end_io_direct(inode); - else - ceph_end_io_read(inode); - return ret; - } - - if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 || - (iocb->ki_flags & IOCB_DIRECT) || - (fi->flags & CEPH_F_SYNC)) { - - doutc(cl, "sync %p %llx.%llx %llu~%u got cap refs on %s\n", - inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len, - ceph_cap_string(got)); - - if (!ceph_has_inline_data(ci)) { - if (!retry_op && - (iocb->ki_flags & IOCB_DIRECT) && - !IS_ENCRYPTED(inode)) { - ret = ceph_direct_read_write(iocb, to, - NULL, NULL); - if (ret >= 0 && ret < len) - retry_op = CHECK_EOF; - } else { - ret = ceph_sync_read(iocb, to, &retry_op); - } - } else { - retry_op = READ_INLINE; - } - } else { - doutc(cl, "async %p %llx.%llx %llu~%u got cap refs on %s\n", - inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len, - ceph_cap_string(got)); - ret = generic_file_read_iter(iocb, to); - } - - doutc(cl, "%p %llx.%llx dropping cap refs on %s = %d\n", - inode, ceph_vinop(inode), ceph_cap_string(got), (int)ret); - ceph_put_cap_refs(ci, got); - - if (direct_lock) - ceph_end_io_direct(inode); - else - ceph_end_io_read(inode); - - if (retry_op > HAVE_RETRIED && ret >= 0) { - int statret; - struct page *page = NULL; - loff_t i_size; - int mask = CEPH_STAT_CAP_SIZE; - if (retry_op == READ_INLINE) { - page = __page_cache_alloc(GFP_KERNEL); - if (!page) - return -ENOMEM; - - mask = CEPH_STAT_CAP_INLINE_DATA; - } - - statret = __ceph_do_getattr(inode, page, mask, !!page); - if (statret < 0) { - if (page) - __free_page(page); - if (statret == -ENODATA) { - BUG_ON(retry_op != READ_INLINE); - goto again; - } - return statret; - } - - i_size = i_size_read(inode); - if (retry_op == READ_INLINE) { - BUG_ON(ret > 0 || read > 0); - if (iocb->ki_pos < i_size && - iocb->ki_pos < PAGE_SIZE) { - loff_t end = min_t(loff_t, i_size, - iocb->ki_pos + len); - end = min_t(loff_t, end, PAGE_SIZE); - if (statret < end) - zero_user_segment(page, statret, end); - ret = copy_page_to_iter(page, - iocb->ki_pos & ~PAGE_MASK, - end - iocb->ki_pos, to); - iocb->ki_pos += ret; - read += ret; - } - if (iocb->ki_pos < i_size && read < len) { - size_t zlen = min_t(size_t, len - read, - i_size - iocb->ki_pos); - ret = iov_iter_zero(zlen, to); - iocb->ki_pos += ret; - read += ret; - } - __free_pages(page, 0); - return read; - } - - /* hit EOF or hole? */ - if (retry_op == CHECK_EOF && iocb->ki_pos < i_size && - ret < len) { - doutc(cl, "may hit hole, ppos %lld < size %lld, reading more\n", - iocb->ki_pos, i_size); - - read += ret; - len -= ret; - retry_op = HAVE_RETRIED; - goto again; - } - } - - if (ret >= 0) - ret += read; - - return ret; -} -#endif // TODO: Remove after netfs conversion - /* * Wrap filemap_splice_read with checks for cap bits on the inode. * Atomically grab references, so that those bits are not released @@ -2298,203 +991,6 @@ static ssize_t ceph_splice_read(struct file *in, loff_t *ppos, return ret; } -#if 0 // TODO: Remove after netfs conversion -/* - * Take cap references to avoid releasing caps to MDS mid-write. - * - * If we are synchronous, and write with an old snap context, the OSD - * may return EOLDSNAPC. In that case, retry the write.. _after_ - * dropping our cap refs and allowing the pending snap to logically - * complete _before_ this write occurs. - * - * If we are near ENOSPC, write synchronously. - */ -static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from) -{ - struct file *file = iocb->ki_filp; - struct ceph_file_info *fi = file->private_data; - struct inode *inode = file_inode(file); - struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); - struct ceph_client *cl = fsc->client; - struct ceph_osd_client *osdc = &fsc->client->osdc; - struct ceph_cap_flush *prealloc_cf; - ssize_t count, written = 0; - int err, want = 0, got; - bool direct_lock = false; - u32 map_flags; - u64 pool_flags; - loff_t pos; - loff_t limit = max(i_size_read(inode), fsc->max_file_size); - - if (ceph_inode_is_shutdown(inode)) - return -ESTALE; - - if (ceph_snap(inode) != CEPH_NOSNAP) - return -EROFS; - - prealloc_cf = ceph_alloc_cap_flush(); - if (!prealloc_cf) - return -ENOMEM; - - if ((iocb->ki_flags & (IOCB_DIRECT | IOCB_APPEND)) == IOCB_DIRECT) - direct_lock = true; - -retry_snap: - if (direct_lock) - ceph_start_io_direct(inode); - else - ceph_start_io_write(inode); - - if (iocb->ki_flags & IOCB_APPEND) { - err = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false); - if (err < 0) - goto out; - } - - err = generic_write_checks(iocb, from); - if (err <= 0) - goto out; - - pos = iocb->ki_pos; - if (unlikely(pos >= limit)) { - err = -EFBIG; - goto out; - } else { - iov_iter_truncate(from, limit - pos); - } - - count = iov_iter_count(from); - if (ceph_quota_is_max_bytes_exceeded(inode, pos + count)) { - err = -EDQUOT; - goto out; - } - - down_read(&osdc->lock); - map_flags = osdc->osdmap->flags; - pool_flags = ceph_pg_pool_flags(osdc->osdmap, ci->i_layout.pool_id); - up_read(&osdc->lock); - if ((map_flags & CEPH_OSDMAP_FULL) || - (pool_flags & CEPH_POOL_FLAG_FULL)) { - err = -ENOSPC; - goto out; - } - - err = file_remove_privs(file); - if (err) - goto out; - - doutc(cl, "%p %llx.%llx %llu~%zd getting caps. i_size %llu\n", - inode, ceph_vinop(inode), pos, count, - i_size_read(inode)); - if (!(fi->flags & CEPH_F_SYNC) && !direct_lock) - want |= CEPH_CAP_FILE_BUFFER; - if (fi->fmode & CEPH_FILE_MODE_LAZY) - want |= CEPH_CAP_FILE_LAZYIO; - got = 0; - err = ceph_get_caps(file, CEPH_CAP_FILE_WR, want, pos + count, &got); - if (err < 0) - goto out; - - err = file_update_time(file); - if (err) - goto out_caps; - - inode_inc_iversion_raw(inode); - - doutc(cl, "%p %llx.%llx %llu~%zd got cap refs on %s\n", - inode, ceph_vinop(inode), pos, count, ceph_cap_string(got)); - - if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 || - (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC) || - (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) { - struct ceph_snap_context *snapc; - struct iov_iter data; - - spin_lock(&ci->i_ceph_lock); - if (__ceph_have_pending_cap_snap(ci)) { - struct ceph_cap_snap *capsnap = - list_last_entry(&ci->i_cap_snaps, - struct ceph_cap_snap, - ci_item); - snapc = ceph_get_snap_context(capsnap->context); - } else { - BUG_ON(!ci->i_head_snapc); - snapc = ceph_get_snap_context(ci->i_head_snapc); - } - spin_unlock(&ci->i_ceph_lock); - - /* we might need to revert back to that point */ - data = *from; - if ((iocb->ki_flags & IOCB_DIRECT) && !IS_ENCRYPTED(inode)) - written = ceph_direct_read_write(iocb, &data, snapc, - &prealloc_cf); - else - written = ceph_sync_write(iocb, &data, pos, snapc); - if (direct_lock) - ceph_end_io_direct(inode); - else - ceph_end_io_write(inode); - if (written > 0) - iov_iter_advance(from, written); - ceph_put_snap_context(snapc); - } else { - /* - * No need to acquire the i_truncate_mutex. Because - * the MDS revokes Fwb caps before sending truncate - * message to us. We can't get Fwb cap while there - * are pending vmtruncate. So write and vmtruncate - * can not run at the same time - */ - written = generic_perform_write(iocb, from); - ceph_end_io_write(inode); - } - - if (written >= 0) { - int dirty; - - spin_lock(&ci->i_ceph_lock); - dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR, - &prealloc_cf); - spin_unlock(&ci->i_ceph_lock); - if (dirty) - __mark_inode_dirty(inode, dirty); - if (ceph_quota_is_max_bytes_approaching(inode, iocb->ki_pos)) - ceph_check_caps(ci, CHECK_CAPS_FLUSH); - } - - doutc(cl, "%p %llx.%llx %llu~%u dropping cap refs on %s\n", - inode, ceph_vinop(inode), pos, (unsigned)count, - ceph_cap_string(got)); - ceph_put_cap_refs(ci, got); - - if (written == -EOLDSNAPC) { - doutc(cl, "%p %llx.%llx %llu~%u" "got EOLDSNAPC, retrying\n", - inode, ceph_vinop(inode), pos, (unsigned)count); - goto retry_snap; - } - - if (written >= 0) { - if ((map_flags & CEPH_OSDMAP_NEARFULL) || - (pool_flags & CEPH_POOL_FLAG_NEARFULL)) - iocb->ki_flags |= IOCB_DSYNC; - written = generic_write_sync(iocb, written); - } - - goto out_unlocked; -out_caps: - ceph_put_cap_refs(ci, got); -out: - if (direct_lock) - ceph_end_io_direct(inode); - else - ceph_end_io_write(inode); -out_unlocked: - ceph_free_cap_flush(prealloc_cf); - return written ? written : err; -} -#endif // TODO: Remove after netfs conversion - /* * llseek. be sure to verify file size on SEEK_END. */ diff --git a/fs/ceph/super.h b/fs/ceph/super.h index acd5c4821ded..97eddbf9dae9 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -470,19 +470,6 @@ struct ceph_inode_info { #endif }; -struct ceph_netfs_request_data { // TODO: Remove - int caps; - - /* - * Maximum size of a file readahead request. - * The fadvise could update the bdi's default ra_pages. - */ - unsigned int file_ra_pages; - - /* Set it if fadvise disables file readahead entirely */ - bool file_ra_disabled; -}; - struct ceph_io_request { struct netfs_io_request rreq; u64 rmw_assert_version; @@ -1260,9 +1247,6 @@ extern void __ceph_touch_fmode(struct ceph_inode_info *ci, struct ceph_mds_client *mdsc, int fmode); /* addr.c */ -#if 0 // TODO: Remove after netfs conversion -extern const struct netfs_request_ops ceph_netfs_ops; -#endif // TODO: Remove after netfs conversion bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio); extern int ceph_mmap(struct file *file, struct vm_area_struct *vma); extern int ceph_uninline_data(struct file *file); @@ -1293,11 +1277,6 @@ extern int ceph_renew_caps(struct inode *inode, int fmode); extern int ceph_open(struct inode *inode, struct file *file); extern int ceph_atomic_open(struct inode *dir, struct dentry *dentry, struct file *file, unsigned flags, umode_t mode); -#if 0 // TODO: Remove after netfs conversion -extern ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos, - struct iov_iter *to, int *retry_op, - u64 *last_objver); -#endif extern int ceph_release(struct inode *inode, struct file *filp); extern void ceph_fill_inline_data(struct inode *inode, struct page *locked_page, char *data, size_t len);