From patchwork Wed Feb 5 00:02:48 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Viacheslav Dubeyko X-Patchwork-Id: 13960221 Received: from mail-oo1-f65.google.com (mail-oo1-f65.google.com [209.85.161.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9BF7ABE5E for ; Wed, 5 Feb 2025 00:03:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.161.65 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738713810; cv=none; b=phJ0JtMaXBYZUOoGCF7J6FgflVWeSIyDdkXCpoAAf4coOvsKfKdA0TF4wb3Qokx9snKJy8mxOpcNMVxD0DrI0odn4Q20XStuJ9x2jdun02P6+GM14avm68ZROOQJy2XCJmHcPWk2zhgziA/WxljkPpoEuFY1bmu/R1/M0fsd4hQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738713810; c=relaxed/simple; bh=zPIWy3D5CiSJ2J/wfsbwpOA4++sgKqfkOjjA3WT5nkE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=bOOQKcqMbMEXLutCsqJ38jCyYyByVNj8NKwv0pGkPifFlBiecP6qilPhHUlJonoIciIWUmLFwj28YvgMSL7CiP/ur2XPtIy74jegKB8Xz6mq8ImLcNkryI/5F3LdgaVYWry4bC34AaAxGluVK3Vdgi9VPc/bl0SO8wJMb9GeU9U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=dubeyko.com; spf=pass smtp.mailfrom=dubeyko.com; dkim=pass (2048-bit key) header.d=dubeyko-com.20230601.gappssmtp.com header.i=@dubeyko-com.20230601.gappssmtp.com header.b=Oqbmznqs; arc=none smtp.client-ip=209.85.161.65 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=dubeyko.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=dubeyko.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=dubeyko-com.20230601.gappssmtp.com header.i=@dubeyko-com.20230601.gappssmtp.com header.b="Oqbmznqs" Received: by mail-oo1-f65.google.com with SMTP id 006d021491bc7-5f4d935084aso2147377eaf.2 for ; Tue, 04 Feb 2025 16:03:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dubeyko-com.20230601.gappssmtp.com; s=20230601; t=1738713806; x=1739318606; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=bLOwU2JVIbz6wIo2X179JkMI/+Xuy+L7Fq5G04GdqK0=; b=OqbmznqsfLwGx5/a+QZXNrFoRxt19aoDTYogYJraAJSYIhj2LsiZ+D0MAWXls9w2ho 9ZuGasarVv2UTDug8HtLk+vyQWwMJQ6lsxfNuX46BNJBrehyv0vyzk0BxL+eRXYo3bxn /68zB/EnygFO3OTUEh90gGDxVeDqm9Y0KG80jE/ACnfbmhLmBdC9Yyi8AKk3lWUnQ7XB LzLyyh4QdCA4C2dGIhTH30OI+HuCw0YiKofKcu2VshqNsjx+YYUbYAa2wEzoEDzxF+q3 g5F7ca0jSsrdWsvbe//eWf+eWKNMLxGDk+8s3LmNnUCqVNTK6ZR5m98gpgkV3ERxSG3T OOvQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738713806; x=1739318606; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bLOwU2JVIbz6wIo2X179JkMI/+Xuy+L7Fq5G04GdqK0=; b=D8p2FZnJZCSvFWNCOGiuZWnR3U1Qojzxli885PuaVByRpDvZ2jOTbhVrnO0jROG2bb /Qp5FxZUB1tDwP74efnV4Zu9suD4J75x93iZ8M6StsKvW/YY963b45k5TwMGVHSuCfK4 9ApBxmKWOs+JC2ivf3HvTRCo7elhIjvvyq5IzuJ93kG8F/F6AfS6UDpPCKB5WDjKF8Kc YWc70i/EUDE7w6WV7XJKJKip2PTFhKrBdHav7juIlAijhjxtfESw6720UjZVVleazvGF 2+dUafYA7LOBMwDQnvP/W2xx9eVa1i7GyqJYM2xMQUPjitCe1XrZuoMWBj7jjPsN16Df 69tA== X-Forwarded-Encrypted: i=1; AJvYcCWFrVjHqqVDQCNxQayDGujjJyKOZIZQUl4swjdFHoFuRyQmR9UkDnZkG2BSB5C6yx3BKVLSCXWiFefJanTR@vger.kernel.org X-Gm-Message-State: AOJu0YwpfO/uEI2S/WvEkMFq4nxbmxFBilanEJbdwbVVEZnWsEjR1ED6 CGwefh9aR0RjJZ9HJi8JIJbVHdDemxgU7aapv0AlKo2wtL/iheD88N1kbe+fDrI= X-Gm-Gg: ASbGnct/DSv/TVJhSLMnIF1kQ5AFDm8aOqaZm1Cxrhij5QZFhfQa0/9eg82+0evP0SI oyTLx2fgCGNSpzrPdROyX/5zVIRr1EfbOyzyaUWOUzqX+rXLpliiPW6ZQUiJkvJItFX3NJQUNBg rb9U2TxFNCKtsTee1uFdP8dXiKl2AAZMThFSwzbL/kCkAbb8NJRl6i5ks2Exe6oEFQMZxRmdaib uSgFXYgfcW5ccUf+EzkAVrs2FeaIHodOlUHVxjKO7s+KZuZiFXx1UcZxx6bbZ05VReW0poPzLpm vfYD794jOvJQt9TGERw3yCoq2GUkRu0T X-Google-Smtp-Source: AGHT+IGSbEFi0Utm4vZ2ntfp8J7/+E7892y7Pa1y4ClKmFnnk4aOD5y9wcVRrPhB7C12NdCPmqKczA== X-Received: by 2002:a05:6871:3a11:b0:29e:55c1:229a with SMTP id 586e51a60fabf-2b804f2080dmr526107fac.14.1738713806367; Tue, 04 Feb 2025 16:03:26 -0800 (PST) Received: from system76-pc.attlocal.net ([2600:1700:6476:1430:d53:ebfc:fe83:43f5]) by smtp.gmail.com with ESMTPSA id 46e09a7af769-726617eb64csm3666413a34.37.2025.02.04.16.03.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 04 Feb 2025 16:03:25 -0800 (PST) From: Viacheslav Dubeyko To: ceph-devel@vger.kernel.org Cc: idryomov@gmail.com, dhowells@redhat.com, linux-fsdevel@vger.kernel.org, pdonnell@redhat.com, amarkuze@redhat.com, Slava.Dubeyko@ibm.com, slava@dubeyko.com Subject: [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method Date: Tue, 4 Feb 2025 16:02:48 -0800 Message-ID: <20250205000249.123054-4-slava@dubeyko.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250205000249.123054-1-slava@dubeyko.com> References: <20250205000249.123054-1-slava@dubeyko.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Viacheslav Dubeyko Final responsibility of ceph_writepages_start() is to submit write requests for processed dirty folios/pages. The ceph_submit_write() summarize all this logic in one method. The generic/421 fails to finish because of the issue: Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds. Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24) Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace: Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] There are two problems here: if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { rc = -EIO; goto release_folios; } (1) ceph_kill_sb() doesn't wait ending of flushing all dirty folios/pages because of racy nature of mdsc->stopping_blockers. As a result, mdsc->stopping becomes CEPH_MDSC_STOPPING_FLUSHED too early. (2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails to increment mdsc->stopping_blockers. Finally, already locked folios/pages are never been unlocked and the logic tries to lock the same page second time. This patch implements refactoring of ceph_submit_write() and also it solves the second issue. Signed-off-by: Viacheslav Dubeyko --- fs/ceph/addr.c | 461 +++++++++++++++++++++++++++---------------------- 1 file changed, 257 insertions(+), 204 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 739329846a07..02d20c000dc5 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1395,6 +1395,245 @@ int ceph_process_folio_batch(struct address_space *mapping, return rc; } +static inline +void ceph_shift_unused_folios_left(struct folio_batch *fbatch) +{ + unsigned j, n = 0; + + /* shift unused page to beginning of fbatch */ + for (j = 0; j < folio_batch_count(fbatch); j++) { + if (!fbatch->folios[j]) + continue; + + if (n < j) { + fbatch->folios[n] = fbatch->folios[j]; + } + + n++; + } + + fbatch->nr = n; +} + +static +int ceph_submit_write(struct address_space *mapping, + struct writeback_control *wbc, + struct ceph_writeback_ctl *ceph_wbc) +{ + struct inode *inode = mapping->host; + struct ceph_inode_info *ci = ceph_inode(inode); + struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); + struct ceph_client *cl = fsc->client; + struct ceph_vino vino = ceph_vino(inode); + struct ceph_osd_request *req = NULL; + struct page *page = NULL; + bool caching = ceph_is_cache_enabled(inode); + u64 offset; + u64 len; + unsigned i; + +new_request: + offset = ceph_fscrypt_page_offset(ceph_wbc->pages[0]); + len = ceph_wbc->wsize; + + req = ceph_osdc_new_request(&fsc->client->osdc, + &ci->i_layout, vino, + offset, &len, 0, ceph_wbc->num_ops, + CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE, + ceph_wbc->snapc, ceph_wbc->truncate_seq, + ceph_wbc->truncate_size, false); + if (IS_ERR(req)) { + req = ceph_osdc_new_request(&fsc->client->osdc, + &ci->i_layout, vino, + offset, &len, 0, + min(ceph_wbc->num_ops, + CEPH_OSD_SLAB_OPS), + CEPH_OSD_OP_WRITE, + CEPH_OSD_FLAG_WRITE, + ceph_wbc->snapc, + ceph_wbc->truncate_seq, + ceph_wbc->truncate_size, + true); + BUG_ON(IS_ERR(req)); + } + + page = ceph_wbc->pages[ceph_wbc->locked_pages - 1]; + BUG_ON(len < ceph_fscrypt_page_offset(page) + thp_size(page) - offset); + + if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { + for (i = 0; i < folio_batch_count(&ceph_wbc->fbatch); i++) { + struct folio *folio = ceph_wbc->fbatch.folios[i]; + + if (!folio) + continue; + + page = &folio->page; + redirty_page_for_writepage(wbc, page); + unlock_page(page); + } + + for (i = 0; i < ceph_wbc->locked_pages; i++) { + page = ceph_fscrypt_pagecache_page(ceph_wbc->pages[i]); + + if (!page) + continue; + + redirty_page_for_writepage(wbc, page); + unlock_page(page); + } + + ceph_osdc_put_request(req); + return -EIO; + } + + req->r_callback = writepages_finish; + req->r_inode = inode; + + /* Format the osd request message and submit the write */ + len = 0; + ceph_wbc->data_pages = ceph_wbc->pages; + ceph_wbc->op_idx = 0; + for (i = 0; i < ceph_wbc->locked_pages; i++) { + u64 cur_offset; + + page = ceph_fscrypt_pagecache_page(ceph_wbc->pages[i]); + cur_offset = page_offset(page); + + /* + * Discontinuity in page range? Ceph can handle that by just passing + * multiple extents in the write op. + */ + if (offset + len != cur_offset) { + /* If it's full, stop here */ + if (ceph_wbc->op_idx + 1 == req->r_num_ops) + break; + + /* Kick off an fscache write with what we have so far. */ + ceph_fscache_write_to_cache(inode, offset, len, caching); + + /* Start a new extent */ + osd_req_op_extent_dup_last(req, ceph_wbc->op_idx, + cur_offset - offset); + + doutc(cl, "got pages at %llu~%llu\n", offset, len); + + osd_req_op_extent_osd_data_pages(req, ceph_wbc->op_idx, + ceph_wbc->data_pages, + len, 0, + ceph_wbc->from_pool, + false); + osd_req_op_extent_update(req, ceph_wbc->op_idx, len); + + len = 0; + offset = cur_offset; + ceph_wbc->data_pages = ceph_wbc->pages + i; + ceph_wbc->op_idx++; + } + + set_page_writeback(page); + + if (caching) + ceph_set_page_fscache(page); + + len += thp_size(page); + } + + ceph_fscache_write_to_cache(inode, offset, len, caching); + + if (ceph_wbc->size_stable) { + len = min(len, ceph_wbc->i_size - offset); + } else if (i == ceph_wbc->locked_pages) { + /* writepages_finish() clears writeback pages + * according to the data length, so make sure + * data length covers all locked pages */ + u64 min_len = len + 1 - thp_size(page); + len = get_writepages_data_length(inode, + ceph_wbc->pages[i - 1], + offset); + len = max(len, min_len); + } + + if (IS_ENCRYPTED(inode)) + len = round_up(len, CEPH_FSCRYPT_BLOCK_SIZE); + + doutc(cl, "got pages at %llu~%llu\n", offset, len); + + if (IS_ENCRYPTED(inode) && + ((offset | len) & ~CEPH_FSCRYPT_BLOCK_MASK)) { + pr_warn_client(cl, + "bad encrypted write offset=%lld len=%llu\n", + offset, len); + } + + osd_req_op_extent_osd_data_pages(req, ceph_wbc->op_idx, + ceph_wbc->data_pages, len, + 0, ceph_wbc->from_pool, false); + osd_req_op_extent_update(req, ceph_wbc->op_idx, len); + + BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops); + + ceph_wbc->from_pool = false; + if (i < ceph_wbc->locked_pages) { + BUG_ON(ceph_wbc->num_ops <= req->r_num_ops); + ceph_wbc->num_ops -= req->r_num_ops; + ceph_wbc->locked_pages -= i; + + /* allocate new pages array for next request */ + ceph_wbc->data_pages = ceph_wbc->pages; + __ceph_allocate_page_array(ceph_wbc, ceph_wbc->locked_pages); + memcpy(ceph_wbc->pages, ceph_wbc->data_pages + i, + ceph_wbc->locked_pages * sizeof(*ceph_wbc->pages)); + memset(ceph_wbc->data_pages + i, 0, + ceph_wbc->locked_pages * sizeof(*ceph_wbc->pages)); + } else { + BUG_ON(ceph_wbc->num_ops != req->r_num_ops); + /* request message now owns the pages array */ + ceph_wbc->pages = NULL; + } + + req->r_mtime = inode_get_mtime(inode); + ceph_osdc_start_request(&fsc->client->osdc, req); + req = NULL; + + wbc->nr_to_write -= i; + if (ceph_wbc->pages) + goto new_request; + + return 0; +} + +static +void ceph_wait_until_current_writes_complete(struct address_space *mapping, + struct writeback_control *wbc, + struct ceph_writeback_ctl *ceph_wbc) +{ + struct page *page; + unsigned i, nr; + + if (wbc->sync_mode != WB_SYNC_NONE && + ceph_wbc->start_index == 0 && /* all dirty pages were checked */ + !ceph_wbc->head_snapc) { + ceph_wbc->index = 0; + + while ((ceph_wbc->index <= ceph_wbc->end) && + (nr = filemap_get_folios_tag(mapping, + &ceph_wbc->index, + (pgoff_t)-1, + PAGECACHE_TAG_WRITEBACK, + &ceph_wbc->fbatch))) { + for (i = 0; i < nr; i++) { + page = &ceph_wbc->fbatch.folios[i]->page; + if (page_snap_context(page) != ceph_wbc->snapc) + continue; + wait_on_page_writeback(page); + } + + folio_batch_release(&ceph_wbc->fbatch); + cond_resched(); + } + } +} + /* * initiate async writeback */ @@ -1402,17 +1641,12 @@ static int ceph_writepages_start(struct address_space *mapping, struct writeback_control *wbc) { struct inode *inode = mapping->host; - struct ceph_inode_info *ci = ceph_inode(inode); struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); struct ceph_client *cl = fsc->client; - struct ceph_vino vino = ceph_vino(inode); struct ceph_writeback_ctl ceph_wbc; - struct ceph_osd_request *req = NULL; int rc = 0; - bool caching = ceph_is_cache_enabled(inode); - if (wbc->sync_mode == WB_SYNC_NONE && - fsc->write_congested) + if (wbc->sync_mode == WB_SYNC_NONE && fsc->write_congested) return 0; doutc(cl, "%llx.%llx (mode=%s)\n", ceph_vinop(inode), @@ -1439,9 +1673,6 @@ static int ceph_writepages_start(struct address_space *mapping, tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end); while (!has_writeback_done(&ceph_wbc)) { - unsigned i; - struct page *page; - ceph_wbc.locked_pages = 0; ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT; @@ -1459,6 +1690,7 @@ static int ceph_writepages_start(struct address_space *mapping, if (!ceph_wbc.nr_folios && !ceph_wbc.locked_pages) break; +process_folio_batch: rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc); if (rc) goto release_folios; @@ -1466,187 +1698,30 @@ static int ceph_writepages_start(struct address_space *mapping, /* did we get anything? */ if (!ceph_wbc.locked_pages) goto release_folios; - if (i) { - unsigned j, n = 0; - /* shift unused page to beginning of fbatch */ - for (j = 0; j < ceph_wbc.nr_folios; j++) { - if (!ceph_wbc.fbatch.folios[j]) - continue; - if (n < j) { - ceph_wbc.fbatch.folios[n] = - ceph_wbc.fbatch.folios[j]; - } - n++; - } - ceph_wbc.fbatch.nr = n; - if (ceph_wbc.nr_folios && i == ceph_wbc.nr_folios && + if (ceph_wbc.processed_in_fbatch) { + ceph_shift_unused_folios_left(&ceph_wbc.fbatch); + + if (folio_batch_count(&ceph_wbc.fbatch) == 0 && ceph_wbc.locked_pages < ceph_wbc.max_pages) { doutc(cl, "reached end fbatch, trying for more\n"); - folio_batch_release(&ceph_wbc.fbatch); goto get_more_pages; } } -new_request: - ceph_wbc.offset = ceph_fscrypt_page_offset(ceph_wbc.pages[0]); - ceph_wbc.len = ceph_wbc.wsize; - - req = ceph_osdc_new_request(&fsc->client->osdc, - &ci->i_layout, vino, - ceph_wbc.offset, &ceph_wbc.len, - 0, ceph_wbc.num_ops, - CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE, - ceph_wbc.snapc, ceph_wbc.truncate_seq, - ceph_wbc.truncate_size, false); - if (IS_ERR(req)) { - req = ceph_osdc_new_request(&fsc->client->osdc, - &ci->i_layout, vino, - ceph_wbc.offset, &ceph_wbc.len, - 0, min(ceph_wbc.num_ops, - CEPH_OSD_SLAB_OPS), - CEPH_OSD_OP_WRITE, - CEPH_OSD_FLAG_WRITE, - ceph_wbc.snapc, - ceph_wbc.truncate_seq, - ceph_wbc.truncate_size, true); - BUG_ON(IS_ERR(req)); - } - BUG_ON(ceph_wbc.len < - ceph_fscrypt_page_offset(ceph_wbc.pages[ceph_wbc.locked_pages - 1]) + - thp_size(ceph_wbc.pages[ceph_wbc.locked_pages - 1]) - - ceph_wbc.offset); - - if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { - rc = -EIO; + rc = ceph_submit_write(mapping, wbc, &ceph_wbc); + if (rc) goto release_folios; - } - req->r_callback = writepages_finish; - req->r_inode = inode; - - /* Format the osd request message and submit the write */ - ceph_wbc.len = 0; - ceph_wbc.data_pages = ceph_wbc.pages; - ceph_wbc.op_idx = 0; - for (i = 0; i < ceph_wbc.locked_pages; i++) { - struct page *page = - ceph_fscrypt_pagecache_page(ceph_wbc.pages[i]); - - u64 cur_offset = page_offset(page); - /* - * Discontinuity in page range? Ceph can handle that by just passing - * multiple extents in the write op. - */ - if (ceph_wbc.offset + ceph_wbc.len != cur_offset) { - /* If it's full, stop here */ - if (ceph_wbc.op_idx + 1 == req->r_num_ops) - break; - - /* Kick off an fscache write with what we have so far. */ - ceph_fscache_write_to_cache(inode, ceph_wbc.offset, - ceph_wbc.len, caching); - - /* Start a new extent */ - osd_req_op_extent_dup_last(req, ceph_wbc.op_idx, - cur_offset - - ceph_wbc.offset); - doutc(cl, "got pages at %llu~%llu\n", - ceph_wbc.offset, - ceph_wbc.len); - osd_req_op_extent_osd_data_pages(req, - ceph_wbc.op_idx, - ceph_wbc.data_pages, - ceph_wbc.len, 0, - ceph_wbc.from_pool, false); - osd_req_op_extent_update(req, ceph_wbc.op_idx, - ceph_wbc.len); - - ceph_wbc.len = 0; - ceph_wbc.offset = cur_offset; - ceph_wbc.data_pages = ceph_wbc.pages + i; - ceph_wbc.op_idx++; - } - - set_page_writeback(page); - if (caching) - ceph_set_page_fscache(page); - ceph_wbc.len += thp_size(page); - } - ceph_fscache_write_to_cache(inode, ceph_wbc.offset, - ceph_wbc.len, caching); - - if (ceph_wbc.size_stable) { - ceph_wbc.len = min(ceph_wbc.len, - ceph_wbc.i_size - ceph_wbc.offset); - } else if (i == ceph_wbc.locked_pages) { - /* writepages_finish() clears writeback pages - * according to the data length, so make sure - * data length covers all locked pages */ - u64 min_len = ceph_wbc.len + 1 - thp_size(page); - ceph_wbc.len = - get_writepages_data_length(inode, - ceph_wbc.pages[i - 1], - ceph_wbc.offset); - ceph_wbc.len = max(ceph_wbc.len, min_len); - } - if (IS_ENCRYPTED(inode)) { - ceph_wbc.len = round_up(ceph_wbc.len, - CEPH_FSCRYPT_BLOCK_SIZE); - } - doutc(cl, "got pages at %llu~%llu\n", - ceph_wbc.offset, ceph_wbc.len); + ceph_wbc.locked_pages = 0; + ceph_wbc.strip_unit_end = 0; - if (IS_ENCRYPTED(inode) && - ((ceph_wbc.offset | ceph_wbc.len) & ~CEPH_FSCRYPT_BLOCK_MASK)) - pr_warn_client(cl, - "bad encrypted write offset=%lld len=%llu\n", - ceph_wbc.offset, ceph_wbc.len); - - osd_req_op_extent_osd_data_pages(req, ceph_wbc.op_idx, - ceph_wbc.data_pages, - ceph_wbc.len, - 0, ceph_wbc.from_pool, false); - osd_req_op_extent_update(req, ceph_wbc.op_idx, ceph_wbc.len); - - BUG_ON(ceph_wbc.op_idx + 1 != req->r_num_ops); - - ceph_wbc.from_pool = false; - if (i < ceph_wbc.locked_pages) { - BUG_ON(ceph_wbc.num_ops <= req->r_num_ops); - ceph_wbc.num_ops -= req->r_num_ops; - ceph_wbc.locked_pages -= i; - - /* allocate new pages array for next request */ - ceph_wbc.data_pages = ceph_wbc.pages; - ceph_wbc.pages = kmalloc_array(ceph_wbc.locked_pages, - sizeof(*ceph_wbc.pages), - GFP_NOFS); - if (!ceph_wbc.pages) { - ceph_wbc.from_pool = true; - ceph_wbc.pages = - mempool_alloc(ceph_wb_pagevec_pool, GFP_NOFS); - BUG_ON(!ceph_wbc.pages); - } - memcpy(ceph_wbc.pages, ceph_wbc.data_pages + i, - ceph_wbc.locked_pages * sizeof(*ceph_wbc.pages)); - memset(ceph_wbc.data_pages + i, 0, - ceph_wbc.locked_pages * sizeof(*ceph_wbc.pages)); - } else { - BUG_ON(ceph_wbc.num_ops != req->r_num_ops); - ceph_wbc.index = ceph_wbc.pages[i - 1]->index + 1; - /* request message now owns the pages array */ - ceph_wbc.pages = NULL; + if (folio_batch_count(&ceph_wbc.fbatch) > 0) { + ceph_wbc.nr_folios = + folio_batch_count(&ceph_wbc.fbatch); + goto process_folio_batch; } - req->r_mtime = inode_get_mtime(inode); - ceph_osdc_start_request(&fsc->client->osdc, req); - req = NULL; - - wbc->nr_to_write -= i; - if (ceph_wbc.pages) - goto new_request; - /* * We stop writing back only if we are not doing * integrity sync. In case of integrity sync we have to @@ -1666,32 +1741,12 @@ static int ceph_writepages_start(struct address_space *mapping, if (ceph_wbc.should_loop && !ceph_wbc.done) { /* more to do; loop back to beginning of file */ doutc(cl, "looping back to beginning of file\n"); - ceph_wbc.end = ceph_wbc.start_index - 1; /* OK even when start_index == 0 */ + /* OK even when start_index == 0 */ + ceph_wbc.end = ceph_wbc.start_index - 1; /* to write dirty pages associated with next snapc, * we need to wait until current writes complete */ - if (wbc->sync_mode != WB_SYNC_NONE && - ceph_wbc.start_index == 0 && /* all dirty pages were checked */ - !ceph_wbc.head_snapc) { - struct page *page; - unsigned i, nr; - ceph_wbc.index = 0; - while ((ceph_wbc.index <= ceph_wbc.end) && - (nr = filemap_get_folios_tag(mapping, - &ceph_wbc.index, - (pgoff_t)-1, - PAGECACHE_TAG_WRITEBACK, - &ceph_wbc.fbatch))) { - for (i = 0; i < nr; i++) { - page = &ceph_wbc.fbatch.folios[i]->page; - if (page_snap_context(page) != ceph_wbc.snapc) - continue; - wait_on_page_writeback(page); - } - folio_batch_release(&ceph_wbc.fbatch); - cond_resched(); - } - } + ceph_wait_until_current_writes_complete(mapping, wbc, &ceph_wbc); ceph_wbc.start_index = 0; ceph_wbc.index = 0; @@ -1702,15 +1757,13 @@ static int ceph_writepages_start(struct address_space *mapping, mapping->writeback_index = ceph_wbc.index; out: - ceph_osdc_put_request(req); ceph_put_snap_context(ceph_wbc.last_snapc); doutc(cl, "%llx.%llx dend - startone, rc = %d\n", ceph_vinop(inode), rc); + return rc; } - - /* * See if a given @snapc is either writeable, or already written. */