Message ID | 20240911173801.4025422-2-shakeel.butt@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: optimize shadow entries removal | expand |
On Wed, Sep 11, 2024 at 10:38:00AM -0700, Shakeel Butt wrote: > The kernel truncates the page cache in batches of PAGEVEC_SIZE. For each > batch, it traverses the page cache tree and collects the entries (folio > and shadow entries) in the struct folio_batch. For the shadow entries > present in the folio_batch, it has to traverse the page cache tree for > each individual entry to remove them. This patch optimize this by > removing them in a single tree traversal. > > On large machines in our production which run workloads manipulating > large amount of data, we have observed that a large amount of CPUs are > spent on truncation of very large files (100s of GiBs file sizes). More > specifically most of time was spent on shadow entries cleanup, so > optimizing the shadow entries cleanup, even a little bit, has good > impact. > > To evaluate the changes, we created 200GiB file on a fuse fs and in a > memcg. We created the shadow entries by triggering reclaim through > memory.reclaim in that specific memcg and measure the simple truncation > operation. > > # time truncate -s 0 file > > time (sec) > Without 5.164 +- 0.059 > With-patch 4.21 +- 0.066 (18.47% decrease) > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Looks good to me. One thing that's a bit subtle is that the tree walk assumes indices[] are ordered, such that indices[0] and indices[nr-1] reliably denote the range of interest. AFAICS that's the case for the current callers but if not that could be a painful bug to hunt down. Assessing lowest and highest index in that first batch iteration seems a bit overkill though. Maybe just a comment stating the requirement? Otherwise, Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Wed, Sep 11, 2024 at 05:08:24PM GMT, Johannes Weiner wrote: > On Wed, Sep 11, 2024 at 10:38:00AM -0700, Shakeel Butt wrote: > > The kernel truncates the page cache in batches of PAGEVEC_SIZE. For each > > batch, it traverses the page cache tree and collects the entries (folio > > and shadow entries) in the struct folio_batch. For the shadow entries > > present in the folio_batch, it has to traverse the page cache tree for > > each individual entry to remove them. This patch optimize this by > > removing them in a single tree traversal. > > > > On large machines in our production which run workloads manipulating > > large amount of data, we have observed that a large amount of CPUs are > > spent on truncation of very large files (100s of GiBs file sizes). More > > specifically most of time was spent on shadow entries cleanup, so > > optimizing the shadow entries cleanup, even a little bit, has good > > impact. > > > > To evaluate the changes, we created 200GiB file on a fuse fs and in a > > memcg. We created the shadow entries by triggering reclaim through > > memory.reclaim in that specific memcg and measure the simple truncation > > operation. > > > > # time truncate -s 0 file > > > > time (sec) > > Without 5.164 +- 0.059 > > With-patch 4.21 +- 0.066 (18.47% decrease) > > > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > > Looks good to me. One thing that's a bit subtle is that the tree walk > assumes indices[] are ordered, such that indices[0] and indices[nr-1] > reliably denote the range of interest. AFAICS that's the case for the > current callers but if not that could be a painful bug to hunt down. The current callers use find_get_entries() and find_lock_entries() to fill up the indices array which provides this guarantee. > > Assessing lowest and highest index in that first batch iteration seems > a bit overkill though. Maybe just a comment stating the requirement? I will add a comment in v2. > > Otherwise, > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> Thanks for the review.
diff --git a/mm/truncate.c b/mm/truncate.c index 0668cd340a46..c7c19c816c2e 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -72,50 +72,46 @@ static void clear_shadow_entries(struct address_space *mapping, static void truncate_folio_batch_exceptionals(struct address_space *mapping, struct folio_batch *fbatch, pgoff_t *indices) { + XA_STATE(xas, &mapping->i_pages, indices[0]); + int nr = folio_batch_count(fbatch); + struct folio *folio; int i, j; - bool dax; /* Handled by shmem itself */ if (shmem_mapping(mapping)) return; - for (j = 0; j < folio_batch_count(fbatch); j++) + for (j = 0; j < nr; j++) if (xa_is_value(fbatch->folios[j])) break; - if (j == folio_batch_count(fbatch)) + if (j == nr) return; - dax = dax_mapping(mapping); - if (!dax) { - spin_lock(&mapping->host->i_lock); - xa_lock_irq(&mapping->i_pages); + if (dax_mapping(mapping)) { + for (i = j; i < nr; i++) { + if (xa_is_value(fbatch->folios[i])) + dax_delete_mapping_entry(mapping, indices[i]); + } + goto out; } - for (i = j; i < folio_batch_count(fbatch); i++) { - struct folio *folio = fbatch->folios[i]; - pgoff_t index = indices[i]; - - if (!xa_is_value(folio)) { - fbatch->folios[j++] = folio; - continue; - } + xas_set_update(&xas, workingset_update_node); - if (unlikely(dax)) { - dax_delete_mapping_entry(mapping, index); - continue; - } + spin_lock(&mapping->host->i_lock); + xas_lock_irq(&xas); - __clear_shadow_entry(mapping, index, folio); + xas_for_each(&xas, folio, indices[nr-1]) { + if (xa_is_value(folio)) + xas_store(&xas, NULL); } - if (!dax) { - xa_unlock_irq(&mapping->i_pages); - if (mapping_shrinkable(mapping)) - inode_add_lru(mapping->host); - spin_unlock(&mapping->host->i_lock); - } - fbatch->nr = j; + xas_unlock_irq(&xas); + if (mapping_shrinkable(mapping)) + inode_add_lru(mapping->host); + spin_unlock(&mapping->host->i_lock); +out: + folio_batch_remove_exceptionals(fbatch); } /**
The kernel truncates the page cache in batches of PAGEVEC_SIZE. For each batch, it traverses the page cache tree and collects the entries (folio and shadow entries) in the struct folio_batch. For the shadow entries present in the folio_batch, it has to traverse the page cache tree for each individual entry to remove them. This patch optimize this by removing them in a single tree traversal. On large machines in our production which run workloads manipulating large amount of data, we have observed that a large amount of CPUs are spent on truncation of very large files (100s of GiBs file sizes). More specifically most of time was spent on shadow entries cleanup, so optimizing the shadow entries cleanup, even a little bit, has good impact. To evaluate the changes, we created 200GiB file on a fuse fs and in a memcg. We created the shadow entries by triggering reclaim through memory.reclaim in that specific memcg and measure the simple truncation operation. # time truncate -s 0 file time (sec) Without 5.164 +- 0.059 With-patch 4.21 +- 0.066 (18.47% decrease) Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> --- mm/truncate.c | 50 +++++++++++++++++++++++--------------------------- 1 file changed, 23 insertions(+), 27 deletions(-)