Message ID | 1641488717-13865-1-git-send-email-quic_charante@quicinc.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3,RESEND] mm: shmem: implement POSIX_FADV_[WILL|DONT]NEED for shmem | expand |
On Thu, 6 Jan 2022 at 17:06, Charan Teja Reddy <quic_charante@quicinc.com> wrote: > > From: Charan Teja Reddy <charante@codeaurora.org> > > Currently fadvise(2) is supported only for the files that doesn't > associated with noop_backing_dev_info thus for the files, like shmem, > fadvise results into NOP. But then there is file_operations->fadvise() > that lets the file systems to implement their own fadvise > implementation. Use this support to implement some of the POSIX_FADV_XXX > functionality for shmem files. > > [snip] > +static int shmem_fadvise_willneed(struct address_space *mapping, > + pgoff_t start, pgoff_t long end) > +{ > + XA_STATE(xas, &mapping->i_pages, start); > + struct page *page; > + > + rcu_read_lock(); > + xas_for_each(&xas, page, end) { > + if (!xa_is_value(page)) > + continue; > + xas_pause(&xas); > + rcu_read_unlock(); > + > + page = shmem_read_mapping_page(mapping, xas.xa_index); > + if (!IS_ERR(page)) > + put_page(page); > + > + rcu_read_lock(); > + if (need_resched()) { > + xas_pause(&xas); > + cond_resched_rcu(); > + } > + } > + rcu_read_unlock(); > + > + return 0; I have a doubt on referencing xa_index after calling xas_pause(). xas_pause() walks xa_index forward, so will not be the value expected for the current page. Also, not necessary to re-call xas_pause() before cond_resched (it is a no-op). Would be better to check need_resched() before rcu_read_lock(). As this loop may call xas_pause() for most iterations, should consider using xa_for_each() instead (I *think* - still getting up to speed with XArray). Mark
Thanks Mark for the review!! On 1/7/2022 5:40 PM, Mark Hemment wrote: > On Thu, 6 Jan 2022 at 17:06, Charan Teja Reddy > <quic_charante@quicinc.com> wrote: >> >> From: Charan Teja Reddy <charante@codeaurora.org> >> >> Currently fadvise(2) is supported only for the files that doesn't >> associated with noop_backing_dev_info thus for the files, like shmem, >> fadvise results into NOP. But then there is file_operations->fadvise() >> that lets the file systems to implement their own fadvise >> implementation. Use this support to implement some of the POSIX_FADV_XXX >> functionality for shmem files. >> >> [snip] > >> +static int shmem_fadvise_willneed(struct address_space *mapping, >> + pgoff_t start, pgoff_t long end) >> +{ >> + XA_STATE(xas, &mapping->i_pages, start); >> + struct page *page; >> + >> + rcu_read_lock(); >> + xas_for_each(&xas, page, end) { >> + if (!xa_is_value(page)) >> + continue; >> + xas_pause(&xas); >> + rcu_read_unlock(); >> + >> + page = shmem_read_mapping_page(mapping, xas.xa_index); >> + if (!IS_ERR(page)) >> + put_page(page); >> + >> + rcu_read_lock(); >> + if (need_resched()) { >> + xas_pause(&xas); >> + cond_resched_rcu(); >> + } >> + } >> + rcu_read_unlock(); >> + >> + return 0; > > I have a doubt on referencing xa_index after calling xas_pause(). > xas_pause() walks xa_index forward, so will not be the value expected > for the current page. Agree here. I should have the better test case to verify my changes. > Also, not necessary to re-call xas_pause() before cond_resched (it is > a no-op). In the event when CONFIG_DEBUG_ATOMIC_SLEEP is enabled users may still need to call the xas_pause(), as we are dropping the rcu lock. NO? static inline void cond_resched_rcu(void) { #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU) rcu_read_unlock(); cond_resched(); rcu_read_lock(); #endif } > Would be better to check need_resched() before > rcu_read_lock(). Okay, I can directly use cond_resched() if used before rcu_read_lock(). > > As this loop may call xas_pause() for most iterations, should consider > using xa_for_each() instead (I *think* - still getting up to speed > with XArray). Even the xarray documentation says that: If most entries found during a walk require you to call xas_pause(), the xa_for_each() iterator may be more appropriate. Since every value entry found in the xarray requires me to do the xas_pause(), I do agree that xa_for_each() is the appropriate call here. Will switch to this in the next spin. Waiting for further review comments on this patch. > > Mark >
On Thu, 6 Jan 2022 at 17:06, Charan Teja Reddy <quic_charante@quicinc.com> wrote: > > From: Charan Teja Reddy <charante@codeaurora.org> > > Currently fadvise(2) is supported only for the files that doesn't > associated with noop_backing_dev_info thus for the files, like shmem, > fadvise results into NOP. But then there is file_operations->fadvise() > that lets the file systems to implement their own fadvise > implementation. Use this support to implement some of the POSIX_FADV_XXX > functionality for shmem files. ... > +static void shmem_isolate_pages_range(struct address_space *mapping, loff_t start, > + loff_t end, struct list_head *list) > +{ > + XA_STATE(xas, &mapping->i_pages, start); > + struct page *page; > + > + rcu_read_lock(); > + xas_for_each(&xas, page, end) { > + if (xas_retry(&xas, page)) > + continue; > + if (xa_is_value(page)) > + continue; > + if (!get_page_unless_zero(page)) > + continue; > + if (isolate_lru_page(page)) > + continue; Need to unwind the get_page on failure to isolate. Should PageUnevicitable() pages (SHM_LOCK) be skipped? (That is, does SHM_LOCK override DONTNEED?) ... > +static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start, > + loff_t end) > +{ > + int ret; > + struct page *page; > + LIST_HEAD(list); > + struct writeback_control wbc = { > + .sync_mode = WB_SYNC_NONE, > + .nr_to_write = LONG_MAX, > + .range_start = 0, > + .range_end = LLONG_MAX, > + .for_reclaim = 1, > + }; > + > + if (!shmem_mapping(mapping)) > + return -EINVAL; > + > + if (!total_swap_pages) > + return 0; > + > + lru_add_drain(); > + shmem_isolate_pages_range(mapping, start, end, &list); > + > + while (!list_empty(&list)) { > + page = lru_to_page(&list); > + list_del(&page->lru); > + if (page_mapped(page)) > + goto keep; > + if (!trylock_page(page)) > + goto keep; > + if (unlikely(PageTransHuge(page))) { > + if (split_huge_page_to_list(page, &list)) > + goto keep; > + } I don't know the shmem code and the lifecycle of a shm-page, so genuine questions; When the try-lock succeeds, should there be a test for PageWriteback() (page skipped if true)? Also, does page->mapping need to be tested for NULL to prevent races with deletion from the page-cache? ... > + > + clear_page_dirty_for_io(page); > + SetPageReclaim(page); > + ret = shmem_writepage(page, &wbc); > + if (ret || PageWriteback(page)) { > + if (ret) > + unlock_page(page); > + goto keep; > + } > + > + if (!PageWriteback(page)) > + ClearPageReclaim(page); > + > + /* > + * shmem_writepage() place the page in the swapcache. > + * Delete the page from the swapcache and release the > + * page. > + */ > + __mod_node_page_state(page_pgdat(page), > + NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page)); > + lock_page(page); > + delete_from_swap_cache(page); > + unlock_page(page); > + put_page(page); > + continue; > +keep: > + putback_lru_page(page); > + __mod_node_page_state(page_pgdat(page), > + NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page)); > + } The putback_lru_page() drops the last reference hold this code has on 'page'. Is it safe to use 'page' after dropping this reference? Cheers, Mark
Thanks again Mark for the review comments!! On 1/10/2022 6:06 PM, Mark Hemment wrote: > On Thu, 6 Jan 2022 at 17:06, Charan Teja Reddy > <quic_charante@quicinc.com> wrote: >> >> From: Charan Teja Reddy <charante@codeaurora.org> >> >> Currently fadvise(2) is supported only for the files that doesn't >> associated with noop_backing_dev_info thus for the files, like shmem, >> fadvise results into NOP. But then there is file_operations->fadvise() >> that lets the file systems to implement their own fadvise >> implementation. Use this support to implement some of the POSIX_FADV_XXX >> functionality for shmem files. > ... >> +static void shmem_isolate_pages_range(struct address_space *mapping, loff_t start, >> + loff_t end, struct list_head *list) >> +{ >> + XA_STATE(xas, &mapping->i_pages, start); >> + struct page *page; >> + >> + rcu_read_lock(); >> + xas_for_each(&xas, page, end) { >> + if (xas_retry(&xas, page)) >> + continue; >> + if (xa_is_value(page)) >> + continue; >> + if (!get_page_unless_zero(page)) >> + continue; >> + if (isolate_lru_page(page)) >> + continue; > > Need to unwind the get_page on failure to isolate. Will be done. > > Should PageUnevicitable() pages (SHM_LOCK) be skipped? > (That is, does SHM_LOCK override DONTNEED?) Should be skipped. Will be done. > > ... >> +static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start, >> + loff_t end) >> +{ >> + int ret; >> + struct page *page; >> + LIST_HEAD(list); >> + struct writeback_control wbc = { >> + .sync_mode = WB_SYNC_NONE, >> + .nr_to_write = LONG_MAX, >> + .range_start = 0, >> + .range_end = LLONG_MAX, >> + .for_reclaim = 1, >> + }; >> + >> + if (!shmem_mapping(mapping)) >> + return -EINVAL; >> + >> + if (!total_swap_pages) >> + return 0; >> + >> + lru_add_drain(); >> + shmem_isolate_pages_range(mapping, start, end, &list); >> + >> + while (!list_empty(&list)) { >> + page = lru_to_page(&list); >> + list_del(&page->lru); >> + if (page_mapped(page)) >> + goto keep; >> + if (!trylock_page(page)) >> + goto keep; >> + if (unlikely(PageTransHuge(page))) { >> + if (split_huge_page_to_list(page, &list)) >> + goto keep; >> + } > > I don't know the shmem code and the lifecycle of a shm-page, so > genuine questions; > When the try-lock succeeds, should there be a test for PageWriteback() > (page skipped if true)? Also, does page->mapping need to be tested > for NULL to prevent races with deletion from the page-cache? I failed to envisage it. I should have considered both these conditions here. BTW, I am just thinking about why we shouldn't use reclaim_pages(page_list) function here with an extra set_page_dirty() on a page that is isolated? It just call the shrink_page_list() where all these conditions are properly handled. What is your opinion here? > > ... >> + >> + clear_page_dirty_for_io(page); >> + SetPageReclaim(page); >> + ret = shmem_writepage(page, &wbc); >> + if (ret || PageWriteback(page)) { >> + if (ret) >> + unlock_page(page); >> + goto keep; >> + } >> + >> + if (!PageWriteback(page)) >> + ClearPageReclaim(page); >> + >> + /* >> + * shmem_writepage() place the page in the swapcache. >> + * Delete the page from the swapcache and release the >> + * page. >> + */ >> + __mod_node_page_state(page_pgdat(page), >> + NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page)); >> + lock_page(page); >> + delete_from_swap_cache(page); >> + unlock_page(page); >> + put_page(page); >> + continue; >> +keep: >> + putback_lru_page(page); >> + __mod_node_page_state(page_pgdat(page), >> + NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page)); >> + } > > The putback_lru_page() drops the last reference hold this code has on > 'page'. Is it safe to use 'page' after dropping this reference? True. Will correct it in the next revision. > > Cheers, > Mark >
Hello Mark, On 1/10/2022 3:51 PM, Charan Teja Kalla wrote: >>> +static int shmem_fadvise_willneed(struct address_space *mapping, >>> + pgoff_t start, pgoff_t long end) >>> +{ >>> + XA_STATE(xas, &mapping->i_pages, start); >>> + struct page *page; >>> + >>> + rcu_read_lock(); >>> + xas_for_each(&xas, page, end) { >>> + if (!xa_is_value(page)) >>> + continue; >>> + xas_pause(&xas); >>> + rcu_read_unlock(); >>> + >>> + page = shmem_read_mapping_page(mapping, xas.xa_index); >>> + if (!IS_ERR(page)) >>> + put_page(page); >>> + >>> + rcu_read_lock(); >>> + if (need_resched()) { >>> + xas_pause(&xas); >>> + cond_resched_rcu(); >>> + } >>> + } >>> + rcu_read_unlock(); >>> + >>> + return 0; >> I have a doubt on referencing xa_index after calling xas_pause(). >> xas_pause() walks xa_index forward, so will not be the value expected >> for the current page. > Agree here. I should have the better test case to verify my changes. > >> Also, not necessary to re-call xas_pause() before cond_resched (it is >> a no-op). > In the event when CONFIG_DEBUG_ATOMIC_SLEEP is enabled users may still > need to call the xas_pause(), as we are dropping the rcu lock. NO? > > static inline void cond_resched_rcu(void) > { > #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU) > rcu_read_unlock(); > cond_resched(); > rcu_read_lock(); > #endif > } > >> Would be better to check need_resched() before >> rcu_read_lock(). > Okay, I can directly use cond_resched() if used before rcu_read_lock(). > >> As this loop may call xas_pause() for most iterations, should consider >> using xa_for_each() instead (I *think* - still getting up to speed >> with XArray). > Even the xarray documentation says that: If most entries found during a > walk require you to call xas_pause(), the xa_for_each() iterator may be > more appropriate. > > Since every value entry found in the xarray requires me to do the > xas_pause(), I do agree that xa_for_each() is the appropriate call here. > Will switch to this in the next spin. Waiting for further review > comments on this patch. I also found the below documentation: xa_for_each() will spin if it hits a retry entry; if you intend to see retry entries, you should use the xas_for_each() iterator instead. Since retry entries are expected, I should be using the xas_for_each() with the corrections you had pointed out. Isn't it? >
On Wed, 12 Jan 2022 at 08:22, Charan Teja Kalla <quic_charante@quicinc.com> wrote: > > Hello Mark, > > On 1/10/2022 3:51 PM, Charan Teja Kalla wrote: > >>> +static int shmem_fadvise_willneed(struct address_space *mapping, > >>> + pgoff_t start, pgoff_t long end) > >>> +{ > >>> + XA_STATE(xas, &mapping->i_pages, start); > >>> + struct page *page; > >>> + > >>> + rcu_read_lock(); > >>> + xas_for_each(&xas, page, end) { > >>> + if (!xa_is_value(page)) > >>> + continue; > >>> + xas_pause(&xas); > >>> + rcu_read_unlock(); > >>> + > >>> + page = shmem_read_mapping_page(mapping, xas.xa_index); > >>> + if (!IS_ERR(page)) > >>> + put_page(page); > >>> + > >>> + rcu_read_lock(); > >>> + if (need_resched()) { > >>> + xas_pause(&xas); > >>> + cond_resched_rcu(); > >>> + } > >>> + } > >>> + rcu_read_unlock(); > >>> + > >>> + return 0; > >> I have a doubt on referencing xa_index after calling xas_pause(). > >> xas_pause() walks xa_index forward, so will not be the value expected > >> for the current page. > > Agree here. I should have the better test case to verify my changes. > > > >> Also, not necessary to re-call xas_pause() before cond_resched (it is > >> a no-op). > > In the event when CONFIG_DEBUG_ATOMIC_SLEEP is enabled users may still > > need to call the xas_pause(), as we are dropping the rcu lock. NO? > > > > static inline void cond_resched_rcu(void) > > { > > #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU) > > rcu_read_unlock(); > > cond_resched(); > > rcu_read_lock(); > > #endif > > } > > > >> Would be better to check need_resched() before > >> rcu_read_lock(). > > Okay, I can directly use cond_resched() if used before rcu_read_lock(). > > > >> As this loop may call xas_pause() for most iterations, should consider > >> using xa_for_each() instead (I *think* - still getting up to speed > >> with XArray). > > Even the xarray documentation says that: If most entries found during a > > walk require you to call xas_pause(), the xa_for_each() iterator may be > > more appropriate. > > > > Since every value entry found in the xarray requires me to do the > > xas_pause(), I do agree that xa_for_each() is the appropriate call here. > > Will switch to this in the next spin. Waiting for further review > > comments on this patch. > > I also found the below documentation: > xa_for_each() will spin if it hits a retry entry; if you intend to see > retry entries, you should use the xas_for_each() iterator instead. > > Since retry entries are expected, I should be using the xas_for_each() > with the corrections you had pointed out. Isn't it? > Ah, you've hit a barrier on my Xarray knowledge. The current shmem code simply does a 'continue' on xas_retry(). Is this different from Xarray looping internally for xas_retry()? I assume not, but cannot give an definite answer (sorry). Cheers, Mark
On Mon, 10 Jan 2022 at 15:14, Charan Teja Kalla <quic_charante@quicinc.com> wrote: > > Thanks again Mark for the review comments!! > > On 1/10/2022 6:06 PM, Mark Hemment wrote: > > On Thu, 6 Jan 2022 at 17:06, Charan Teja Reddy > > <quic_charante@quicinc.com> wrote: > >> > >> From: Charan Teja Reddy <charante@codeaurora.org> > >> > >> Currently fadvise(2) is supported only for the files that doesn't > >> associated with noop_backing_dev_info thus for the files, like shmem, > >> fadvise results into NOP. But then there is file_operations->fadvise() > >> that lets the file systems to implement their own fadvise > >> implementation. Use this support to implement some of the POSIX_FADV_XXX > >> functionality for shmem files. > > ... > >> +static void shmem_isolate_pages_range(struct address_space *mapping, loff_t start, > >> + loff_t end, struct list_head *list) > >> +{ > >> + XA_STATE(xas, &mapping->i_pages, start); > >> + struct page *page; > >> + > >> + rcu_read_lock(); > >> + xas_for_each(&xas, page, end) { > >> + if (xas_retry(&xas, page)) > >> + continue; > >> + if (xa_is_value(page)) > >> + continue; > >> + if (!get_page_unless_zero(page)) > >> + continue; > >> + if (isolate_lru_page(page)) > >> + continue; > > > > Need to unwind the get_page on failure to isolate. > > Will be done. > > > > > Should PageUnevicitable() pages (SHM_LOCK) be skipped? > > (That is, does SHM_LOCK override DONTNEED?) > > > Should be skipped. Will be done. > > > > > ... > >> +static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start, > >> + loff_t end) > >> +{ > >> + int ret; > >> + struct page *page; > >> + LIST_HEAD(list); > >> + struct writeback_control wbc = { > >> + .sync_mode = WB_SYNC_NONE, > >> + .nr_to_write = LONG_MAX, > >> + .range_start = 0, > >> + .range_end = LLONG_MAX, > >> + .for_reclaim = 1, > >> + }; > >> + > >> + if (!shmem_mapping(mapping)) > >> + return -EINVAL; > >> + > >> + if (!total_swap_pages) > >> + return 0; > >> + > >> + lru_add_drain(); > >> + shmem_isolate_pages_range(mapping, start, end, &list); > >> + > >> + while (!list_empty(&list)) { > >> + page = lru_to_page(&list); > >> + list_del(&page->lru); > >> + if (page_mapped(page)) > >> + goto keep; > >> + if (!trylock_page(page)) > >> + goto keep; > >> + if (unlikely(PageTransHuge(page))) { > >> + if (split_huge_page_to_list(page, &list)) > >> + goto keep; > >> + } > > > > I don't know the shmem code and the lifecycle of a shm-page, so > > genuine questions; > > When the try-lock succeeds, should there be a test for PageWriteback() > > (page skipped if true)? Also, does page->mapping need to be tested > > for NULL to prevent races with deletion from the page-cache? > > I failed to envisage it. I should have considered both these conditions > here. BTW, I am just thinking about why we shouldn't use > reclaim_pages(page_list) function here with an extra set_page_dirty() on > a page that is isolated? It just call the shrink_page_list() where all > these conditions are properly handled. What is your opinion here? Should be possible to use reclaim_pages() (I haven't look closely). It might actually be good to use this function, as will do some congestion throttling. Although it will always try to unmap pages (note: your page_mapped() test is 'unstable' as done without the page locked), so might give behaviour you want to avoid. Note: reclaim_pages() is already used for madvise(PAGEOUT). The shmem code would need to prepare page(s) to help shrink_page_list() to make progress (see madvise.c:madvise_cold_or_pageout_pte_range()). Taking a step back; is fadvise(DONTNEED) really needed/wanted? Yes, you gave a usecase (which I cut from this thread in my earlier reply), but I'm not familiar with various shmem uses to know if this feature is needed. Someone else will need to answer this. Cheers, Mark > > > > > ... > >> + > >> + clear_page_dirty_for_io(page); > >> + SetPageReclaim(page); > >> + ret = shmem_writepage(page, &wbc); > >> + if (ret || PageWriteback(page)) { > >> + if (ret) > >> + unlock_page(page); > >> + goto keep; > >> + } > >> + > >> + if (!PageWriteback(page)) > >> + ClearPageReclaim(page); > >> + > >> + /* > >> + * shmem_writepage() place the page in the swapcache. > >> + * Delete the page from the swapcache and release the > >> + * page. > >> + */ > >> + __mod_node_page_state(page_pgdat(page), > >> + NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page)); > >> + lock_page(page); > >> + delete_from_swap_cache(page); > >> + unlock_page(page); > >> + put_page(page); > >> + continue; > >> +keep: > >> + putback_lru_page(page); > >> + __mod_node_page_state(page_pgdat(page), > >> + NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page)); > >> + } > > > > The putback_lru_page() drops the last reference hold this code has on > > 'page'. Is it safe to use 'page' after dropping this reference? > > True. Will correct it in the next revision. > > > > > Cheers, > > Mark > >
On Wed, Jan 12, 2022 at 01:51:55PM +0530, Charan Teja Kalla wrote: > >>> + rcu_read_lock(); > >>> + xas_for_each(&xas, page, end) { > >>> + if (!xa_is_value(page)) > >>> + continue; > >>> + xas_pause(&xas); > >>> + rcu_read_unlock(); > >>> + > >>> + page = shmem_read_mapping_page(mapping, xas.xa_index); > >>> + if (!IS_ERR(page)) > >>> + put_page(page); > >>> + > >>> + rcu_read_lock(); > >>> + if (need_resched()) { > >>> + xas_pause(&xas); > >>> + cond_resched_rcu(); > >>> + } > >>> + } > >>> + rcu_read_unlock(); > > Even the xarray documentation says that: If most entries found during a > > walk require you to call xas_pause(), the xa_for_each() iterator may be > > more appropriate. Yes. This should obviously be an xa_for_each() loop. > > Since every value entry found in the xarray requires me to do the > > xas_pause(), I do agree that xa_for_each() is the appropriate call here. > > Will switch to this in the next spin. Waiting for further review > > comments on this patch. > > I also found the below documentation: > xa_for_each() will spin if it hits a retry entry; if you intend to see > retry entries, you should use the xas_for_each() iterator instead. > > Since retry entries are expected, I should be using the xas_for_each() > with the corrections you had pointed out. Isn't it? No. You aren't handling retry entries at all; you clearly don't expect to see them. Just let the XArray code handle them itself.
Thanks Matthew for the review. On 1/12/2022 6:49 PM, Matthew Wilcox wrote: > On Wed, Jan 12, 2022 at 01:51:55PM +0530, Charan Teja Kalla wrote: >>>>> + rcu_read_lock(); >>>>> + xas_for_each(&xas, page, end) { >>>>> + if (!xa_is_value(page)) >>>>> + continue; >>>>> + xas_pause(&xas); >>>>> + rcu_read_unlock(); >>>>> + >>>>> + page = shmem_read_mapping_page(mapping, xas.xa_index); >>>>> + if (!IS_ERR(page)) >>>>> + put_page(page); >>>>> + >>>>> + rcu_read_lock(); >>>>> + if (need_resched()) { >>>>> + xas_pause(&xas); >>>>> + cond_resched_rcu(); >>>>> + } >>>>> + } >>>>> + rcu_read_unlock(); >>> Even the xarray documentation says that: If most entries found during a >>> walk require you to call xas_pause(), the xa_for_each() iterator may be >>> more appropriate. > > Yes. This should obviously be an xa_for_each() loop. > ACK. >>> Since every value entry found in the xarray requires me to do the >>> xas_pause(), I do agree that xa_for_each() is the appropriate call here. >>> Will switch to this in the next spin. Waiting for further review >>> comments on this patch. >> >> I also found the below documentation: >> xa_for_each() will spin if it hits a retry entry; if you intend to see >> retry entries, you should use the xas_for_each() iterator instead. >> >> Since retry entries are expected, I should be using the xas_for_each() >> with the corrections you had pointed out. Isn't it? > > No. You aren't handling retry entries at all; you clearly don't > expect to see them. Just let the XArray code handle them itself. ACK. >
Thanks Mark!! On 1/12/2022 5:08 PM, Mark Hemment wrote: >>>> +static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start, >>>> + loff_t end) >>>> +{ >>>> + int ret; >>>> + struct page *page; >>>> + LIST_HEAD(list); >>>> + struct writeback_control wbc = { >>>> + .sync_mode = WB_SYNC_NONE, >>>> + .nr_to_write = LONG_MAX, >>>> + .range_start = 0, >>>> + .range_end = LLONG_MAX, >>>> + .for_reclaim = 1, >>>> + }; >>>> + >>>> + if (!shmem_mapping(mapping)) >>>> + return -EINVAL; >>>> + >>>> + if (!total_swap_pages) >>>> + return 0; >>>> + >>>> + lru_add_drain(); >>>> + shmem_isolate_pages_range(mapping, start, end, &list); >>>> + >>>> + while (!list_empty(&list)) { >>>> + page = lru_to_page(&list); >>>> + list_del(&page->lru); >>>> + if (page_mapped(page)) >>>> + goto keep; >>>> + if (!trylock_page(page)) >>>> + goto keep; >>>> + if (unlikely(PageTransHuge(page))) { >>>> + if (split_huge_page_to_list(page, &list)) >>>> + goto keep; >>>> + } >>> I don't know the shmem code and the lifecycle of a shm-page, so >>> genuine questions; >>> When the try-lock succeeds, should there be a test for PageWriteback() >>> (page skipped if true)? Also, does page->mapping need to be tested >>> for NULL to prevent races with deletion from the page-cache? >> I failed to envisage it. I should have considered both these conditions >> here. BTW, I am just thinking about why we shouldn't use >> reclaim_pages(page_list) function here with an extra set_page_dirty() on >> a page that is isolated? It just call the shrink_page_list() where all >> these conditions are properly handled. What is your opinion here? > Should be possible to use reclaim_pages() (I haven't look closely). > It might actually be good to use this function, as will do some > congestion throttling. Although it will always try to unmap > pages (note: your page_mapped() test is 'unstable' as done without the > page locked), so might give behaviour you want to avoid. page_mapped can be true between isolate and then asking for reclaim of it through reclaim_pages(), and then can be unmapped there. Thanks for pointing it out. BTW, the posix_fadvise man pages[1] doesn't talk about any restrictions with the mapped pages. If so, Am I allowed to even unmap the pages when called FADV_DONTNEED on the file (agree for mapped, we can rely on MADV_DONTNEED too)? [1]https://man7.org/linux/man-pages/man2/posix_fadvise.2.html > Note: reclaim_pages() is already used for madvise(PAGEOUT). The shmem > code would need to prepare page(s) to help shrink_page_list() to make > progress (see madvise.c:madvise_cold_or_pageout_pte_range()). > > Taking a step back; is fadvise(DONTNEED) really needed/wanted? Yes, > you gave a usecase (which I cut from this thread in my earlier reply), > but I'm not familiar with various shmem uses to know if this feature > is needed. Someone else will need to answer this. Actually I needed this for the use case mentioned. And regarding the various use cases, I encountered that GEM buffers for display/graphics are using the shmem buffers. drivers/gpu/drm/i915/gem/i915_gem_shmem.c
Hello Matthew, On 1/12/2022 7:05 PM, Charan Teja Kalla wrote: >>>>> + rcu_read_lock(); >>>>> + xas_for_each(&xas, page, end) { >>>>> + if (!xa_is_value(page)) >>>>> + continue; >>>>> + xas_pause(&xas); >>>>> + rcu_read_unlock(); >>>>> + >>>>> + page = shmem_read_mapping_page(mapping, xas.xa_index); >>>>> + if (!IS_ERR(page)) >>>>> + put_page(page); >>>>> + >>>>> + rcu_read_lock(); >>>>> + if (need_resched()) { >>>>> + xas_pause(&xas); >>>>> + cond_resched_rcu(); >>>>> + } >>>>> + } >>>>> + rcu_read_unlock(); >>> Even the xarray documentation says that: If most entries found during a >>> walk require you to call xas_pause(), the xa_for_each() iterator may be >>> more appropriate. > Yes. This should obviously be an xa_for_each() loop. In one of your patch[1], where we used xarray iterator, though most of the entries found requires to call xas_pause() but still endup in using xas_for_each() rather than xa_for_each(). Then, Should this code be changed to use xa_for_each()? The documentation also says that "The xas_for_each() iterator will expand into more inline code than xa_for_each()." [1]https://patchwork.kernel.org/project/linux-mm/patch/20200819184850.24779-4-willy@infradead.org/
On Tue, Jan 18, 2022 at 05:05:40PM +0530, Charan Teja Kalla wrote: > Hello Matthew, > > > Yes. This should obviously be an xa_for_each() loop. > > In one of your patch[1], where we used xarray iterator, though most of > the entries found requires to call xas_pause() but still endup in using > xas_for_each() rather than xa_for_each(). Then, Should this code be > changed to use xa_for_each()? The documentation also says that "The > xas_for_each() iterator will expand into more inline code than > xa_for_each()." > > [1]https://patchwork.kernel.org/project/linux-mm/patch/20200819184850.24779-4-willy@infradead.org/ How do you know the distribution of swap and non-swap entries in that region of that xarray?
diff --git a/mm/shmem.c b/mm/shmem.c index 23c91a8..dd3ac2e 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -39,6 +39,8 @@ #include <linux/frontswap.h> #include <linux/fs_parser.h> #include <linux/swapfile.h> +#include <linux/mm_inline.h> +#include <linux/fadvise.h> static struct vfsmount *shm_mnt; @@ -2275,6 +2277,175 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) return 0; } +static void shmem_isolate_pages_range(struct address_space *mapping, loff_t start, + loff_t end, struct list_head *list) +{ + XA_STATE(xas, &mapping->i_pages, start); + struct page *page; + + rcu_read_lock(); + xas_for_each(&xas, page, end) { + if (xas_retry(&xas, page)) + continue; + if (xa_is_value(page)) + continue; + if (!get_page_unless_zero(page)) + continue; + if (isolate_lru_page(page)) + continue; + + list_add(&page->lru, list); + __mod_node_page_state(page_pgdat(page), + NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page)); + put_page(page); + if (need_resched()) { + xas_pause(&xas); + cond_resched_rcu(); + } + } + rcu_read_unlock(); +} + +static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start, + loff_t end) +{ + int ret; + struct page *page; + LIST_HEAD(list); + struct writeback_control wbc = { + .sync_mode = WB_SYNC_NONE, + .nr_to_write = LONG_MAX, + .range_start = 0, + .range_end = LLONG_MAX, + .for_reclaim = 1, + }; + + if (!shmem_mapping(mapping)) + return -EINVAL; + + if (!total_swap_pages) + return 0; + + lru_add_drain(); + shmem_isolate_pages_range(mapping, start, end, &list); + + while (!list_empty(&list)) { + page = lru_to_page(&list); + list_del(&page->lru); + if (page_mapped(page)) + goto keep; + if (!trylock_page(page)) + goto keep; + if (unlikely(PageTransHuge(page))) { + if (split_huge_page_to_list(page, &list)) + goto keep; + } + + clear_page_dirty_for_io(page); + SetPageReclaim(page); + ret = shmem_writepage(page, &wbc); + if (ret || PageWriteback(page)) { + if (ret) + unlock_page(page); + goto keep; + } + + if (!PageWriteback(page)) + ClearPageReclaim(page); + + /* + * shmem_writepage() place the page in the swapcache. + * Delete the page from the swapcache and release the + * page. + */ + __mod_node_page_state(page_pgdat(page), + NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page)); + lock_page(page); + delete_from_swap_cache(page); + unlock_page(page); + put_page(page); + continue; +keep: + putback_lru_page(page); + __mod_node_page_state(page_pgdat(page), + NR_ISOLATED_ANON + page_is_file_lru(page), compound_nr(page)); + } + + return 0; +} + +static int shmem_fadvise_willneed(struct address_space *mapping, + pgoff_t start, pgoff_t long end) +{ + XA_STATE(xas, &mapping->i_pages, start); + struct page *page; + + rcu_read_lock(); + xas_for_each(&xas, page, end) { + if (!xa_is_value(page)) + continue; + xas_pause(&xas); + rcu_read_unlock(); + + page = shmem_read_mapping_page(mapping, xas.xa_index); + if (!IS_ERR(page)) + put_page(page); + + rcu_read_lock(); + if (need_resched()) { + xas_pause(&xas); + cond_resched_rcu(); + } + } + rcu_read_unlock(); + + return 0; +} + +static int shmem_fadvise(struct file *file, loff_t offset, loff_t len, int advice) +{ + loff_t endbyte; + pgoff_t start_index; + pgoff_t end_index; + struct address_space *mapping; + int ret = 0; + + mapping = file->f_mapping; + if (!mapping || len < 0) + return -EINVAL; + + endbyte = (u64)offset + (u64)len; + if (!len || endbyte < len) + endbyte = -1; + else + endbyte--; + + + start_index = offset >> PAGE_SHIFT; + end_index = endbyte >> PAGE_SHIFT; + switch (advice) { + case POSIX_FADV_DONTNEED: + ret = shmem_fadvise_dontneed(mapping, start_index, end_index); + break; + case POSIX_FADV_WILLNEED: + ret = shmem_fadvise_willneed(mapping, start_index, end_index); + break; + case POSIX_FADV_NORMAL: + case POSIX_FADV_RANDOM: + case POSIX_FADV_SEQUENTIAL: + case POSIX_FADV_NOREUSE: + /* + * No bad return value, but ignore advice. May have to + * implement in future. + */ + break; + default: + return -EINVAL; + } + + return ret; +} + static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir, umode_t mode, dev_t dev, unsigned long flags) { @@ -3826,6 +3997,7 @@ static const struct file_operations shmem_file_operations = { .splice_write = iter_file_splice_write, .fallocate = shmem_fallocate, #endif + .fadvise = shmem_fadvise, }; static const struct inode_operations shmem_inode_operations = {