Message ID | 20240726094618.401593-4-21cnbao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: support mTHP swap-in for zRAM-like swapfile | expand |
On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > - vma, vmf->address, false); > + folio = alloc_swap_folio(vmf); > page = &folio->page; This is no longer correct. You need to set 'page' to the precise page that is being faulted rather than the first page of the folio. It was fine before because it always allocated a single-page folio, but now it must use folio_page() or folio_file_page() (whichever has the correct semantics for you). Also you need to fix your test suite to notice this bug. I suggest doing that first so that you know whether you've got the calculation correct.
On Mon, Jul 29, 2024 at 3:51 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > - vma, vmf->address, false); > > + folio = alloc_swap_folio(vmf); > > page = &folio->page; > > This is no longer correct. You need to set 'page' to the precise page > that is being faulted rather than the first page of the folio. It was > fine before because it always allocated a single-page folio, but now it > must use folio_page() or folio_file_page() (whichever has the correct > semantics for you). > > Also you need to fix your test suite to notice this bug. I suggest > doing that first so that you know whether you've got the calculation > correct. I don't understand why the code is designed in the way the page is the first page of this folio. Otherwise, we need lots of changes later while mapping the folio in ptes and rmap. > Thanks Barry
Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 11:51写道: > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > - vma, vmf->address, false); > > + folio = alloc_swap_folio(vmf); > > page = &folio->page; > > This is no longer correct. You need to set 'page' to the precise page > that is being faulted rather than the first page of the folio. It was > fine before because it always allocated a single-page folio, but now it > must use folio_page() or folio_file_page() (whichever has the correct > semantics for you). > > Also you need to fix your test suite to notice this bug. I suggest > doing that first so that you know whether you've got the calculation > correct. > > This is no problem now, we support large folios swapin as a whole, so the head page is used here instead of the page that is being faulted. You can also refer to the current code context, now support large folios swapin as a whole, and previously only support small page swapin is not the same.
On Mon, Jul 29, 2024 at 04:46:42PM +1200, Barry Song wrote: > On Mon, Jul 29, 2024 at 4:41 PM Barry Song <21cnbao@gmail.com> wrote: > > > > On Mon, Jul 29, 2024 at 3:51 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > > > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > > > - vma, vmf->address, false); > > > > + folio = alloc_swap_folio(vmf); > > > > page = &folio->page; > > > > > > This is no longer correct. You need to set 'page' to the precise page > > > that is being faulted rather than the first page of the folio. It was > > > fine before because it always allocated a single-page folio, but now it > > > must use folio_page() or folio_file_page() (whichever has the correct > > > semantics for you). > > > > > > Also you need to fix your test suite to notice this bug. I suggest > > > doing that first so that you know whether you've got the calculation > > > correct. > > > > I don't understand why the code is designed in the way the page > > is the first page of this folio. Otherwise, we need lots of changes > > later while mapping the folio in ptes and rmap. What? folio = swap_cache_get_folio(entry, vma, vmf->address); if (folio) page = folio_file_page(folio, swp_offset(entry)); page is the precise page, not the first page of the folio. > For both accessing large folios in the swapcache and allocating > new large folios, the page points to the first page of the folio. we > are mapping the whole folio not the specific page. But what address are we mapping the whole folio at? > for swapcache cases, you can find the same thing here, > > if (folio_test_large(folio) && folio_test_swapcache(folio)) { > ... > entry = folio->swap; > page = &folio->page; > } Yes, but you missed some important lines from your quote: page_idx = idx; address = folio_start; ptep = folio_ptep; nr_pages = nr; We deliberate adjust the address so that, yes, we're mapping the entire folio, but we're mapping it at an address that means that the page we actually faulted on ends up at the address that we faulted on.
On Mon, Jul 29, 2024 at 02:36:38PM +0800, Chuanhua Han wrote: > Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 11:51写道: > > > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > > - vma, vmf->address, false); > > > + folio = alloc_swap_folio(vmf); > > > page = &folio->page; > > > > This is no longer correct. You need to set 'page' to the precise page > > that is being faulted rather than the first page of the folio. It was > > fine before because it always allocated a single-page folio, but now it > > must use folio_page() or folio_file_page() (whichever has the correct > > semantics for you). > > > > Also you need to fix your test suite to notice this bug. I suggest > > doing that first so that you know whether you've got the calculation > > correct. > > > > > > This is no problem now, we support large folios swapin as a whole, so > the head page is used here instead of the page that is being faulted. > You can also refer to the current code context, now support large > folios swapin as a whole, and previously only support small page > swapin is not the same. You have completely failed to understand the problem. Let's try it this way: We take a page fault at address 0x123456789000. If part of a 16KiB folio, that's page 1 of the folio at 0x123456788000. If you now map page 0 of the folio at 0x123456789000, you've given the user the wrong page! That looks like data corruption. The code in if (folio_test_large(folio) && folio_test_swapcache(folio)) { as Barry pointed out will save you -- but what if those conditions fail? What if the mmap has been mremap()ed and the folio now crosses a PMD boundary? mk_pte() will now be called on the wrong page.
On Tue, Jul 30, 2024 at 12:49 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Jul 29, 2024 at 04:46:42PM +1200, Barry Song wrote: > > On Mon, Jul 29, 2024 at 4:41 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > On Mon, Jul 29, 2024 at 3:51 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > > > > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > > > > - vma, vmf->address, false); > > > > > + folio = alloc_swap_folio(vmf); > > > > > page = &folio->page; > > > > > > > > This is no longer correct. You need to set 'page' to the precise page > > > > that is being faulted rather than the first page of the folio. It was > > > > fine before because it always allocated a single-page folio, but now it > > > > must use folio_page() or folio_file_page() (whichever has the correct > > > > semantics for you). > > > > > > > > Also you need to fix your test suite to notice this bug. I suggest > > > > doing that first so that you know whether you've got the calculation > > > > correct. > > > > > > I don't understand why the code is designed in the way the page > > > is the first page of this folio. Otherwise, we need lots of changes > > > later while mapping the folio in ptes and rmap. > > What? > > folio = swap_cache_get_folio(entry, vma, vmf->address); > if (folio) > page = folio_file_page(folio, swp_offset(entry)); > > page is the precise page, not the first page of the folio. this is the case we may get a large folio in swapcache but we result in mapping only one subpage due to the condition to map the whole folio is not met. if we meet the condition, we are going to set page to the head instead and map the whole mTHP: if (folio_test_large(folio) && folio_test_swapcache(folio)) { int nr = folio_nr_pages(folio); unsigned long idx = folio_page_idx(folio, page); unsigned long folio_start = address - idx * PAGE_SIZE; unsigned long folio_end = folio_start + nr * PAGE_SIZE; pte_t *folio_ptep; pte_t folio_pte; if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start))) goto check_folio; if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end))) goto check_folio; folio_ptep = vmf->pte - idx; folio_pte = ptep_get(folio_ptep); if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || swap_pte_batch(folio_ptep, nr, folio_pte) != nr) goto check_folio; page_idx = idx; address = folio_start; ptep = folio_ptep; nr_pages = nr; entry = folio->swap; page = &folio->page; } > > > For both accessing large folios in the swapcache and allocating > > new large folios, the page points to the first page of the folio. we > > are mapping the whole folio not the specific page. > > But what address are we mapping the whole folio at? > > > for swapcache cases, you can find the same thing here, > > > > if (folio_test_large(folio) && folio_test_swapcache(folio)) { > > ... > > entry = folio->swap; > > page = &folio->page; > > } > > Yes, but you missed some important lines from your quote: > > page_idx = idx; > address = folio_start; > ptep = folio_ptep; > nr_pages = nr; > > We deliberate adjust the address so that, yes, we're mapping the entire > folio, but we're mapping it at an address that means that the page we > actually faulted on ends up at the address that we faulted on. for this zRAM case, it is a new allocated large folio, only while all conditions are met, we will allocate and map the whole folio. you can check can_swapin_thp() and thp_swap_suitable_orders(). static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) { struct swap_info_struct *si; unsigned long addr; swp_entry_t entry; pgoff_t offset; char has_cache; int idx, i; pte_t pte; addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); idx = (vmf->address - addr) / PAGE_SIZE; pte = ptep_get(ptep); if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) return false; entry = pte_to_swp_entry(pte); offset = swp_offset(entry); if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) return false; si = swp_swap_info(entry); has_cache = si->swap_map[offset] & SWAP_HAS_CACHE; for (i = 1; i < nr_pages; i++) { /* * while allocating a large folio and doing swap_read_folio for the * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte * doesn't have swapcache. We need to ensure all PTEs have no cache * as well, otherwise, we might go to swap devices while the content * is in swapcache */ if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache) return false; } return true; } and static struct folio *alloc_swap_folio(struct vm_fault *vmf) { .... entry = pte_to_swp_entry(vmf->orig_pte); /* * Get a list of all the (large) orders below PMD_ORDER that are enabled * and suitable for swapping THP. */ orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_IN_PF | TVA_IN_SWAPIN | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); orders = thp_vma_suitable_orders(vma, vmf->address, orders); orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); .... } static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, unsigned long addr, unsigned long orders) { int order, nr; order = highest_order(orders); /* * To swap-in a THP with nr pages, we require its first swap_offset * is aligned with nr. This can filter out most invalid entries. */ while (orders) { nr = 1 << order; if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr) break; order = next_order(&orders, order); } return orders; } A mTHP is swapped out at aligned swap offset. and we only swap in aligned mTHP. if somehow one mTHP is mremap() to unaligned address, we won't swap them in as a large folio. For swapcache case, we are still checking unaligned mTHP, but for new allocated mTHP, it is a different story. There is totally no necessity to support unaligned mTHP and there is no possibility to support unless something is marked in swap devices to say there was a mTHP. Thanks Barry
On Tue, Jul 30, 2024 at 12:55 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Jul 29, 2024 at 02:36:38PM +0800, Chuanhua Han wrote: > > Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 11:51写道: > > > > > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > > > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > > > - vma, vmf->address, false); > > > > + folio = alloc_swap_folio(vmf); > > > > page = &folio->page; > > > > > > This is no longer correct. You need to set 'page' to the precise page > > > that is being faulted rather than the first page of the folio. It was > > > fine before because it always allocated a single-page folio, but now it > > > must use folio_page() or folio_file_page() (whichever has the correct > > > semantics for you). > > > > > > Also you need to fix your test suite to notice this bug. I suggest > > > doing that first so that you know whether you've got the calculation > > > correct. > > > > > > > > > > This is no problem now, we support large folios swapin as a whole, so > > the head page is used here instead of the page that is being faulted. > > You can also refer to the current code context, now support large > > folios swapin as a whole, and previously only support small page > > swapin is not the same. > > You have completely failed to understand the problem. Let's try it this > way: > > We take a page fault at address 0x123456789000. > If part of a 16KiB folio, that's page 1 of the folio at 0x123456788000. > If you now map page 0 of the folio at 0x123456789000, you've > given the user the wrong page! That looks like data corruption. > > The code in > if (folio_test_large(folio) && folio_test_swapcache(folio)) { > as Barry pointed out will save you -- but what if those conditions fail? > What if the mmap has been mremap()ed and the folio now crosses a PMD > boundary? mk_pte() will now be called on the wrong page. Chuanhua understood everything correctly. I think you might have missed that we have very strict checks both before allocating large folios and before mapping them for this new allocated mTHP swap-in case. to allocate a large folio, we check all alignment requirements; PTEs have aligned swap offset and all physically contiguous, that is how mTHP is swapped out. if a mTHP has been mremap() to be unaligned, we won't swap them in as mTHP. two reasons: 1. we have no way to figure out what is the start address of a previous mTHP for non-swapcache case; 2. mremap() to unaligned addresses is rare. to map a large folio, we check all PTEs are still there by double confirming can_swapin_thp() is true. if PTEs have changed, this is a "goto out_nomap" case. /* allocated large folios for SWP_SYNCHRONOUS_IO */ if (folio_test_large(folio) && !folio_test_swapcache(folio)) { unsigned long nr = folio_nr_pages(folio); unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; pte_t *folio_ptep = vmf->pte - idx; if (!can_swapin_thp(vmf, folio_ptep, nr)) goto out_nomap; page_idx = idx; address = folio_start; ptep = folio_ptep; goto check_folio; } Thanks Barry
Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 20:55写道: > > On Mon, Jul 29, 2024 at 02:36:38PM +0800, Chuanhua Han wrote: > > Matthew Wilcox <willy@infradead.org> 于2024年7月29日周一 11:51写道: > > > > > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > > > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, > > > > - vma, vmf->address, false); > > > > + folio = alloc_swap_folio(vmf); > > > > page = &folio->page; > > > > > > This is no longer correct. You need to set 'page' to the precise page > > > that is being faulted rather than the first page of the folio. It was > > > fine before because it always allocated a single-page folio, but now it > > > must use folio_page() or folio_file_page() (whichever has the correct > > > semantics for you). > > > > > > Also you need to fix your test suite to notice this bug. I suggest > > > doing that first so that you know whether you've got the calculation > > > correct. > > > > > > > > > > This is no problem now, we support large folios swapin as a whole, so > > the head page is used here instead of the page that is being faulted. > > You can also refer to the current code context, now support large > > folios swapin as a whole, and previously only support small page > > swapin is not the same. > > You have completely failed to understand the problem. Let's try it this > way: > > We take a page fault at address 0x123456789000. > If part of a 16KiB folio, that's page 1 of the folio at 0x123456788000. > If you now map page 0 of the folio at 0x123456789000, you've > given the user the wrong page! That looks like data corruption. The user does not get the wrong data because we are mapping the whole, and for 16KiB folio, we map 16KiB through the page table. > > The code in > if (folio_test_large(folio) && folio_test_swapcache(folio)) { > as Barry pointed out will save you -- but what if those conditions fail? > What if the mmap has been mremap()ed and the folio now crosses a PMD > boundary? mk_pte() will now be called on the wrong page. These special cases have been dealt with in our patch. For mthp's large folio, mk_pte uses head page to construct pte.
Hi Barry, kernel test robot noticed the following build warnings: url: https://github.com/intel-lab-lkp/linux/commits/Barry-Song/mm-swap-introduce-swapcache_prepare_nr-and-swapcache_clear_nr-for-large-folios-swap-in/20240726-181412 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20240726094618.401593-4-21cnbao%40gmail.com patch subject: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile config: i386-randconfig-141-20240727 (https://download.01.org/0day-ci/archive/20240727/202407270917.18F5rYPH-lkp@intel.com/config) compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Reported-by: Dan Carpenter <dan.carpenter@linaro.org> | Closes: https://lore.kernel.org/r/202407270917.18F5rYPH-lkp@intel.com/ smatch warnings: mm/memory.c:4467 do_swap_page() error: uninitialized symbol 'nr_pages'. vim +/nr_pages +4467 mm/memory.c 2b7403035459c7 Souptick Joarder 2018-08-23 4143 vm_fault_t do_swap_page(struct vm_fault *vmf) ^1da177e4c3f41 Linus Torvalds 2005-04-16 4144 { 82b0f8c39a3869 Jan Kara 2016-12-14 4145 struct vm_area_struct *vma = vmf->vma; d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4146) struct folio *swapcache, *folio = NULL; d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4147) struct page *page; 2799e77529c2a2 Miaohe Lin 2021-06-28 4148 struct swap_info_struct *si = NULL; 14f9135d547060 David Hildenbrand 2022-05-09 4149 rmap_t rmap_flags = RMAP_NONE; 13ddaf26be324a Kairui Song 2024-02-07 4150 bool need_clear_cache = false; 1493a1913e34b0 David Hildenbrand 2022-05-09 4151 bool exclusive = false; 65500d234e74fc Hugh Dickins 2005-10-29 4152 swp_entry_t entry; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4153 pte_t pte; 2b7403035459c7 Souptick Joarder 2018-08-23 4154 vm_fault_t ret = 0; aae466b0052e18 Joonsoo Kim 2020-08-11 4155 void *shadow = NULL; 508758960b8d89 Chuanhua Han 2024-05-29 4156 int nr_pages; 508758960b8d89 Chuanhua Han 2024-05-29 4157 unsigned long page_idx; 508758960b8d89 Chuanhua Han 2024-05-29 4158 unsigned long address; 508758960b8d89 Chuanhua Han 2024-05-29 4159 pte_t *ptep; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4160 2ca99358671ad3 Peter Xu 2021-11-05 4161 if (!pte_unmap_same(vmf)) 8f4e2101fd7df9 Hugh Dickins 2005-10-29 4162 goto out; 65500d234e74fc Hugh Dickins 2005-10-29 4163 2994302bc8a171 Jan Kara 2016-12-14 4164 entry = pte_to_swp_entry(vmf->orig_pte); d1737fdbec7f90 Andi Kleen 2009-09-16 4165 if (unlikely(non_swap_entry(entry))) { 0697212a411c1d Christoph Lameter 2006-06-23 4166 if (is_migration_entry(entry)) { 82b0f8c39a3869 Jan Kara 2016-12-14 4167 migration_entry_wait(vma->vm_mm, vmf->pmd, 82b0f8c39a3869 Jan Kara 2016-12-14 4168 vmf->address); b756a3b5e7ead8 Alistair Popple 2021-06-30 4169 } else if (is_device_exclusive_entry(entry)) { b756a3b5e7ead8 Alistair Popple 2021-06-30 4170 vmf->page = pfn_swap_entry_to_page(entry); b756a3b5e7ead8 Alistair Popple 2021-06-30 4171 ret = remove_device_exclusive_entry(vmf); 5042db43cc26f5 Jérôme Glisse 2017-09-08 4172 } else if (is_device_private_entry(entry)) { 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4173 if (vmf->flags & FAULT_FLAG_VMA_LOCK) { 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4174 /* 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4175 * migrate_to_ram is not yet ready to operate 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4176 * under VMA lock. 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4177 */ 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4178 vma_end_read(vma); 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4179 ret = VM_FAULT_RETRY; 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4180 goto out; 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4181 } 1235ccd05b6dd6 Suren Baghdasaryan 2023-06-30 4182 af5cdaf82238fb Alistair Popple 2021-06-30 4183 vmf->page = pfn_swap_entry_to_page(entry); 16ce101db85db6 Alistair Popple 2022-09-28 4184 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, 16ce101db85db6 Alistair Popple 2022-09-28 4185 vmf->address, &vmf->ptl); 3db82b9374ca92 Hugh Dickins 2023-06-08 4186 if (unlikely(!vmf->pte || c33c794828f212 Ryan Roberts 2023-06-12 4187 !pte_same(ptep_get(vmf->pte), c33c794828f212 Ryan Roberts 2023-06-12 4188 vmf->orig_pte))) 3b65f437d9e8dd Ryan Roberts 2023-06-02 4189 goto unlock; 16ce101db85db6 Alistair Popple 2022-09-28 4190 16ce101db85db6 Alistair Popple 2022-09-28 4191 /* 16ce101db85db6 Alistair Popple 2022-09-28 4192 * Get a page reference while we know the page can't be 16ce101db85db6 Alistair Popple 2022-09-28 4193 * freed. 16ce101db85db6 Alistair Popple 2022-09-28 4194 */ 16ce101db85db6 Alistair Popple 2022-09-28 4195 get_page(vmf->page); 16ce101db85db6 Alistair Popple 2022-09-28 4196 pte_unmap_unlock(vmf->pte, vmf->ptl); 4a955bed882e73 Alistair Popple 2022-11-14 4197 ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); 16ce101db85db6 Alistair Popple 2022-09-28 4198 put_page(vmf->page); d1737fdbec7f90 Andi Kleen 2009-09-16 4199 } else if (is_hwpoison_entry(entry)) { d1737fdbec7f90 Andi Kleen 2009-09-16 4200 ret = VM_FAULT_HWPOISON; 5c041f5d1f23d3 Peter Xu 2022-05-12 4201 } else if (is_pte_marker_entry(entry)) { 5c041f5d1f23d3 Peter Xu 2022-05-12 4202 ret = handle_pte_marker(vmf); d1737fdbec7f90 Andi Kleen 2009-09-16 4203 } else { 2994302bc8a171 Jan Kara 2016-12-14 4204 print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL); d99be1a8ecf377 Hugh Dickins 2009-12-14 4205 ret = VM_FAULT_SIGBUS; d1737fdbec7f90 Andi Kleen 2009-09-16 4206 } 0697212a411c1d Christoph Lameter 2006-06-23 4207 goto out; 0697212a411c1d Christoph Lameter 2006-06-23 4208 } 0bcac06f27d752 Minchan Kim 2017-11-15 4209 2799e77529c2a2 Miaohe Lin 2021-06-28 4210 /* Prevent swapoff from happening to us. */ 2799e77529c2a2 Miaohe Lin 2021-06-28 4211 si = get_swap_device(entry); 2799e77529c2a2 Miaohe Lin 2021-06-28 4212 if (unlikely(!si)) 2799e77529c2a2 Miaohe Lin 2021-06-28 4213 goto out; 0bcac06f27d752 Minchan Kim 2017-11-15 4214 5a423081b2465d Matthew Wilcox (Oracle 2022-09-02 4215) folio = swap_cache_get_folio(entry, vma, vmf->address); 5a423081b2465d Matthew Wilcox (Oracle 2022-09-02 4216) if (folio) 5a423081b2465d Matthew Wilcox (Oracle 2022-09-02 4217) page = folio_file_page(folio, swp_offset(entry)); d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4218) swapcache = folio; f80207727aaca3 Minchan Kim 2018-01-18 4219 d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4220) if (!folio) { a449bf58e45abf Qian Cai 2020-08-14 4221 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && eb085574a7526c Huang Ying 2019-07-11 4222 __swap_count(entry) == 1) { 684d098daf0b3a Chuanhua Han 2024-07-26 4223 /* skip swapcache */ 684d098daf0b3a Chuanhua Han 2024-07-26 4224 folio = alloc_swap_folio(vmf); 684d098daf0b3a Chuanhua Han 2024-07-26 4225 page = &folio->page; 684d098daf0b3a Chuanhua Han 2024-07-26 4226 if (folio) { 684d098daf0b3a Chuanhua Han 2024-07-26 4227 __folio_set_locked(folio); 684d098daf0b3a Chuanhua Han 2024-07-26 4228 __folio_set_swapbacked(folio); 684d098daf0b3a Chuanhua Han 2024-07-26 4229 684d098daf0b3a Chuanhua Han 2024-07-26 4230 nr_pages = folio_nr_pages(folio); nr_pages is initialized here 684d098daf0b3a Chuanhua Han 2024-07-26 4231 if (folio_test_large(folio)) 684d098daf0b3a Chuanhua Han 2024-07-26 4232 entry.val = ALIGN_DOWN(entry.val, nr_pages); 13ddaf26be324a Kairui Song 2024-02-07 4233 /* 13ddaf26be324a Kairui Song 2024-02-07 4234 * Prevent parallel swapin from proceeding with 13ddaf26be324a Kairui Song 2024-02-07 4235 * the cache flag. Otherwise, another thread may 13ddaf26be324a Kairui Song 2024-02-07 4236 * finish swapin first, free the entry, and swapout 13ddaf26be324a Kairui Song 2024-02-07 4237 * reusing the same entry. It's undetectable as 13ddaf26be324a Kairui Song 2024-02-07 4238 * pte_same() returns true due to entry reuse. 13ddaf26be324a Kairui Song 2024-02-07 4239 */ 684d098daf0b3a Chuanhua Han 2024-07-26 4240 if (swapcache_prepare_nr(entry, nr_pages)) { 13ddaf26be324a Kairui Song 2024-02-07 4241 /* Relax a bit to prevent rapid repeated page faults */ 13ddaf26be324a Kairui Song 2024-02-07 4242 schedule_timeout_uninterruptible(1); 684d098daf0b3a Chuanhua Han 2024-07-26 4243 goto out_page; 13ddaf26be324a Kairui Song 2024-02-07 4244 } 13ddaf26be324a Kairui Song 2024-02-07 4245 need_clear_cache = true; 13ddaf26be324a Kairui Song 2024-02-07 4246 6599591816f522 Matthew Wilcox (Oracle 2022-09-02 4247) if (mem_cgroup_swapin_charge_folio(folio, 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4248) vma->vm_mm, GFP_KERNEL, 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4249) entry)) { 545b1b077ca6b3 Michal Hocko 2020-06-25 4250 ret = VM_FAULT_OOM; 4c6355b25e8bb8 Johannes Weiner 2020-06-03 4251 goto out_page; 545b1b077ca6b3 Michal Hocko 2020-06-25 4252 } 684d098daf0b3a Chuanhua Han 2024-07-26 4253 mem_cgroup_swapin_uncharge_swap_nr(entry, nr_pages); 4c6355b25e8bb8 Johannes Weiner 2020-06-03 4254 aae466b0052e18 Joonsoo Kim 2020-08-11 4255 shadow = get_shadow_from_swap_cache(entry); aae466b0052e18 Joonsoo Kim 2020-08-11 4256 if (shadow) 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4257) workingset_refault(folio, shadow); 0076f029cb2906 Joonsoo Kim 2020-06-25 4258 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4259) folio_add_lru(folio); 0add0c77a9bd0c Shakeel Butt 2021-04-29 4260 c9bdf768dd9319 Matthew Wilcox (Oracle 2023-12-13 4261) /* To provide entry to swap_read_folio() */ 3d2c9087688777 David Hildenbrand 2023-08-21 4262 folio->swap = entry; b2d1f38b524121 Yosry Ahmed 2024-06-07 4263 swap_read_folio(folio, NULL); 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4264) folio->private = NULL; 0bcac06f27d752 Minchan Kim 2017-11-15 4265 } aa8d22a11da933 Minchan Kim 2017-11-15 4266 } else { e9e9b7ecee4a13 Minchan Kim 2018-04-05 4267 page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, e9e9b7ecee4a13 Minchan Kim 2018-04-05 4268 vmf); 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4269) if (page) 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4270) folio = page_folio(page); d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4271) swapcache = folio; 0bcac06f27d752 Minchan Kim 2017-11-15 4272 } 0bcac06f27d752 Minchan Kim 2017-11-15 4273 d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4274) if (!folio) { ^1da177e4c3f41 Linus Torvalds 2005-04-16 4275 /* 8f4e2101fd7df9 Hugh Dickins 2005-10-29 4276 * Back out if somebody else faulted in this pte 8f4e2101fd7df9 Hugh Dickins 2005-10-29 4277 * while we released the pte lock. ^1da177e4c3f41 Linus Torvalds 2005-04-16 4278 */ 82b0f8c39a3869 Jan Kara 2016-12-14 4279 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, 82b0f8c39a3869 Jan Kara 2016-12-14 4280 vmf->address, &vmf->ptl); c33c794828f212 Ryan Roberts 2023-06-12 4281 if (likely(vmf->pte && c33c794828f212 Ryan Roberts 2023-06-12 4282 pte_same(ptep_get(vmf->pte), vmf->orig_pte))) ^1da177e4c3f41 Linus Torvalds 2005-04-16 4283 ret = VM_FAULT_OOM; 65500d234e74fc Hugh Dickins 2005-10-29 4284 goto unlock; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4285 } ^1da177e4c3f41 Linus Torvalds 2005-04-16 4286 ^1da177e4c3f41 Linus Torvalds 2005-04-16 4287 /* Had to read the page from swap area: Major fault */ ^1da177e4c3f41 Linus Torvalds 2005-04-16 4288 ret = VM_FAULT_MAJOR; f8891e5e1f93a1 Christoph Lameter 2006-06-30 4289 count_vm_event(PGMAJFAULT); 2262185c5b287f Roman Gushchin 2017-07-06 4290 count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); d1737fdbec7f90 Andi Kleen 2009-09-16 4291 } else if (PageHWPoison(page)) { 71f72525dfaaec Wu Fengguang 2009-12-16 4292 /* 71f72525dfaaec Wu Fengguang 2009-12-16 4293 * hwpoisoned dirty swapcache pages are kept for killing 71f72525dfaaec Wu Fengguang 2009-12-16 4294 * owner processes (which may be unknown at hwpoison time) 71f72525dfaaec Wu Fengguang 2009-12-16 4295 */ d1737fdbec7f90 Andi Kleen 2009-09-16 4296 ret = VM_FAULT_HWPOISON; 4779cb31c0ee3b Andi Kleen 2009-10-14 4297 goto out_release; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4298 } ^1da177e4c3f41 Linus Torvalds 2005-04-16 4299 fdc724d6aa44ef Suren Baghdasaryan 2023-06-30 4300 ret |= folio_lock_or_retry(folio, vmf); fdc724d6aa44ef Suren Baghdasaryan 2023-06-30 4301 if (ret & VM_FAULT_RETRY) d065bd810b6deb Michel Lespinasse 2010-10-26 4302 goto out_release; 073e587ec2cc37 KAMEZAWA Hiroyuki 2008-10-18 4303 84d60fdd3733fb David Hildenbrand 2022-03-24 4304 if (swapcache) { 4969c1192d15af Andrea Arcangeli 2010-09-09 4305 /* 3b344157c0c15b Matthew Wilcox (Oracle 2022-09-02 4306) * Make sure folio_free_swap() or swapoff did not release the 84d60fdd3733fb David Hildenbrand 2022-03-24 4307 * swapcache from under us. The page pin, and pte_same test 84d60fdd3733fb David Hildenbrand 2022-03-24 4308 * below, are not enough to exclude that. Even if it is still 84d60fdd3733fb David Hildenbrand 2022-03-24 4309 * swapcache, we need to check that the page's swap has not 84d60fdd3733fb David Hildenbrand 2022-03-24 4310 * changed. 4969c1192d15af Andrea Arcangeli 2010-09-09 4311 */ 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4312) if (unlikely(!folio_test_swapcache(folio) || cfeed8ffe55b37 David Hildenbrand 2023-08-21 4313 page_swap_entry(page).val != entry.val)) 4969c1192d15af Andrea Arcangeli 2010-09-09 4314 goto out_page; 4969c1192d15af Andrea Arcangeli 2010-09-09 4315 84d60fdd3733fb David Hildenbrand 2022-03-24 4316 /* 84d60fdd3733fb David Hildenbrand 2022-03-24 4317 * KSM sometimes has to copy on read faults, for example, if 84d60fdd3733fb David Hildenbrand 2022-03-24 4318 * page->index of !PageKSM() pages would be nonlinear inside the 84d60fdd3733fb David Hildenbrand 2022-03-24 4319 * anon VMA -- PageKSM() is lost on actual swapout. 84d60fdd3733fb David Hildenbrand 2022-03-24 4320 */ 96db66d9c8f3c1 Matthew Wilcox (Oracle 2023-12-11 4321) folio = ksm_might_need_to_copy(folio, vma, vmf->address); 96db66d9c8f3c1 Matthew Wilcox (Oracle 2023-12-11 4322) if (unlikely(!folio)) { 5ad6468801d28c Hugh Dickins 2009-12-14 4323 ret = VM_FAULT_OOM; 96db66d9c8f3c1 Matthew Wilcox (Oracle 2023-12-11 4324) folio = swapcache; 4969c1192d15af Andrea Arcangeli 2010-09-09 4325 goto out_page; 96db66d9c8f3c1 Matthew Wilcox (Oracle 2023-12-11 4326) } else if (unlikely(folio == ERR_PTR(-EHWPOISON))) { 6b970599e807ea Kefeng Wang 2022-12-09 4327 ret = VM_FAULT_HWPOISON; 96db66d9c8f3c1 Matthew Wilcox (Oracle 2023-12-11 4328) folio = swapcache; 6b970599e807ea Kefeng Wang 2022-12-09 4329 goto out_page; 4969c1192d15af Andrea Arcangeli 2010-09-09 4330 } 96db66d9c8f3c1 Matthew Wilcox (Oracle 2023-12-11 4331) if (folio != swapcache) 96db66d9c8f3c1 Matthew Wilcox (Oracle 2023-12-11 4332) page = folio_page(folio, 0); c145e0b47c77eb David Hildenbrand 2022-03-24 4333 c145e0b47c77eb David Hildenbrand 2022-03-24 4334 /* c145e0b47c77eb David Hildenbrand 2022-03-24 4335 * If we want to map a page that's in the swapcache writable, we c145e0b47c77eb David Hildenbrand 2022-03-24 4336 * have to detect via the refcount if we're really the exclusive c145e0b47c77eb David Hildenbrand 2022-03-24 4337 * owner. Try removing the extra reference from the local LRU 1fec6890bf2247 Matthew Wilcox (Oracle 2023-06-21 4338) * caches if required. c145e0b47c77eb David Hildenbrand 2022-03-24 4339 */ d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4340) if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache && 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4341) !folio_test_ksm(folio) && !folio_test_lru(folio)) c145e0b47c77eb David Hildenbrand 2022-03-24 4342 lru_add_drain(); 84d60fdd3733fb David Hildenbrand 2022-03-24 4343 } 5ad6468801d28c Hugh Dickins 2009-12-14 4344 4231f8425833b1 Kefeng Wang 2023-03-02 4345 folio_throttle_swaprate(folio, GFP_KERNEL); 8a9f3ccd24741b Balbir Singh 2008-02-07 4346 ^1da177e4c3f41 Linus Torvalds 2005-04-16 4347 /* 8f4e2101fd7df9 Hugh Dickins 2005-10-29 4348 * Back out if somebody else already faulted in this pte. ^1da177e4c3f41 Linus Torvalds 2005-04-16 4349 */ 82b0f8c39a3869 Jan Kara 2016-12-14 4350 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, 82b0f8c39a3869 Jan Kara 2016-12-14 4351 &vmf->ptl); c33c794828f212 Ryan Roberts 2023-06-12 4352 if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte))) b81074800b98ac Kirill Korotaev 2005-05-16 4353 goto out_nomap; b81074800b98ac Kirill Korotaev 2005-05-16 4354 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4355) if (unlikely(!folio_test_uptodate(folio))) { b81074800b98ac Kirill Korotaev 2005-05-16 4356 ret = VM_FAULT_SIGBUS; b81074800b98ac Kirill Korotaev 2005-05-16 4357 goto out_nomap; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4358 } ^1da177e4c3f41 Linus Torvalds 2005-04-16 4359 684d098daf0b3a Chuanhua Han 2024-07-26 4360 /* allocated large folios for SWP_SYNCHRONOUS_IO */ 684d098daf0b3a Chuanhua Han 2024-07-26 4361 if (folio_test_large(folio) && !folio_test_swapcache(folio)) { 684d098daf0b3a Chuanhua Han 2024-07-26 4362 unsigned long nr = folio_nr_pages(folio); 684d098daf0b3a Chuanhua Han 2024-07-26 4363 unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); 684d098daf0b3a Chuanhua Han 2024-07-26 4364 unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; 684d098daf0b3a Chuanhua Han 2024-07-26 4365 pte_t *folio_ptep = vmf->pte - idx; 684d098daf0b3a Chuanhua Han 2024-07-26 4366 684d098daf0b3a Chuanhua Han 2024-07-26 4367 if (!can_swapin_thp(vmf, folio_ptep, nr)) 684d098daf0b3a Chuanhua Han 2024-07-26 4368 goto out_nomap; 684d098daf0b3a Chuanhua Han 2024-07-26 4369 684d098daf0b3a Chuanhua Han 2024-07-26 4370 page_idx = idx; 684d098daf0b3a Chuanhua Han 2024-07-26 4371 address = folio_start; 684d098daf0b3a Chuanhua Han 2024-07-26 4372 ptep = folio_ptep; 684d098daf0b3a Chuanhua Han 2024-07-26 4373 goto check_folio; Let's say we hit this goto 684d098daf0b3a Chuanhua Han 2024-07-26 4374 } 684d098daf0b3a Chuanhua Han 2024-07-26 4375 508758960b8d89 Chuanhua Han 2024-05-29 4376 nr_pages = 1; 508758960b8d89 Chuanhua Han 2024-05-29 4377 page_idx = 0; 508758960b8d89 Chuanhua Han 2024-05-29 4378 address = vmf->address; 508758960b8d89 Chuanhua Han 2024-05-29 4379 ptep = vmf->pte; 508758960b8d89 Chuanhua Han 2024-05-29 4380 if (folio_test_large(folio) && folio_test_swapcache(folio)) { 508758960b8d89 Chuanhua Han 2024-05-29 4381 int nr = folio_nr_pages(folio); 508758960b8d89 Chuanhua Han 2024-05-29 4382 unsigned long idx = folio_page_idx(folio, page); 508758960b8d89 Chuanhua Han 2024-05-29 4383 unsigned long folio_start = address - idx * PAGE_SIZE; 508758960b8d89 Chuanhua Han 2024-05-29 4384 unsigned long folio_end = folio_start + nr * PAGE_SIZE; 508758960b8d89 Chuanhua Han 2024-05-29 4385 pte_t *folio_ptep; 508758960b8d89 Chuanhua Han 2024-05-29 4386 pte_t folio_pte; 508758960b8d89 Chuanhua Han 2024-05-29 4387 508758960b8d89 Chuanhua Han 2024-05-29 4388 if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start))) 508758960b8d89 Chuanhua Han 2024-05-29 4389 goto check_folio; 508758960b8d89 Chuanhua Han 2024-05-29 4390 if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end))) 508758960b8d89 Chuanhua Han 2024-05-29 4391 goto check_folio; 508758960b8d89 Chuanhua Han 2024-05-29 4392 508758960b8d89 Chuanhua Han 2024-05-29 4393 folio_ptep = vmf->pte - idx; 508758960b8d89 Chuanhua Han 2024-05-29 4394 folio_pte = ptep_get(folio_ptep); 508758960b8d89 Chuanhua Han 2024-05-29 4395 if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || 508758960b8d89 Chuanhua Han 2024-05-29 4396 swap_pte_batch(folio_ptep, nr, folio_pte) != nr) 508758960b8d89 Chuanhua Han 2024-05-29 4397 goto check_folio; 508758960b8d89 Chuanhua Han 2024-05-29 4398 508758960b8d89 Chuanhua Han 2024-05-29 4399 page_idx = idx; 508758960b8d89 Chuanhua Han 2024-05-29 4400 address = folio_start; 508758960b8d89 Chuanhua Han 2024-05-29 4401 ptep = folio_ptep; 508758960b8d89 Chuanhua Han 2024-05-29 4402 nr_pages = nr; 508758960b8d89 Chuanhua Han 2024-05-29 4403 entry = folio->swap; 508758960b8d89 Chuanhua Han 2024-05-29 4404 page = &folio->page; 508758960b8d89 Chuanhua Han 2024-05-29 4405 } 508758960b8d89 Chuanhua Han 2024-05-29 4406 508758960b8d89 Chuanhua Han 2024-05-29 4407 check_folio: 78fbe906cc900b David Hildenbrand 2022-05-09 4408 /* 78fbe906cc900b David Hildenbrand 2022-05-09 4409 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte 78fbe906cc900b David Hildenbrand 2022-05-09 4410 * must never point at an anonymous page in the swapcache that is 78fbe906cc900b David Hildenbrand 2022-05-09 4411 * PG_anon_exclusive. Sanity check that this holds and especially, that 78fbe906cc900b David Hildenbrand 2022-05-09 4412 * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity 78fbe906cc900b David Hildenbrand 2022-05-09 4413 * check after taking the PT lock and making sure that nobody 78fbe906cc900b David Hildenbrand 2022-05-09 4414 * concurrently faulted in this page and set PG_anon_exclusive. 78fbe906cc900b David Hildenbrand 2022-05-09 4415 */ 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4416) BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio)); 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4417) BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page)); 78fbe906cc900b David Hildenbrand 2022-05-09 4418 1493a1913e34b0 David Hildenbrand 2022-05-09 4419 /* 1493a1913e34b0 David Hildenbrand 2022-05-09 4420 * Check under PT lock (to protect against concurrent fork() sharing 1493a1913e34b0 David Hildenbrand 2022-05-09 4421 * the swap entry concurrently) for certainly exclusive pages. 1493a1913e34b0 David Hildenbrand 2022-05-09 4422 */ 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4423) if (!folio_test_ksm(folio)) { 1493a1913e34b0 David Hildenbrand 2022-05-09 4424 exclusive = pte_swp_exclusive(vmf->orig_pte); d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4425) if (folio != swapcache) { 1493a1913e34b0 David Hildenbrand 2022-05-09 4426 /* 1493a1913e34b0 David Hildenbrand 2022-05-09 4427 * We have a fresh page that is not exposed to the 1493a1913e34b0 David Hildenbrand 2022-05-09 4428 * swapcache -> certainly exclusive. 1493a1913e34b0 David Hildenbrand 2022-05-09 4429 */ 1493a1913e34b0 David Hildenbrand 2022-05-09 4430 exclusive = true; 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4431) } else if (exclusive && folio_test_writeback(folio) && eacde32757c756 Miaohe Lin 2022-05-19 4432 data_race(si->flags & SWP_STABLE_WRITES)) { 1493a1913e34b0 David Hildenbrand 2022-05-09 4433 /* 1493a1913e34b0 David Hildenbrand 2022-05-09 4434 * This is tricky: not all swap backends support 1493a1913e34b0 David Hildenbrand 2022-05-09 4435 * concurrent page modifications while under writeback. 1493a1913e34b0 David Hildenbrand 2022-05-09 4436 * 1493a1913e34b0 David Hildenbrand 2022-05-09 4437 * So if we stumble over such a page in the swapcache 1493a1913e34b0 David Hildenbrand 2022-05-09 4438 * we must not set the page exclusive, otherwise we can 1493a1913e34b0 David Hildenbrand 2022-05-09 4439 * map it writable without further checks and modify it 1493a1913e34b0 David Hildenbrand 2022-05-09 4440 * while still under writeback. 1493a1913e34b0 David Hildenbrand 2022-05-09 4441 * 1493a1913e34b0 David Hildenbrand 2022-05-09 4442 * For these problematic swap backends, simply drop the 1493a1913e34b0 David Hildenbrand 2022-05-09 4443 * exclusive marker: this is perfectly fine as we start 1493a1913e34b0 David Hildenbrand 2022-05-09 4444 * writeback only if we fully unmapped the page and 1493a1913e34b0 David Hildenbrand 2022-05-09 4445 * there are no unexpected references on the page after 1493a1913e34b0 David Hildenbrand 2022-05-09 4446 * unmapping succeeded. After fully unmapped, no 1493a1913e34b0 David Hildenbrand 2022-05-09 4447 * further GUP references (FOLL_GET and FOLL_PIN) can 1493a1913e34b0 David Hildenbrand 2022-05-09 4448 * appear, so dropping the exclusive marker and mapping 1493a1913e34b0 David Hildenbrand 2022-05-09 4449 * it only R/O is fine. 1493a1913e34b0 David Hildenbrand 2022-05-09 4450 */ 1493a1913e34b0 David Hildenbrand 2022-05-09 4451 exclusive = false; 1493a1913e34b0 David Hildenbrand 2022-05-09 4452 } 1493a1913e34b0 David Hildenbrand 2022-05-09 4453 } 1493a1913e34b0 David Hildenbrand 2022-05-09 4454 6dca4ac6fc91fd Peter Collingbourne 2023-05-22 4455 /* 6dca4ac6fc91fd Peter Collingbourne 2023-05-22 4456 * Some architectures may have to restore extra metadata to the page 6dca4ac6fc91fd Peter Collingbourne 2023-05-22 4457 * when reading from swap. This metadata may be indexed by swap entry 6dca4ac6fc91fd Peter Collingbourne 2023-05-22 4458 * so this must be called before swap_free(). 6dca4ac6fc91fd Peter Collingbourne 2023-05-22 4459 */ f238b8c33c6738 Barry Song 2024-03-23 4460 arch_swap_restore(folio_swap(entry, folio), folio); 6dca4ac6fc91fd Peter Collingbourne 2023-05-22 4461 8c7c6e34a1256a KAMEZAWA Hiroyuki 2009-01-07 4462 /* c145e0b47c77eb David Hildenbrand 2022-03-24 4463 * Remove the swap entry and conditionally try to free up the swapcache. c145e0b47c77eb David Hildenbrand 2022-03-24 4464 * We're already holding a reference on the page but haven't mapped it c145e0b47c77eb David Hildenbrand 2022-03-24 4465 * yet. 8c7c6e34a1256a KAMEZAWA Hiroyuki 2009-01-07 4466 */ 508758960b8d89 Chuanhua Han 2024-05-29 @4467 swap_free_nr(entry, nr_pages); ^^^^^^^^ Smatch warning. The code is a bit complicated so it could be a false positive. a160e5377b55bc Matthew Wilcox (Oracle 2022-09-02 4468) if (should_try_to_free_swap(folio, vma, vmf->flags)) a160e5377b55bc Matthew Wilcox (Oracle 2022-09-02 4469) folio_free_swap(folio); ^1da177e4c3f41 Linus Torvalds 2005-04-16 4470 508758960b8d89 Chuanhua Han 2024-05-29 4471 add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); 508758960b8d89 Chuanhua Han 2024-05-29 4472 add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages); ^1da177e4c3f41 Linus Torvalds 2005-04-16 4473 pte = mk_pte(page, vma->vm_page_prot); c18160dba5ff63 Barry Song 2024-06-02 4474 if (pte_swp_soft_dirty(vmf->orig_pte)) c18160dba5ff63 Barry Song 2024-06-02 4475 pte = pte_mksoft_dirty(pte); c18160dba5ff63 Barry Song 2024-06-02 4476 if (pte_swp_uffd_wp(vmf->orig_pte)) c18160dba5ff63 Barry Song 2024-06-02 4477 pte = pte_mkuffd_wp(pte); c145e0b47c77eb David Hildenbrand 2022-03-24 4478 c145e0b47c77eb David Hildenbrand 2022-03-24 4479 /* 1493a1913e34b0 David Hildenbrand 2022-05-09 4480 * Same logic as in do_wp_page(); however, optimize for pages that are 1493a1913e34b0 David Hildenbrand 2022-05-09 4481 * certainly not shared either because we just allocated them without 1493a1913e34b0 David Hildenbrand 2022-05-09 4482 * exposing them to the swapcache or because the swap entry indicates 1493a1913e34b0 David Hildenbrand 2022-05-09 4483 * exclusivity. c145e0b47c77eb David Hildenbrand 2022-03-24 4484 */ 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4485) if (!folio_test_ksm(folio) && 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4486) (exclusive || folio_ref_count(folio) == 1)) { c18160dba5ff63 Barry Song 2024-06-02 4487 if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) && 20dfa5b7adc5a1 Barry Song 2024-06-08 4488 !pte_needs_soft_dirty_wp(vma, pte)) { c18160dba5ff63 Barry Song 2024-06-02 4489 pte = pte_mkwrite(pte, vma); 6c287605fd5646 David Hildenbrand 2022-05-09 4490 if (vmf->flags & FAULT_FLAG_WRITE) { c18160dba5ff63 Barry Song 2024-06-02 4491 pte = pte_mkdirty(pte); 82b0f8c39a3869 Jan Kara 2016-12-14 4492 vmf->flags &= ~FAULT_FLAG_WRITE; 6c287605fd5646 David Hildenbrand 2022-05-09 4493 } c18160dba5ff63 Barry Song 2024-06-02 4494 } 14f9135d547060 David Hildenbrand 2022-05-09 4495 rmap_flags |= RMAP_EXCLUSIVE; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4496 } 508758960b8d89 Chuanhua Han 2024-05-29 4497 folio_ref_add(folio, nr_pages - 1); 508758960b8d89 Chuanhua Han 2024-05-29 4498 flush_icache_pages(vma, page, nr_pages); 508758960b8d89 Chuanhua Han 2024-05-29 4499 vmf->orig_pte = pte_advance_pfn(pte, page_idx); 0bcac06f27d752 Minchan Kim 2017-11-15 4500 0bcac06f27d752 Minchan Kim 2017-11-15 4501 /* ksm created a completely new copy */ d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4502) if (unlikely(folio != swapcache && swapcache)) { 15bde4abab734c Barry Song 2024-06-18 4503 folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4504) folio_add_lru_vma(folio, vma); 9ae2feacedde16 Barry Song 2024-06-18 4505 } else if (!folio_test_anon(folio)) { 9ae2feacedde16 Barry Song 2024-06-18 4506 /* 684d098daf0b3a Chuanhua Han 2024-07-26 4507 * We currently only expect small !anon folios which are either 684d098daf0b3a Chuanhua Han 2024-07-26 4508 * fully exclusive or fully shared, or new allocated large folios 684d098daf0b3a Chuanhua Han 2024-07-26 4509 * which are fully exclusive. If we ever get large folios within 684d098daf0b3a Chuanhua Han 2024-07-26 4510 * swapcache here, we have to be careful. 9ae2feacedde16 Barry Song 2024-06-18 4511 */ 684d098daf0b3a Chuanhua Han 2024-07-26 4512 VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); 9ae2feacedde16 Barry Song 2024-06-18 4513 VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); 9ae2feacedde16 Barry Song 2024-06-18 4514 folio_add_new_anon_rmap(folio, vma, address, rmap_flags); 0bcac06f27d752 Minchan Kim 2017-11-15 4515 } else { 508758960b8d89 Chuanhua Han 2024-05-29 4516 folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address, b832a354d787bf David Hildenbrand 2023-12-20 4517 rmap_flags); 00501b531c4723 Johannes Weiner 2014-08-08 4518 } ^1da177e4c3f41 Linus Torvalds 2005-04-16 4519 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4520) VM_BUG_ON(!folio_test_anon(folio) || 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4521) (pte_write(pte) && !PageAnonExclusive(page))); 508758960b8d89 Chuanhua Han 2024-05-29 4522 set_ptes(vma->vm_mm, address, ptep, pte, nr_pages); 508758960b8d89 Chuanhua Han 2024-05-29 4523 arch_do_swap_page_nr(vma->vm_mm, vma, address, 508758960b8d89 Chuanhua Han 2024-05-29 4524 pte, pte, nr_pages); 1eba86c096e35e Pasha Tatashin 2022-01-14 4525 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4526) folio_unlock(folio); d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4527) if (folio != swapcache && swapcache) { 4969c1192d15af Andrea Arcangeli 2010-09-09 4528 /* 4969c1192d15af Andrea Arcangeli 2010-09-09 4529 * Hold the lock to avoid the swap entry to be reused 4969c1192d15af Andrea Arcangeli 2010-09-09 4530 * until we take the PT lock for the pte_same() check 4969c1192d15af Andrea Arcangeli 2010-09-09 4531 * (to avoid false positives from pte_same). For 4969c1192d15af Andrea Arcangeli 2010-09-09 4532 * further safety release the lock after the swap_free 4969c1192d15af Andrea Arcangeli 2010-09-09 4533 * so that the swap count won't change under a 4969c1192d15af Andrea Arcangeli 2010-09-09 4534 * parallel locked swapcache. 4969c1192d15af Andrea Arcangeli 2010-09-09 4535 */ d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4536) folio_unlock(swapcache); d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4537) folio_put(swapcache); 4969c1192d15af Andrea Arcangeli 2010-09-09 4538 } c475a8ab625d56 Hugh Dickins 2005-06-21 4539 82b0f8c39a3869 Jan Kara 2016-12-14 4540 if (vmf->flags & FAULT_FLAG_WRITE) { 2994302bc8a171 Jan Kara 2016-12-14 4541 ret |= do_wp_page(vmf); 61469f1d51777f Hugh Dickins 2008-03-04 4542 if (ret & VM_FAULT_ERROR) 61469f1d51777f Hugh Dickins 2008-03-04 4543 ret &= VM_FAULT_ERROR; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4544 goto out; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4545 } ^1da177e4c3f41 Linus Torvalds 2005-04-16 4546 ^1da177e4c3f41 Linus Torvalds 2005-04-16 4547 /* No need to invalidate - it was non-present before */ 508758960b8d89 Chuanhua Han 2024-05-29 4548 update_mmu_cache_range(vmf, vma, address, ptep, nr_pages); 65500d234e74fc Hugh Dickins 2005-10-29 4549 unlock: 3db82b9374ca92 Hugh Dickins 2023-06-08 4550 if (vmf->pte) 82b0f8c39a3869 Jan Kara 2016-12-14 4551 pte_unmap_unlock(vmf->pte, vmf->ptl); ^1da177e4c3f41 Linus Torvalds 2005-04-16 4552 out: 13ddaf26be324a Kairui Song 2024-02-07 4553 /* Clear the swap cache pin for direct swapin after PTL unlock */ 13ddaf26be324a Kairui Song 2024-02-07 4554 if (need_clear_cache) 684d098daf0b3a Chuanhua Han 2024-07-26 4555 swapcache_clear_nr(si, entry, nr_pages); 2799e77529c2a2 Miaohe Lin 2021-06-28 4556 if (si) 2799e77529c2a2 Miaohe Lin 2021-06-28 4557 put_swap_device(si); ^1da177e4c3f41 Linus Torvalds 2005-04-16 4558 return ret; b81074800b98ac Kirill Korotaev 2005-05-16 4559 out_nomap: 3db82b9374ca92 Hugh Dickins 2023-06-08 4560 if (vmf->pte) 82b0f8c39a3869 Jan Kara 2016-12-14 4561 pte_unmap_unlock(vmf->pte, vmf->ptl); bc43f75cd98158 Johannes Weiner 2009-04-30 4562 out_page: 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4563) folio_unlock(folio); 4779cb31c0ee3b Andi Kleen 2009-10-14 4564 out_release: 63ad4add382305 Matthew Wilcox (Oracle 2022-09-02 4565) folio_put(folio); d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4566) if (folio != swapcache && swapcache) { d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4567) folio_unlock(swapcache); d4f9565ae598bd Matthew Wilcox (Oracle 2022-09-02 4568) folio_put(swapcache); 4969c1192d15af Andrea Arcangeli 2010-09-09 4569 } 13ddaf26be324a Kairui Song 2024-02-07 4570 if (need_clear_cache) 684d098daf0b3a Chuanhua Han 2024-07-26 4571 swapcache_clear_nr(si, entry, nr_pages); 2799e77529c2a2 Miaohe Lin 2021-06-28 4572 if (si) 2799e77529c2a2 Miaohe Lin 2021-06-28 4573 put_swap_device(si); 65500d234e74fc Hugh Dickins 2005-10-29 4574 return ret; ^1da177e4c3f41 Linus Torvalds 2005-04-16 4575 }
On Tue, Jul 30, 2024 at 01:11:31AM +1200, Barry Song wrote: > for this zRAM case, it is a new allocated large folio, only > while all conditions are met, we will allocate and map > the whole folio. you can check can_swapin_thp() and > thp_swap_suitable_orders(). YOU ARE DOING THIS WRONGLY! All of you anonymous memory people are utterly fixated on TLBs AND THIS IS WRONG. Yes, TLB performance is important, particularly with crappy ARM designs, which I know a lot of you are paid to work on. But you seem to think this is the only consideration, and you're making bad design choices as a result. It's overly complicated, and you're leaving performance on the table. Look back at the results Ryan showed in the early days of working on large anonymous folios. Half of the performance win on his system came from using larger TLBs. But the other half came from _reduced software overhead_. The LRU lock is a huge problem, and using large folios cuts the length of the LRU list, hence LRU lock hold time. Your _own_ data on how hard it is to get hold of a large folio due to fragmentation should be enough to convince you that the more large folios in the system, the better the whole system runs. We should not decline to allocate large folios just because they can't be mapped with a single TLB!
On Tue, Jul 30, 2024 at 3:13 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Jul 30, 2024 at 01:11:31AM +1200, Barry Song wrote: > > for this zRAM case, it is a new allocated large folio, only > > while all conditions are met, we will allocate and map > > the whole folio. you can check can_swapin_thp() and > > thp_swap_suitable_orders(). > > YOU ARE DOING THIS WRONGLY! > > All of you anonymous memory people are utterly fixated on TLBs AND THIS > IS WRONG. Yes, TLB performance is important, particularly with crappy > ARM designs, which I know a lot of you are paid to work on. But you > seem to think this is the only consideration, and you're making bad > design choices as a result. It's overly complicated, and you're leaving > performance on the table. > > Look back at the results Ryan showed in the early days of working on > large anonymous folios. Half of the performance win on his system came > from using larger TLBs. But the other half came from _reduced software > overhead_. The LRU lock is a huge problem, and using large folios cuts > the length of the LRU list, hence LRU lock hold time. > > Your _own_ data on how hard it is to get hold of a large folio due to > fragmentation should be enough to convince you that the more large folios > in the system, the better the whole system runs. We should not decline to > allocate large folios just because they can't be mapped with a single TLB! I am not convinced. for a new allocated large folio, even alloc_anon_folio() of do_anonymous_page() does the exactly same thing alloc_anon_folio() { /* * Get a list of all the (large) orders below PMD_ORDER that are enabled * for this vma. Then filter out the orders that can't be allocated over * the faulting address and still be fully contained in the vma. */ orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); orders = thp_vma_suitable_orders(vma, vmf->address, orders); } you are not going to allocate a mTHP for an unaligned address for a new PF. Please point out where it is wrong. Thanks Barry
On Tue, Jul 30, 2024 at 8:03 AM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Jul 30, 2024 at 3:13 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Tue, Jul 30, 2024 at 01:11:31AM +1200, Barry Song wrote: > > > for this zRAM case, it is a new allocated large folio, only > > > while all conditions are met, we will allocate and map > > > the whole folio. you can check can_swapin_thp() and > > > thp_swap_suitable_orders(). > > > > YOU ARE DOING THIS WRONGLY! > > > > All of you anonymous memory people are utterly fixated on TLBs AND THIS > > IS WRONG. Yes, TLB performance is important, particularly with crappy > > ARM designs, which I know a lot of you are paid to work on. But you > > seem to think this is the only consideration, and you're making bad > > design choices as a result. It's overly complicated, and you're leaving > > performance on the table. > > > > Look back at the results Ryan showed in the early days of working on > > large anonymous folios. Half of the performance win on his system came > > from using larger TLBs. But the other half came from _reduced software > > overhead_. The LRU lock is a huge problem, and using large folios cuts > > the length of the LRU list, hence LRU lock hold time. > > > > Your _own_ data on how hard it is to get hold of a large folio due to > > fragmentation should be enough to convince you that the more large folios > > in the system, the better the whole system runs. We should not decline to > > allocate large folios just because they can't be mapped with a single TLB! > > I am not convinced. for a new allocated large folio, even alloc_anon_folio() > of do_anonymous_page() does the exactly same thing > > alloc_anon_folio() > { > /* > * Get a list of all the (large) orders below PMD_ORDER that are enabled > * for this vma. Then filter out the orders that can't be allocated over > * the faulting address and still be fully contained in the vma. > */ > orders = thp_vma_allowable_orders(vma, vma->vm_flags, > TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); > orders = thp_vma_suitable_orders(vma, vmf->address, orders); > > } > > you are not going to allocate a mTHP for an unaligned address for a new > PF. > Please point out where it is wrong. Let's assume we have a folio with the virtual address as 0x500000000000 ~ 0x500000000000 + 64KB if it is swapped out to 0x10000 ~ 0x100000 + 64KB. The current code will swap it in as a mTHP if page fault occurs in any address within (0x500000000000 ~ 0x500000000000 + 64KB) In this case, the mTHP enjoys both decreased TLB and reduced overhead such as LRU lock etc. So it sounds we have nothing lost in this case. But if the folio is mremap-ed to an unaligned address like: (0x600000000000 + 16KB ~ 0x600000000000 + 80KB) and its swap offset is still (0x10000 ~ 0x100000 + 64KB). The current code won't swap in them as mTHP. Sounds like a loss? If this is the performance problem you are trying to address, my point is that it is not worth increasing the complexity for this stage though this might be doable. We once tracked hundreds of phones running apps randomly for a couple of days, and we didn't encounter such a case. So this is pretty much a corner case. If your concern is more than this, for example, if you want to swap in large folios even when swaps are completely not contiguous, this is a different story. I agree this is a potential optimization direction to go, but in that case, you still need to find an aligned boundary to handle page faults just like do_anonymous_page(), otherwise, you may result in all kinds of pointless intersections where PFs can cover the address ranges of other PFs, then make the PTEs check such as pte_range_none() completely dis-ordered: static struct folio *alloc_anon_folio(struct vm_fault *vmf) { .... /* * Find the highest order where the aligned range is completely * pte_none(). Note that all remaining orders will be completely * pte_none(). */ order = highest_order(orders); while (orders) { addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); if (pte_range_none(pte + pte_index(addr), 1 << order)) break; order = next_order(&orders, order); } } > > Thanks > Barry
On 29/07/2024 16:13, Matthew Wilcox wrote: > On Tue, Jul 30, 2024 at 01:11:31AM +1200, Barry Song wrote: >> for this zRAM case, it is a new allocated large folio, only >> while all conditions are met, we will allocate and map >> the whole folio. you can check can_swapin_thp() and >> thp_swap_suitable_orders(). > > YOU ARE DOING THIS WRONGLY! I've only scanned the preceeding thread, but I think you're talking about the design descision to only allocate large folios that are naturally aligned in virtual address space, and you're arguing to remove that restriction? The main reason we gave ourselves that constraint for anon mTHP was because allowing it would create the possibility of wandering off the end of the PTE table and add significant complexity to manage neighbouring PTE tables and their respective PTLs. If the proposal is to start doing this, then I don't agree with that approach. > > All of you anonymous memory people are utterly fixated on TLBs AND THIS > IS WRONG. Yes, TLB performance is important, particularly with crappy > ARM designs, which I know a lot of you are paid to work on. But you > seem to think this is the only consideration, and you're making bad > design choices as a result. It's overly complicated, and you're leaving > performance on the table. > > Look back at the results Ryan showed in the early days of working on > large anonymous folios. Half of the performance win on his system came > from using larger TLBs. But the other half came from _reduced software > overhead_. I would just point out that I think the results you are referring to are for the kernel compilation workload, and yes this is indeed what I observed. But kernel compilation is a bit of an outlier since it does a huge amount of fork/exec so the kernel spends a lot of time fiddling with page tables and faulting. The vast majority of the reduced sw overhead is due to significantly reducing the number of faults because we map more pages per fault. But in my experience this workload is a bit of an outlier; most workloads that I've tested with at least tend to set up their memory at the start and its static forever more, which means that those workloads benefit mostly from the TLB benefits - there are very few existing SW overheads to actually reduce. > The LRU lock is a huge problem, and using large folios cuts > the length of the LRU list, hence LRU lock hold time. I'm sure this is true and you have lots more experience and data than me. And it makes intuitive sense. But I've never personally seen this in any of the workloads that I've benchmarked. Thanks, Ryan > > Your _own_ data on how hard it is to get hold of a large folio due to > fragmentation should be enough to convince you that the more large folios > in the system, the better the whole system runs. We should not decline to > allocate large folios just because they can't be mapped with a single TLB! >
diff --git a/mm/memory.c b/mm/memory.c index 833d2cad6eb2..14048e9285d4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3986,6 +3986,152 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) return VM_FAULT_SIGBUS; } +/* + * check a range of PTEs are completely swap entries with + * contiguous swap offsets and the same SWAP_HAS_CACHE. + * ptep must be first one in the range + */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) +{ + struct swap_info_struct *si; + unsigned long addr; + swp_entry_t entry; + pgoff_t offset; + char has_cache; + int idx, i; + pte_t pte; + + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); + idx = (vmf->address - addr) / PAGE_SIZE; + pte = ptep_get(ptep); + + if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) + return false; + entry = pte_to_swp_entry(pte); + offset = swp_offset(entry); + if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) + return false; + + si = swp_swap_info(entry); + has_cache = si->swap_map[offset] & SWAP_HAS_CACHE; + for (i = 1; i < nr_pages; i++) { + /* + * while allocating a large folio and doing swap_read_folio for the + * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte + * doesn't have swapcache. We need to ensure all PTEs have no cache + * as well, otherwise, we might go to swap devices while the content + * is in swapcache + */ + if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache) + return false; + } + + return true; +} + +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, + unsigned long addr, unsigned long orders) +{ + int order, nr; + + order = highest_order(orders); + + /* + * To swap-in a THP with nr pages, we require its first swap_offset + * is aligned with nr. This can filter out most invalid entries. + */ + while (orders) { + nr = 1 << order; + if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr) + break; + order = next_order(&orders, order); + } + + return orders; +} +#else +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) +{ + return false; +} +#endif + +static struct folio *alloc_swap_folio(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + unsigned long orders; + struct folio *folio; + unsigned long addr; + swp_entry_t entry; + spinlock_t *ptl; + pte_t *pte; + gfp_t gfp; + int order; + + /* + * If uffd is active for the vma we need per-page fault fidelity to + * maintain the uffd semantics. + */ + if (unlikely(userfaultfd_armed(vma))) + goto fallback; + + /* + * A large swapped out folio could be partially or fully in zswap. We + * lack handling for such cases, so fallback to swapping in order-0 + * folio. + */ + if (!zswap_never_enabled()) + goto fallback; + + entry = pte_to_swp_entry(vmf->orig_pte); + /* + * Get a list of all the (large) orders below PMD_ORDER that are enabled + * and suitable for swapping THP. + */ + orders = thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); + orders = thp_vma_suitable_orders(vma, vmf->address, orders); + orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); + + if (!orders) + goto fallback; + + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl); + if (unlikely(!pte)) + goto fallback; + + /* + * For do_swap_page, find the highest order where the aligned range is + * completely swap entries with contiguous swap offsets. + */ + order = highest_order(orders); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order)) + break; + order = next_order(&orders, order); + } + + pte_unmap_unlock(pte, ptl); + + /* Try allocating the highest of the remaining orders. */ + gfp = vma_thp_gfp_mask(vma); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) + return folio; + order = next_order(&orders, order); + } + +fallback: +#endif + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); +} + + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) { - /* - * Prevent parallel swapin from proceeding with - * the cache flag. Otherwise, another thread may - * finish swapin first, free the entry, and swapout - * reusing the same entry. It's undetectable as - * pte_same() returns true due to entry reuse. - */ - if (swapcache_prepare(entry)) { - /* Relax a bit to prevent rapid repeated page faults */ - schedule_timeout_uninterruptible(1); - goto out; - } - need_clear_cache = true; - /* skip swapcache */ - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, - vma, vmf->address, false); + folio = alloc_swap_folio(vmf); page = &folio->page; if (folio) { __folio_set_locked(folio); __folio_set_swapbacked(folio); + nr_pages = folio_nr_pages(folio); + if (folio_test_large(folio)) + entry.val = ALIGN_DOWN(entry.val, nr_pages); + /* + * Prevent parallel swapin from proceeding with + * the cache flag. Otherwise, another thread may + * finish swapin first, free the entry, and swapout + * reusing the same entry. It's undetectable as + * pte_same() returns true due to entry reuse. + */ + if (swapcache_prepare_nr(entry, nr_pages)) { + /* Relax a bit to prevent rapid repeated page faults */ + schedule_timeout_uninterruptible(1); + goto out_page; + } + need_clear_cache = true; + if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry)) { ret = VM_FAULT_OOM; goto out_page; } - mem_cgroup_swapin_uncharge_swap(entry); + mem_cgroup_swapin_uncharge_swap_nr(entry, nr_pages); shadow = get_shadow_from_swap_cache(entry); if (shadow) @@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_nomap; } + /* allocated large folios for SWP_SYNCHRONOUS_IO */ + if (folio_test_large(folio) && !folio_test_swapcache(folio)) { + unsigned long nr = folio_nr_pages(folio); + unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); + unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; + pte_t *folio_ptep = vmf->pte - idx; + + if (!can_swapin_thp(vmf, folio_ptep, nr)) + goto out_nomap; + + page_idx = idx; + address = folio_start; + ptep = folio_ptep; + goto check_folio; + } + nr_pages = 1; page_idx = 0; address = vmf->address; @@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_add_lru_vma(folio, vma); } else if (!folio_test_anon(folio)) { /* - * We currently only expect small !anon folios, which are either - * fully exclusive or fully shared. If we ever get large folios - * here, we have to be careful. + * We currently only expect small !anon folios which are either + * fully exclusive or fully shared, or new allocated large folios + * which are fully exclusive. If we ever get large folios within + * swapcache here, we have to be careful. */ - VM_WARN_ON_ONCE(folio_test_large(folio)); + VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { @@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) out: /* Clear the swap cache pin for direct swapin after PTL unlock */ if (need_clear_cache) - swapcache_clear(si, entry); + swapcache_clear_nr(si, entry, nr_pages); if (si) put_swap_device(si); return ret; @@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_put(swapcache); } if (need_clear_cache) - swapcache_clear(si, entry); + swapcache_clear_nr(si, entry, nr_pages); if (si) put_swap_device(si); return ret;