diff mbox series

[5.10/5.15,LTS] Question on mlock race between ksm and cow

Message ID 4263470f-77f8-47e2-be03-e1f8d790999e@huawei.com (mailing list archive)
State New
Headers show
Series [5.10/5.15,LTS] Question on mlock race between ksm and cow | expand

Commit Message

mawupeng Aug. 15, 2023, 7:07 a.m. UTC
Our syzbot reports a warning on bad page state. The mlocked flag is not
cleared during page free.

During try_to_merge_one_page in ksm, kpage will be remlocked if vma
contains flag VM_LOCKED, however this flag is just cleared in wp_page_copy.
Since the mapcount of this kpage is -1, no one can remove its mlocked flag
before free, this lead to the bad page report.

Since mlock changes a lot in v5.18-rc1[1], the latest linux do not have
this problem. The 5.10/5.15 LTS do have this issue.

Here is the simplified calltrace:
try_to_merge_one_page				wp_page_copy

try_to_merge_one_page
// clear page mlocked during rmap removal
replace_page
page_remove_rmap
if (unlikely(PageMlocked(page)))
clear_page_mlock(compound_head(page));

if ((vma->vm_flags & VM_LOCKED)
									lock_page(old_page);
									if (vma->vm_flags & VM_LOCKED)
									if (PageMlocked(old_page))
									munlock_vma_page(old_page);
if (!PageMlocked(kpage))
lock_page(kpage);
mlock_vma_page(kpage);
unlock_page(kpage);
-------------------------------------------------

This problem can be easily reproduced with the following modifies:
1. enable the following CONFIG
  a) CONFIG_DEBUG_VM
  b) CONFIG_KSM
  c) CONFIG_MEMORY_FAIALURE

2. add delay in try_to_merge_one_page

3. run syzbot with the following content:

madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc000, 0xc)
mlockall(0x1)
mlockall(0x5)
madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc04c, 0x65)

madvise(&(0x7f0000ff5000/0x4000)=nil, 0x4000, 0xc)
mlockall(0x1)
mlockall(0xa5)
mlockall(0x0)
munlock(&(0x7f0000ff7000/0x4000)=nil, 0x4000)

-------------------------------------------------
The detail bug report can be seen as follow:

BUG: Bad page state in process rs:main Q:Reg  pfn:11406a
page:fffff7b004501a80 refcount:0 mapcount:0 mapping:0000000000000000 index:0x20ff4 pfn:0x11406a
flags: 0x30000000028000e(referenced|uptodate|dirty|swapbacked|mlocked|node=0|zone=3)
raw: 030000000028000e fffff7b00456aec8 fffff7b011439908 0000000000000000
Soft offlining pfn 0x455e8f at process virtual address 0x20ff6000
raw: 0000000000020ff4 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
Modules linked in:
CPU: 1 PID: 239 Comm: rs:main Q:Reg Not tainted 5.15.126+ #580
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Call Trace:
 <TASK>
 dump_stack_lvl+0x33/0x46
 bad_page+0x9e/0xe0
 free_pcp_prepare+0x14b/0x1f0
 free_unref_page_list+0x7c/0x210
 release_pages+0x2fe/0x3c0
 __pagevec_lru_add+0x21a/0x360
 lru_cache_add+0x80/0xe0
 add_to_page_cache_lru+0x71/0xd0
 pagecache_get_page+0x245/0x460
 grab_cache_page_write_begin+0x1a/0x40
 ext4_da_write_begin+0xb7/0x280
 generic_perform_write+0xb4/0x1e0
 ext4_buffered_write_iter+0x9c/0x140
 ext4_file_write_iter+0x5b/0x840
 ? do_futex+0x1af/0xb60
 ? check_preempt_curr+0x21/0x60
 ? ttwu_do_wakeup.isra.140+0xd/0xf0
 new_sync_write+0x117/0x1b0
 vfs_write+0x1ff/0x260
 ksys_write+0xa0/0xe0
 do_syscall_64+0x37/0x90
 entry_SYSCALL_64_after_hwframe+0x67/0xd1
RIP: 0033:0x7fb815cef32f
Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 29 fd ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 5c fd ff ff 48
RSP: 002b:00007fb814b2b860 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fb808004f20 RCX: 00007fb815cef32f
RDX: 000000000000006e RSI: 00007fb808004f20 RDI: 0000000000000007
RBP: 00007fb808004c40 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 00007fb808009550
R13: 000000000000006e R14: 0000000000000000 R15: 0000000000000000
 </TASK>

[1]: https://lore.kernel.org/linux-mm/e7fbbdca-6590-7e45-3efd-279fba7f8376@suse.cz/T/#m0cb6e42b2a5ad634e1ec16e59f0f98f2e9382460

Comments

mawupeng Aug. 16, 2023, 3:52 a.m. UTC | #1
Since page_remove_rmap in wp_page_copy only clear this mlocked page iff
page's mapcount is -1. which can be seen as follow.

wp_page_copy
  page_remove_rmap
    if (!atomic_add_negative(-1, &page->_mapcount))
      goto out;
    clear_page_mlock(page);	// clear mlocked flag

During out test, we can test this mapcount before mlock the kpage, this
can close this race.

diff --git a/mm/ksm.c b/mm/ksm.c
index 62feb478a367..347f4c0339c2 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1295,7 +1295,8 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
                if (!PageMlocked(kpage)) {
                        unlock_page(page);
                        lock_page(kpage);
-                       mlock_vma_page(kpage);
+                       if (page_mapcount(kpage) > 0)
+                               mlock_vma_page(kpage);
                        page = kpage;           /* for final unlock */
                }
        }


On 2023/8/15 15:07, mawupeng wrote:
> Our syzbot reports a warning on bad page state. The mlocked flag is not
> cleared during page free.
> 
> During try_to_merge_one_page in ksm, kpage will be remlocked if vma
> contains flag VM_LOCKED, however this flag is just cleared in wp_page_copy.
> Since the mapcount of this kpage is -1, no one can remove its mlocked flag
> before free, this lead to the bad page report.
> 
> Since mlock changes a lot in v5.18-rc1[1], the latest linux do not have
> this problem. The 5.10/5.15 LTS do have this issue.
> 
> Here is the simplified calltrace:
> try_to_merge_one_page				wp_page_copy
> 
> try_to_merge_one_page
> // clear page mlocked during rmap removal
> replace_page
> page_remove_rmap
> if (unlikely(PageMlocked(page)))
> clear_page_mlock(compound_head(page));
> 
> if ((vma->vm_flags & VM_LOCKED)
> 									lock_page(old_page);
> 									if (vma->vm_flags & VM_LOCKED)
> 									if (PageMlocked(old_page))
> 									munlock_vma_page(old_page);
> if (!PageMlocked(kpage))
> lock_page(kpage);
> mlock_vma_page(kpage);
> unlock_page(kpage);
> -------------------------------------------------
> 
> This problem can be easily reproduced with the following modifies:
> 1. enable the following CONFIG
>   a) CONFIG_DEBUG_VM
>   b) CONFIG_KSM
>   c) CONFIG_MEMORY_FAIALURE
> 
> 2. add delay in try_to_merge_one_page
> diff --git a/mm/ksm.c b/mm/ksm.c
> index a5716fdec1aa..f9ee2ec615ac 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1248,8 +1248,10 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>  
>         if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
>                 munlock_vma_page(page);
> +               mdelay(10);
>                 if (!PageMlocked(kpage)) {
>                         unlock_page(page);
> +                       mdelay(100);
>                         lock_page(kpage);
>                         mlock_vma_page(kpage);
>                         page = kpage;           /* for final unlock */
> 
> 3. run syzbot with the following content:
> 
> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc000, 0xc)
> mlockall(0x1)
> mlockall(0x5)
> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc04c, 0x65)
> 
> madvise(&(0x7f0000ff5000/0x4000)=nil, 0x4000, 0xc)
> mlockall(0x1)
> mlockall(0xa5)
> mlockall(0x0)
> munlock(&(0x7f0000ff7000/0x4000)=nil, 0x4000)
> 
> -------------------------------------------------
> The detail bug report can be seen as follow:
> 
> BUG: Bad page state in process rs:main Q:Reg  pfn:11406a
> page:fffff7b004501a80 refcount:0 mapcount:0 mapping:0000000000000000 index:0x20ff4 pfn:0x11406a
> flags: 0x30000000028000e(referenced|uptodate|dirty|swapbacked|mlocked|node=0|zone=3)
> raw: 030000000028000e fffff7b00456aec8 fffff7b011439908 0000000000000000
> Soft offlining pfn 0x455e8f at process virtual address 0x20ff6000
> raw: 0000000000020ff4 0000000000000000 00000000ffffffff 0000000000000000
> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> Modules linked in:
> CPU: 1 PID: 239 Comm: rs:main Q:Reg Not tainted 5.15.126+ #580
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x33/0x46
>  bad_page+0x9e/0xe0
>  free_pcp_prepare+0x14b/0x1f0
>  free_unref_page_list+0x7c/0x210
>  release_pages+0x2fe/0x3c0
>  __pagevec_lru_add+0x21a/0x360
>  lru_cache_add+0x80/0xe0
>  add_to_page_cache_lru+0x71/0xd0
>  pagecache_get_page+0x245/0x460
>  grab_cache_page_write_begin+0x1a/0x40
>  ext4_da_write_begin+0xb7/0x280
>  generic_perform_write+0xb4/0x1e0
>  ext4_buffered_write_iter+0x9c/0x140
>  ext4_file_write_iter+0x5b/0x840
>  ? do_futex+0x1af/0xb60
>  ? check_preempt_curr+0x21/0x60
>  ? ttwu_do_wakeup.isra.140+0xd/0xf0
>  new_sync_write+0x117/0x1b0
>  vfs_write+0x1ff/0x260
>  ksys_write+0xa0/0xe0
>  do_syscall_64+0x37/0x90
>  entry_SYSCALL_64_after_hwframe+0x67/0xd1
> RIP: 0033:0x7fb815cef32f
> Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 29 fd ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 5c fd ff ff 48
> RSP: 002b:00007fb814b2b860 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 00007fb808004f20 RCX: 00007fb815cef32f
> RDX: 000000000000006e RSI: 00007fb808004f20 RDI: 0000000000000007
> RBP: 00007fb808004c40 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000293 R12: 00007fb808009550
> R13: 000000000000006e R14: 0000000000000000 R15: 0000000000000000
>  </TASK>
> 
> [1]: https://lore.kernel.org/linux-mm/e7fbbdca-6590-7e45-3efd-279fba7f8376@suse.cz/T/#m0cb6e42b2a5ad634e1ec16e59f0f98f2e9382460
mawupeng Sept. 18, 2023, 1:07 a.m. UTC | #2
Kindly ping.

On 2023/8/16 11:52, mawupeng wrote:
> Since page_remove_rmap in wp_page_copy only clear this mlocked page iff
> page's mapcount is -1. which can be seen as follow.
> 
> wp_page_copy
>   page_remove_rmap
>     if (!atomic_add_negative(-1, &page->_mapcount))
>       goto out;
>     clear_page_mlock(page);	// clear mlocked flag
> 
> During out test, we can test this mapcount before mlock the kpage, this
> can close this race.
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 62feb478a367..347f4c0339c2 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1295,7 +1295,8 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>                 if (!PageMlocked(kpage)) {
>                         unlock_page(page);
>                         lock_page(kpage);
> -                       mlock_vma_page(kpage);
> +                       if (page_mapcount(kpage) > 0)
> +                               mlock_vma_page(kpage);
>                         page = kpage;           /* for final unlock */
>                 }
>         }
> 
> 
> On 2023/8/15 15:07, mawupeng wrote:
>> Our syzbot reports a warning on bad page state. The mlocked flag is not
>> cleared during page free.
>>
>> During try_to_merge_one_page in ksm, kpage will be remlocked if vma
>> contains flag VM_LOCKED, however this flag is just cleared in wp_page_copy.
>> Since the mapcount of this kpage is -1, no one can remove its mlocked flag
>> before free, this lead to the bad page report.
>>
>> Since mlock changes a lot in v5.18-rc1[1], the latest linux do not have
>> this problem. The 5.10/5.15 LTS do have this issue.
>>
>> Here is the simplified calltrace:
>> try_to_merge_one_page				wp_page_copy
>>
>> try_to_merge_one_page
>> // clear page mlocked during rmap removal
>> replace_page
>> page_remove_rmap
>> if (unlikely(PageMlocked(page)))
>> clear_page_mlock(compound_head(page));
>>
>> if ((vma->vm_flags & VM_LOCKED)
>> 									lock_page(old_page);
>> 									if (vma->vm_flags & VM_LOCKED)
>> 									if (PageMlocked(old_page))
>> 									munlock_vma_page(old_page);
>> if (!PageMlocked(kpage))
>> lock_page(kpage);
>> mlock_vma_page(kpage);
>> unlock_page(kpage);
>> -------------------------------------------------
>>
>> This problem can be easily reproduced with the following modifies:
>> 1. enable the following CONFIG
>>   a) CONFIG_DEBUG_VM
>>   b) CONFIG_KSM
>>   c) CONFIG_MEMORY_FAIALURE
>>
>> 2. add delay in try_to_merge_one_page
>> diff --git a/mm/ksm.c b/mm/ksm.c
>> index a5716fdec1aa..f9ee2ec615ac 100644
>> --- a/mm/ksm.c
>> +++ b/mm/ksm.c
>> @@ -1248,8 +1248,10 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>>  
>>         if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
>>                 munlock_vma_page(page);
>> +               mdelay(10);
>>                 if (!PageMlocked(kpage)) {
>>                         unlock_page(page);
>> +                       mdelay(100);
>>                         lock_page(kpage);
>>                         mlock_vma_page(kpage);
>>                         page = kpage;           /* for final unlock */
>>
>> 3. run syzbot with the following content:
>>
>> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc000, 0xc)
>> mlockall(0x1)
>> mlockall(0x5)
>> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc04c, 0x65)
>>
>> madvise(&(0x7f0000ff5000/0x4000)=nil, 0x4000, 0xc)
>> mlockall(0x1)
>> mlockall(0xa5)
>> mlockall(0x0)
>> munlock(&(0x7f0000ff7000/0x4000)=nil, 0x4000)
>>
>> -------------------------------------------------
>> The detail bug report can be seen as follow:
>>
>> BUG: Bad page state in process rs:main Q:Reg  pfn:11406a
>> page:fffff7b004501a80 refcount:0 mapcount:0 mapping:0000000000000000 index:0x20ff4 pfn:0x11406a
>> flags: 0x30000000028000e(referenced|uptodate|dirty|swapbacked|mlocked|node=0|zone=3)
>> raw: 030000000028000e fffff7b00456aec8 fffff7b011439908 0000000000000000
>> Soft offlining pfn 0x455e8f at process virtual address 0x20ff6000
>> raw: 0000000000020ff4 0000000000000000 00000000ffffffff 0000000000000000
>> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
>> Modules linked in:
>> CPU: 1 PID: 239 Comm: rs:main Q:Reg Not tainted 5.15.126+ #580
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
>> Call Trace:
>>  <TASK>
>>  dump_stack_lvl+0x33/0x46
>>  bad_page+0x9e/0xe0
>>  free_pcp_prepare+0x14b/0x1f0
>>  free_unref_page_list+0x7c/0x210
>>  release_pages+0x2fe/0x3c0
>>  __pagevec_lru_add+0x21a/0x360
>>  lru_cache_add+0x80/0xe0
>>  add_to_page_cache_lru+0x71/0xd0
>>  pagecache_get_page+0x245/0x460
>>  grab_cache_page_write_begin+0x1a/0x40
>>  ext4_da_write_begin+0xb7/0x280
>>  generic_perform_write+0xb4/0x1e0
>>  ext4_buffered_write_iter+0x9c/0x140
>>  ext4_file_write_iter+0x5b/0x840
>>  ? do_futex+0x1af/0xb60
>>  ? check_preempt_curr+0x21/0x60
>>  ? ttwu_do_wakeup.isra.140+0xd/0xf0
>>  new_sync_write+0x117/0x1b0
>>  vfs_write+0x1ff/0x260
>>  ksys_write+0xa0/0xe0
>>  do_syscall_64+0x37/0x90
>>  entry_SYSCALL_64_after_hwframe+0x67/0xd1
>> RIP: 0033:0x7fb815cef32f
>> Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 29 fd ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 5c fd ff ff 48
>> RSP: 002b:00007fb814b2b860 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
>> RAX: ffffffffffffffda RBX: 00007fb808004f20 RCX: 00007fb815cef32f
>> RDX: 000000000000006e RSI: 00007fb808004f20 RDI: 0000000000000007
>> RBP: 00007fb808004c40 R08: 0000000000000000 R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000293 R12: 00007fb808009550
>> R13: 000000000000006e R14: 0000000000000000 R15: 0000000000000000
>>  </TASK>
>>
>> [1]: https://lore.kernel.org/linux-mm/e7fbbdca-6590-7e45-3efd-279fba7f8376@suse.cz/T/#m0cb6e42b2a5ad634e1ec16e59f0f98f2e9382460
mawupeng Oct. 18, 2023, 3:16 a.m. UTC | #3
This problem can be easily reproduced with syz in LTS 5.10/5.15.

Kindly ping.

On 2023/9/18 9:07, mawupeng wrote:
> Kindly ping.
> 
> On 2023/8/16 11:52, mawupeng wrote:
>> Since page_remove_rmap in wp_page_copy only clear this mlocked page iff
>> page's mapcount is -1. which can be seen as follow.
>>
>> wp_page_copy
>>   page_remove_rmap
>>     if (!atomic_add_negative(-1, &page->_mapcount))
>>       goto out;
>>     clear_page_mlock(page);	// clear mlocked flag
>>
>> During out test, we can test this mapcount before mlock the kpage, this
>> can close this race.
>>
>> diff --git a/mm/ksm.c b/mm/ksm.c
>> index 62feb478a367..347f4c0339c2 100644
>> --- a/mm/ksm.c
>> +++ b/mm/ksm.c
>> @@ -1295,7 +1295,8 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>>                 if (!PageMlocked(kpage)) {
>>                         unlock_page(page);
>>                         lock_page(kpage);
>> -                       mlock_vma_page(kpage);
>> +                       if (page_mapcount(kpage) > 0)
>> +                               mlock_vma_page(kpage);
>>                         page = kpage;           /* for final unlock */
>>                 }
>>         }
>>
>>
>> On 2023/8/15 15:07, mawupeng wrote:
>>> Our syzbot reports a warning on bad page state. The mlocked flag is not
>>> cleared during page free.
>>>
>>> During try_to_merge_one_page in ksm, kpage will be remlocked if vma
>>> contains flag VM_LOCKED, however this flag is just cleared in wp_page_copy.
>>> Since the mapcount of this kpage is -1, no one can remove its mlocked flag
>>> before free, this lead to the bad page report.
>>>
>>> Since mlock changes a lot in v5.18-rc1[1], the latest linux do not have
>>> this problem. The 5.10/5.15 LTS do have this issue.
>>>
>>> Here is the simplified calltrace:
>>> try_to_merge_one_page				wp_page_copy
>>>
>>> try_to_merge_one_page
>>> // clear page mlocked during rmap removal
>>> replace_page
>>> page_remove_rmap
>>> if (unlikely(PageMlocked(page)))
>>> clear_page_mlock(compound_head(page));
>>>
>>> if ((vma->vm_flags & VM_LOCKED)
>>> 									lock_page(old_page);
>>> 									if (vma->vm_flags & VM_LOCKED)
>>> 									if (PageMlocked(old_page))
>>> 									munlock_vma_page(old_page);
>>> if (!PageMlocked(kpage))
>>> lock_page(kpage);
>>> mlock_vma_page(kpage);
>>> unlock_page(kpage);
>>> -------------------------------------------------
>>>
>>> This problem can be easily reproduced with the following modifies:
>>> 1. enable the following CONFIG
>>>   a) CONFIG_DEBUG_VM
>>>   b) CONFIG_KSM
>>>   c) CONFIG_MEMORY_FAIALURE
>>>
>>> 2. add delay in try_to_merge_one_page
>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>> index a5716fdec1aa..f9ee2ec615ac 100644
>>> --- a/mm/ksm.c
>>> +++ b/mm/ksm.c
>>> @@ -1248,8 +1248,10 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>>>  
>>>         if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
>>>                 munlock_vma_page(page);
>>> +               mdelay(10);
>>>                 if (!PageMlocked(kpage)) {
>>>                         unlock_page(page);
>>> +                       mdelay(100);
>>>                         lock_page(kpage);
>>>                         mlock_vma_page(kpage);
>>>                         page = kpage;           /* for final unlock */
>>>
>>> 3. run syzbot with the following content:
>>>
>>> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc000, 0xc)
>>> mlockall(0x1)
>>> mlockall(0x5)
>>> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc04c, 0x65)
>>>
>>> madvise(&(0x7f0000ff5000/0x4000)=nil, 0x4000, 0xc)
>>> mlockall(0x1)
>>> mlockall(0xa5)
>>> mlockall(0x0)
>>> munlock(&(0x7f0000ff7000/0x4000)=nil, 0x4000)
>>>
>>> -------------------------------------------------
>>> The detail bug report can be seen as follow:
>>>
>>> BUG: Bad page state in process rs:main Q:Reg  pfn:11406a
>>> page:fffff7b004501a80 refcount:0 mapcount:0 mapping:0000000000000000 index:0x20ff4 pfn:0x11406a
>>> flags: 0x30000000028000e(referenced|uptodate|dirty|swapbacked|mlocked|node=0|zone=3)
>>> raw: 030000000028000e fffff7b00456aec8 fffff7b011439908 0000000000000000
>>> Soft offlining pfn 0x455e8f at process virtual address 0x20ff6000
>>> raw: 0000000000020ff4 0000000000000000 00000000ffffffff 0000000000000000
>>> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
>>> Modules linked in:
>>> CPU: 1 PID: 239 Comm: rs:main Q:Reg Not tainted 5.15.126+ #580
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
>>> Call Trace:
>>>  <TASK>
>>>  dump_stack_lvl+0x33/0x46
>>>  bad_page+0x9e/0xe0
>>>  free_pcp_prepare+0x14b/0x1f0
>>>  free_unref_page_list+0x7c/0x210
>>>  release_pages+0x2fe/0x3c0
>>>  __pagevec_lru_add+0x21a/0x360
>>>  lru_cache_add+0x80/0xe0
>>>  add_to_page_cache_lru+0x71/0xd0
>>>  pagecache_get_page+0x245/0x460
>>>  grab_cache_page_write_begin+0x1a/0x40
>>>  ext4_da_write_begin+0xb7/0x280
>>>  generic_perform_write+0xb4/0x1e0
>>>  ext4_buffered_write_iter+0x9c/0x140
>>>  ext4_file_write_iter+0x5b/0x840
>>>  ? do_futex+0x1af/0xb60
>>>  ? check_preempt_curr+0x21/0x60
>>>  ? ttwu_do_wakeup.isra.140+0xd/0xf0
>>>  new_sync_write+0x117/0x1b0
>>>  vfs_write+0x1ff/0x260
>>>  ksys_write+0xa0/0xe0
>>>  do_syscall_64+0x37/0x90
>>>  entry_SYSCALL_64_after_hwframe+0x67/0xd1
>>> RIP: 0033:0x7fb815cef32f
>>> Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 29 fd ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 5c fd ff ff 48
>>> RSP: 002b:00007fb814b2b860 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
>>> RAX: ffffffffffffffda RBX: 00007fb808004f20 RCX: 00007fb815cef32f
>>> RDX: 000000000000006e RSI: 00007fb808004f20 RDI: 0000000000000007
>>> RBP: 00007fb808004c40 R08: 0000000000000000 R09: 0000000000000000
>>> R10: 0000000000000000 R11: 0000000000000293 R12: 00007fb808009550
>>> R13: 000000000000006e R14: 0000000000000000 R15: 0000000000000000
>>>  </TASK>
>>>
>>> [1]: https://lore.kernel.org/linux-mm/e7fbbdca-6590-7e45-3efd-279fba7f8376@suse.cz/T/#m0cb6e42b2a5ad634e1ec16e59f0f98f2e9382460
Hugh Dickins Oct. 20, 2023, 3:57 a.m. UTC | #4
On Wed, 18 Oct 2023, mawupeng wrote:

> This problem can be easily reproduced with syz in LTS 5.10/5.15.
> 
> Kindly ping.

Sorry, but it takes me a long time to find a long enough slot
to sink back deep enough into understanding these things.

> 
> On 2023/9/18 9:07, mawupeng wrote:
> > Kindly ping.
> > 
> > On 2023/8/16 11:52, mawupeng wrote:
> >> Since page_remove_rmap in wp_page_copy only clear this mlocked page iff
> >> page's mapcount is -1. which can be seen as follow.
> >>
> >> wp_page_copy
> >>   page_remove_rmap
> >>     if (!atomic_add_negative(-1, &page->_mapcount))
> >>       goto out;
> >>     clear_page_mlock(page);	// clear mlocked flag
> >>
> >> During out test, we can test this mapcount before mlock the kpage, this
> >> can close this race.
> >>
> >> diff --git a/mm/ksm.c b/mm/ksm.c
> >> index 62feb478a367..347f4c0339c2 100644
> >> --- a/mm/ksm.c
> >> +++ b/mm/ksm.c
> >> @@ -1295,7 +1295,8 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
> >>                 if (!PageMlocked(kpage)) {
> >>                         unlock_page(page);
> >>                         lock_page(kpage);
> >> -                       mlock_vma_page(kpage);
> >> +                       if (page_mapcount(kpage) > 0)
> >> +                               mlock_vma_page(kpage);

Yes, that's safe enough; though if we thought for longer, we might
devise other such races which this would not be enough for.

But I'm not going to spend longer on it, this is good enough, if you
find that it enables you to get much further with your syzbottery.

If you prepare a proper Signed-off patch to send GregKH & stable,
Cc me and I shall probably be able to support it with a grudging Ack.

> >>                         page = kpage;           /* for final unlock */
> >>                 }
> >>         }
> >>
> >>
> >> On 2023/8/15 15:07, mawupeng wrote:
> >>> Our syzbot reports a warning on bad page state. The mlocked flag is not
> >>> cleared during page free.
> >>>
> >>> During try_to_merge_one_page in ksm, kpage will be remlocked if vma
> >>> contains flag VM_LOCKED, however this flag is just cleared in wp_page_copy.
> >>> Since the mapcount of this kpage is -1, no one can remove its mlocked flag
> >>> before free, this lead to the bad page report.
> >>>
> >>> Since mlock changes a lot in v5.18-rc1[1], the latest linux do not have
> >>> this problem. The 5.10/5.15 LTS do have this issue.

I was glad to hear that it happens to have been fixed by 5.18 changes.
I was going to complain that you're asking me for a fix to old kernels,
just because I fixed it in new kernels.  But that would be unfair: you'll
have noticed that I was guilty of this KSM PageMlocked code too (and one
of the pleasures of those mlock changes was being able to delete this).

> >>>
> >>> Here is the simplified calltrace:
> >>> try_to_merge_one_page				wp_page_copy
> >>>
> >>> try_to_merge_one_page
> >>> // clear page mlocked during rmap removal
> >>> replace_page
> >>> page_remove_rmap
> >>> if (unlikely(PageMlocked(page)))
> >>> clear_page_mlock(compound_head(page));
> >>>
> >>> if ((vma->vm_flags & VM_LOCKED)
> >>> 									lock_page(old_page);
> >>> 									if (vma->vm_flags & VM_LOCKED)
> >>> 									if (PageMlocked(old_page))
> >>> 									munlock_vma_page(old_page);
> >>> if (!PageMlocked(kpage))
> >>> lock_page(kpage);
> >>> mlock_vma_page(kpage);
> >>> unlock_page(kpage);
> >>> -------------------------------------------------
> >>>
> >>> This problem can be easily reproduced with the following modifies:
> >>> 1. enable the following CONFIG
> >>>   a) CONFIG_DEBUG_VM
> >>>   b) CONFIG_KSM
> >>>   c) CONFIG_MEMORY_FAIALURE

CONFIG_MEMORY_FAILURE

> >>>
> >>> 2. add delay in try_to_merge_one_page
> >>> diff --git a/mm/ksm.c b/mm/ksm.c
> >>> index a5716fdec1aa..f9ee2ec615ac 100644
> >>> --- a/mm/ksm.c
> >>> +++ b/mm/ksm.c
> >>> @@ -1248,8 +1248,10 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
> >>>  
> >>>         if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
> >>>                 munlock_vma_page(page);
> >>> +               mdelay(10);
> >>>                 if (!PageMlocked(kpage)) {
> >>>                         unlock_page(page);
> >>> +                       mdelay(100);
> >>>                         lock_page(kpage);
> >>>                         mlock_vma_page(kpage);
> >>>                         page = kpage;           /* for final unlock */
> >>>
> >>> 3. run syzbot with the following content:
> >>>
> >>> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc000, 0xc)
> >>> mlockall(0x1)
> >>> mlockall(0x5)
> >>> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc04c, 0x65)

0x65: MADV_SOFT_OFFLINE /* soft offline page for testing */

Whereby CAP_SYS_ADMIN allows syzbot to take advantage of memory-failure
testing backdoors to inject unexpected memory failures.  Which gives it
the right to TTU_IGNORE_MLOCK unmap pages which would ordinarily be kept
mapped by VM_LOCKED.

I said "grudging Ack" above, because I don't enjoy the time spent on
working around such issues.  There's lots of other damage that can be
done with CAP_SYS_ADMIN, and with memory failures on kernel memory.
But you've devised a patch which takes you further along... so okay.

Please mention the CAP_SYS_ADMIN MADV_SOFT_OFFLINE error injection
dependence in your commitlog - thanks.

Hugh

> >>>
> >>> madvise(&(0x7f0000ff5000/0x4000)=nil, 0x4000, 0xc)
> >>> mlockall(0x1)
> >>> mlockall(0xa5)
> >>> mlockall(0x0)
> >>> munlock(&(0x7f0000ff7000/0x4000)=nil, 0x4000)
> >>>
> >>> -------------------------------------------------
> >>> The detail bug report can be seen as follow:
> >>>
> >>> BUG: Bad page state in process rs:main Q:Reg  pfn:11406a
> >>> page:fffff7b004501a80 refcount:0 mapcount:0 mapping:0000000000000000 index:0x20ff4 pfn:0x11406a
> >>> flags: 0x30000000028000e(referenced|uptodate|dirty|swapbacked|mlocked|node=0|zone=3)
> >>> raw: 030000000028000e fffff7b00456aec8 fffff7b011439908 0000000000000000
> >>> Soft offlining pfn 0x455e8f at process virtual address 0x20ff6000
> >>> raw: 0000000000020ff4 0000000000000000 00000000ffffffff 0000000000000000
> >>> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> >>> Modules linked in:
> >>> CPU: 1 PID: 239 Comm: rs:main Q:Reg Not tainted 5.15.126+ #580
> >>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
> >>> Call Trace:
> >>>  <TASK>
> >>>  dump_stack_lvl+0x33/0x46
> >>>  bad_page+0x9e/0xe0
> >>>  free_pcp_prepare+0x14b/0x1f0
> >>>  free_unref_page_list+0x7c/0x210
> >>>  release_pages+0x2fe/0x3c0
> >>>  __pagevec_lru_add+0x21a/0x360
> >>>  lru_cache_add+0x80/0xe0
> >>>  add_to_page_cache_lru+0x71/0xd0
> >>>  pagecache_get_page+0x245/0x460
> >>>  grab_cache_page_write_begin+0x1a/0x40
> >>>  ext4_da_write_begin+0xb7/0x280
> >>>  generic_perform_write+0xb4/0x1e0
> >>>  ext4_buffered_write_iter+0x9c/0x140
> >>>  ext4_file_write_iter+0x5b/0x840
> >>>  ? do_futex+0x1af/0xb60
> >>>  ? check_preempt_curr+0x21/0x60
> >>>  ? ttwu_do_wakeup.isra.140+0xd/0xf0
> >>>  new_sync_write+0x117/0x1b0
> >>>  vfs_write+0x1ff/0x260
> >>>  ksys_write+0xa0/0xe0
> >>>  do_syscall_64+0x37/0x90
> >>>  entry_SYSCALL_64_after_hwframe+0x67/0xd1
> >>> RIP: 0033:0x7fb815cef32f
> >>> Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 29 fd ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 5c fd ff ff 48
> >>> RSP: 002b:00007fb814b2b860 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> >>> RAX: ffffffffffffffda RBX: 00007fb808004f20 RCX: 00007fb815cef32f
> >>> RDX: 000000000000006e RSI: 00007fb808004f20 RDI: 0000000000000007
> >>> RBP: 00007fb808004c40 R08: 0000000000000000 R09: 0000000000000000
> >>> R10: 0000000000000000 R11: 0000000000000293 R12: 00007fb808009550
> >>> R13: 000000000000006e R14: 0000000000000000 R15: 0000000000000000
> >>>  </TASK>
> >>>
> >>> [1]: https://lore.kernel.org/linux-mm/e7fbbdca-6590-7e45-3efd-279fba7f8376@suse.cz/T/#m0cb6e42b2a5ad634e1ec16e59f0f98f2e9382460
mawupeng Oct. 21, 2023, 6:59 a.m. UTC | #5
On 2023/10/20 11:57, Hugh Dickins wrote:
> On Wed, 18 Oct 2023, mawupeng wrote:
> 
>> This problem can be easily reproduced with syz in LTS 5.10/5.15.
>>
>> Kindly ping.
> 
> Sorry, but it takes me a long time to find a long enough slot
> to sink back deep enough into understanding these things.
> 
>>
>> On 2023/9/18 9:07, mawupeng wrote:
>>> Kindly ping.
>>>
>>> On 2023/8/16 11:52, mawupeng wrote:
>>>> Since page_remove_rmap in wp_page_copy only clear this mlocked page iff
>>>> page's mapcount is -1. which can be seen as follow.
>>>>
>>>> wp_page_copy
>>>>   page_remove_rmap
>>>>     if (!atomic_add_negative(-1, &page->_mapcount))
>>>>       goto out;
>>>>     clear_page_mlock(page);	// clear mlocked flag
>>>>
>>>> During out test, we can test this mapcount before mlock the kpage, this
>>>> can close this race.
>>>>
>>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>>> index 62feb478a367..347f4c0339c2 100644
>>>> --- a/mm/ksm.c
>>>> +++ b/mm/ksm.c
>>>> @@ -1295,7 +1295,8 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>>>>                 if (!PageMlocked(kpage)) {
>>>>                         unlock_page(page);
>>>>                         lock_page(kpage);
>>>> -                       mlock_vma_page(kpage);
>>>> +                       if (page_mapcount(kpage) > 0)
>>>> +                               mlock_vma_page(kpage);
> 
> Yes, that's safe enough; though if we thought for longer, we might
> devise other such races which this would not be enough for.
> 
> But I'm not going to spend longer on it, this is good enough, if you
> find that it enables you to get much further with your syzbottery.
> 
> If you prepare a proper Signed-off patch to send GregKH & stable,
> Cc me and I shall probably be able to support it with a grudging Ack.
> 
>>>>                         page = kpage;           /* for final unlock */
>>>>                 }
>>>>         }
>>>>
>>>>
>>>> On 2023/8/15 15:07, mawupeng wrote:
>>>>> Our syzbot reports a warning on bad page state. The mlocked flag is not
>>>>> cleared during page free.
>>>>>
>>>>> During try_to_merge_one_page in ksm, kpage will be remlocked if vma
>>>>> contains flag VM_LOCKED, however this flag is just cleared in wp_page_copy.
>>>>> Since the mapcount of this kpage is -1, no one can remove its mlocked flag
>>>>> before free, this lead to the bad page report.
>>>>>
>>>>> Since mlock changes a lot in v5.18-rc1[1], the latest linux do not have
>>>>> this problem. The 5.10/5.15 LTS do have this issue.
> 
> I was glad to hear that it happens to have been fixed by 5.18 changes.
> I was going to complain that you're asking me for a fix to old kernels,
> just because I fixed it in new kernels.  But that would be unfair: you'll
> have noticed that I was guilty of this KSM PageMlocked code too (and one
> of the pleasures of those mlock changes was being able to delete this).
> 
>>>>>
>>>>> Here is the simplified calltrace:
>>>>> try_to_merge_one_page				wp_page_copy
>>>>>
>>>>> try_to_merge_one_page
>>>>> // clear page mlocked during rmap removal
>>>>> replace_page
>>>>> page_remove_rmap
>>>>> if (unlikely(PageMlocked(page)))
>>>>> clear_page_mlock(compound_head(page));
>>>>>
>>>>> if ((vma->vm_flags & VM_LOCKED)
>>>>> 									lock_page(old_page);
>>>>> 									if (vma->vm_flags & VM_LOCKED)
>>>>> 									if (PageMlocked(old_page))
>>>>> 									munlock_vma_page(old_page);
>>>>> if (!PageMlocked(kpage))
>>>>> lock_page(kpage);
>>>>> mlock_vma_page(kpage);
>>>>> unlock_page(kpage);
>>>>> -------------------------------------------------
>>>>>
>>>>> This problem can be easily reproduced with the following modifies:
>>>>> 1. enable the following CONFIG
>>>>>   a) CONFIG_DEBUG_VM
>>>>>   b) CONFIG_KSM
>>>>>   c) CONFIG_MEMORY_FAIALURE
> 
> CONFIG_MEMORY_FAILURE
> 
>>>>>
>>>>> 2. add delay in try_to_merge_one_page
>>>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>>>> index a5716fdec1aa..f9ee2ec615ac 100644
>>>>> --- a/mm/ksm.c
>>>>> +++ b/mm/ksm.c
>>>>> @@ -1248,8 +1248,10 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>>>>>  
>>>>>         if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
>>>>>                 munlock_vma_page(page);
>>>>> +               mdelay(10);
>>>>>                 if (!PageMlocked(kpage)) {
>>>>>                         unlock_page(page);
>>>>> +                       mdelay(100);
>>>>>                         lock_page(kpage);
>>>>>                         mlock_vma_page(kpage);
>>>>>                         page = kpage;           /* for final unlock */
>>>>>
>>>>> 3. run syzbot with the following content:
>>>>>
>>>>> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc000, 0xc)
>>>>> mlockall(0x1)
>>>>> mlockall(0x5)
>>>>> madvise(&(0x7f0000ff3000/0xc000)=nil, 0xc04c, 0x65)
> 
> 0x65: MADV_SOFT_OFFLINE /* soft offline page for testing */
> 
> Whereby CAP_SYS_ADMIN allows syzbot to take advantage of memory-failure
> testing backdoors to inject unexpected memory failures.  Which gives it
> the right to TTU_IGNORE_MLOCK unmap pages which would ordinarily be kept
> mapped by VM_LOCKED.
> 
> I said "grudging Ack" above, because I don't enjoy the time spent on
> working around such issues.  There's lots of other damage that can be
> done with CAP_SYS_ADMIN, and with memory failures on kernel memory.
> But you've devised a patch which takes you further along... so okay.
> 
> Please mention the CAP_SYS_ADMIN MADV_SOFT_OFFLINE error injection
> dependence in your commitlog - thanks.
> 
> Hugh
> 
>>>>>
>>>>> madvise(&(0x7f0000ff5000/0x4000)=nil, 0x4000, 0xc)
>>>>> mlockall(0x1)
>>>>> mlockall(0xa5)
>>>>> mlockall(0x0)
>>>>> munlock(&(0x7f0000ff7000/0x4000)=nil, 0x4000)
>>>>>
>>>>> -------------------------------------------------
>>>>> The detail bug report can be seen as follow:
>>>>>
>>>>> BUG: Bad page state in process rs:main Q:Reg  pfn:11406a
>>>>> page:fffff7b004501a80 refcount:0 mapcount:0 mapping:0000000000000000 index:0x20ff4 pfn:0x11406a
>>>>> flags: 0x30000000028000e(referenced|uptodate|dirty|swapbacked|mlocked|node=0|zone=3)
>>>>> raw: 030000000028000e fffff7b00456aec8 fffff7b011439908 0000000000000000
>>>>> Soft offlining pfn 0x455e8f at process virtual address 0x20ff6000
>>>>> raw: 0000000000020ff4 0000000000000000 00000000ffffffff 0000000000000000
>>>>> page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
>>>>> Modules linked in:
>>>>> CPU: 1 PID: 239 Comm: rs:main Q:Reg Not tainted 5.15.126+ #580
>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
>>>>> Call Trace:
>>>>>  <TASK>
>>>>>  dump_stack_lvl+0x33/0x46
>>>>>  bad_page+0x9e/0xe0
>>>>>  free_pcp_prepare+0x14b/0x1f0
>>>>>  free_unref_page_list+0x7c/0x210
>>>>>  release_pages+0x2fe/0x3c0
>>>>>  __pagevec_lru_add+0x21a/0x360
>>>>>  lru_cache_add+0x80/0xe0
>>>>>  add_to_page_cache_lru+0x71/0xd0
>>>>>  pagecache_get_page+0x245/0x460
>>>>>  grab_cache_page_write_begin+0x1a/0x40
>>>>>  ext4_da_write_begin+0xb7/0x280
>>>>>  generic_perform_write+0xb4/0x1e0
>>>>>  ext4_buffered_write_iter+0x9c/0x140
>>>>>  ext4_file_write_iter+0x5b/0x840
>>>>>  ? do_futex+0x1af/0xb60
>>>>>  ? check_preempt_curr+0x21/0x60
>>>>>  ? ttwu_do_wakeup.isra.140+0xd/0xf0
>>>>>  new_sync_write+0x117/0x1b0
>>>>>  vfs_write+0x1ff/0x260
>>>>>  ksys_write+0xa0/0xe0
>>>>>  do_syscall_64+0x37/0x90
>>>>>  entry_SYSCALL_64_after_hwframe+0x67/0xd1
>>>>> RIP: 0033:0x7fb815cef32f
>>>>> Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 29 fd ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 5c fd ff ff 48
>>>>> RSP: 002b:00007fb814b2b860 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
>>>>> RAX: ffffffffffffffda RBX: 00007fb808004f20 RCX: 00007fb815cef32f
>>>>> RDX: 000000000000006e RSI: 00007fb808004f20 RDI: 0000000000000007
>>>>> RBP: 00007fb808004c40 R08: 0000000000000000 R09: 0000000000000000
>>>>> R10: 0000000000000000 R11: 0000000000000293 R12: 00007fb808009550
>>>>> R13: 000000000000006e R14: 0000000000000000 R15: 0000000000000000
>>>>>  </TASK>
>>>>>
>>>>> [1]: https://lore.kernel.org/linux-mm/e7fbbdca-6590-7e45-3efd-279fba7f8376@suse.cz/T/#m0cb6e42b2a5ad634e1ec16e59f0f98f2e9382460
> 

Hi Hugh.

Thanks for you kindly reply.

I will cc to you if a proper patch is considered.

Wupeng.
diff mbox series

Patch

diff --git a/mm/ksm.c b/mm/ksm.c
index a5716fdec1aa..f9ee2ec615ac 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1248,8 +1248,10 @@  static int try_to_merge_one_page(struct vm_area_struct *vma,
 
        if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
                munlock_vma_page(page);
+               mdelay(10);
                if (!PageMlocked(kpage)) {
                        unlock_page(page);
+                       mdelay(100);
                        lock_page(kpage);
                        mlock_vma_page(kpage);
                        page = kpage;           /* for final unlock */