[v12,09/31] mm: VMA sequence count

Message ID	20190416134522.17540-10-ldufour@linux.ibm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of ldufour@linux.ibm.com designates 148.163.156.1 as permitted sender) client-ip=148.163.156.1; Gateway: Authorized Use Only! Violators will be prosecuted for <linux-mm@kvack.org> from <ldufour@linux.ibm.com>; Tue, 16 Apr 2019 14:46:01 +0100 Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Tue, 16 Apr 2019 14:45:51 +0100 From: Laurent Dufour <ldufour@linux.ibm.com> To: akpm@linux-foundation.org, mhocko@kernel.org, peterz@infradead.org, kirill@shutemov.name, ak@linux.intel.com, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox <willy@infradead.org>, aneesh.kumar@linux.ibm.com, benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, hpa@zytor.com, Will Deacon <will.deacon@arm.com>, Sergey Senozhatsky <sergey.senozhatsky@gmail.com>, sergey.senozhatsky.work@gmail.com, Andrea Arcangeli <aarcange@redhat.com>, Alexei Starovoitov <alexei.starovoitov@gmail.com>, kemi.wang@intel.com, Daniel Jordan <daniel.m.jordan@oracle.com>, David Rientjes <rientjes@google.com>, Jerome Glisse <jglisse@redhat.com>, Ganesh Mahendran <opensource.ganesh@gmail.com>, Minchan Kim <minchan@kernel.org>, Punit Agrawal <punitagrawal@gmail.com>, vinayak menon <vinayakm.list@gmail.com>, Yang Shi <yang.shi@linux.alibaba.com>, zhong jiang <zhongjiang@huawei.com>, Haiyan Song <haiyanx.song@intel.com>, Balbir Singh <bsingharora@gmail.com>, sj38.park@gmail.com, Michel Lespinasse <walken@google.com>, Mike Rapoport <rppt@linux.ibm.com> Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, npiggin@gmail.com, paulmck@linux.vnet.ibm.com, Tim Chen <tim.c.chen@linux.intel.com>, linuxppc-dev@lists.ozlabs.org, x86@kernel.org Subject: [PATCH v12 09/31] mm: VMA sequence count Date: Tue, 16 Apr 2019 15:45:00 +0200 In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com> References: <20190416134522.17540-1-ldufour@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-Id: <20190416134522.17540-10-ldufour@linux.ibm.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Speculative page faults \| expand [v12,00/31] Speculative page faults [v12,01/31] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT [v12,02/31] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT [v12,03/31] powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT [v12,04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT [v12,05/31] mm: prepare for FAULT_FLAG_SPECULATIVE [v12,06/31] mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE [v12,07/31] mm: make pte_unmap_same compatible with SPF [v12,08/31] mm: introduce INIT_VMA() [v12,09/31] mm: VMA sequence count [v12,10/31] mm: protect VMA modifications using VMA sequence count [v12,11/31] mm: protect mremap() against SPF hanlder [v12,12/31] mm: protect SPF handler against anon_vma changes [v12,13/31] mm: cache some VMA fields in the vm_fault structure [v12,14/31] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() [v12,15/31] mm: introduce __lru_cache_add_active_or_unevictable [v12,16/31] mm: introduce __vm_normal_page() [v12,17/31] mm: introduce __page_add_new_anon_rmap() [v12,18/31] mm: protect against PTE changes done by dup_mmap() [v12,19/31] mm: protect the RB tree with a sequence lock [v12,20/31] mm: introduce vma reference counter [v12,21/31] mm: Introduce find_vma_rcu() [v12,22/31] mm: provide speculative fault infrastructure [v12,23/31] mm: don't do swap readahead during speculative page fault [v12,24/31] mm: adding speculative page fault failure trace events [v12,25/31] perf: add a speculative page fault sw event [v12,26/31] perf tools: add support for the SPF perf event [v12,27/31] mm: add speculative page fault vmstats [v12,28/31] x86/mm: add speculative pagefault handling [v12,29/31] powerpc/mm: add speculative page fault [v12,30/31] arm64/mm: add speculative page fault [v12,31/31] mm: Add a speculative page fault switch in sysctl

Message ID

20190416134522.17540-10-ldufour@linux.ibm.com (mailing list archive)

State

New, archived

Headers

Received-SPF: pass (google.com: domain of ldufour@linux.ibm.com designates
 148.163.156.1 as permitted sender) client-ip=148.163.156.1;
From: Laurent Dufour <ldufour@linux.ibm.com>
To: akpm@linux-foundation.org, mhocko@kernel.org, peterz@infradead.org,
        kirill@shutemov.name, ak@linux.intel.com, dave@stgolabs.net,
        jack@suse.cz, Matthew Wilcox <willy@infradead.org>,
        aneesh.kumar@linux.ibm.com, benh@kernel.crashing.org,
        mpe@ellerman.id.au, paulus@samba.org,
        Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
        hpa@zytor.com, Will Deacon <will.deacon@arm.com>,
        Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
        sergey.senozhatsky.work@gmail.com,
        Andrea Arcangeli <aarcange@redhat.com>,
        Alexei Starovoitov <alexei.starovoitov@gmail.com>,
 kemi.wang@intel.com,
        Daniel Jordan <daniel.m.jordan@oracle.com>,
        David Rientjes <rientjes@google.com>,
        Jerome Glisse <jglisse@redhat.com>,
        Ganesh Mahendran <opensource.ganesh@gmail.com>,
        Minchan Kim <minchan@kernel.org>,
        Punit Agrawal <punitagrawal@gmail.com>,
        vinayak menon <vinayakm.list@gmail.com>,
        Yang Shi <yang.shi@linux.alibaba.com>,
        zhong jiang <zhongjiang@huawei.com>,
        Haiyan Song <haiyanx.song@intel.com>,
        Balbir Singh <bsingharora@gmail.com>, sj38.park@gmail.com,
        Michel Lespinasse <walken@google.com>,
        Mike Rapoport <rppt@linux.ibm.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
 haren@linux.vnet.ibm.com,
        npiggin@gmail.com, paulmck@linux.vnet.ibm.com,
        Tim Chen <tim.c.chen@linux.intel.com>, linuxppc-dev@lists.ozlabs.org,
        x86@kernel.org
Subject: [PATCH v12 09/31] mm: VMA sequence count
Date: Tue, 16 Apr 2019 15:45:00 +0200
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>
References: <20190416134522.17540-1-ldufour@linux.ibm.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Message-Id: <20190416134522.17540-10-ldufour@linux.ibm.com>
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

Speculative page faults | expand

Commit Message

Laurent Dufour April 16, 2019, 1:45 p.m. UTC

From: Peter Zijlstra <peterz@infradead.org>

Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
counts such that we can easily test if a VMA is changed.

The calls to vm_write_begin/end() in unmap_page_range() are
used to detect when a VMA is being unmap and thus that new page fault
should not be satisfied for this VMA. If the seqcount hasn't changed when
the page table are locked, this means we are safe to satisfy the page
fault.

The flip side is that we cannot distinguish between a vma_adjust() and
the unmap_page_range() -- where with the former we could have
re-checked the vma bounds against the address.

The VMA's sequence counter is also used to detect change to various VMA's
fields used during the page fault handling, such as:
 - vm_start, vm_end
 - vm_pgoff
 - vm_flags, vm_page_prot
 - anon_vma
 - vm_policy

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[Port to 4.12 kernel]
[Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
[Introduce vm_write_* inline function depending on
 CONFIG_SPECULATIVE_PAGE_FAULT]
[Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
 using vm_raw_write* functions]
[Fix a lock dependency warning in mmap_region() when entering the error
 path]
[move sequence initialisation INIT_VMA()]
[Review the patch description about unmap_page_range()]
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
 include/linux/mm.h       | 44 ++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |  3 +++
 mm/memory.c              |  2 ++
 mm/mmap.c                | 30 +++++++++++++++++++++++++++
 4 files changed, 79 insertions(+)

Comments

Jerome Glisse April 18, 2019, 10:48 p.m. UTC | #1

On Tue, Apr 16, 2019 at 03:45:00PM +0200, Laurent Dufour wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
> counts such that we can easily test if a VMA is changed.
> 
> The calls to vm_write_begin/end() in unmap_page_range() are
> used to detect when a VMA is being unmap and thus that new page fault
> should not be satisfied for this VMA. If the seqcount hasn't changed when
> the page table are locked, this means we are safe to satisfy the page
> fault.
> 
> The flip side is that we cannot distinguish between a vma_adjust() and
> the unmap_page_range() -- where with the former we could have
> re-checked the vma bounds against the address.
> 
> The VMA's sequence counter is also used to detect change to various VMA's
> fields used during the page fault handling, such as:
>  - vm_start, vm_end
>  - vm_pgoff
>  - vm_flags, vm_page_prot
>  - vm_policy

^ All above are under mmap write lock ?

>  - anon_vma

^ This is either under mmap write lock or under page table lock

So my question is do we need the complexity of seqcount_t for this ?

It seems that using regular int as counter and also relying on vm_flags
when vma is unmap should do the trick.

vma_delete(struct vm_area_struct *vma)
{
    ...
    /*
     * Make sure the vma is mark as invalid ie neither read nor write
     * so that speculative fault back off. A racing speculative fault
     * will either see the flags as 0 or the new seqcount.
     */
    vma->vm_flags = 0;
    smp_wmb();
    vma->seqcount++;
    ...
}

Then:
speculative_fault_begin(struct vm_area_struct *vma,
                        struct spec_vmf *spvmf)
{
    ...
    spvmf->seqcount = vma->seqcount;
    smp_rmb();
    spvmf->vm_flags = vma->vm_flags;
    if (!spvmf->vm_flags) {
        // Back off the vma is dying ...
        ...
    }
}

bool speculative_fault_commit(struct vm_area_struct *vma,
                              struct spec_vmf *spvmf)
{
    ...
    seqcount = vma->seqcount;
    smp_rmb();
    vm_flags = vma->vm_flags;

    if (spvmf->vm_flags != vm_flags || seqcount != spvmf->seqcount) {
        // Something did change for the vma
        return false;
    }
    return true;
}

This would also avoid the lockdep issue described below. But maybe what
i propose is stupid and i will see it after further reviewing thing.


Cheers,
Jérôme


> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> [Port to 4.12 kernel]
> [Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
> [Introduce vm_write_* inline function depending on
>  CONFIG_SPECULATIVE_PAGE_FAULT]
> [Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
>  using vm_raw_write* functions]
> [Fix a lock dependency warning in mmap_region() when entering the error
>  path]
> [move sequence initialisation INIT_VMA()]
> [Review the patch description about unmap_page_range()]
> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
> ---
>  include/linux/mm.h       | 44 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/mm_types.h |  3 +++
>  mm/memory.c              |  2 ++
>  mm/mmap.c                | 30 +++++++++++++++++++++++++++
>  4 files changed, 79 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2ceb1d2869a6..906b9e06f18e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1410,6 +1410,9 @@ struct zap_details {
>  static inline void INIT_VMA(struct vm_area_struct *vma)
>  {
>  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +	seqcount_init(&vma->vm_sequence);
> +#endif
>  }
>  
>  struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> @@ -1534,6 +1537,47 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
>  	unmap_mapping_range(mapping, holebegin, holelen, 0);
>  }
>  
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +static inline void vm_write_begin(struct vm_area_struct *vma)
> +{
> +	write_seqcount_begin(&vma->vm_sequence);
> +}
> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
> +					 int subclass)
> +{
> +	write_seqcount_begin_nested(&vma->vm_sequence, subclass);
> +}
> +static inline void vm_write_end(struct vm_area_struct *vma)
> +{
> +	write_seqcount_end(&vma->vm_sequence);
> +}
> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
> +{
> +	raw_write_seqcount_begin(&vma->vm_sequence);
> +}
> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
> +{
> +	raw_write_seqcount_end(&vma->vm_sequence);
> +}
> +#else
> +static inline void vm_write_begin(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
> +					 int subclass)
> +{
> +}
> +static inline void vm_write_end(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
> +{
> +}
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> +
>  extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
>  		void *buf, int len, unsigned int gup_flags);
>  extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index fd7d38ee2e33..e78f72eb2576 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -337,6 +337,9 @@ struct vm_area_struct {
>  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>  #endif
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +	seqcount_t vm_sequence;
> +#endif
>  } __randomize_layout;
>  
>  struct core_thread {
> diff --git a/mm/memory.c b/mm/memory.c
> index d5bebca47d98..423fa8ea0569 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1256,6 +1256,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>  	unsigned long next;
>  
>  	BUG_ON(addr >= end);
> +	vm_write_begin(vma);
>  	tlb_start_vma(tlb, vma);
>  	pgd = pgd_offset(vma->vm_mm, addr);
>  	do {
> @@ -1265,6 +1266,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>  		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
>  	} while (pgd++, addr = next, addr != end);
>  	tlb_end_vma(tlb, vma);
> +	vm_write_end(vma);
>  }
>  
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5ad3a3228d76..a4e4d52a5148 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -726,6 +726,30 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  	long adjust_next = 0;
>  	int remove_next = 0;
>  
> +	/*
> +	 * Why using vm_raw_write*() functions here to avoid lockdep's warning ?
> +	 *
> +	 * Locked is complaining about a theoretical lock dependency, involving
> +	 * 3 locks:
> +	 *   mapping->i_mmap_rwsem --> vma->vm_sequence --> fs_reclaim
> +	 *
> +	 * Here are the major path leading to this dependency :
> +	 *  1. __vma_adjust() mmap_sem  -> vm_sequence -> i_mmap_rwsem
> +	 *  2. move_vmap() mmap_sem -> vm_sequence -> fs_reclaim
> +	 *  3. __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem
> +	 *  4. unmap_mapping_range() i_mmap_rwsem -> vm_sequence
> +	 *
> +	 * So there is no way to solve this easily, especially because in
> +	 * unmap_mapping_range() the i_mmap_rwsem is grab while the impacted
> +	 * VMAs are not yet known.
> +	 * However, the way the vm_seq is used is guarantying that we will
> +	 * never block on it since we just check for its value and never wait
> +	 * for it to move, see vma_has_changed() and handle_speculative_fault().
> +	 */
> +	vm_raw_write_begin(vma);
> +	if (next)
> +		vm_raw_write_begin(next);
> +
>  	if (next && !insert) {
>  		struct vm_area_struct *exporter = NULL, *importer = NULL;
>  
> @@ -950,6 +974,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  			 * "vma->vm_next" gap must be updated.
>  			 */
>  			next = vma->vm_next;
> +			if (next)
> +				vm_raw_write_begin(next);
>  		} else {
>  			/*
>  			 * For the scope of the comment "next" and
> @@ -996,6 +1022,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  	if (insert && file)
>  		uprobe_mmap(insert);
>  
> +	if (next && next != vma)
> +		vm_raw_write_end(next);
> +	vm_raw_write_end(vma);
> +
>  	validate_mm(mm);
>  
>  	return 0;
> -- 
> 2.21.0
>

Laurent Dufour April 19, 2019, 3:45 p.m. UTC | #2

Hi Jerome,

Thanks a lot for reviewing this series.

Le 19/04/2019 à 00:48, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:00PM +0200, Laurent Dufour wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>>
>> Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
>> counts such that we can easily test if a VMA is changed.
>>
>> The calls to vm_write_begin/end() in unmap_page_range() are
>> used to detect when a VMA is being unmap and thus that new page fault
>> should not be satisfied for this VMA. If the seqcount hasn't changed when
>> the page table are locked, this means we are safe to satisfy the page
>> fault.
>>
>> The flip side is that we cannot distinguish between a vma_adjust() and
>> the unmap_page_range() -- where with the former we could have
>> re-checked the vma bounds against the address.
>>
>> The VMA's sequence counter is also used to detect change to various VMA's
>> fields used during the page fault handling, such as:
>>   - vm_start, vm_end
>>   - vm_pgoff
>>   - vm_flags, vm_page_prot
>>   - vm_policy
> 
> ^ All above are under mmap write lock ?

Yes, changes are still made under the protection of the mmap_sem.

> 
>>   - anon_vma
> 
> ^ This is either under mmap write lock or under page table lock
> 
> So my question is do we need the complexity of seqcount_t for this ?

The sequence counter is used to detect write operation done while 
readers (SPF handler) is running.

The implementation is quite simple (here without the lockdep checks):

static inline void raw_write_seqcount_begin(seqcount_t *s)
{
	s->sequence++;
	smp_wmb();
}

I can't see why this is too complex here, would you elaborate on this ?

> 
> It seems that using regular int as counter and also relying on vm_flags
> when vma is unmap should do the trick.

vm_flags is not enough I guess an some operation are not impacting the 
vm_flags at all (resizing for instance).
Am I missing something ?

> 
> vma_delete(struct vm_area_struct *vma)
> {
>      ...
>      /*
>       * Make sure the vma is mark as invalid ie neither read nor write
>       * so that speculative fault back off. A racing speculative fault
>       * will either see the flags as 0 or the new seqcount.
>       */
>      vma->vm_flags = 0;
>      smp_wmb();
>      vma->seqcount++;
>      ...
> }

Well I don't think we can safely clear the vm_flags this way when the 
VMA is unmap, I think it is used later when cleaning is doen.

Later in this series, the VMA deletion is managed when the VMA is 
unlinked from the RB Tree. That is checked using the vm_rb field's 
value, and managed using RCU.

> Then:
> speculative_fault_begin(struct vm_area_struct *vma,
>                          struct spec_vmf *spvmf)
> {
>      ...
>      spvmf->seqcount = vma->seqcount;
>      smp_rmb();
>      spvmf->vm_flags = vma->vm_flags;
>      if (!spvmf->vm_flags) {
>          // Back off the vma is dying ...
>          ...
>      }
> }
> 
> bool speculative_fault_commit(struct vm_area_struct *vma,
>                                struct spec_vmf *spvmf)
> {
>      ...
>      seqcount = vma->seqcount;
>      smp_rmb();
>      vm_flags = vma->vm_flags;
> 
>      if (spvmf->vm_flags != vm_flags || seqcount != spvmf->seqcount) {
>          // Something did change for the vma
>          return false;
>      }
>      return true;
> }
> 
> This would also avoid the lockdep issue described below. But maybe what
> i propose is stupid and i will see it after further reviewing thing.

That's true that the lockdep is quite annoying here. But it is still 
interesting to keep in the loop to avoid 2 subsequent 
write_seqcount_begin() call being made in the same context (which would 
lead to an even sequence counter value while write operation is in 
progress). So I think this is still a good thing to have lockdep 
available here.



> 
> Cheers,
> Jérôme
> 
> 
>>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>
>> [Port to 4.12 kernel]
>> [Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
>> [Introduce vm_write_* inline function depending on
>>   CONFIG_SPECULATIVE_PAGE_FAULT]
>> [Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
>>   using vm_raw_write* functions]
>> [Fix a lock dependency warning in mmap_region() when entering the error
>>   path]
>> [move sequence initialisation INIT_VMA()]
>> [Review the patch description about unmap_page_range()]
>> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
>> ---
>>   include/linux/mm.h       | 44 ++++++++++++++++++++++++++++++++++++++++
>>   include/linux/mm_types.h |  3 +++
>>   mm/memory.c              |  2 ++
>>   mm/mmap.c                | 30 +++++++++++++++++++++++++++
>>   4 files changed, 79 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 2ceb1d2869a6..906b9e06f18e 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1410,6 +1410,9 @@ struct zap_details {
>>   static inline void INIT_VMA(struct vm_area_struct *vma)
>>   {
>>   	INIT_LIST_HEAD(&vma->anon_vma_chain);
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +	seqcount_init(&vma->vm_sequence);
>> +#endif
>>   }
>>   
>>   struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>> @@ -1534,6 +1537,47 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
>>   	unmap_mapping_range(mapping, holebegin, holelen, 0);
>>   }
>>   
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +static inline void vm_write_begin(struct vm_area_struct *vma)
>> +{
>> +	write_seqcount_begin(&vma->vm_sequence);
>> +}
>> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
>> +					 int subclass)
>> +{
>> +	write_seqcount_begin_nested(&vma->vm_sequence, subclass);
>> +}
>> +static inline void vm_write_end(struct vm_area_struct *vma)
>> +{
>> +	write_seqcount_end(&vma->vm_sequence);
>> +}
>> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
>> +{
>> +	raw_write_seqcount_begin(&vma->vm_sequence);
>> +}
>> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
>> +{
>> +	raw_write_seqcount_end(&vma->vm_sequence);
>> +}
>> +#else
>> +static inline void vm_write_begin(struct vm_area_struct *vma)
>> +{
>> +}
>> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
>> +					 int subclass)
>> +{
>> +}
>> +static inline void vm_write_end(struct vm_area_struct *vma)
>> +{
>> +}
>> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
>> +{
>> +}
>> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
>> +{
>> +}
>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>> +
>>   extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
>>   		void *buf, int len, unsigned int gup_flags);
>>   extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index fd7d38ee2e33..e78f72eb2576 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -337,6 +337,9 @@ struct vm_area_struct {
>>   	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>>   #endif
>>   	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +	seqcount_t vm_sequence;
>> +#endif
>>   } __randomize_layout;
>>   
>>   struct core_thread {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index d5bebca47d98..423fa8ea0569 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1256,6 +1256,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>>   	unsigned long next;
>>   
>>   	BUG_ON(addr >= end);
>> +	vm_write_begin(vma);
>>   	tlb_start_vma(tlb, vma);
>>   	pgd = pgd_offset(vma->vm_mm, addr);
>>   	do {
>> @@ -1265,6 +1266,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>>   		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
>>   	} while (pgd++, addr = next, addr != end);
>>   	tlb_end_vma(tlb, vma);
>> +	vm_write_end(vma);
>>   }
>>   
>>   
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 5ad3a3228d76..a4e4d52a5148 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -726,6 +726,30 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>>   	long adjust_next = 0;
>>   	int remove_next = 0;
>>   
>> +	/*
>> +	 * Why using vm_raw_write*() functions here to avoid lockdep's warning ?
>> +	 *
>> +	 * Locked is complaining about a theoretical lock dependency, involving
>> +	 * 3 locks:
>> +	 *   mapping->i_mmap_rwsem --> vma->vm_sequence --> fs_reclaim
>> +	 *
>> +	 * Here are the major path leading to this dependency :
>> +	 *  1. __vma_adjust() mmap_sem  -> vm_sequence -> i_mmap_rwsem
>> +	 *  2. move_vmap() mmap_sem -> vm_sequence -> fs_reclaim
>> +	 *  3. __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem
>> +	 *  4. unmap_mapping_range() i_mmap_rwsem -> vm_sequence
>> +	 *
>> +	 * So there is no way to solve this easily, especially because in
>> +	 * unmap_mapping_range() the i_mmap_rwsem is grab while the impacted
>> +	 * VMAs are not yet known.
>> +	 * However, the way the vm_seq is used is guarantying that we will
>> +	 * never block on it since we just check for its value and never wait
>> +	 * for it to move, see vma_has_changed() and handle_speculative_fault().
>> +	 */
>> +	vm_raw_write_begin(vma);
>> +	if (next)
>> +		vm_raw_write_begin(next);
>> +
>>   	if (next && !insert) {
>>   		struct vm_area_struct *exporter = NULL, *importer = NULL;
>>   
>> @@ -950,6 +974,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>>   			 * "vma->vm_next" gap must be updated.
>>   			 */
>>   			next = vma->vm_next;
>> +			if (next)
>> +				vm_raw_write_begin(next);
>>   		} else {
>>   			/*
>>   			 * For the scope of the comment "next" and
>> @@ -996,6 +1022,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>>   	if (insert && file)
>>   		uprobe_mmap(insert);
>>   
>> +	if (next && next != vma)
>> +		vm_raw_write_end(next);
>> +	vm_raw_write_end(vma);
>> +
>>   	validate_mm(mm);
>>   
>>   	return 0;
>> -- 
>> 2.21.0
>>
>

Jerome Glisse April 22, 2019, 3:51 p.m. UTC | #3

On Fri, Apr 19, 2019 at 05:45:57PM +0200, Laurent Dufour wrote:
> Hi Jerome,
> 
> Thanks a lot for reviewing this series.
> 
> Le 19/04/2019 à 00:48, Jerome Glisse a écrit :
> > On Tue, Apr 16, 2019 at 03:45:00PM +0200, Laurent Dufour wrote:
> > > From: Peter Zijlstra <peterz@infradead.org>
> > > 
> > > Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
> > > counts such that we can easily test if a VMA is changed.
> > > 
> > > The calls to vm_write_begin/end() in unmap_page_range() are
> > > used to detect when a VMA is being unmap and thus that new page fault
> > > should not be satisfied for this VMA. If the seqcount hasn't changed when
> > > the page table are locked, this means we are safe to satisfy the page
> > > fault.
> > > 
> > > The flip side is that we cannot distinguish between a vma_adjust() and
> > > the unmap_page_range() -- where with the former we could have
> > > re-checked the vma bounds against the address.
> > > 
> > > The VMA's sequence counter is also used to detect change to various VMA's
> > > fields used during the page fault handling, such as:
> > >   - vm_start, vm_end
> > >   - vm_pgoff
> > >   - vm_flags, vm_page_prot
> > >   - vm_policy
> > 
> > ^ All above are under mmap write lock ?
> 
> Yes, changes are still made under the protection of the mmap_sem.
> 
> > 
> > >   - anon_vma
> > 
> > ^ This is either under mmap write lock or under page table lock
> > 
> > So my question is do we need the complexity of seqcount_t for this ?
> 
> The sequence counter is used to detect write operation done while readers
> (SPF handler) is running.
> 
> The implementation is quite simple (here without the lockdep checks):
> 
> static inline void raw_write_seqcount_begin(seqcount_t *s)
> {
> 	s->sequence++;
> 	smp_wmb();
> }
> 
> I can't see why this is too complex here, would you elaborate on this ?
> 
> > 
> > It seems that using regular int as counter and also relying on vm_flags
> > when vma is unmap should do the trick.
> 
> vm_flags is not enough I guess an some operation are not impacting the
> vm_flags at all (resizing for instance).
> Am I missing something ?
> 
> > 
> > vma_delete(struct vm_area_struct *vma)
> > {
> >      ...
> >      /*
> >       * Make sure the vma is mark as invalid ie neither read nor write
> >       * so that speculative fault back off. A racing speculative fault
> >       * will either see the flags as 0 or the new seqcount.
> >       */
> >      vma->vm_flags = 0;
> >      smp_wmb();
> >      vma->seqcount++;
> >      ...
> > }
> 
> Well I don't think we can safely clear the vm_flags this way when the VMA is
> unmap, I think it is used later when cleaning is doen.
> 
> Later in this series, the VMA deletion is managed when the VMA is unlinked
> from the RB Tree. That is checked using the vm_rb field's value, and managed
> using RCU.
> 
> > Then:
> > speculative_fault_begin(struct vm_area_struct *vma,
> >                          struct spec_vmf *spvmf)
> > {
> >      ...
> >      spvmf->seqcount = vma->seqcount;
> >      smp_rmb();
> >      spvmf->vm_flags = vma->vm_flags;
> >      if (!spvmf->vm_flags) {
> >          // Back off the vma is dying ...
> >          ...
> >      }
> > }
> > 
> > bool speculative_fault_commit(struct vm_area_struct *vma,
> >                                struct spec_vmf *spvmf)
> > {
> >      ...
> >      seqcount = vma->seqcount;
> >      smp_rmb();
> >      vm_flags = vma->vm_flags;
> > 
> >      if (spvmf->vm_flags != vm_flags || seqcount != spvmf->seqcount) {
> >          // Something did change for the vma
> >          return false;
> >      }
> >      return true;
> > }
> > 
> > This would also avoid the lockdep issue described below. But maybe what
> > i propose is stupid and i will see it after further reviewing thing.
> 
> That's true that the lockdep is quite annoying here. But it is still
> interesting to keep in the loop to avoid 2 subsequent write_seqcount_begin()
> call being made in the same context (which would lead to an even sequence
> counter value while write operation is in progress). So I think this is
> still a good thing to have lockdep available here.

Ok so i had to read everything and i should have read everything before
asking all of the above. It does look good in fact, what worried my in
this patch is all the lockdep avoidance as it is usualy a red flags.

But after thinking long and hard i do not see how to easily solve that
one as unmap_page_range() is in so many different path... So what is done
in this patch is the most sane thing. Sorry for the noise.

So for this patch:

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2ceb1d2869a6..906b9e06f18e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1410,6 +1410,9 @@  struct zap_details {
 static inline void INIT_VMA(struct vm_area_struct *vma)
 {
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+	seqcount_init(&vma->vm_sequence);
+#endif
 }
 
 struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1534,6 +1537,47 @@  static inline void unmap_shared_mapping_range(struct address_space *mapping,
 	unmap_mapping_range(mapping, holebegin, holelen, 0);
 }
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static inline void vm_write_begin(struct vm_area_struct *vma)
+{
+	write_seqcount_begin(&vma->vm_sequence);
+}
+static inline void vm_write_begin_nested(struct vm_area_struct *vma,
+					 int subclass)
+{
+	write_seqcount_begin_nested(&vma->vm_sequence, subclass);
+}
+static inline void vm_write_end(struct vm_area_struct *vma)
+{
+	write_seqcount_end(&vma->vm_sequence);
+}
+static inline void vm_raw_write_begin(struct vm_area_struct *vma)
+{
+	raw_write_seqcount_begin(&vma->vm_sequence);
+}
+static inline void vm_raw_write_end(struct vm_area_struct *vma)
+{
+	raw_write_seqcount_end(&vma->vm_sequence);
+}
+#else
+static inline void vm_write_begin(struct vm_area_struct *vma)
+{
+}
+static inline void vm_write_begin_nested(struct vm_area_struct *vma,
+					 int subclass)
+{
+}
+static inline void vm_write_end(struct vm_area_struct *vma)
+{
+}
+static inline void vm_raw_write_begin(struct vm_area_struct *vma)
+{
+}
+static inline void vm_raw_write_end(struct vm_area_struct *vma)
+{
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
 		void *buf, int len, unsigned int gup_flags);
 extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd7d38ee2e33..e78f72eb2576 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -337,6 +337,9 @@  struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+	seqcount_t vm_sequence;
+#endif
 } __randomize_layout;
 
 struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index d5bebca47d98..423fa8ea0569 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1256,6 +1256,7 @@  void unmap_page_range(struct mmu_gather *tlb,
 	unsigned long next;
 
 	BUG_ON(addr >= end);
+	vm_write_begin(vma);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -1265,6 +1266,7 @@  void unmap_page_range(struct mmu_gather *tlb,
 		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
+	vm_write_end(vma);
 }
 
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 5ad3a3228d76..a4e4d52a5148 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -726,6 +726,30 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	long adjust_next = 0;
 	int remove_next = 0;
 
+	/*
+	 * Why using vm_raw_write*() functions here to avoid lockdep's warning ?
+	 *
+	 * Locked is complaining about a theoretical lock dependency, involving
+	 * 3 locks:
+	 *   mapping->i_mmap_rwsem --> vma->vm_sequence --> fs_reclaim
+	 *
+	 * Here are the major path leading to this dependency :
+	 *  1. __vma_adjust() mmap_sem  -> vm_sequence -> i_mmap_rwsem
+	 *  2. move_vmap() mmap_sem -> vm_sequence -> fs_reclaim
+	 *  3. __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem
+	 *  4. unmap_mapping_range() i_mmap_rwsem -> vm_sequence
+	 *
+	 * So there is no way to solve this easily, especially because in
+	 * unmap_mapping_range() the i_mmap_rwsem is grab while the impacted
+	 * VMAs are not yet known.
+	 * However, the way the vm_seq is used is guarantying that we will
+	 * never block on it since we just check for its value and never wait
+	 * for it to move, see vma_has_changed() and handle_speculative_fault().
+	 */
+	vm_raw_write_begin(vma);
+	if (next)
+		vm_raw_write_begin(next);
+
 	if (next && !insert) {
 		struct vm_area_struct *exporter = NULL, *importer = NULL;
 
@@ -950,6 +974,8 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * "vma->vm_next" gap must be updated.
 			 */
 			next = vma->vm_next;
+			if (next)
+				vm_raw_write_begin(next);
 		} else {
 			/*
 			 * For the scope of the comment "next" and
@@ -996,6 +1022,10 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	if (insert && file)
 		uprobe_mmap(insert);
 
+	if (next && next != vma)
+		vm_raw_write_end(next);
+	vm_raw_write_end(vma);
+
 	validate_mm(mm);
 
 	return 0;

[v12,09/31] mm: VMA sequence count

Commit Message

Comments

Patch