diff mbox series

[v2,2/2] mm: adds NOSIGBUS extension to mmap()

Message ID 1622792602-40459-3-git-send-email-mlin@kernel.org (mailing list archive)
State New
Headers show
Series mm: support NOSIGBUS on fault of mmap | expand

Commit Message

Ming Lin June 4, 2021, 7:43 a.m. UTC
Adds new flag MAP_NOSIGBUS of mmap() to specify the behavior of
"don't SIGBUS on fault". Right now, this flag is only allowed
for private mapping.

For MAP_NOSIGBUS mapping, map in the zero page on read fault
or fill a freshly allocated page with zeroes on write fault.

Signed-off-by: Ming Lin <mlin@kernel.org>
---
 arch/parisc/include/uapi/asm/mman.h          |  1 +
 include/linux/mm.h                           |  2 ++
 include/linux/mman.h                         |  1 +
 include/uapi/asm-generic/mman-common.h       |  1 +
 mm/memory.c                                  | 11 +++++++++++
 mm/mmap.c                                    |  4 ++++
 tools/include/uapi/asm-generic/mman-common.h |  1 +
 7 files changed, 21 insertions(+)

Comments

Kirill A. Shutemov June 4, 2021, 3:24 p.m. UTC | #1
On Fri, Jun 04, 2021 at 12:43:22AM -0700, Ming Lin wrote:
> Adds new flag MAP_NOSIGBUS of mmap() to specify the behavior of
> "don't SIGBUS on fault". Right now, this flag is only allowed
> for private mapping.

That's not what your use case asks for.

SIGBUS can be generated for a number of reasons, not only on fault beyond
end-of-file. vmf_error() would convert any errno, except ENOMEM to
VM_FAULT_SIGBUS.

Do you want to ignore -EIO or -ENOSPC? I don't think so.

> For MAP_NOSIGBUS mapping, map in the zero page on read fault
> or fill a freshly allocated page with zeroes on write fault.

I don't like the resulting semantics: if you had a read fault beyond EOF
and got zero page, you will still see zero page even if the file grows.
Yes, it's allowed by POSIX for MAP_PRIVATE to get out-of-sync with the
file, but it's not what users used to.

It might be enough for the use case, but I would rather avoid one-user
features.
Ming Lin June 4, 2021, 4:22 p.m. UTC | #2
On Fri, Jun 04, 2021 at 06:24:07PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jun 04, 2021 at 12:43:22AM -0700, Ming Lin wrote:
> > Adds new flag MAP_NOSIGBUS of mmap() to specify the behavior of
> > "don't SIGBUS on fault". Right now, this flag is only allowed
> > for private mapping.
> 
> That's not what your use case asks for.

Simon explained the use case here: https://bit.ly/3wR85Lc

FYI, I copied here too.

------begin-------------------------------------------------------------------
Regarding the requirements for Wayland:

- The baseline requirement is being able to avoid SIGBUS for read-only mappings
  of shm files.
- Wayland clients can expand their shm files. However the compositor doesn't
  need to immediately access the new expanded region. The client will tell the
  compositor what the new shm file size is, and the compositor will re-map it.
- Ideally, MAP_NOSIGBUS would work on PROT_WRITE + MAP_SHARED mappings (of
  course, the no-SIGBUS behavior would be restricted to that mapping). The
  use-case is writing back to client buffers e.g. for screen capture. From the
  earlier discussions it seems like this would be complicated to implement.
  This means we'll need to come up with a new libwayland API to allow
  compositors to opt-in to the read-only mappings. This is sub-optimal but
  seems doable.
- Ideally, MAP_SIGBUS wouldn't be restricted to shm. There are use-cases for
  using it on ordinary files too, e.g. for sharing ICC profiles. But from the
  earlier replies it seems very unlikely that this will become possible, and
  making it work only on shm files would already be fantastic.
------end-------------------------------------------------------------------

> 
> SIGBUS can be generated for a number of reasons, not only on fault beyond
> end-of-file. vmf_error() would convert any errno, except ENOMEM to
> VM_FAULT_SIGBUS.
> 
> Do you want to ignore -EIO or -ENOSPC? I don't think so.
> 
> > For MAP_NOSIGBUS mapping, map in the zero page on read fault
> > or fill a freshly allocated page with zeroes on write fault.
> 
> I don't like the resulting semantics: if you had a read fault beyond EOF
> and got zero page, you will still see zero page even if the file grows.
> Yes, it's allowed by POSIX for MAP_PRIVATE to get out-of-sync with the
> file, but it's not what users used to.

Actually old version did support file grows.
https://github.com/minggr/linux/commit/77f3722b94ff33cafe0a72c1bf1b8fa374adb29f

We can support this if there is real use case.

> 
> It might be enough for the use case, but I would rather avoid one-user
> features.
Vivek Goyal June 28, 2021, 2:27 p.m. UTC | #3
On Fri, Jun 04, 2021 at 12:43:22AM -0700, Ming Lin wrote:
> Adds new flag MAP_NOSIGBUS of mmap() to specify the behavior of
> "don't SIGBUS on fault". Right now, this flag is only allowed
> for private mapping.
> 
> For MAP_NOSIGBUS mapping, map in the zero page on read fault
> or fill a freshly allocated page with zeroes on write fault.

I am wondering if this could be of limited use for me if MAP_NOSIGBUS
were to be supported for shared mappings as well.

When virtiofs is run with dax enabled, then it is possible that if
a file is shared between two guests, then one guest truncates the
file and second guest tries to do load/store operation. Given current
kvm architecture, there is no mechanism to propagate SIGBUS to guest
process, instead KVM retries page fault infinitely and guest cpu/process
hangs.

Ideally we want this error to propagate all the way back into the
guest and to the guest process but that solution is not in place yet.

https://lore.kernel.org/kvm/20200406190951.GA19259@redhat.com/

In the absense of a proper solution, one could think of mapping
shared file on host with MAP_NOSIGBUS, and hopefully that means
kvm will be able to resolve fault to a zero filled page and guest
will not hang. But this means that data sharing between two processes
is now broken. Writes by process A will not be visible to process B
in another once this situation happens, IIUC.

So if we were to MAP_NOSIGBUS, guest will not hang but failures resulting
from ftruncate will be silent and will be noticed sometime later. I guess
not exactly a very pleasant scenario...

Thanks
Vivek



> 
> Signed-off-by: Ming Lin <mlin@kernel.org>
> ---
>  arch/parisc/include/uapi/asm/mman.h          |  1 +
>  include/linux/mm.h                           |  2 ++
>  include/linux/mman.h                         |  1 +
>  include/uapi/asm-generic/mman-common.h       |  1 +
>  mm/memory.c                                  | 11 +++++++++++
>  mm/mmap.c                                    |  4 ++++
>  tools/include/uapi/asm-generic/mman-common.h |  1 +
>  7 files changed, 21 insertions(+)
> 
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index ab78cba..eecf9af 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -25,6 +25,7 @@
>  #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
>  #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
>  #define MAP_FIXED_NOREPLACE 0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
> +#define MAP_NOSIGBUS	0x200000	/* do not SIGBUS on fault */
>  #define MAP_UNINITIALIZED 0		/* uninitialized anonymous mmap */
>  
>  #define MS_SYNC		1		/* synchronous memory sync */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9e86ca1..100d122 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -373,6 +373,8 @@ int __add_to_page_cache_locked(struct page *page, struct address_space *mapping,
>  # define VM_UFFD_MINOR		VM_NONE
>  #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
>  
> +#define VM_NOSIGBUS		VM_FLAGS_BIT(38)	/* Do not SIGBUS on fault */
> +
>  /* Bits set in the VMA until the stack is in its final location */
>  #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
>  
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index b2cbae9..c966b08 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -154,6 +154,7 @@ static inline bool arch_validate_flags(unsigned long flags)
>  	       _calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
>  	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>  	       _calc_vm_trans(flags, MAP_SYNC,	     VM_SYNC      ) |
> +	       _calc_vm_trans(flags, MAP_NOSIGBUS,   VM_NOSIGBUS  ) |
>  	       arch_calc_vm_flag_bits(flags);
>  }
>  
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index f94f65d..a2a5333 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -29,6 +29,7 @@
>  #define MAP_HUGETLB		0x040000	/* create a huge page mapping */
>  #define MAP_SYNC		0x080000 /* perform synchronous page faults for the mapping */
>  #define MAP_FIXED_NOREPLACE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
> +#define MAP_NOSIGBUS		0x200000	/* do not SIGBUS on fault */
>  
>  #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
>  					 * uninitialized */
> diff --git a/mm/memory.c b/mm/memory.c
> index 8d5e583..6b5a897 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3676,6 +3676,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
>  	}
>  
>  	ret = vma->vm_ops->fault(vmf);
> +	if (unlikely(ret & VM_FAULT_SIGBUS) && (vma->vm_flags & VM_NOSIGBUS)) {
> +		/*
> +		 * For MAP_NOSIGBUS mapping, map in the zero page on read fault
> +		 * or fill a freshly allocated page with zeroes on write fault
> +		 */
> +		ret = do_anonymous_page(vmf);
> +		if (!ret)
> +			ret = VM_FAULT_NOPAGE;
> +		return ret;
> +	}
> +
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
>  			    VM_FAULT_DONE_COW)))
>  		return ret;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8bed547..d5c9fb5 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1419,6 +1419,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  	if (!len)
>  		return -EINVAL;
>  
> +	/* Restrict MAP_NOSIGBUS to MAP_PRIVATE mapping */
> +	if ((flags & MAP_NOSIGBUS) && !(flags & MAP_PRIVATE))
> +		return -EINVAL;
> +
>  	/*
>  	 * Does the application expect PROT_READ to imply PROT_EXEC?
>  	 *
> diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> index f94f65d..a2a5333 100644
> --- a/tools/include/uapi/asm-generic/mman-common.h
> +++ b/tools/include/uapi/asm-generic/mman-common.h
> @@ -29,6 +29,7 @@
>  #define MAP_HUGETLB		0x040000	/* create a huge page mapping */
>  #define MAP_SYNC		0x080000 /* perform synchronous page faults for the mapping */
>  #define MAP_FIXED_NOREPLACE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
> +#define MAP_NOSIGBUS		0x200000	/* do not SIGBUS on fault */
>  
>  #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
>  					 * uninitialized */
> -- 
> 1.8.3.1
>
Ming Lin June 30, 2021, 4:37 p.m. UTC | #4
O Mon, Jun 28, 2021 at 10:27:23AM -0400, Vivek Goyal wrote:
> On Fri, Jun 04, 2021 at 12:43:22AM -0700, Ming Lin wrote:
> > Adds new flag MAP_NOSIGBUS of mmap() to specify the behavior of
> > "don't SIGBUS on fault". Right now, this flag is only allowed
> > for private mapping.
> > 
> > For MAP_NOSIGBUS mapping, map in the zero page on read fault
> > or fill a freshly allocated page with zeroes on write fault.
> 
> I am wondering if this could be of limited use for me if MAP_NOSIGBUS
> were to be supported for shared mappings as well.

V1 did support shared mapping.
https://lkml.org/lkml/2021/6/1/1078

And V0 even supported unmapping the zero page for later write.
https://github.com/minggr/linux/commit/77f3722b94ff33cafe0a72c1bf1b8fa374adb29f

We may support shared mapping if there is a real use case.
As Hugh mentioned:
> And by restricting to MAP_PRIVATE, you would allow for adding a
> proper MAP_SHARED implementation later, if it's thought useful
> (that being the implementation which can subsequently unmap a
> zero page to let new page cache be mapped).

See https://lkml.org/lkml/2021/6/1/1258

Ming

> 
> When virtiofs is run with dax enabled, then it is possible that if
> a file is shared between two guests, then one guest truncates the
> file and second guest tries to do load/store operation. Given current
> kvm architecture, there is no mechanism to propagate SIGBUS to guest
> process, instead KVM retries page fault infinitely and guest cpu/process
> hangs.
> 
> Ideally we want this error to propagate all the way back into the
> guest and to the guest process but that solution is not in place yet.
> 
> https://lore.kernel.org/kvm/20200406190951.GA19259@redhat.com/
> 
> In the absense of a proper solution, one could think of mapping
> shared file on host with MAP_NOSIGBUS, and hopefully that means
> kvm will be able to resolve fault to a zero filled page and guest
> will not hang. But this means that data sharing between two processes
> is now broken. Writes by process A will not be visible to process B
> in another once this situation happens, IIUC.
> 
> So if we were to MAP_NOSIGBUS, guest will not hang but failures resulting
> from ftruncate will be silent and will be noticed sometime later. I guess
> not exactly a very pleasant scenario...
> 
> Thanks
> Vivek
diff mbox series

Patch

diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index ab78cba..eecf9af 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -25,6 +25,7 @@ 
 #define MAP_STACK	0x40000		/* give out an address that is best suited for process/thread stacks */
 #define MAP_HUGETLB	0x80000		/* create a huge page mapping */
 #define MAP_FIXED_NOREPLACE 0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_NOSIGBUS	0x200000	/* do not SIGBUS on fault */
 #define MAP_UNINITIALIZED 0		/* uninitialized anonymous mmap */
 
 #define MS_SYNC		1		/* synchronous memory sync */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9e86ca1..100d122 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -373,6 +373,8 @@  int __add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 # define VM_UFFD_MINOR		VM_NONE
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
 
+#define VM_NOSIGBUS		VM_FLAGS_BIT(38)	/* Do not SIGBUS on fault */
+
 /* Bits set in the VMA until the stack is in its final location */
 #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
 
diff --git a/include/linux/mman.h b/include/linux/mman.h
index b2cbae9..c966b08 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -154,6 +154,7 @@  static inline bool arch_validate_flags(unsigned long flags)
 	       _calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
 	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
 	       _calc_vm_trans(flags, MAP_SYNC,	     VM_SYNC      ) |
+	       _calc_vm_trans(flags, MAP_NOSIGBUS,   VM_NOSIGBUS  ) |
 	       arch_calc_vm_flag_bits(flags);
 }
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f94f65d..a2a5333 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -29,6 +29,7 @@ 
 #define MAP_HUGETLB		0x040000	/* create a huge page mapping */
 #define MAP_SYNC		0x080000 /* perform synchronous page faults for the mapping */
 #define MAP_FIXED_NOREPLACE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_NOSIGBUS		0x200000	/* do not SIGBUS on fault */
 
 #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
 					 * uninitialized */
diff --git a/mm/memory.c b/mm/memory.c
index 8d5e583..6b5a897 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3676,6 +3676,17 @@  static vm_fault_t __do_fault(struct vm_fault *vmf)
 	}
 
 	ret = vma->vm_ops->fault(vmf);
+	if (unlikely(ret & VM_FAULT_SIGBUS) && (vma->vm_flags & VM_NOSIGBUS)) {
+		/*
+		 * For MAP_NOSIGBUS mapping, map in the zero page on read fault
+		 * or fill a freshly allocated page with zeroes on write fault
+		 */
+		ret = do_anonymous_page(vmf);
+		if (!ret)
+			ret = VM_FAULT_NOPAGE;
+		return ret;
+	}
+
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
 			    VM_FAULT_DONE_COW)))
 		return ret;
diff --git a/mm/mmap.c b/mm/mmap.c
index 8bed547..d5c9fb5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1419,6 +1419,10 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
 	if (!len)
 		return -EINVAL;
 
+	/* Restrict MAP_NOSIGBUS to MAP_PRIVATE mapping */
+	if ((flags & MAP_NOSIGBUS) && !(flags & MAP_PRIVATE))
+		return -EINVAL;
+
 	/*
 	 * Does the application expect PROT_READ to imply PROT_EXEC?
 	 *
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index f94f65d..a2a5333 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -29,6 +29,7 @@ 
 #define MAP_HUGETLB		0x040000	/* create a huge page mapping */
 #define MAP_SYNC		0x080000 /* perform synchronous page faults for the mapping */
 #define MAP_FIXED_NOREPLACE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_NOSIGBUS		0x200000	/* do not SIGBUS on fault */
 
 #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
 					 * uninitialized */