diff mbox

[3/6] shm: add memfd_create() syscall

Message ID 1395256011-2423-4-git-send-email-dh.herrmann@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

David Herrmann March 19, 2014, 7:06 p.m. UTC
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
that you can pass to mmap(). It explicitly allows sealing and
avoids any connection to user-visible mount-points. Thus, it's not
subject to quotas on mounted file-systems, but can be used like
malloc()'ed memory, but with a file-descriptor to it.

memfd_create() does not create a front-FD, but instead returns the raw
shmem file, so calls like ftruncate() can be used. Also calls like fstat()
will return proper information and mark the file as regular file. Sealing
is explicitly supported on memfds.

Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
subject to quotas and alike.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 arch/x86/syscalls/syscall_32.tbl |  1 +
 arch/x86/syscalls/syscall_64.tbl |  1 +
 include/linux/syscalls.h         |  1 +
 include/uapi/linux/memfd.h       |  9 ++++++
 kernel/sys_ni.c                  |  1 +
 mm/shmem.c                       | 67 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 80 insertions(+)
 create mode 100644 include/uapi/linux/memfd.h

Comments

Cyrill Gorcunov March 20, 2014, 8:47 a.m. UTC | #1
On Wed, Mar 19, 2014 at 08:06:48PM +0100, David Herrmann wrote:
> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It explicitly allows sealing and
> avoids any connection to user-visible mount-points. Thus, it's not
> subject to quotas on mounted file-systems, but can be used like
> malloc()'ed memory, but with a file-descriptor to it.
> 
> memfd_create() does not create a front-FD, but instead returns the raw
> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
> will return proper information and mark the file as regular file. Sealing
> is explicitly supported on memfds.
> 
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike.

If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files
feature, Pavel?
Pavel Emelyanov March 20, 2014, 9:01 a.m. UTC | #2
On 03/20/2014 12:47 PM, Cyrill Gorcunov wrote:
> On Wed, Mar 19, 2014 at 08:06:48PM +0100, David Herrmann wrote:
>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>> that you can pass to mmap(). It explicitly allows sealing and
>> avoids any connection to user-visible mount-points. Thus, it's not
>> subject to quotas on mounted file-systems, but can be used like
>> malloc()'ed memory, but with a file-descriptor to it.
>>
>> memfd_create() does not create a front-FD, but instead returns the raw
>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>> will return proper information and mark the file as regular file. Sealing
>> is explicitly supported on memfds.
>>
>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>> subject to quotas and alike.
> 
> If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files
> feature, Pavel?

Thanks, Cyrill.

It is, but the map_files will work "in the opposite direction" :) In the memfd
case one first gets an FD, then mmap()s it; in the /proc/pis/map_files case one
should first mmap() a region, then open it via /proc/self/map_files.

But I don't know whether this matters.

Thanks,
Pavel
David Herrmann March 20, 2014, 11:29 a.m. UTC | #3
Hi

On Thu, Mar 20, 2014 at 10:01 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
> On 03/20/2014 12:47 PM, Cyrill Gorcunov wrote:
>> If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files
>> feature, Pavel?
>
> It is, but the map_files will work "in the opposite direction" :) In the memfd
> case one first gets an FD, then mmap()s it; in the /proc/pis/map_files case one
> should first mmap() a region, then open it via /proc/self/map_files.
>
> But I don't know whether this matters.

Yes, you can replace memfd_create() so far with:
  p = mmap(NULL, size, ..., MAP_ANON | MAP_SHARED, -1, 0);
  sprintf(path, "/proc/self/map_files/%lx-%lx", p, p + size);
  fd = open(path, O_RDWR);

However, map_files is only enabled with CONFIG_CHECKPOINT_RESTORE, the
/proc/pid/map_files/ directory is root-only (at least I get EPERM if
non-root), it doesn't provide the "name" argument which is very handy
for debugging, it doesn't explicitly support sealing (it requires
MAP_ANON to be backed by shmem) and it's a very weird API for
something this simple.

Thanks
David
Pavel Emelyanov March 20, 2014, 11:50 a.m. UTC | #4
On 03/20/2014 03:29 PM, David Herrmann wrote:
> Hi
> 
> On Thu, Mar 20, 2014 at 10:01 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
>> On 03/20/2014 12:47 PM, Cyrill Gorcunov wrote:
>>> If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files
>>> feature, Pavel?
>>
>> It is, but the map_files will work "in the opposite direction" :) In the memfd
>> case one first gets an FD, then mmap()s it; in the /proc/pis/map_files case one
>> should first mmap() a region, then open it via /proc/self/map_files.
>>
>> But I don't know whether this matters.
> 
> Yes, you can replace memfd_create() so far with:
>   p = mmap(NULL, size, ..., MAP_ANON | MAP_SHARED, -1, 0);
>   sprintf(path, "/proc/self/map_files/%lx-%lx", p, p + size);
>   fd = open(path, O_RDWR);
> 
> However, map_files is only enabled with CONFIG_CHECKPOINT_RESTORE, the
> /proc/pid/map_files/ directory is root-only (at least I get EPERM if
> non-root),

Yes. But this is something we'd also like to have fixed :) Having two
parties willing the same makes it easier for the patch to get accepted.

> it doesn't provide the "name" argument which is very handy
> for debugging,

What if we make mmap's shmem_zero_setup() generate a meaningful name,
would it solve the debugging issue?

> it doesn't explicitly support sealing (it requires MAP_ANON to be backed 
> by shmem)

Can you elaborate on this? The fd generated by sys_memfd() will be
shmem-backed, so will be the file opened via map_files link for the
MAP_ANON | MAP_SHARED mapping. So what are the problems to make it
support sealing?

> and it's a very weird API for something this simple.

:)

Thanks,
Pavel
John Stultz March 20, 2014, 7:22 p.m. UTC | #5
On 03/19/2014 12:06 PM, David Herrmann wrote:
> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It explicitly allows sealing and
> avoids any connection to user-visible mount-points. Thus, it's not
> subject to quotas on mounted file-systems, but can be used like
> malloc()'ed memory, but with a file-descriptor to it.
>
> memfd_create() does not create a front-FD, but instead returns the raw
> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
> will return proper information and mark the file as regular file. Sealing
> is explicitly supported on memfds.
>
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike.

This syscall would also be useful to Android, since it would satisfy the
requirement for providing atomically unlinked tmpfs fds that ashmem
provides (although upstreamed solutions to ashmem's other
functionalities are still needed).

My only comment is that I think memfd_* is sort of a new namespace.
Since this is providing shmem files, it seems it might be better named
something like shmfd_create() or my earlier suggestion of shmget_fd(). 
Otherwise, when talking about functionality like sealing, which is only
available on shmfs, we'll have to say "shmfs/tmpfs/memfd" or risk
confusing folks who might not initially grasp that its all the same
underneath.

thanks
-john
Konstantin Khlebnikov April 2, 2014, 1:38 p.m. UTC | #6
On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It explicitly allows sealing and
> avoids any connection to user-visible mount-points. Thus, it's not
> subject to quotas on mounted file-systems, but can be used like
> malloc()'ed memory, but with a file-descriptor to it.
>
> memfd_create() does not create a front-FD, but instead returns the raw
> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
> will return proper information and mark the file as regular file. Sealing
> is explicitly supported on memfds.
>
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike.

Instead of adding new syscall we can extend existing openat() a little
bit more:

openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)
David Herrmann April 2, 2014, 2:18 p.m. UTC | #7
Hi

On Wed, Apr 2, 2014 at 3:38 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>> that you can pass to mmap(). It explicitly allows sealing and
>> avoids any connection to user-visible mount-points. Thus, it's not
>> subject to quotas on mounted file-systems, but can be used like
>> malloc()'ed memory, but with a file-descriptor to it.
>>
>> memfd_create() does not create a front-FD, but instead returns the raw
>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>> will return proper information and mark the file as regular file. Sealing
>> is explicitly supported on memfds.
>>
>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>> subject to quotas and alike.
>
> Instead of adding new syscall we can extend existing openat() a little
> bit more:
>
> openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)

O_TMPFILE requires an existing directory as "name". So you have to use:
  open("/run/", O_TMPFILE | O_RDWR, 0666)
instead of
  open("/run/new_file", O_TMPFILE | O_RDWR, 0666)

We _really_ want to set a name for the inode, though. Otherwise,
debug-info via /proc/pid/fd/ is useless.

Furthermore, Linus requested to allow sealing only on files that
_explicitly_ allow sealing. So v2 of this series will have
MFD_ALLOW_SEALING as memfd_create() flag. I don't think we can do this
with linkat() (or is that meant to be implicit for the new AT_FDSHM?).
Last but not least, you now need a separate syscall to set the
file-size.

I could live with most of these issues, except for the name-thing. Ideas?

Thanks
David
Konstantin Khlebnikov April 2, 2014, 2:52 p.m. UTC | #8
On Wed, Apr 2, 2014 at 6:18 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> Hi
>
> On Wed, Apr 2, 2014 at 3:38 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
>>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>>> that you can pass to mmap(). It explicitly allows sealing and
>>> avoids any connection to user-visible mount-points. Thus, it's not
>>> subject to quotas on mounted file-systems, but can be used like
>>> malloc()'ed memory, but with a file-descriptor to it.
>>>
>>> memfd_create() does not create a front-FD, but instead returns the raw
>>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>>> will return proper information and mark the file as regular file. Sealing
>>> is explicitly supported on memfds.
>>>
>>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>>> subject to quotas and alike.
>>
>> Instead of adding new syscall we can extend existing openat() a little
>> bit more:
>>
>> openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)
>
> O_TMPFILE requires an existing directory as "name". So you have to use:
>   open("/run/", O_TMPFILE | O_RDWR, 0666)
> instead of
>   open("/run/new_file", O_TMPFILE | O_RDWR, 0666)
>
> We _really_ want to set a name for the inode, though. Otherwise,
> debug-info via /proc/pid/fd/ is useless.
>
> Furthermore, Linus requested to allow sealing only on files that
> _explicitly_ allow sealing. So v2 of this series will have
> MFD_ALLOW_SEALING as memfd_create() flag. I don't think we can do this
> with linkat() (or is that meant to be implicit for the new AT_FDSHM?).
> Last but not least, you now need a separate syscall to set the
> file-size.
>
> I could live with most of these issues, except for the name-thing. Ideas?

Hmm, why AT_FDSHM + O_TMPFILE pair cannot has different naming behavior?
Actually O_TMPFILE flag is optional here. AT_FDSHM is enough, but
O_TMPFILE allows to
move branching out of common fast-paths and hide it inside do_tmpfile.

BTW you can set some extended attribute via fsetxattr and distinguish
files in proc by its value.

OR you could add fcntl() for changing 'name' of tmpfiles. In
combination with AT_FDSHM this
would give complete solution without changing O_TMPFILE naming scheme.
But one syscall turns into three. )

--
Andy Lutomirski April 10, 2014, 7:07 p.m. UTC | #9
On 04/02/2014 06:38 AM, Konstantin Khlebnikov wrote:
> On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>> that you can pass to mmap(). It explicitly allows sealing and
>> avoids any connection to user-visible mount-points. Thus, it's not
>> subject to quotas on mounted file-systems, but can be used like
>> malloc()'ed memory, but with a file-descriptor to it.
>>
>> memfd_create() does not create a front-FD, but instead returns the raw
>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>> will return proper information and mark the file as regular file. Sealing
>> is explicitly supported on memfds.
>>
>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>> subject to quotas and alike.
> 
> Instead of adding new syscall we can extend existing openat() a little
> bit more:
> 
> openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)

Please don't.  O_TMPFILE is a messy enough API, and the last thing we
need to do is to extend it.  If we want a fancy API for creating new
inodes with no corresponding dentry, let's create one.

Otherwise, let's just stick with a special-purpose API for these shm files.

--Andy
diff mbox

Patch

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 96bc506..c943b8a 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -359,3 +359,4 @@ 
 350	i386	finit_module		sys_finit_module
 351	i386	sched_setattr		sys_sched_setattr
 352	i386	sched_getattr		sys_sched_getattr
+353	i386	memfd_create		sys_memfd_create
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc..e9d56a8 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@ 
 313	common	finit_module		sys_finit_module
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
+316	common	memfd_create		sys_memfd_create
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a747a77..124b838 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -791,6 +791,7 @@  asmlinkage long sys_timerfd_settime(int ufd, int flags,
 asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
+asmlinkage long sys_memfd_create(const char *uname_ptr, u64 size, u64 flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
new file mode 100644
index 0000000..d74cc89
--- /dev/null
+++ b/include/uapi/linux/memfd.h
@@ -0,0 +1,9 @@ 
+#ifndef _UAPI_LINUX_MEMFD_H
+#define _UAPI_LINUX_MEMFD_H
+
+#include <linux/types.h>
+
+/* flags for memfd_create(2) */
+#define MFD_CLOEXEC		0x0001
+
+#endif /* _UAPI_LINUX_MEMFD_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7078052..53e05af 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -193,6 +193,7 @@  cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+cond_syscall(sys_memfd_create);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);
diff --git a/mm/shmem.c b/mm/shmem.c
index 44d7f3b..48feb42 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,7 +66,9 @@  static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/syscalls.h>
 #include <linux/fcntl.h>
+#include <uapi/linux/memfd.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -3039,6 +3041,71 @@  out4:
 	return error;
 }
 
+/* maximum length of memfd names */
+#define MFD_MAX_NAMELEN 256
+
+SYSCALL_DEFINE3(memfd_create,
+		const char*, uname,
+		u64, size,
+		u64, flags)
+{
+	struct file *shm;
+	char *name;
+	int fd, r;
+	long len;
+
+	if (flags & ~(u64)MFD_CLOEXEC)
+		return -EINVAL;
+	if ((u64)(loff_t)size != size || (loff_t)size < 0)
+		return -EINVAL;
+
+	/* length includes terminating zero */
+	len = strnlen_user(uname, MFD_MAX_NAMELEN);
+	if (len <= 0)
+		return -EFAULT;
+	else if (len > MFD_MAX_NAMELEN)
+		return -EINVAL;
+
+	name = kmalloc(len + 6, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+
+	strcpy(name, "memfd:");
+	if (copy_from_user(&name[6], uname, len)) {
+		r = -EFAULT;
+		goto err_name;
+	}
+
+	/* terminating-zero may have changed after strnlen_user() returned */
+	if (name[len + 6 - 1]) {
+		r = -EFAULT;
+		goto err_name;
+	}
+
+	fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
+	if (fd < 0) {
+		r = fd;
+		goto err_name;
+	}
+
+	shm = shmem_file_setup(name, size, 0);
+	if (IS_ERR(shm)) {
+		r = PTR_ERR(shm);
+		goto err_fd;
+	}
+	shm->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
+
+	fd_install(fd, shm);
+	kfree(name);
+	return fd;
+
+err_fd:
+	put_unused_fd(fd);
+err_name:
+	kfree(name);
+	return r;
+}
+
 #else /* !CONFIG_SHMEM */
 
 /*