[v3,09/10] virtiofsd: Optionally fill lo_inode.fhandle

Message ID	20210730150134.216126-10-mreitz@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=HQuH=MW=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 1B17E60F5C From: Max Reitz <mreitz@redhat.com> To: qemu-devel@nongnu.org, virtio-fs@redhat.com Subject: [PATCH v3 09/10] virtiofsd: Optionally fill lo_inode.fhandle Date: Fri, 30 Jul 2021 17:01:33 +0200 Message-Id: <20210730150134.216126-10-mreitz@redhat.com> In-Reply-To: <20210730150134.216126-1-mreitz@redhat.com> References: <20210730150134.216126-1-mreitz@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII" Received-SPF: pass client-ip=216.205.24.124; envelope-from=mreitz@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -34 X-Spam_score: -3.5 X-Spam_bar: --- X-Spam_report: (-3.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.717, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: Stefan Hajnoczi <stefanha@redhat.com>, "Dr . David Alan Gilbert" <dgilbert@redhat.com>, Vivek Goyal <vgoyal@redhat.com>, Max Reitz <mreitz@redhat.com> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	virtiofsd: Allow using file handles instead of O_PATH FDs \| expand [v3,00/10] virtiofsd: Allow using file handles instead of O_PATH FDs [v3,01/10] virtiofsd: Limit setxattr()'s creds-dropped region [v3,02/10] virtiofsd: Add TempFd structure [v3,03/10] virtiofsd: Use lo_inode_open() instead of openat() [v3,04/10] virtiofsd: Add lo_inode_fd() helper [v3,05/10] virtiofsd: Let lo_fd() return a TempFd [v3,06/10] virtiofsd: Let lo_inode_open() return a TempFd [v3,07/10] virtiofsd: Add lo_inode.fhandle [v3,08/10] virtiofsd: Add inodes_by_handle hash table [v3,09/10] virtiofsd: Optionally fill lo_inode.fhandle [v3,10/10] virtiofsd: Add lazy lo_do_find()

Max Reitz July 30, 2021, 3:01 p.m. UTC

When the inode_file_handles option is set, try to generate a file handle
for new inodes instead of opening an O_PATH FD.

Being able to open these again will require CAP_DAC_READ_SEARCH, so the
description text tells the user they will also need to specify
-o modcaps=+dac_read_search.

Generating a file handle returns the mount ID it is valid for.  Opening
it will require an FD instead.  We have mount_fds to map an ID to an FD.
get_file_handle() fills the hash map by opening the file we have
generated a handle for.  To verify that the resulting FD indeed
represents the handle's mount ID, we use statx().  Therefore, using file
handles requires statx() support.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 tools/virtiofsd/helper.c              |   3 +
 tools/virtiofsd/passthrough_ll.c      | 194 ++++++++++++++++++++++++--
 tools/virtiofsd/passthrough_seccomp.c |   1 +
 3 files changed, 190 insertions(+), 8 deletions(-)

Vivek Goyal Aug. 9, 2021, 6:41 p.m. UTC | #1

On Fri, Jul 30, 2021 at 05:01:33PM +0200, Max Reitz wrote:
> When the inode_file_handles option is set, try to generate a file handle
> for new inodes instead of opening an O_PATH FD.
> 
> Being able to open these again will require CAP_DAC_READ_SEARCH, so the
> description text tells the user they will also need to specify
> -o modcaps=+dac_read_search.
> 
> Generating a file handle returns the mount ID it is valid for.  Opening
> it will require an FD instead.  We have mount_fds to map an ID to an FD.
> get_file_handle() fills the hash map by opening the file we have
> generated a handle for.  To verify that the resulting FD indeed
> represents the handle's mount ID, we use statx().  Therefore, using file
> handles requires statx() support.

So opening the file and storing that fd in mount_fds table might be
a potential problem with inotify work Ioannis is doing.

So say a file foo.txt was opened O_RDONLY and fd stored in mount_fs. Now
say user unlinks foo.txt. If notifications are enabled, final notification
will not be generated till this mount_fds fd is closed.

Now question is when will this fd be closed? If it closed at some
later point and then notification is generated, that will break
notificaitons.

In fact even O_PATH fd is delaying notifications due to same reason.
But its not too bad as we close O_PATH fd pretty quickly after
unlinking. And we were hoping that file handle support will get rid
of this problem because we will not keep O_PATH fd open.

But, IIUC, mount_fds stuff will make it even worse. I did not see 
the code which removes this fd from mount_fds. So I am not sure what's
the life time of this fd.

Thanks
Vivek

> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  tools/virtiofsd/helper.c              |   3 +
>  tools/virtiofsd/passthrough_ll.c      | 194 ++++++++++++++++++++++++--
>  tools/virtiofsd/passthrough_seccomp.c |   1 +
>  3 files changed, 190 insertions(+), 8 deletions(-)
> 
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> index a8295d975a..aa63a21d43 100644
> --- a/tools/virtiofsd/helper.c
> +++ b/tools/virtiofsd/helper.c
> @@ -187,6 +187,9 @@ void fuse_cmdline_help(void)
>             "                               default: no_allow_direct_io\n"
>             "    -o announce_submounts      Announce sub-mount points to the guest\n"
>             "    -o posix_acl/no_posix_acl  Enable/Disable posix_acl. (default: disabled)\n"
> +           "    -o inode_file_handles      Use file handles to reference inodes\n"
> +           "                               instead of O_PATH file descriptors\n"
> +           "                               (requires -o modcaps=+dac_read_search)\n"
>             );
>  }
>  
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index f9d8b2f134..ac95961d12 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -194,6 +194,7 @@ struct lo_data {
>      /* If set, virtiofsd is responsible for setting umask during creation */
>      bool change_umask;
>      int user_posix_acl, posix_acl;
> +    int inode_file_handles;
>  };
>  
>  /**
> @@ -250,6 +251,10 @@ static const struct fuse_opt lo_opts[] = {
>      { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 },
>      { "posix_acl", offsetof(struct lo_data, user_posix_acl), 1 },
>      { "no_posix_acl", offsetof(struct lo_data, user_posix_acl), 0 },
> +    { "inode_file_handles", offsetof(struct lo_data, inode_file_handles), 1 },
> +    { "no_inode_file_handles",
> +      offsetof(struct lo_data, inode_file_handles),
> +      0 },
>      FUSE_OPT_END
>  };
>  static bool use_syslog = false;
> @@ -321,6 +326,135 @@ static int temp_fd_steal(TempFd *temp_fd)
>      }
>  }
>  
> +/**
> + * Generate a file handle for the given dirfd/name combination.
> + *
> + * If mount_fds does not yet contain an entry for the handle's mount
> + * ID, (re)open dirfd/name in O_RDONLY mode and add it to mount_fds
> + * as the FD for that mount ID.  (That is the file that we have
> + * generated a handle for, so it should be representative for the
> + * mount ID.  However, to be sure (and to rule out races), we use
> + * statx() to verify that our assumption is correct.)
> + */
> +static struct lo_fhandle *get_file_handle(struct lo_data *lo,
> +                                          int dirfd, const char *name)
> +{
> +    /* We need statx() to verify the mount ID */
> +#if defined(CONFIG_STATX) && defined(STATX_MNT_ID)
> +    struct lo_fhandle *fh;
> +    int ret;
> +
> +    if (!lo->use_statx || !lo->inode_file_handles) {
> +        return NULL;
> +    }
> +
> +    fh = g_new0(struct lo_fhandle, 1);
> +
> +    fh->handle.handle_bytes = sizeof(fh->padding) - sizeof(fh->handle);
> +    ret = name_to_handle_at(dirfd, name, &fh->handle, &fh->mount_id,
> +                            AT_EMPTY_PATH);
> +    if (ret < 0) {
> +        goto fail;
> +    }
> +
> +    if (pthread_rwlock_rdlock(&mount_fds_lock)) {
> +        goto fail;
> +    }
> +    if (!g_hash_table_contains(mount_fds, GINT_TO_POINTER(fh->mount_id))) {
> +        g_auto(TempFd) path_fd = TEMP_FD_INIT;
> +        struct statx stx;
> +        char procname[64];
> +        int fd;
> +
> +        pthread_rwlock_unlock(&mount_fds_lock);
> +
> +        /*
> +         * Before opening an O_RDONLY fd, check whether dirfd/name is a regular
> +         * file or directory, because we must not open anything else with
> +         * anything but O_PATH.
> +         * (And we use that occasion to verify that the file has the mount ID we
> +         * need.)
> +         */
> +        if (name[0]) {
> +            path_fd.fd = openat(dirfd, name, O_PATH);
> +            if (path_fd.fd < 0) {
> +                goto fail;
> +            }
> +            path_fd.owned = true;
> +        } else {
> +            path_fd.fd = dirfd;
> +            path_fd.owned = false;
> +        }
> +
> +        ret = statx(path_fd.fd, "", AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW,
> +                    STATX_TYPE | STATX_MNT_ID, &stx);
> +        if (ret < 0) {
> +            if (errno == ENOSYS) {
> +                lo->use_statx = false;
> +                fuse_log(FUSE_LOG_WARNING,
> +                         "statx() does not work: Will not be able to use file "
> +                         "handles for inodes\n");
> +            }
> +            goto fail;
> +        }
> +        if (!(stx.stx_mask & STATX_MNT_ID) || stx.stx_mnt_id != fh->mount_id) {
> +            /*
> +             * One reason for stx_mnt_id != mount_id could be that dirfd/name
> +             * is a directory, and some other filesystem was mounted there
> +             * between us generating the file handle and then opening the FD.
> +             * (Other kinds of races might be possible, too.)
> +             * Failing this function is not fatal, though, because our caller
> +             * (lo_do_lookup()) will just fall back to opening an O_PATH FD to
> +             * store in lo_inode.fd instead of storing a file handle in
> +             * lo_inode.fhandle.  So we do not need to try too hard to get an
> +             * FD for fh->mount_id so this function could succeed.
> +             */
> +            goto fail;
> +        }
> +        if (!(stx.stx_mask & STATX_TYPE) ||
> +            !(S_ISREG(stx.stx_mode) || S_ISDIR(stx.stx_mode)))
> +        {
> +            /*
> +             * We must not open special files with anything but O_PATH, so we
> +             * cannot use this file for mount_fds.
> +             * Just return a failure in such a case and let the lo_inode have
> +             * an O_PATH fd instead of a file handle.
> +             */
> +            goto fail;
> +        }
> +
> +        /* Now that we know this fd is safe to open, do it */
> +        snprintf(procname, sizeof(procname), "%i", path_fd.fd);
> +        fd = openat(lo->proc_self_fd, procname, O_RDONLY);
> +        if (fd < 0) {
> +            goto fail;
> +        }
> +
> +        if (pthread_rwlock_wrlock(&mount_fds_lock)) {
> +            goto fail;
> +        }
> +
> +        /* Check again, might have changed */
> +        if (g_hash_table_contains(mount_fds, GINT_TO_POINTER(fh->mount_id))) {
> +            close(fd);
> +        } else {
> +            g_hash_table_insert(mount_fds,
> +                                GINT_TO_POINTER(fh->mount_id),
> +                                GINT_TO_POINTER(fd));
> +        }
> +    }
> +    pthread_rwlock_unlock(&mount_fds_lock);
> +
> +    return fh;
> +
> +fail:
> +    free(fh);
> +    return NULL;
> +#else /* defined(CONFIG_STATX) && defined(STATX_MNT_ID) */
> +    return NULL;
> +#endif
> +}
> +
>  /**
>   * Open the given file handle with the given flags.
>   *
> @@ -1165,6 +1299,11 @@ static int do_statx(struct lo_data *lo, int dirfd, const char *pathname,
>              return -1;
>          }
>          lo->use_statx = false;
> +        if (lo->inode_file_handles) {
> +            fuse_log(FUSE_LOG_WARNING,
> +                     "statx() does not work: Will not be able to use file "
> +                     "handles for inodes\n");
> +        }
>          /* fallback */
>      }
>  #endif
> @@ -1194,6 +1333,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
>      struct lo_data *lo = lo_data(req);
>      struct lo_inode *inode = NULL;
>      struct lo_inode *dir = lo_inode(req, parent);
> +    struct lo_fhandle *fh;
>  
>      if (inodep) {
>          *inodep = NULL; /* in case there is an error */
> @@ -1223,13 +1363,21 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
>          goto out;
>      }
>  
> -    newfd = openat(dir_fd.fd, name, O_PATH | O_NOFOLLOW);
> -    if (newfd == -1) {
> -        goto out_err;
> +    fh = get_file_handle(lo, dir_fd.fd, name);
> +    if (!fh) {
> +        newfd = openat(dir_fd.fd, name, O_PATH | O_NOFOLLOW);
> +        if (newfd == -1) {
> +            goto out_err;
> +        }
>      }
>  
> -    res = do_statx(lo, newfd, "", &e->attr, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW,
> -                   &mnt_id);
> +    if (newfd >= 0) {
> +        res = do_statx(lo, newfd, "", &e->attr,
> +                       AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW, &mnt_id);
> +    } else {
> +        res = do_statx(lo, dir_fd.fd, name, &e->attr,
> +                       AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW, &mnt_id);
> +    }
>      if (res == -1) {
>          goto out_err;
>      }
> @@ -1239,9 +1387,19 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
>          e->attr_flags |= FUSE_ATTR_SUBMOUNT;
>      }
>  
> -    inode = lo_find(lo, NULL, &e->attr, mnt_id);
> +    /*
> +     * Note that fh is always NULL if lo->inode_file_handles is false,
> +     * and so we will never do a lookup by file handle here, and
> +     * lo->inodes_by_handle will always remain empty.  We only need
> +     * this map when we do not have an O_PATH fd open for every
> +     * lo_inode, though, so if inode_file_handles is false, we do not
> +     * need that map anyway.
> +     */
> +    inode = lo_find(lo, fh, &e->attr, mnt_id);
>      if (inode) {
> -        close(newfd);
> +        if (newfd != -1) {
> +            close(newfd);
> +        }
>      } else {
>          inode = calloc(1, sizeof(struct lo_inode));
>          if (!inode) {
> @@ -1259,6 +1417,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
>  
>          inode->nlookup = 1;
>          inode->fd = newfd;
> +        inode->fhandle = fh;
>          inode->key.ino = e->attr.st_ino;
>          inode->key.dev = e->attr.st_dev;
>          inode->key.mnt_id = mnt_id;
> @@ -1270,6 +1429,9 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
>          pthread_mutex_lock(&lo->mutex);
>          inode->fuse_ino = lo_add_inode_mapping(req, inode);
>          g_hash_table_insert(lo->inodes_by_ids, &inode->key, inode);
> +        if (inode->fhandle) {
> +            g_hash_table_insert(lo->inodes_by_handle, inode->fhandle, inode);
> +        }
>          pthread_mutex_unlock(&lo->mutex);
>      }
>      e->ino = inode->fuse_ino;
> @@ -1615,6 +1777,7 @@ static struct lo_inode *lookup_name(fuse_req_t req, fuse_ino_t parent,
>      int res;
>      uint64_t mnt_id;
>      struct stat attr;
> +    struct lo_fhandle *fh;
>      struct lo_data *lo = lo_data(req);
>      struct lo_inode *dir = lo_inode(req, parent);
>      struct lo_inode *inode = NULL;
> @@ -1628,12 +1791,16 @@ static struct lo_inode *lookup_name(fuse_req_t req, fuse_ino_t parent,
>          goto out;
>      }
>  
> +    fh = get_file_handle(lo, dir_fd.fd, name);
> +    /* Ignore errors, this is just an optional key for the lookup */
> +
>      res = do_statx(lo, dir_fd.fd, name, &attr, AT_SYMLINK_NOFOLLOW, &mnt_id);
>      if (res == -1) {
>          goto out;
>      }
>  
> -    inode = lo_find(lo, NULL, &attr, mnt_id);
> +    inode = lo_find(lo, fh, &attr, mnt_id);
> +    g_free(fh);
>  
>  out:
>      lo_inode_put(lo, &dir);
> @@ -1801,6 +1968,9 @@ static void unref_inode(struct lo_data *lo, struct lo_inode *inode, uint64_t n)
>      if (!inode->nlookup) {
>          lo_map_remove(&lo->ino_map, inode->fuse_ino);
>          g_hash_table_remove(lo->inodes_by_ids, &inode->key);
> +        if (inode->fhandle) {
> +            g_hash_table_remove(lo->inodes_by_handle, inode->fhandle);
> +        }
>          if (lo->posix_lock) {
>              if (g_hash_table_size(inode->posix_locks)) {
>                  fuse_log(FUSE_LOG_WARNING, "Hash table is not empty\n");
> @@ -4362,6 +4532,14 @@ int main(int argc, char *argv[])
>  
>      lo.use_statx = true;
>  
> +#if !defined(CONFIG_STATX) || !defined(STATX_MNT_ID)
> +    if (lo.inode_file_handles) {
> +        fuse_log(FUSE_LOG_WARNING,
> +                 "No statx() or mount ID support: Will not be able to use file "
> +                 "handles for inodes\n");
> +    }
> +#endif
> +
>      se = fuse_session_new(&args, &lo_oper, sizeof(lo_oper), &lo);
>      if (se == NULL) {
>          goto err_out1;
> diff --git a/tools/virtiofsd/passthrough_seccomp.c b/tools/virtiofsd/passthrough_seccomp.c
> index af04c638cb..ab4dc07e3f 100644
> --- a/tools/virtiofsd/passthrough_seccomp.c
> +++ b/tools/virtiofsd/passthrough_seccomp.c
> @@ -73,6 +73,7 @@ static const int syscall_allowlist[] = {
>      SCMP_SYS(mprotect),
>      SCMP_SYS(mremap),
>      SCMP_SYS(munmap),
> +    SCMP_SYS(name_to_handle_at),
>      SCMP_SYS(newfstatat),
>      SCMP_SYS(statx),
>      SCMP_SYS(open),
> -- 
> 2.31.1
>

Hanna Czenczek Aug. 10, 2021, 8:32 a.m. UTC | #2

On 09.08.21 20:41, Vivek Goyal wrote:
> On Fri, Jul 30, 2021 at 05:01:33PM +0200, Max Reitz wrote:
>> When the inode_file_handles option is set, try to generate a file handle
>> for new inodes instead of opening an O_PATH FD.
>>
>> Being able to open these again will require CAP_DAC_READ_SEARCH, so the
>> description text tells the user they will also need to specify
>> -o modcaps=+dac_read_search.
>>
>> Generating a file handle returns the mount ID it is valid for.  Opening
>> it will require an FD instead.  We have mount_fds to map an ID to an FD.
>> get_file_handle() fills the hash map by opening the file we have
>> generated a handle for.  To verify that the resulting FD indeed
>> represents the handle's mount ID, we use statx().  Therefore, using file
>> handles requires statx() support.
> So opening the file and storing that fd in mount_fds table might be
> a potential problem with inotify work Ioannis is doing.
>
> So say a file foo.txt was opened O_RDONLY and fd stored in mount_fs. Now
> say user unlinks foo.txt. If notifications are enabled, final notification
> will not be generated till this mount_fds fd is closed.
>
> Now question is when will this fd be closed? If it closed at some
> later point and then notification is generated, that will break
> notificaitons.

Currently, it is never closed.

> In fact even O_PATH fd is delaying notifications due to same reason.
> But its not too bad as we close O_PATH fd pretty quickly after
> unlinking. And we were hoping that file handle support will get rid
> of this problem because we will not keep O_PATH fd open.
>
> But, IIUC, mount_fds stuff will make it even worse. I did not see
> the code which removes this fd from mount_fds. So I am not sure what's
> the life time of this fd.

The lifetime is forever.  If we wanted to remove it at some point, we’d 
need to track how many file handles we have open for the given mount fd 
and then remove it from the table once the count reaches 0, so it would 
still be delayed.

I think in practice the first thing that is looked up from some mount 
will probably be the root directory, which cannot be deleted before 
everything else on the mount is gone, so that would work.  We track how 
many handles are there, if the whole mount were to be deleted, I hope 
all lo_inodes are evicted, the count goes to 0, and we can drop the 
mount fd.

I think we can make the assumption that the mount fd is the root 
directory certain by, well, looking into mountinfo...  That would result 
in us always opening the root node of the filesystem, so that first the 
whole filesystem needs to disappear before it can be deleted (and our 
mount fd closed) – which should work, I guess?

It’s a bit tricky because our sandboxing prevents easy access to 
mountinfo, but if that’s the only way...

Hanna

Vivek Goyal Aug. 10, 2021, 3:23 p.m. UTC | #3

On Tue, Aug 10, 2021 at 10:32:55AM +0200, Hanna Reitz wrote:
> On 09.08.21 20:41, Vivek Goyal wrote:
> > On Fri, Jul 30, 2021 at 05:01:33PM +0200, Max Reitz wrote:
> > > When the inode_file_handles option is set, try to generate a file handle
> > > for new inodes instead of opening an O_PATH FD.
> > > 
> > > Being able to open these again will require CAP_DAC_READ_SEARCH, so the
> > > description text tells the user they will also need to specify
> > > -o modcaps=+dac_read_search.
> > > 
> > > Generating a file handle returns the mount ID it is valid for.  Opening
> > > it will require an FD instead.  We have mount_fds to map an ID to an FD.
> > > get_file_handle() fills the hash map by opening the file we have
> > > generated a handle for.  To verify that the resulting FD indeed
> > > represents the handle's mount ID, we use statx().  Therefore, using file
> > > handles requires statx() support.
> > So opening the file and storing that fd in mount_fds table might be
> > a potential problem with inotify work Ioannis is doing.
> > 
> > So say a file foo.txt was opened O_RDONLY and fd stored in mount_fs. Now
> > say user unlinks foo.txt. If notifications are enabled, final notification
> > will not be generated till this mount_fds fd is closed.
> > 
> > Now question is when will this fd be closed? If it closed at some
> > later point and then notification is generated, that will break
> > notificaitons.
> 
> Currently, it is never closed.
> 
> > In fact even O_PATH fd is delaying notifications due to same reason.
> > But its not too bad as we close O_PATH fd pretty quickly after
> > unlinking. And we were hoping that file handle support will get rid
> > of this problem because we will not keep O_PATH fd open.
> > 
> > But, IIUC, mount_fds stuff will make it even worse. I did not see
> > the code which removes this fd from mount_fds. So I am not sure what's
> > the life time of this fd.
> 
> The lifetime is forever.  If we wanted to remove it at some point, we’d need
> to track how many file handles we have open for the given mount fd and then
> remove it from the table once the count reaches 0, so it would still be
> delayed.
> 
> I think in practice the first thing that is looked up from some mount will
> probably be the root directory, which cannot be deleted before everything
> else on the mount is gone, so that would work.  We track how many handles
> are there, if the whole mount were to be deleted, I hope all lo_inodes are
> evicted, the count goes to 0, and we can drop the mount fd.

Keeping a reference count on mount_fd object make sense. So we probably
maintain this hash table and lookup using mount_id (as you are already
doing). All subsequent inodes from same filesystem will use same
object. Once all inodes have been flushed out, then mount_fd object
should go away as well (allowing for unmount on host).

> 
> I think we can make the assumption that the mount fd is the root directory
> certain by, well, looking into mountinfo...  That would result in us always
> opening the root node of the filesystem, so that first the whole filesystem
> needs to disappear before it can be deleted (and our mount fd closed) –
> which should work, I guess?

This seems more reasonable. And I think that's what man page seems to 
suggest.

       The  mount_id  argument  returns an identifier for the filesystem mount
       that corresponds to pathname.  This corresponds to the first  field  in
       one  of  the  records in /proc/self/mountinfo.  Opening the pathname in
       the fifth field of that record yields a file descriptor for  the  mount
       point;  that  file  descriptor  can  be  used  in  a subsequent call to
       open_by_handle_at().

Fifth field seems to be the mount point. man proc says.

              (5)  mount  point:  the  pathname of the mount point relative to
                   the process's root directory.

So opening mount point and saving as mount_fd (if it is not already
in hash table) and then take a per inode reference count on mount_fd
object looks like will solve the life time issue of mount_fd as
well as the issue of temporary failures arising because we can't
open a device special file.

> 
> It’s a bit tricky because our sandboxing prevents easy access to mountinfo,
> but if that’s the only way...

yes. We already have lo->proc_self_fd. Maybe we need to keep
/proc/self/mountinfo open in lo->proc_self_mountinfo. I am assuming
that any mount table changes will still be visible despite the fact
I have fd open (and don't have to open new fd to notice new mount/unmount
changes).

Vivek

Hanna Czenczek Aug. 10, 2021, 3:26 p.m. UTC | #4

On 10.08.21 17:23, Vivek Goyal wrote:
> On Tue, Aug 10, 2021 at 10:32:55AM +0200, Hanna Reitz wrote:
>> On 09.08.21 20:41, Vivek Goyal wrote:
>>> On Fri, Jul 30, 2021 at 05:01:33PM +0200, Max Reitz wrote:
>>>> When the inode_file_handles option is set, try to generate a file handle
>>>> for new inodes instead of opening an O_PATH FD.
>>>>
>>>> Being able to open these again will require CAP_DAC_READ_SEARCH, so the
>>>> description text tells the user they will also need to specify
>>>> -o modcaps=+dac_read_search.
>>>>
>>>> Generating a file handle returns the mount ID it is valid for.  Opening
>>>> it will require an FD instead.  We have mount_fds to map an ID to an FD.
>>>> get_file_handle() fills the hash map by opening the file we have
>>>> generated a handle for.  To verify that the resulting FD indeed
>>>> represents the handle's mount ID, we use statx().  Therefore, using file
>>>> handles requires statx() support.
>>> So opening the file and storing that fd in mount_fds table might be
>>> a potential problem with inotify work Ioannis is doing.
>>>
>>> So say a file foo.txt was opened O_RDONLY and fd stored in mount_fs. Now
>>> say user unlinks foo.txt. If notifications are enabled, final notification
>>> will not be generated till this mount_fds fd is closed.
>>>
>>> Now question is when will this fd be closed? If it closed at some
>>> later point and then notification is generated, that will break
>>> notificaitons.
>> Currently, it is never closed.
>>
>>> In fact even O_PATH fd is delaying notifications due to same reason.
>>> But its not too bad as we close O_PATH fd pretty quickly after
>>> unlinking. And we were hoping that file handle support will get rid
>>> of this problem because we will not keep O_PATH fd open.
>>>
>>> But, IIUC, mount_fds stuff will make it even worse. I did not see
>>> the code which removes this fd from mount_fds. So I am not sure what's
>>> the life time of this fd.
>> The lifetime is forever.  If we wanted to remove it at some point, we’d need
>> to track how many file handles we have open for the given mount fd and then
>> remove it from the table once the count reaches 0, so it would still be
>> delayed.
>>
>> I think in practice the first thing that is looked up from some mount will
>> probably be the root directory, which cannot be deleted before everything
>> else on the mount is gone, so that would work.  We track how many handles
>> are there, if the whole mount were to be deleted, I hope all lo_inodes are
>> evicted, the count goes to 0, and we can drop the mount fd.
> Keeping a reference count on mount_fd object make sense. So we probably
> maintain this hash table and lookup using mount_id (as you are already
> doing). All subsequent inodes from same filesystem will use same
> object. Once all inodes have been flushed out, then mount_fd object
> should go away as well (allowing for unmount on host).
>
>> I think we can make the assumption that the mount fd is the root directory
>> certain by, well, looking into mountinfo...  That would result in us always
>> opening the root node of the filesystem, so that first the whole filesystem
>> needs to disappear before it can be deleted (and our mount fd closed) –
>> which should work, I guess?
> This seems more reasonable. And I think that's what man page seems to
> suggest.
>
>         The  mount_id  argument  returns an identifier for the filesystem mount
>         that corresponds to pathname.  This corresponds to the first  field  in
>         one  of  the  records in /proc/self/mountinfo.  Opening the pathname in
>         the fifth field of that record yields a file descriptor for  the  mount
>         point;  that  file  descriptor  can  be  used  in  a subsequent call to
>         open_by_handle_at().
>
> Fifth field seems to be the mount point. man proc says.
>
>                (5)  mount  point:  the  pathname of the mount point relative to
>                     the process's root directory.
>
> So opening mount point and saving as mount_fd (if it is not already
> in hash table) and then take a per inode reference count on mount_fd
> object looks like will solve the life time issue of mount_fd as
> well as the issue of temporary failures arising because we can't
> open a device special file.

Well, we’ve had this discussion before, and it’s possible that a 
filesystem has a device file as its mount point.

But given the inotify complications, there’s really a good reason we 
should use mountinfo.

>> It’s a bit tricky because our sandboxing prevents easy access to mountinfo,
>> but if that’s the only way...
> yes. We already have lo->proc_self_fd. Maybe we need to keep
> /proc/self/mountinfo open in lo->proc_self_mountinfo. I am assuming
> that any mount table changes will still be visible despite the fact
> I have fd open (and don't have to open new fd to notice new mount/unmount
> changes).

Well, yes, that was my idea.  Unfortunately, I wasn’t quite successful 
yet; when I tried keeping the fd open, reading from it would just return 
0 bytes.  Perhaps that’s because we bind-mount /proc/self/fd to /proc so 
that nothing else in /proc is visible. Perhaps we need to bind-mount 
/proc/self/mountinfo into /proc/self/fd before that...

I’ll just have to try.

Hanna

Vivek Goyal Aug. 10, 2021, 3:57 p.m. UTC | #5

On Tue, Aug 10, 2021 at 05:26:15PM +0200, Hanna Reitz wrote:
> On 10.08.21 17:23, Vivek Goyal wrote:
> > On Tue, Aug 10, 2021 at 10:32:55AM +0200, Hanna Reitz wrote:
> > > On 09.08.21 20:41, Vivek Goyal wrote:
> > > > On Fri, Jul 30, 2021 at 05:01:33PM +0200, Max Reitz wrote:
> > > > > When the inode_file_handles option is set, try to generate a file handle
> > > > > for new inodes instead of opening an O_PATH FD.
> > > > > 
> > > > > Being able to open these again will require CAP_DAC_READ_SEARCH, so the
> > > > > description text tells the user they will also need to specify
> > > > > -o modcaps=+dac_read_search.
> > > > > 
> > > > > Generating a file handle returns the mount ID it is valid for.  Opening
> > > > > it will require an FD instead.  We have mount_fds to map an ID to an FD.
> > > > > get_file_handle() fills the hash map by opening the file we have
> > > > > generated a handle for.  To verify that the resulting FD indeed
> > > > > represents the handle's mount ID, we use statx().  Therefore, using file
> > > > > handles requires statx() support.
> > > > So opening the file and storing that fd in mount_fds table might be
> > > > a potential problem with inotify work Ioannis is doing.
> > > > 
> > > > So say a file foo.txt was opened O_RDONLY and fd stored in mount_fs. Now
> > > > say user unlinks foo.txt. If notifications are enabled, final notification
> > > > will not be generated till this mount_fds fd is closed.
> > > > 
> > > > Now question is when will this fd be closed? If it closed at some
> > > > later point and then notification is generated, that will break
> > > > notificaitons.
> > > Currently, it is never closed.
> > > 
> > > > In fact even O_PATH fd is delaying notifications due to same reason.
> > > > But its not too bad as we close O_PATH fd pretty quickly after
> > > > unlinking. And we were hoping that file handle support will get rid
> > > > of this problem because we will not keep O_PATH fd open.
> > > > 
> > > > But, IIUC, mount_fds stuff will make it even worse. I did not see
> > > > the code which removes this fd from mount_fds. So I am not sure what's
> > > > the life time of this fd.
> > > The lifetime is forever.  If we wanted to remove it at some point, we’d need
> > > to track how many file handles we have open for the given mount fd and then
> > > remove it from the table once the count reaches 0, so it would still be
> > > delayed.
> > > 
> > > I think in practice the first thing that is looked up from some mount will
> > > probably be the root directory, which cannot be deleted before everything
> > > else on the mount is gone, so that would work.  We track how many handles
> > > are there, if the whole mount were to be deleted, I hope all lo_inodes are
> > > evicted, the count goes to 0, and we can drop the mount fd.
> > Keeping a reference count on mount_fd object make sense. So we probably
> > maintain this hash table and lookup using mount_id (as you are already
> > doing). All subsequent inodes from same filesystem will use same
> > object. Once all inodes have been flushed out, then mount_fd object
> > should go away as well (allowing for unmount on host).
> > 
> > > I think we can make the assumption that the mount fd is the root directory
> > > certain by, well, looking into mountinfo...  That would result in us always
> > > opening the root node of the filesystem, so that first the whole filesystem
> > > needs to disappear before it can be deleted (and our mount fd closed) –
> > > which should work, I guess?
> > This seems more reasonable. And I think that's what man page seems to
> > suggest.
> > 
> >         The  mount_id  argument  returns an identifier for the filesystem mount
> >         that corresponds to pathname.  This corresponds to the first  field  in
> >         one  of  the  records in /proc/self/mountinfo.  Opening the pathname in
> >         the fifth field of that record yields a file descriptor for  the  mount
> >         point;  that  file  descriptor  can  be  used  in  a subsequent call to
> >         open_by_handle_at().
> > 
> > Fifth field seems to be the mount point. man proc says.
> > 
> >                (5)  mount  point:  the  pathname of the mount point relative to
> >                     the process's root directory.
> > 
> > So opening mount point and saving as mount_fd (if it is not already
> > in hash table) and then take a per inode reference count on mount_fd
> > object looks like will solve the life time issue of mount_fd as
> > well as the issue of temporary failures arising because we can't
> > open a device special file.
> 
> Well, we’ve had this discussion before, and it’s possible that a filesystem
> has a device file as its mount point.

Yes. I think you did modified fuse to do some special trickery. Not sure
where should that be fixed. 

If filesystem is faking, then it can fake a device node as regular
file and fool us into opening it as well?

> 
> But given the inotify complications, there’s really a good reason we should
> use mountinfo.
> 
> > > It’s a bit tricky because our sandboxing prevents easy access to mountinfo,
> > > but if that’s the only way...
> > yes. We already have lo->proc_self_fd. Maybe we need to keep
> > /proc/self/mountinfo open in lo->proc_self_mountinfo. I am assuming
> > that any mount table changes will still be visible despite the fact
> > I have fd open (and don't have to open new fd to notice new mount/unmount
> > changes).
> 
> Well, yes, that was my idea.  Unfortunately, I wasn’t quite successful yet;
> when I tried keeping the fd open, reading from it would just return 0
> bytes.  Perhaps that’s because we bind-mount /proc/self/fd to /proc so that
> nothing else in /proc is visible. Perhaps we need to bind-mount
> /proc/self/mountinfo into /proc/self/fd before that...

Or perhaps open /proc/self/mountinfo and save fd in lo->proc_mountinfo
before /proc/self/fd is bind mounted on /proc?

Vivek

Hanna Czenczek Aug. 11, 2021, 6:41 a.m. UTC | #6

On 10.08.21 17:57, Vivek Goyal wrote:
> On Tue, Aug 10, 2021 at 05:26:15PM +0200, Hanna Reitz wrote:
>> On 10.08.21 17:23, Vivek Goyal wrote:
>>> On Tue, Aug 10, 2021 at 10:32:55AM +0200, Hanna Reitz wrote:
>>>> On 09.08.21 20:41, Vivek Goyal wrote:
>>>>> On Fri, Jul 30, 2021 at 05:01:33PM +0200, Max Reitz wrote:
>>>>>> When the inode_file_handles option is set, try to generate a file handle
>>>>>> for new inodes instead of opening an O_PATH FD.
>>>>>>
>>>>>> Being able to open these again will require CAP_DAC_READ_SEARCH, so the
>>>>>> description text tells the user they will also need to specify
>>>>>> -o modcaps=+dac_read_search.
>>>>>>
>>>>>> Generating a file handle returns the mount ID it is valid for.  Opening
>>>>>> it will require an FD instead.  We have mount_fds to map an ID to an FD.
>>>>>> get_file_handle() fills the hash map by opening the file we have
>>>>>> generated a handle for.  To verify that the resulting FD indeed
>>>>>> represents the handle's mount ID, we use statx().  Therefore, using file
>>>>>> handles requires statx() support.
>>>>> So opening the file and storing that fd in mount_fds table might be
>>>>> a potential problem with inotify work Ioannis is doing.
>>>>>
>>>>> So say a file foo.txt was opened O_RDONLY and fd stored in mount_fs. Now
>>>>> say user unlinks foo.txt. If notifications are enabled, final notification
>>>>> will not be generated till this mount_fds fd is closed.
>>>>>
>>>>> Now question is when will this fd be closed? If it closed at some
>>>>> later point and then notification is generated, that will break
>>>>> notificaitons.
>>>> Currently, it is never closed.
>>>>
>>>>> In fact even O_PATH fd is delaying notifications due to same reason.
>>>>> But its not too bad as we close O_PATH fd pretty quickly after
>>>>> unlinking. And we were hoping that file handle support will get rid
>>>>> of this problem because we will not keep O_PATH fd open.
>>>>>
>>>>> But, IIUC, mount_fds stuff will make it even worse. I did not see
>>>>> the code which removes this fd from mount_fds. So I am not sure what's
>>>>> the life time of this fd.
>>>> The lifetime is forever.  If we wanted to remove it at some point, we’d need
>>>> to track how many file handles we have open for the given mount fd and then
>>>> remove it from the table once the count reaches 0, so it would still be
>>>> delayed.
>>>>
>>>> I think in practice the first thing that is looked up from some mount will
>>>> probably be the root directory, which cannot be deleted before everything
>>>> else on the mount is gone, so that would work.  We track how many handles
>>>> are there, if the whole mount were to be deleted, I hope all lo_inodes are
>>>> evicted, the count goes to 0, and we can drop the mount fd.
>>> Keeping a reference count on mount_fd object make sense. So we probably
>>> maintain this hash table and lookup using mount_id (as you are already
>>> doing). All subsequent inodes from same filesystem will use same
>>> object. Once all inodes have been flushed out, then mount_fd object
>>> should go away as well (allowing for unmount on host).
>>>
>>>> I think we can make the assumption that the mount fd is the root directory
>>>> certain by, well, looking into mountinfo...  That would result in us always
>>>> opening the root node of the filesystem, so that first the whole filesystem
>>>> needs to disappear before it can be deleted (and our mount fd closed) –
>>>> which should work, I guess?
>>> This seems more reasonable. And I think that's what man page seems to
>>> suggest.
>>>
>>>          The  mount_id  argument  returns an identifier for the filesystem mount
>>>          that corresponds to pathname.  This corresponds to the first  field  in
>>>          one  of  the  records in /proc/self/mountinfo.  Opening the pathname in
>>>          the fifth field of that record yields a file descriptor for  the  mount
>>>          point;  that  file  descriptor  can  be  used  in  a subsequent call to
>>>          open_by_handle_at().
>>>
>>> Fifth field seems to be the mount point. man proc says.
>>>
>>>                 (5)  mount  point:  the  pathname of the mount point relative to
>>>                      the process's root directory.
>>>
>>> So opening mount point and saving as mount_fd (if it is not already
>>> in hash table) and then take a per inode reference count on mount_fd
>>> object looks like will solve the life time issue of mount_fd as
>>> well as the issue of temporary failures arising because we can't
>>> open a device special file.
>> Well, we’ve had this discussion before, and it’s possible that a filesystem
>> has a device file as its mount point.
> Yes. I think you did modified fuse to do some special trickery. Not sure
> where should that be fixed.

I used fuse, but I’m sure a non-fuse filesystem can do the same.  (I 
mean, fuse effectively is a non-fuse filesystem, too.)

I don’t think it needs to be fixed, it just means we need to continue to 
stat the mount point to verify it’s a regular file or directory.

> If filesystem is faking, then it can fake a device node as regular
> file and fool us into opening it as well?

Well, of course opening any file can have side effects, on any filesystem.

>> But given the inotify complications, there’s really a good reason we should
>> use mountinfo.
>>
>>>> It’s a bit tricky because our sandboxing prevents easy access to mountinfo,
>>>> but if that’s the only way...
>>> yes. We already have lo->proc_self_fd. Maybe we need to keep
>>> /proc/self/mountinfo open in lo->proc_self_mountinfo. I am assuming
>>> that any mount table changes will still be visible despite the fact
>>> I have fd open (and don't have to open new fd to notice new mount/unmount
>>> changes).
>> Well, yes, that was my idea.  Unfortunately, I wasn’t quite successful yet;
>> when I tried keeping the fd open, reading from it would just return 0
>> bytes.  Perhaps that’s because we bind-mount /proc/self/fd to /proc so that
>> nothing else in /proc is visible. Perhaps we need to bind-mount
>> /proc/self/mountinfo into /proc/self/fd before that...
> Or perhaps open /proc/self/mountinfo and save fd in lo->proc_mountinfo
> before /proc/self/fd is bind mounted on /proc?

Yes, I tried that, and then reading would just return 0 bytes.

Hanna

Vivek Goyal Aug. 16, 2021, 7:44 p.m. UTC | #7

On Wed, Aug 11, 2021 at 08:41:18AM +0200, Hanna Reitz wrote:

[..]
> > > But given the inotify complications, there’s really a good reason we should
> > > use mountinfo.
> > > 
> > > > > It’s a bit tricky because our sandboxing prevents easy access to mountinfo,
> > > > > but if that’s the only way...
> > > > yes. We already have lo->proc_self_fd. Maybe we need to keep
> > > > /proc/self/mountinfo open in lo->proc_self_mountinfo. I am assuming
> > > > that any mount table changes will still be visible despite the fact
> > > > I have fd open (and don't have to open new fd to notice new mount/unmount
> > > > changes).
> > > Well, yes, that was my idea.  Unfortunately, I wasn’t quite successful yet;
> > > when I tried keeping the fd open, reading from it would just return 0
> > > bytes.  Perhaps that’s because we bind-mount /proc/self/fd to /proc so that
> > > nothing else in /proc is visible. Perhaps we need to bind-mount
> > > /proc/self/mountinfo into /proc/self/fd before that...
> > Or perhaps open /proc/self/mountinfo and save fd in lo->proc_mountinfo
> > before /proc/self/fd is bind mounted on /proc?
> 
> Yes, I tried that, and then reading would just return 0 bytes.

Hi Hanna,

I tried this simple patch and I can read /proc/self/mountinfo before
bind mounting /proc/self/fd and after bind mounting /proc/self/fd. Am
I missing something.

Vivek

---
 tools/virtiofsd/passthrough_ll.c |   32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

Index: rhvgoyal-qemu/tools/virtiofsd/passthrough_ll.c
===================================================================
--- rhvgoyal-qemu.orig/tools/virtiofsd/passthrough_ll.c	2021-08-16 15:29:27.712223551 -0400
+++ rhvgoyal-qemu/tools/virtiofsd/passthrough_ll.c	2021-08-16 15:41:29.500032032 -0400
@@ -172,6 +172,7 @@ struct lo_data {
 
     /* An O_PATH file descriptor to /proc/self/fd/ */
     int proc_self_fd;
+    int proc_mountinfo;
     int user_killpriv_v2, killpriv_v2;
     /* If set, virtiofsd is responsible for setting umask during creation */
     bool change_umask;
@@ -3409,6 +3410,9 @@ static void setup_wait_parent_capabiliti
 static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
 {
     pid_t child;
+    int fd;
+    char buf[128];
+    ssize_t count;
 
     /*
      * Create a new pid namespace for *child* processes.  We'll have to
@@ -3472,6 +3476,24 @@ static void setup_namespaces(struct lo_d
         exit(1);
     }
 
+    fd = open("/proc/self/mountinfo", O_RDONLY);
+    if (fd == -1) {
+        fuse_log(FUSE_LOG_ERR, "open(/proc/self/mountinfo, O_RDONLY): %m\n");
+        exit(1);
+    }
+
+    lo->proc_mountinfo = fd;
+
+    count = read(lo->proc_mountinfo, buf, 127);
+    if (count == -1) {
+        fuse_log(FUSE_LOG_ERR, "read(/proc/self/mountinfo): %m\n");
+        exit(1);
+    }
+
+    fuse_log(FUSE_LOG_INFO, "read(%d) bytes\n", count);
+    buf[count] = '\0';
+    fuse_log(FUSE_LOG_INFO, "%s\n", buf);
+
     /*
      * We only need /proc/self/fd. Prevent ".." from accessing parent
      * directories of /proc/self/fd by bind-mounting it over /proc. Since / was
@@ -3489,6 +3511,16 @@ static void setup_namespaces(struct lo_d
         fuse_log(FUSE_LOG_ERR, "open(/proc, O_PATH): %m\n");
         exit(1);
     }
+
+    count = read(lo->proc_mountinfo, buf, 127);
+    if (count == -1) {
+        fuse_log(FUSE_LOG_ERR, "read(/proc/self/mountinfo): %m\n");
+        exit(1);
+    }
+
+    fuse_log(FUSE_LOG_INFO, "read(%d) bytes\n", count);
+    buf[count] = '\0';
+    fuse_log(FUSE_LOG_INFO, "%s\n", buf);
 }
 
 /*

Hanna Czenczek Aug. 17, 2021, 8:27 a.m. UTC | #8

On 16.08.21 21:44, Vivek Goyal wrote:
> On Wed, Aug 11, 2021 at 08:41:18AM +0200, Hanna Reitz wrote:
>
> [..]
>>>> But given the inotify complications, there’s really a good reason we should
>>>> use mountinfo.
>>>>
>>>>>> It’s a bit tricky because our sandboxing prevents easy access to mountinfo,
>>>>>> but if that’s the only way...
>>>>> yes. We already have lo->proc_self_fd. Maybe we need to keep
>>>>> /proc/self/mountinfo open in lo->proc_self_mountinfo. I am assuming
>>>>> that any mount table changes will still be visible despite the fact
>>>>> I have fd open (and don't have to open new fd to notice new mount/unmount
>>>>> changes).
>>>> Well, yes, that was my idea.  Unfortunately, I wasn’t quite successful yet;
>>>> when I tried keeping the fd open, reading from it would just return 0
>>>> bytes.  Perhaps that’s because we bind-mount /proc/self/fd to /proc so that
>>>> nothing else in /proc is visible. Perhaps we need to bind-mount
>>>> /proc/self/mountinfo into /proc/self/fd before that...
>>> Or perhaps open /proc/self/mountinfo and save fd in lo->proc_mountinfo
>>> before /proc/self/fd is bind mounted on /proc?
>> Yes, I tried that, and then reading would just return 0 bytes.
> Hi Hanna,
>
> I tried this simple patch and I can read /proc/self/mountinfo before
> bind mounting /proc/self/fd and after bind mounting /proc/self/fd. Am
> I missing something.

Yes, but I tried reading it in the main loop (where we’d actually need 
it).  It looks like the umount2(".", MNT_DETACH) in setup_mounts() 
breaks it.

Hanna

Vivek Goyal Aug. 17, 2021, 7:45 p.m. UTC | #9

On Tue, Aug 17, 2021 at 10:27:16AM +0200, Hanna Reitz wrote:
> On 16.08.21 21:44, Vivek Goyal wrote:
> > On Wed, Aug 11, 2021 at 08:41:18AM +0200, Hanna Reitz wrote:
> > 
> > [..]
> > > > > But given the inotify complications, there’s really a good reason we should
> > > > > use mountinfo.
> > > > > 
> > > > > > > It’s a bit tricky because our sandboxing prevents easy access to mountinfo,
> > > > > > > but if that’s the only way...
> > > > > > yes. We already have lo->proc_self_fd. Maybe we need to keep
> > > > > > /proc/self/mountinfo open in lo->proc_self_mountinfo. I am assuming
> > > > > > that any mount table changes will still be visible despite the fact
> > > > > > I have fd open (and don't have to open new fd to notice new mount/unmount
> > > > > > changes).
> > > > > Well, yes, that was my idea.  Unfortunately, I wasn’t quite successful yet;
> > > > > when I tried keeping the fd open, reading from it would just return 0
> > > > > bytes.  Perhaps that’s because we bind-mount /proc/self/fd to /proc so that
> > > > > nothing else in /proc is visible. Perhaps we need to bind-mount
> > > > > /proc/self/mountinfo into /proc/self/fd before that...
> > > > Or perhaps open /proc/self/mountinfo and save fd in lo->proc_mountinfo
> > > > before /proc/self/fd is bind mounted on /proc?
> > > Yes, I tried that, and then reading would just return 0 bytes.
> > Hi Hanna,
> > 
> > I tried this simple patch and I can read /proc/self/mountinfo before
> > bind mounting /proc/self/fd and after bind mounting /proc/self/fd. Am
> > I missing something.
> 
> Yes, but I tried reading it in the main loop (where we’d actually need it). 
> It looks like the umount2(".", MNT_DETACH) in setup_mounts() breaks it.

Good point. I modified my code and notice too that after umoutn2() it
always reads 0 bytes. I can understand that all the other mount points
could go away but new rootfs mount point of virtiofsd should still be
visible, IIUC. I don't understand why.

Anyway, I tried re-opening /proc/self/mountinfo file after umount2(".",
MNT_DETACH), and that seems to work and it shows root mount point. I 
created a bind mount and it shows that too.

So looks like quick fix can be that we re-open /proc/self/mountinfo. But
that means we can't bind /proc/self/fd on /proc/. We could bind mount
/proc/self on /proc. Not sure is it safe enough.

Here is the debug patch I tried.


---
 tools/virtiofsd/passthrough_ll.c |  101 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 96 insertions(+), 5 deletions(-)

Index: rhvgoyal-qemu/tools/virtiofsd/passthrough_ll.c
===================================================================
--- rhvgoyal-qemu.orig/tools/virtiofsd/passthrough_ll.c	2021-08-16 15:29:27.712223551 -0400
+++ rhvgoyal-qemu/tools/virtiofsd/passthrough_ll.c	2021-08-17 15:40:20.456811218 -0400
@@ -172,6 +172,8 @@ struct lo_data {
 
     /* An O_PATH file descriptor to /proc/self/fd/ */
     int proc_self_fd;
+    int proc_mountinfo;
+    int proc_self;
     int user_killpriv_v2, killpriv_v2;
     /* If set, virtiofsd is responsible for setting umask during creation */
     bool change_umask;
@@ -3403,12 +3405,56 @@ static void setup_wait_parent_capabiliti
     capng_apply(CAPNG_SELECT_BOTH);
 }
 
+static void read_mountinfo(struct lo_data *lo)
+{
+    char buf[4096];
+    ssize_t count, total_read = 0;
+    int ret;
+
+    ret = lseek(lo->proc_mountinfo, 0, SEEK_SET);
+    if (ret == -1) {
+            fuse_log(FUSE_LOG_ERR, "lseek(): %m\n");
+            exit(1);
+    }
+
+    do {
+        count = read(lo->proc_mountinfo, buf, 4095);
+        if (count == -1) {
+            fuse_log(FUSE_LOG_ERR, "read(/proc/self/mountinfo): %m\n");
+            exit(1);
+        }
+
+        //fuse_log(FUSE_LOG_INFO, "read(%d) bytes\n", count);
+        buf[count] = '\0';
+        fuse_log(FUSE_LOG_INFO, "%s", buf);
+        total_read += count;
+    } while(count);
+
+    fuse_log(FUSE_LOG_INFO, "read(%d) bytes\n", total_read);
+}
+
+static void reopen_mountinfo(struct lo_data *lo)
+{
+    int fd;
+
+    close(lo->proc_mountinfo);
+
+    fd = openat(lo->proc_self, "mountinfo", O_RDONLY);
+    if (fd == -1) {
+        fuse_log(FUSE_LOG_ERR, "open(/proc/self/mountinfo, O_RDONLY): %m\n");
+        exit(1);
+    }
+
+    lo->proc_mountinfo = fd;
+}
+
 /*
  * Move to a new mount, net, and pid namespaces to isolate this process.
  */
 static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
 {
     pid_t child;
+    int fd;
 
     /*
      * Create a new pid namespace for *child* processes.  We'll have to
@@ -3472,21 +3518,35 @@ static void setup_namespaces(struct lo_d
         exit(1);
     }
 
+    fd = open("/proc/self/mountinfo", O_RDONLY);
+    if (fd == -1) {
+        fuse_log(FUSE_LOG_ERR, "open(/proc/self/mountinfo, O_RDONLY): %m\n");
+        exit(1);
+    }
+
+    lo->proc_mountinfo = fd;
+
     /*
      * We only need /proc/self/fd. Prevent ".." from accessing parent
      * directories of /proc/self/fd by bind-mounting it over /proc. Since / was
      * previously remounted with MS_REC | MS_SLAVE this mount change only
      * affects our process.
      */
-    if (mount("/proc/self/fd", "/proc", NULL, MS_BIND, NULL) < 0) {
+    if (mount("/proc/self/", "/proc", NULL, MS_BIND, NULL) < 0) {
         fuse_log(FUSE_LOG_ERR, "mount(/proc/self/fd, MS_BIND): %m\n");
         exit(1);
     }
 
     /* Get the /proc (actually /proc/self/fd, see above) file descriptor */
-    lo->proc_self_fd = open("/proc", O_PATH);
+    lo->proc_self_fd = open("/proc/fd", O_PATH);
     if (lo->proc_self_fd == -1) {
-        fuse_log(FUSE_LOG_ERR, "open(/proc, O_PATH): %m\n");
+        fuse_log(FUSE_LOG_ERR, "open(/proc/fd, O_PATH): %m\n");
+        exit(1);
+    }
+
+    lo->proc_self = open("/proc/", O_PATH);
+    if (lo->proc_self == -1) {
+        fuse_log(FUSE_LOG_ERR, "open(/proc/self, O_PATH): %m\n");
         exit(1);
     }
 }
@@ -3524,7 +3584,7 @@ static void cleanup_capng(void)
  * Make the source directory our root so symlinks cannot escape and no other
  * files are accessible.  Assumes unshare(CLONE_NEWNS) was already called.
  */
-static void setup_mounts(const char *source)
+static void setup_mounts(const char *source, struct lo_data *lo)
 {
     int oldroot;
     int newroot;
@@ -3552,26 +3612,43 @@ static void setup_mounts(const char *sou
         exit(1);
     }
 
+    fuse_log(FUSE_LOG_INFO, "mountinfo before pivot_root()\n");
+    read_mountinfo(lo);
+
     if (syscall(__NR_pivot_root, ".", ".") < 0) {
         fuse_log(FUSE_LOG_ERR, "pivot_root(., .): %m\n");
         exit(1);
     }
 
+    fuse_log(FUSE_LOG_INFO, "mountinfo after pivot_root()\n");
+    read_mountinfo(lo);
+
     if (fchdir(oldroot) < 0) {
         fuse_log(FUSE_LOG_ERR, "fchdir(oldroot): %m\n");
         exit(1);
     }
 
+    fuse_log(FUSE_LOG_INFO, "mountinfo after fchdir()\n");
+    read_mountinfo(lo);
+
     if (mount("", ".", "", MS_SLAVE | MS_REC, NULL) < 0) {
         fuse_log(FUSE_LOG_ERR, "mount(., MS_SLAVE | MS_REC): %m\n");
         exit(1);
     }
 
+    fuse_log(FUSE_LOG_INFO, "mountinfo before umount2(., MNT_DETACH): %m\n");
+    reopen_mountinfo(lo);
+    read_mountinfo(lo);
+
     if (umount2(".", MNT_DETACH) < 0) {
         fuse_log(FUSE_LOG_ERR, "umount2(., MNT_DETACH): %m\n");
         exit(1);
     }
 
+    fuse_log(FUSE_LOG_INFO, "mountinfo after umount2(., MNT_DETACH): %m\n");
+    reopen_mountinfo(lo);
+    read_mountinfo(lo);
+
     if (fchdir(newroot) < 0) {
         fuse_log(FUSE_LOG_ERR, "fchdir(newroot): %m\n");
         exit(1);
@@ -3711,6 +3788,19 @@ static void setup_chroot(struct lo_data
     }
 }
 
+static void create_mount(struct lo_data *lo)
+{
+    const char *source="foo", *dest="bar";
+
+    if (mount(source, dest, NULL, MS_BIND | MS_REC, NULL) < 0) {
+        fuse_log(FUSE_LOG_ERR, "mount(%s, %s, MS_BIND): %m\n", source, source);
+        exit(1);
+    }
+
+    fuse_log(FUSE_LOG_INFO, "mountinfo after mounting foo\n");
+    read_mountinfo(lo);
+}
+
 /*
  * Lock down this process to prevent access to other processes or files outside
  * source directory.  This reduces the impact of arbitrary code execution bugs.
@@ -3720,7 +3810,8 @@ static void setup_sandbox(struct lo_data
 {
     if (lo->sandbox == SANDBOX_NAMESPACE) {
         setup_namespaces(lo, se);
-        setup_mounts(lo->source);
+        setup_mounts(lo->source, lo);
+        create_mount(lo);
     } else {
         setup_chroot(lo);
     }

Vivek Goyal Aug. 18, 2021, 12:14 a.m. UTC | #10

On Tue, Aug 17, 2021 at 03:45:19PM -0400, Vivek Goyal wrote:
> On Tue, Aug 17, 2021 at 10:27:16AM +0200, Hanna Reitz wrote:
> > On 16.08.21 21:44, Vivek Goyal wrote:
> > > On Wed, Aug 11, 2021 at 08:41:18AM +0200, Hanna Reitz wrote:
> > > 
> > > [..]
> > > > > > But given the inotify complications, there’s really a good reason we should
> > > > > > use mountinfo.
> > > > > > 
> > > > > > > > It’s a bit tricky because our sandboxing prevents easy access to mountinfo,
> > > > > > > > but if that’s the only way...
> > > > > > > yes. We already have lo->proc_self_fd. Maybe we need to keep
> > > > > > > /proc/self/mountinfo open in lo->proc_self_mountinfo. I am assuming
> > > > > > > that any mount table changes will still be visible despite the fact
> > > > > > > I have fd open (and don't have to open new fd to notice new mount/unmount
> > > > > > > changes).
> > > > > > Well, yes, that was my idea.  Unfortunately, I wasn’t quite successful yet;
> > > > > > when I tried keeping the fd open, reading from it would just return 0
> > > > > > bytes.  Perhaps that’s because we bind-mount /proc/self/fd to /proc so that
> > > > > > nothing else in /proc is visible. Perhaps we need to bind-mount
> > > > > > /proc/self/mountinfo into /proc/self/fd before that...
> > > > > Or perhaps open /proc/self/mountinfo and save fd in lo->proc_mountinfo
> > > > > before /proc/self/fd is bind mounted on /proc?
> > > > Yes, I tried that, and then reading would just return 0 bytes.
> > > Hi Hanna,
> > > 
> > > I tried this simple patch and I can read /proc/self/mountinfo before
> > > bind mounting /proc/self/fd and after bind mounting /proc/self/fd. Am
> > > I missing something.
> > 
> > Yes, but I tried reading it in the main loop (where we’d actually need it). 
> > It looks like the umount2(".", MNT_DETACH) in setup_mounts() breaks it.
> 
> Good point. I modified my code and notice too that after umoutn2() it
> always reads 0 bytes. I can understand that all the other mount points
> could go away but new rootfs mount point of virtiofsd should still be
> visible, IIUC. I don't understand why.
> 
> Anyway, I tried re-opening /proc/self/mountinfo file after umount2(".",
> MNT_DETACH), and that seems to work and it shows root mount point. I 
> created a bind mount and it shows that too.
> 
> So looks like quick fix can be that we re-open /proc/self/mountinfo. But
> that means we can't bind /proc/self/fd on /proc/. We could bind mount
> /proc/self on /proc. Not sure is it safe enough.

Or may be I can do this.

- Open O_PATH fd for /proc/self
  proc_self = open("/proc/self");
- Bind mount /proc/self/fd on /proc
- pivot_root() and umount() stuff
- Openat(proc_self, "mountinfo")
- close(proc_self)

If this works, then we don't have the security issue and we managed
to open mountinfo after pivot_root() and umount(). Will give it a
try and see if it works tomorrow.

Vivek

Vivek Goyal Aug. 18, 2021, 1:32 p.m. UTC | #11

On Tue, Aug 17, 2021 at 08:14:46PM -0400, Vivek Goyal wrote:
> On Tue, Aug 17, 2021 at 03:45:19PM -0400, Vivek Goyal wrote:
> > On Tue, Aug 17, 2021 at 10:27:16AM +0200, Hanna Reitz wrote:
> > > On 16.08.21 21:44, Vivek Goyal wrote:
> > > > On Wed, Aug 11, 2021 at 08:41:18AM +0200, Hanna Reitz wrote:
> > > > 
> > > > [..]
> > > > > > > But given the inotify complications, there’s really a good reason we should
> > > > > > > use mountinfo.
> > > > > > > 
> > > > > > > > > It’s a bit tricky because our sandboxing prevents easy access to mountinfo,
> > > > > > > > > but if that’s the only way...
> > > > > > > > yes. We already have lo->proc_self_fd. Maybe we need to keep
> > > > > > > > /proc/self/mountinfo open in lo->proc_self_mountinfo. I am assuming
> > > > > > > > that any mount table changes will still be visible despite the fact
> > > > > > > > I have fd open (and don't have to open new fd to notice new mount/unmount
> > > > > > > > changes).
> > > > > > > Well, yes, that was my idea.  Unfortunately, I wasn’t quite successful yet;
> > > > > > > when I tried keeping the fd open, reading from it would just return 0
> > > > > > > bytes.  Perhaps that’s because we bind-mount /proc/self/fd to /proc so that
> > > > > > > nothing else in /proc is visible. Perhaps we need to bind-mount
> > > > > > > /proc/self/mountinfo into /proc/self/fd before that...
> > > > > > Or perhaps open /proc/self/mountinfo and save fd in lo->proc_mountinfo
> > > > > > before /proc/self/fd is bind mounted on /proc?
> > > > > Yes, I tried that, and then reading would just return 0 bytes.
> > > > Hi Hanna,
> > > > 
> > > > I tried this simple patch and I can read /proc/self/mountinfo before
> > > > bind mounting /proc/self/fd and after bind mounting /proc/self/fd. Am
> > > > I missing something.
> > > 
> > > Yes, but I tried reading it in the main loop (where we’d actually need it). 
> > > It looks like the umount2(".", MNT_DETACH) in setup_mounts() breaks it.
> > 
> > Good point. I modified my code and notice too that after umoutn2() it
> > always reads 0 bytes. I can understand that all the other mount points
> > could go away but new rootfs mount point of virtiofsd should still be
> > visible, IIUC. I don't understand why.
> > 
> > Anyway, I tried re-opening /proc/self/mountinfo file after umount2(".",
> > MNT_DETACH), and that seems to work and it shows root mount point. I 
> > created a bind mount and it shows that too.
> > 
> > So looks like quick fix can be that we re-open /proc/self/mountinfo. But
> > that means we can't bind /proc/self/fd on /proc/. We could bind mount
> > /proc/self on /proc. Not sure is it safe enough.
> 
> Or may be I can do this.
> 
> - Open O_PATH fd for /proc/self
>   proc_self = open("/proc/self");
> - Bind mount /proc/self/fd on /proc
> - pivot_root() and umount() stuff
> - Openat(proc_self, "mountinfo")
> - close(proc_self)
> 
> If this works, then we don't have the security issue and we managed
> to open mountinfo after pivot_root() and umount(). Will give it a
> try and see if it works tomorrow.

Hi Hanna,

This seems to work for me. I think key is to open mountinfo after
pivot_root() and then it works. If it is opened before pivot_root()
then it does not work. Not sure why.

Thanks
Vivek


---
 tools/virtiofsd/passthrough_ll.c |   61 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 59 insertions(+), 2 deletions(-)

Index: rhvgoyal-qemu/tools/virtiofsd/passthrough_ll.c
===================================================================
--- rhvgoyal-qemu.orig/tools/virtiofsd/passthrough_ll.c	2021-08-16 15:29:27.712223551 -0400
+++ rhvgoyal-qemu/tools/virtiofsd/passthrough_ll.c	2021-08-18 09:29:34.653891067 -0400
@@ -172,6 +172,8 @@ struct lo_data {
 
     /* An O_PATH file descriptor to /proc/self/fd/ */
     int proc_self_fd;
+    int proc_mountinfo;
+    int proc_self;
     int user_killpriv_v2, killpriv_v2;
     /* If set, virtiofsd is responsible for setting umask during creation */
     bool change_umask;
@@ -3403,6 +3405,47 @@ static void setup_wait_parent_capabiliti
     capng_apply(CAPNG_SELECT_BOTH);
 }
 
+static void read_mountinfo(struct lo_data *lo)
+{
+    char buf[4096];
+    ssize_t count, total_read = 0;
+    int ret;
+
+    ret = lseek(lo->proc_mountinfo, 0, SEEK_SET);
+    if (ret == -1) {
+            fuse_log(FUSE_LOG_ERR, "lseek(): %m\n");
+            exit(1);
+    }
+
+    do {
+        count = read(lo->proc_mountinfo, buf, 4095);
+        if (count == -1) {
+            fuse_log(FUSE_LOG_ERR, "read(/proc/self/mountinfo): %m\n");
+            exit(1);
+        }
+
+        //fuse_log(FUSE_LOG_INFO, "read(%d) bytes\n", count);
+        buf[count] = '\0';
+        fuse_log(FUSE_LOG_INFO, "%s", buf);
+        total_read += count;
+    } while(count);
+
+    fuse_log(FUSE_LOG_INFO, "read(%d) bytes\n", total_read);
+}
+
+static void open_mountinfo(struct lo_data *lo)
+{
+    int fd;
+
+    fd = openat(lo->proc_self, "mountinfo", O_RDONLY);
+    if (fd == -1) {
+        fuse_log(FUSE_LOG_ERR, "open(/proc/self/mountinfo, O_RDONLY): %m\n");
+        exit(1);
+    }
+
+    lo->proc_mountinfo = fd;
+}
+
 /*
  * Move to a new mount, net, and pid namespaces to isolate this process.
  */
@@ -3472,6 +3515,12 @@ static void setup_namespaces(struct lo_d
         exit(1);
     }
 
+    lo->proc_self = open("/proc/self", O_PATH);
+    if (lo->proc_self == -1) {
+        fuse_log(FUSE_LOG_ERR, "open(/proc/self, O_PATH): %m\n");
+        exit(1);
+    }
+
     /*
      * We only need /proc/self/fd. Prevent ".." from accessing parent
      * directories of /proc/self/fd by bind-mounting it over /proc. Since / was
@@ -3524,7 +3573,7 @@ static void cleanup_capng(void)
  * Make the source directory our root so symlinks cannot escape and no other
  * files are accessible.  Assumes unshare(CLONE_NEWNS) was already called.
  */
-static void setup_mounts(const char *source)
+static void setup_mounts(const char *source, struct lo_data *lo)
 {
     int oldroot;
     int newroot;
@@ -3557,6 +3606,8 @@ static void setup_mounts(const char *sou
         exit(1);
     }
 
+    open_mountinfo(lo);
+
     if (fchdir(oldroot) < 0) {
         fuse_log(FUSE_LOG_ERR, "fchdir(oldroot): %m\n");
         exit(1);
@@ -3567,11 +3618,17 @@ static void setup_mounts(const char *sou
         exit(1);
     }
 
+    fuse_log(FUSE_LOG_INFO, "mountinfo before umount2(., MNT_DETACH)\n");
+    read_mountinfo(lo);
+
     if (umount2(".", MNT_DETACH) < 0) {
         fuse_log(FUSE_LOG_ERR, "umount2(., MNT_DETACH): %m\n");
         exit(1);
     }
 
+    fuse_log(FUSE_LOG_INFO, "mountinfo after umount2(., MNT_DETACH):\n");
+    read_mountinfo(lo);
+
     if (fchdir(newroot) < 0) {
         fuse_log(FUSE_LOG_ERR, "fchdir(newroot): %m\n");
         exit(1);
@@ -3720,7 +3777,7 @@ static void setup_sandbox(struct lo_data
 {
     if (lo->sandbox == SANDBOX_NAMESPACE) {
         setup_namespaces(lo, se);
-        setup_mounts(lo->source);
+        setup_mounts(lo->source, lo);
     } else {
         setup_chroot(lo);
     }

Hanna Czenczek Aug. 18, 2021, 1:48 p.m. UTC | #12

On 18.08.21 15:32, Vivek Goyal wrote:
> On Tue, Aug 17, 2021 at 08:14:46PM -0400, Vivek Goyal wrote:
>> On Tue, Aug 17, 2021 at 03:45:19PM -0400, Vivek Goyal wrote:
>>> On Tue, Aug 17, 2021 at 10:27:16AM +0200, Hanna Reitz wrote:
>>>> On 16.08.21 21:44, Vivek Goyal wrote:
>>>>> On Wed, Aug 11, 2021 at 08:41:18AM +0200, Hanna Reitz wrote:
>>>>>
>>>>> [..]
>>>>>>>> But given the inotify complications, there’s really a good reason we should
>>>>>>>> use mountinfo.
>>>>>>>>
>>>>>>>>>> It’s a bit tricky because our sandboxing prevents easy access to mountinfo,
>>>>>>>>>> but if that’s the only way...
>>>>>>>>> yes. We already have lo->proc_self_fd. Maybe we need to keep
>>>>>>>>> /proc/self/mountinfo open in lo->proc_self_mountinfo. I am assuming
>>>>>>>>> that any mount table changes will still be visible despite the fact
>>>>>>>>> I have fd open (and don't have to open new fd to notice new mount/unmount
>>>>>>>>> changes).
>>>>>>>> Well, yes, that was my idea.  Unfortunately, I wasn’t quite successful yet;
>>>>>>>> when I tried keeping the fd open, reading from it would just return 0
>>>>>>>> bytes.  Perhaps that’s because we bind-mount /proc/self/fd to /proc so that
>>>>>>>> nothing else in /proc is visible. Perhaps we need to bind-mount
>>>>>>>> /proc/self/mountinfo into /proc/self/fd before that...
>>>>>>> Or perhaps open /proc/self/mountinfo and save fd in lo->proc_mountinfo
>>>>>>> before /proc/self/fd is bind mounted on /proc?
>>>>>> Yes, I tried that, and then reading would just return 0 bytes.
>>>>> Hi Hanna,
>>>>>
>>>>> I tried this simple patch and I can read /proc/self/mountinfo before
>>>>> bind mounting /proc/self/fd and after bind mounting /proc/self/fd. Am
>>>>> I missing something.
>>>> Yes, but I tried reading it in the main loop (where we’d actually need it).
>>>> It looks like the umount2(".", MNT_DETACH) in setup_mounts() breaks it.
>>> Good point. I modified my code and notice too that after umoutn2() it
>>> always reads 0 bytes. I can understand that all the other mount points
>>> could go away but new rootfs mount point of virtiofsd should still be
>>> visible, IIUC. I don't understand why.
>>>
>>> Anyway, I tried re-opening /proc/self/mountinfo file after umount2(".",
>>> MNT_DETACH), and that seems to work and it shows root mount point. I
>>> created a bind mount and it shows that too.
>>>
>>> So looks like quick fix can be that we re-open /proc/self/mountinfo. But
>>> that means we can't bind /proc/self/fd on /proc/. We could bind mount
>>> /proc/self on /proc. Not sure is it safe enough.
>> Or may be I can do this.
>>
>> - Open O_PATH fd for /proc/self
>>    proc_self = open("/proc/self");
>> - Bind mount /proc/self/fd on /proc
>> - pivot_root() and umount() stuff
>> - Openat(proc_self, "mountinfo")
>> - close(proc_self)
>>
>> If this works, then we don't have the security issue and we managed
>> to open mountinfo after pivot_root() and umount(). Will give it a
>> try and see if it works tomorrow.
> Hi Hanna,
>
> This seems to work for me. I think key is to open mountinfo after
> pivot_root() and then it works. If it is opened before pivot_root()
> then it does not work. Not sure why.

Great, your code looks good to me.  I was afraid this was going to be 
really complicated, but that doesn’t look too bad.

Thanks!

Hanna

Dr. David Alan Gilbert Aug. 19, 2021, 4:38 p.m. UTC | #13

* Max Reitz (mreitz@redhat.com) wrote:
> When the inode_file_handles option is set, try to generate a file handle
> for new inodes instead of opening an O_PATH FD.
> 
> Being able to open these again will require CAP_DAC_READ_SEARCH, so the
> description text tells the user they will also need to specify
> -o modcaps=+dac_read_search.
> 
> Generating a file handle returns the mount ID it is valid for.  Opening
> it will require an FD instead.  We have mount_fds to map an ID to an FD.
> get_file_handle() fills the hash map by opening the file we have
> generated a handle for.  To verify that the resulting FD indeed
> represents the handle's mount ID, we use statx().  Therefore, using file
> handles requires statx() support.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  tools/virtiofsd/helper.c              |   3 +
>  tools/virtiofsd/passthrough_ll.c      | 194 ++++++++++++++++++++++++--
>  tools/virtiofsd/passthrough_seccomp.c |   1 +
>  3 files changed, 190 insertions(+), 8 deletions(-)
> 
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> index a8295d975a..aa63a21d43 100644
> --- a/tools/virtiofsd/helper.c
> +++ b/tools/virtiofsd/helper.c
> @@ -187,6 +187,9 @@ void fuse_cmdline_help(void)
>             "                               default: no_allow_direct_io\n"
>             "    -o announce_submounts      Announce sub-mount points to the guest\n"
>             "    -o posix_acl/no_posix_acl  Enable/Disable posix_acl. (default: disabled)\n"
> +           "    -o inode_file_handles      Use file handles to reference inodes\n"
> +           "                               instead of O_PATH file descriptors\n"
> +           "                               (requires -o modcaps=+dac_read_search)\n"

I think you should probably add that automatically for the user; we do
similar for seccomp/syslog (see syscall_allowlist_syslog); just do it
before the while (modcaps) {   line so whatever the user specifies
sticks.

Dave

>             );
>  }
>  
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index f9d8b2f134..ac95961d12 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -194,6 +194,7 @@ struct lo_data {
>      /* If set, virtiofsd is responsible for setting umask during creation */
>      bool change_umask;
>      int user_posix_acl, posix_acl;
> +    int inode_file_handles;
>  };
>  
>  /**
> @@ -250,6 +251,10 @@ static const struct fuse_opt lo_opts[] = {
>      { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 },
>      { "posix_acl", offsetof(struct lo_data, user_posix_acl), 1 },
>      { "no_posix_acl", offsetof(struct lo_data, user_posix_acl), 0 },
> +    { "inode_file_handles", offsetof(struct lo_data, inode_file_handles), 1 },
> +    { "no_inode_file_handles",
> +      offsetof(struct lo_data, inode_file_handles),
> +      0 },
>      FUSE_OPT_END
>  };
>  static bool use_syslog = false;
> @@ -321,6 +326,135 @@ static int temp_fd_steal(TempFd *temp_fd)
>      }
>  }
>  
> +/**
> + * Generate a file handle for the given dirfd/name combination.
> + *
> + * If mount_fds does not yet contain an entry for the handle's mount
> + * ID, (re)open dirfd/name in O_RDONLY mode and add it to mount_fds
> + * as the FD for that mount ID.  (That is the file that we have
> + * generated a handle for, so it should be representative for the
> + * mount ID.  However, to be sure (and to rule out races), we use
> + * statx() to verify that our assumption is correct.)
> + */
> +static struct lo_fhandle *get_file_handle(struct lo_data *lo,
> +                                          int dirfd, const char *name)
> +{
> +    /* We need statx() to verify the mount ID */
> +#if defined(CONFIG_STATX) && defined(STATX_MNT_ID)
> +    struct lo_fhandle *fh;
> +    int ret;
> +
> +    if (!lo->use_statx || !lo->inode_file_handles) {
> +        return NULL;
> +    }
> +
> +    fh = g_new0(struct lo_fhandle, 1);
> +
> +    fh->handle.handle_bytes = sizeof(fh->padding) - sizeof(fh->handle);
> +    ret = name_to_handle_at(dirfd, name, &fh->handle, &fh->mount_id,
> +                            AT_EMPTY_PATH);
> +    if (ret < 0) {
> +        goto fail;
> +    }
> +
> +    if (pthread_rwlock_rdlock(&mount_fds_lock)) {
> +        goto fail;
> +    }
> +    if (!g_hash_table_contains(mount_fds, GINT_TO_POINTER(fh->mount_id))) {
> +        g_auto(TempFd) path_fd = TEMP_FD_INIT;
> +        struct statx stx;
> +        char procname[64];
> +        int fd;
> +
> +        pthread_rwlock_unlock(&mount_fds_lock);
> +
> +        /*
> +         * Before opening an O_RDONLY fd, check whether dirfd/name is a regular
> +         * file or directory, because we must not open anything else with
> +         * anything but O_PATH.
> +         * (And we use that occasion to verify that the file has the mount ID we
> +         * need.)
> +         */
> +        if (name[0]) {
> +            path_fd.fd = openat(dirfd, name, O_PATH);
> +            if (path_fd.fd < 0) {
> +                goto fail;
> +            }
> +            path_fd.owned = true;
> +        } else {
> +            path_fd.fd = dirfd;
> +            path_fd.owned = false;
> +        }
> +
> +        ret = statx(path_fd.fd, "", AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW,
> +                    STATX_TYPE | STATX_MNT_ID, &stx);
> +        if (ret < 0) {
> +            if (errno == ENOSYS) {
> +                lo->use_statx = false;
> +                fuse_log(FUSE_LOG_WARNING,
> +                         "statx() does not work: Will not be able to use file "
> +                         "handles for inodes\n");
> +            }
> +            goto fail;
> +        }
> +        if (!(stx.stx_mask & STATX_MNT_ID) || stx.stx_mnt_id != fh->mount_id) {
> +            /*
> +             * One reason for stx_mnt_id != mount_id could be that dirfd/name
> +             * is a directory, and some other filesystem was mounted there
> +             * between us generating the file handle and then opening the FD.
> +             * (Other kinds of races might be possible, too.)
> +             * Failing this function is not fatal, though, because our caller
> +             * (lo_do_lookup()) will just fall back to opening an O_PATH FD to
> +             * store in lo_inode.fd instead of storing a file handle in
> +             * lo_inode.fhandle.  So we do not need to try too hard to get an
> +             * FD for fh->mount_id so this function could succeed.
> +             */
> +            goto fail;
> +        }
> +        if (!(stx.stx_mask & STATX_TYPE) ||
> +            !(S_ISREG(stx.stx_mode) || S_ISDIR(stx.stx_mode)))
> +        {
> +            /*
> +             * We must not open special files with anything but O_PATH, so we
> +             * cannot use this file for mount_fds.
> +             * Just return a failure in such a case and let the lo_inode have
> +             * an O_PATH fd instead of a file handle.
> +             */
> +            goto fail;
> +        }
> +
> +        /* Now that we know this fd is safe to open, do it */
> +        snprintf(procname, sizeof(procname), "%i", path_fd.fd);
> +        fd = openat(lo->proc_self_fd, procname, O_RDONLY);
> +        if (fd < 0) {
> +            goto fail;
> +        }
> +
> +        if (pthread_rwlock_wrlock(&mount_fds_lock)) {
> +            goto fail;
> +        }
> +
> +        /* Check again, might have changed */
> +        if (g_hash_table_contains(mount_fds, GINT_TO_POINTER(fh->mount_id))) {
> +            close(fd);
> +        } else {
> +            g_hash_table_insert(mount_fds,
> +                                GINT_TO_POINTER(fh->mount_id),
> +                                GINT_TO_POINTER(fd));
> +        }
> +    }
> +    pthread_rwlock_unlock(&mount_fds_lock);
> +
> +    return fh;
> +
> +fail:
> +    free(fh);
> +    return NULL;
> +#else /* defined(CONFIG_STATX) && defined(STATX_MNT_ID) */
> +    return NULL;
> +#endif
> +}
> +
>  /**
>   * Open the given file handle with the given flags.
>   *
> @@ -1165,6 +1299,11 @@ static int do_statx(struct lo_data *lo, int dirfd, const char *pathname,
>              return -1;
>          }
>          lo->use_statx = false;
> +        if (lo->inode_file_handles) {
> +            fuse_log(FUSE_LOG_WARNING,
> +                     "statx() does not work: Will not be able to use file "
> +                     "handles for inodes\n");
> +        }
>          /* fallback */
>      }
>  #endif
> @@ -1194,6 +1333,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
>      struct lo_data *lo = lo_data(req);
>      struct lo_inode *inode = NULL;
>      struct lo_inode *dir = lo_inode(req, parent);
> +    struct lo_fhandle *fh;
>  
>      if (inodep) {
>          *inodep = NULL; /* in case there is an error */
> @@ -1223,13 +1363,21 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
>          goto out;
>      }
>  
> -    newfd = openat(dir_fd.fd, name, O_PATH | O_NOFOLLOW);
> -    if (newfd == -1) {
> -        goto out_err;
> +    fh = get_file_handle(lo, dir_fd.fd, name);
> +    if (!fh) {
> +        newfd = openat(dir_fd.fd, name, O_PATH | O_NOFOLLOW);
> +        if (newfd == -1) {
> +            goto out_err;
> +        }
>      }
>  
> -    res = do_statx(lo, newfd, "", &e->attr, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW,
> -                   &mnt_id);
> +    if (newfd >= 0) {
> +        res = do_statx(lo, newfd, "", &e->attr,
> +                       AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW, &mnt_id);
> +    } else {
> +        res = do_statx(lo, dir_fd.fd, name, &e->attr,
> +                       AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW, &mnt_id);
> +    }
>      if (res == -1) {
>          goto out_err;
>      }
> @@ -1239,9 +1387,19 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
>          e->attr_flags |= FUSE_ATTR_SUBMOUNT;
>      }
>  
> -    inode = lo_find(lo, NULL, &e->attr, mnt_id);
> +    /*
> +     * Note that fh is always NULL if lo->inode_file_handles is false,
> +     * and so we will never do a lookup by file handle here, and
> +     * lo->inodes_by_handle will always remain empty.  We only need
> +     * this map when we do not have an O_PATH fd open for every
> +     * lo_inode, though, so if inode_file_handles is false, we do not
> +     * need that map anyway.
> +     */
> +    inode = lo_find(lo, fh, &e->attr, mnt_id);
>      if (inode) {
> -        close(newfd);
> +        if (newfd != -1) {
> +            close(newfd);
> +        }
>      } else {
>          inode = calloc(1, sizeof(struct lo_inode));
>          if (!inode) {
> @@ -1259,6 +1417,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
>  
>          inode->nlookup = 1;
>          inode->fd = newfd;
> +        inode->fhandle = fh;
>          inode->key.ino = e->attr.st_ino;
>          inode->key.dev = e->attr.st_dev;
>          inode->key.mnt_id = mnt_id;
> @@ -1270,6 +1429,9 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
>          pthread_mutex_lock(&lo->mutex);
>          inode->fuse_ino = lo_add_inode_mapping(req, inode);
>          g_hash_table_insert(lo->inodes_by_ids, &inode->key, inode);
> +        if (inode->fhandle) {
> +            g_hash_table_insert(lo->inodes_by_handle, inode->fhandle, inode);
> +        }
>          pthread_mutex_unlock(&lo->mutex);
>      }
>      e->ino = inode->fuse_ino;
> @@ -1615,6 +1777,7 @@ static struct lo_inode *lookup_name(fuse_req_t req, fuse_ino_t parent,
>      int res;
>      uint64_t mnt_id;
>      struct stat attr;
> +    struct lo_fhandle *fh;
>      struct lo_data *lo = lo_data(req);
>      struct lo_inode *dir = lo_inode(req, parent);
>      struct lo_inode *inode = NULL;
> @@ -1628,12 +1791,16 @@ static struct lo_inode *lookup_name(fuse_req_t req, fuse_ino_t parent,
>          goto out;
>      }
>  
> +    fh = get_file_handle(lo, dir_fd.fd, name);
> +    /* Ignore errors, this is just an optional key for the lookup */
> +
>      res = do_statx(lo, dir_fd.fd, name, &attr, AT_SYMLINK_NOFOLLOW, &mnt_id);
>      if (res == -1) {
>          goto out;
>      }
>  
> -    inode = lo_find(lo, NULL, &attr, mnt_id);
> +    inode = lo_find(lo, fh, &attr, mnt_id);
> +    g_free(fh);
>  
>  out:
>      lo_inode_put(lo, &dir);
> @@ -1801,6 +1968,9 @@ static void unref_inode(struct lo_data *lo, struct lo_inode *inode, uint64_t n)
>      if (!inode->nlookup) {
>          lo_map_remove(&lo->ino_map, inode->fuse_ino);
>          g_hash_table_remove(lo->inodes_by_ids, &inode->key);
> +        if (inode->fhandle) {
> +            g_hash_table_remove(lo->inodes_by_handle, inode->fhandle);
> +        }
>          if (lo->posix_lock) {
>              if (g_hash_table_size(inode->posix_locks)) {
>                  fuse_log(FUSE_LOG_WARNING, "Hash table is not empty\n");
> @@ -4362,6 +4532,14 @@ int main(int argc, char *argv[])
>  
>      lo.use_statx = true;
>  
> +#if !defined(CONFIG_STATX) || !defined(STATX_MNT_ID)
> +    if (lo.inode_file_handles) {
> +        fuse_log(FUSE_LOG_WARNING,
> +                 "No statx() or mount ID support: Will not be able to use file "
> +                 "handles for inodes\n");
> +    }
> +#endif
> +
>      se = fuse_session_new(&args, &lo_oper, sizeof(lo_oper), &lo);
>      if (se == NULL) {
>          goto err_out1;
> diff --git a/tools/virtiofsd/passthrough_seccomp.c b/tools/virtiofsd/passthrough_seccomp.c
> index af04c638cb..ab4dc07e3f 100644
> --- a/tools/virtiofsd/passthrough_seccomp.c
> +++ b/tools/virtiofsd/passthrough_seccomp.c
> @@ -73,6 +73,7 @@ static const int syscall_allowlist[] = {
>      SCMP_SYS(mprotect),
>      SCMP_SYS(mremap),
>      SCMP_SYS(munmap),
> +    SCMP_SYS(name_to_handle_at),
>      SCMP_SYS(newfstatat),
>      SCMP_SYS(statx),
>      SCMP_SYS(open),
> -- 
> 2.31.1
>

[v3,09/10] virtiofsd: Optionally fill lo_inode.fhandle

Commit Message

Comments

Patch