diff mbox series

[bpf-next,v1,1/8] io_uring: Implement eBPF iterator for registered buffers

Message ID 20211116054237.100814-2-memxor@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series Introduce BPF iterators for io_uring and epoll | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 12480 this patch: 12480
netdev/cc_maintainers warning 7 maintainers not CCed: kafai@fb.com joe@cilium.io yhs@fb.com songliubraving@fb.com john.fastabend@gmail.com netdev@vger.kernel.org kpsingh@kernel.org
netdev/build_clang success Errors and warnings before: 2106 this patch: 2106
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 11646 this patch: 11646
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next fail VM_Test

Commit Message

Kumar Kartikeya Dwivedi Nov. 16, 2021, 5:42 a.m. UTC
This change adds eBPF iterator for buffers registered in io_uring ctx.
It gives access to the ctx, the index of the registered buffer, and a
pointer to the io_uring_ubuf itself. This allows the iterator to save
info related to buffers added to an io_uring instance, that isn't easy
to export using the fdinfo interface (like exact struct page composing
the registered buffer).

The primary usecase this is enabling is checkpoint/restore support.

Note that we need to use mutex_trylock when the file is read from, in
seq_start functions, as the order of lock taken is opposite of what it
would be when io_uring operation reads the same file.  We take
seq_file->lock, then ctx->uring_lock, while io_uring would first take
ctx->uring_lock and then seq_file->lock for the same ctx.

This can lead to a deadlock scenario described below:

      CPU 0				CPU 1

      vfs_read
      mutex_lock(&seq_file->lock)	io_read
					mutex_lock(&ctx->uring_lock)
      mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
					mutex_lock(&seq_file->lock)

The trylock also protects the case where io_uring tries to read from
iterator attached to itself (same ctx), where the order of locks would
be:
 io_uring_enter
  mutex_lock(&ctx->uring_lock) <-----------.
  io_read				    \
   seq_read				     \
    mutex_lock(&seq_file->lock)		     /
    mutex_lock(&ctx->uring_lock) # deadlock-`

In both these cases (recursive read and contended uring_lock), -EDEADLK
is returned to userspace.

In the future, this iterator will be extended to directly support
iteration of bvec Flexible Array Member, so that when there is no
corresponding VMA that maps to the registered buffer (e.g. if VMA is
destroyed after pinning pages), we are able to reconstruct the
registration on restore by dumping the page contents and then replaying
them into a temporary mapping used for registration later. All this is
out of scope for the current series however, but builds upon this
iterator.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: io-uring@vger.kernel.org
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 fs/io_uring.c                  | 179 +++++++++++++++++++++++++++++++++
 include/linux/bpf.h            |   2 +
 include/uapi/linux/bpf.h       |   3 +
 tools/include/uapi/linux/bpf.h |   3 +
 4 files changed, 187 insertions(+)

Comments

Yonghong Song Nov. 18, 2021, 5:21 p.m. UTC | #1
On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> This change adds eBPF iterator for buffers registered in io_uring ctx.
> It gives access to the ctx, the index of the registered buffer, and a
> pointer to the io_uring_ubuf itself. This allows the iterator to save
> info related to buffers added to an io_uring instance, that isn't easy
> to export using the fdinfo interface (like exact struct page composing
> the registered buffer).
> 
> The primary usecase this is enabling is checkpoint/restore support.
> 
> Note that we need to use mutex_trylock when the file is read from, in
> seq_start functions, as the order of lock taken is opposite of what it
> would be when io_uring operation reads the same file.  We take
> seq_file->lock, then ctx->uring_lock, while io_uring would first take
> ctx->uring_lock and then seq_file->lock for the same ctx.
> 
> This can lead to a deadlock scenario described below:
> 
>        CPU 0				CPU 1
> 
>        vfs_read
>        mutex_lock(&seq_file->lock)	io_read
> 					mutex_lock(&ctx->uring_lock)
>        mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
> 					mutex_lock(&seq_file->lock)

It is not clear which mutex_lock switched to mutex_trylock.
 From below example, it looks like &ctx->uring_lock. But if this is
the case, we could have deadlock, right?

> 
> The trylock also protects the case where io_uring tries to read from
> iterator attached to itself (same ctx), where the order of locks would
> be:
>   io_uring_enter
>    mutex_lock(&ctx->uring_lock) <-----------.
>    io_read				    \
>     seq_read				     \
>      mutex_lock(&seq_file->lock)		     /
>      mutex_lock(&ctx->uring_lock) # deadlock-`
> 
> In both these cases (recursive read and contended uring_lock), -EDEADLK
> is returned to userspace.
> 
> In the future, this iterator will be extended to directly support
> iteration of bvec Flexible Array Member, so that when there is no
> corresponding VMA that maps to the registered buffer (e.g. if VMA is
> destroyed after pinning pages), we are able to reconstruct the
> registration on restore by dumping the page contents and then replaying
> them into a temporary mapping used for registration later. All this is
> out of scope for the current series however, but builds upon this
> iterator.
> 
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Pavel Begunkov <asml.silence@gmail.com>
> Cc: io-uring@vger.kernel.org
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>   fs/io_uring.c                  | 179 +++++++++++++++++++++++++++++++++
>   include/linux/bpf.h            |   2 +
>   include/uapi/linux/bpf.h       |   3 +
>   tools/include/uapi/linux/bpf.h |   3 +
>   4 files changed, 187 insertions(+)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index b07196b4511c..46a110989155 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -81,6 +81,7 @@
>   #include <linux/tracehook.h>
>   #include <linux/audit.h>
>   #include <linux/security.h>
> +#include <linux/btf_ids.h>
>   
>   #define CREATE_TRACE_POINTS
>   #include <trace/events/io_uring.h>
> @@ -11125,3 +11126,181 @@ static int __init io_uring_init(void)
>   	return 0;
>   };
>   __initcall(io_uring_init);
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +
> +BTF_ID_LIST(btf_io_uring_ids)
> +BTF_ID(struct, io_ring_ctx)
> +BTF_ID(struct, io_mapped_ubuf)
> +
> +struct bpf_io_uring_seq_info {
> +	struct io_ring_ctx *ctx;
> +	unsigned long index;
> +};
> +
> +static int bpf_io_uring_init_seq(void *priv_data, struct bpf_iter_aux_info *aux)
> +{
> +	struct bpf_io_uring_seq_info *info = priv_data;
> +	struct io_ring_ctx *ctx = aux->ctx;
> +
> +	info->ctx = ctx;
> +	return 0;
> +}
> +
> +static int bpf_io_uring_iter_attach(struct bpf_prog *prog,
> +				    union bpf_iter_link_info *linfo,
> +				    struct bpf_iter_aux_info *aux)
> +{
> +	struct io_ring_ctx *ctx;
> +	struct fd f;
> +	int ret;
> +
> +	f = fdget(linfo->io_uring.io_uring_fd);
> +	if (unlikely(!f.file))
> +		return -EBADF;
> +
> +	ret = -EOPNOTSUPP;
> +	if (unlikely(f.file->f_op != &io_uring_fops))
> +		goto out_fput;
> +
> +	ret = -ENXIO;
> +	ctx = f.file->private_data;
> +	if (unlikely(!percpu_ref_tryget(&ctx->refs)))
> +		goto out_fput;
> +
> +	ret = 0;
> +	aux->ctx = ctx;
> +
> +out_fput:
> +	fdput(f);
> +	return ret;
> +}
> +
> +static void bpf_io_uring_iter_detach(struct bpf_iter_aux_info *aux)
> +{
> +	percpu_ref_put(&aux->ctx->refs);
> +}
> +
> +/* io_uring iterator for registered buffers */
> +
> +struct bpf_iter__io_uring_buf {
> +	__bpf_md_ptr(struct bpf_iter_meta *, meta);
> +	__bpf_md_ptr(struct io_ring_ctx *, ctx);
> +	__bpf_md_ptr(struct io_mapped_ubuf *, ubuf);
> +	unsigned long index;
> +};

I would suggest to change "unsigned long index" to either u32 or u64.
This structure is also the bpf program context and in bpf program 
context, "index" will be u64. Then on 32bit system, we potentially
could have issues.

> +
> +static void *__bpf_io_uring_buf_seq_get_next(struct bpf_io_uring_seq_info *info)
> +{
> +	if (info->index < info->ctx->nr_user_bufs)
> +		return info->ctx->user_bufs[info->index++];
> +	return NULL;
> +}
> +
> +static void *bpf_io_uring_buf_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	struct bpf_io_uring_seq_info *info = seq->private;
> +	struct io_mapped_ubuf *ubuf;
> +
> +	/* Indicate to userspace that the uring lock is contended */
> +	if (!mutex_trylock(&info->ctx->uring_lock))
> +		return ERR_PTR(-EDEADLK);
> +
> +	ubuf = __bpf_io_uring_buf_seq_get_next(info);
> +	if (!ubuf)
> +		return NULL;
> +
> +	if (*pos == 0)
> +		++*pos;
> +	return ubuf;
> +}
> +
> +static void *bpf_io_uring_buf_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	struct bpf_io_uring_seq_info *info = seq->private;
> +
> +	++*pos;
> +	return __bpf_io_uring_buf_seq_get_next(info);
> +}
> +
> +DEFINE_BPF_ITER_FUNC(io_uring_buf, struct bpf_iter_meta *meta,
> +		     struct io_ring_ctx *ctx, struct io_mapped_ubuf *ubuf,
> +		     unsigned long index)

Again, change "unsigned long" to "u32" or "u64".

> +
> +static int __bpf_io_uring_buf_seq_show(struct seq_file *seq, void *v, bool in_stop)
> +{
> +	struct bpf_io_uring_seq_info *info = seq->private;
> +	struct bpf_iter__io_uring_buf ctx;
> +	struct bpf_iter_meta meta;
> +	struct bpf_prog *prog;
> +
> +	meta.seq = seq;
> +	prog = bpf_iter_get_info(&meta, in_stop);
> +	if (!prog)
> +		return 0;
> +
> +	ctx.meta = &meta;
> +	ctx.ctx = info->ctx;
> +	ctx.ubuf = v;
> +	ctx.index = info->index ? info->index - !in_stop : 0;
> +
> +	return bpf_iter_run_prog(prog, &ctx);
> +}
> +
> +static int bpf_io_uring_buf_seq_show(struct seq_file *seq, void *v)
> +{
> +	return __bpf_io_uring_buf_seq_show(seq, v, false);
> +}
> +
> +static void bpf_io_uring_buf_seq_stop(struct seq_file *seq, void *v)
> +{
> +	struct bpf_io_uring_seq_info *info = seq->private;
> +
> +	/* If IS_ERR(v) is true, then ctx->uring_lock wasn't taken */
> +	if (IS_ERR(v))
> +		return;
> +	if (!v)
> +		__bpf_io_uring_buf_seq_show(seq, v, true);
> +	else if (info->index) /* restart from index */
> +		info->index--;
> +	mutex_unlock(&info->ctx->uring_lock);
> +}
> +
> +static const struct seq_operations bpf_io_uring_buf_seq_ops = {
> +	.start = bpf_io_uring_buf_seq_start,
> +	.next  = bpf_io_uring_buf_seq_next,
> +	.stop  = bpf_io_uring_buf_seq_stop,
> +	.show  = bpf_io_uring_buf_seq_show,
> +};
> +
> +static const struct bpf_iter_seq_info bpf_io_uring_buf_seq_info = {
> +	.seq_ops          = &bpf_io_uring_buf_seq_ops,
> +	.init_seq_private = bpf_io_uring_init_seq,
> +	.fini_seq_private = NULL,
> +	.seq_priv_size    = sizeof(struct bpf_io_uring_seq_info),
> +};
> +
> +static struct bpf_iter_reg io_uring_buf_reg_info = {
> +	.target            = "io_uring_buf",
> +	.feature	   = BPF_ITER_RESCHED,
> +	.attach_target     = bpf_io_uring_iter_attach,
> +	.detach_target     = bpf_io_uring_iter_detach,

Since you have this extra `io_uring_fd` for the iterator, you may want
to implement show_fdinfo and fill_link_info callback functions here.

> +	.ctx_arg_info_size = 2,
> +	.ctx_arg_info = {
> +		{ offsetof(struct bpf_iter__io_uring_buf, ctx),
> +		  PTR_TO_BTF_ID },
> +		{ offsetof(struct bpf_iter__io_uring_buf, ubuf),
> +		  PTR_TO_BTF_ID_OR_NULL },
> +	},
> +	.seq_info	   = &bpf_io_uring_buf_seq_info,
> +};
> +
> +static int __init io_uring_iter_init(void)
> +{
> +	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
> +	io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
> +	return bpf_iter_reg_target(&io_uring_buf_reg_info);
> +}
> +late_initcall(io_uring_iter_init);
> +
> +#endif
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 56098c866704..ddb9d4520a3f 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1509,8 +1509,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>   	extern int bpf_iter_ ## target(args);			\
>   	int __init bpf_iter_ ## target(args) { return 0; }
>   
> +struct io_ring_ctx;
>   struct bpf_iter_aux_info {
>   	struct bpf_map *map;
> +	struct io_ring_ctx *ctx;
>   };

Can we use union here? Note that below bpf_iter_link_info in 
uapi/linux/bpf.h, map_fd and io_uring_fd is also an union.

>   
>   typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 6297eafdc40f..3323defa99a1 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>   	struct {
>   		__u32	map_fd;
>   	} map;
> +	struct {
> +		__u32   io_uring_fd;
> +	} io_uring;
>   };
>   
>   /* BPF syscall commands, see bpf(2) man-page for more details. */
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 6297eafdc40f..3323defa99a1 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>   	struct {
>   		__u32	map_fd;
>   	} map;
> +	struct {
> +		__u32   io_uring_fd;
> +	} io_uring;
>   };
>   
>   /* BPF syscall commands, see bpf(2) man-page for more details. */
>
Kumar Kartikeya Dwivedi Nov. 18, 2021, 6:28 p.m. UTC | #2
On Thu, Nov 18, 2021 at 10:51:59PM IST, Yonghong Song wrote:
>
>
> On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> > This change adds eBPF iterator for buffers registered in io_uring ctx.
> > It gives access to the ctx, the index of the registered buffer, and a
> > pointer to the io_uring_ubuf itself. This allows the iterator to save
> > info related to buffers added to an io_uring instance, that isn't easy
> > to export using the fdinfo interface (like exact struct page composing
> > the registered buffer).
> >
> > The primary usecase this is enabling is checkpoint/restore support.
> >
> > Note that we need to use mutex_trylock when the file is read from, in
> > seq_start functions, as the order of lock taken is opposite of what it
> > would be when io_uring operation reads the same file.  We take
> > seq_file->lock, then ctx->uring_lock, while io_uring would first take
> > ctx->uring_lock and then seq_file->lock for the same ctx.
> >
> > This can lead to a deadlock scenario described below:
> >
> >        CPU 0				CPU 1
> >
> >        vfs_read
> >        mutex_lock(&seq_file->lock)	io_read
> > 					mutex_lock(&ctx->uring_lock)
> >        mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
> > 					mutex_lock(&seq_file->lock)
>
> It is not clear which mutex_lock switched to mutex_trylock.

The one in vfs_read.

> From below example, it looks like &ctx->uring_lock. But if this is
> the case, we could have deadlock, right?
>

Sorry, I will make the commit message clearer in the next version.

The sequence on CPU 0 is for normal read(2) on iterator.
For CPU 1, it is an io_uring instance trying to do same on iterator attached to
itself.

So CPU 0 does

sys_read
vfs_read
 bpf_seq_read
 mutex_lock(&seq_file->lock)    # A
  io_uring_buf_seq_start
  mutex_lock(&ctx->uring_lock)  # B

and CPU 1 does

io_uring_enter
mutex_lock(&ctx->uring_lock)    # B
 io_read
  bpf_seq_read
  mutex_lock(&seq_file->lock)   # A
  ...

Since the order of locks is opposite, it can deadlock. So I switched the
mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for
this case, then it will release seq_file->lock and CPU 1 will make progress.

The second problem without use of trylock is described below (for same case of
io_uring reading from iterator attached to itself).

Let me know if I missed something.

> >
> > The trylock also protects the case where io_uring tries to read from
> > iterator attached to itself (same ctx), where the order of locks would
> > be:
> >   io_uring_enter
> >    mutex_lock(&ctx->uring_lock) <-----------.
> >    io_read				    \
> >     seq_read				     \
> >      mutex_lock(&seq_file->lock)		     /
> >      mutex_lock(&ctx->uring_lock) # deadlock-`
> >
> > In both these cases (recursive read and contended uring_lock), -EDEADLK
> > is returned to userspace.
> >
> > In the future, this iterator will be extended to directly support
> > iteration of bvec Flexible Array Member, so that when there is no
> > corresponding VMA that maps to the registered buffer (e.g. if VMA is
> > destroyed after pinning pages), we are able to reconstruct the
> > registration on restore by dumping the page contents and then replaying
> > them into a temporary mapping used for registration later. All this is
> > out of scope for the current series however, but builds upon this
> > iterator.
> >
> > Cc: Jens Axboe <axboe@kernel.dk>
> > Cc: Pavel Begunkov <asml.silence@gmail.com>
> > Cc: io-uring@vger.kernel.org
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >   fs/io_uring.c                  | 179 +++++++++++++++++++++++++++++++++
> >   include/linux/bpf.h            |   2 +
> >   include/uapi/linux/bpf.h       |   3 +
> >   tools/include/uapi/linux/bpf.h |   3 +
> >   4 files changed, 187 insertions(+)
> >
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index b07196b4511c..46a110989155 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -81,6 +81,7 @@
> >   #include <linux/tracehook.h>
> >   #include <linux/audit.h>
> >   #include <linux/security.h>
> > +#include <linux/btf_ids.h>
> >   #define CREATE_TRACE_POINTS
> >   #include <trace/events/io_uring.h>
> > @@ -11125,3 +11126,181 @@ static int __init io_uring_init(void)
> >   	return 0;
> >   };
> >   __initcall(io_uring_init);
> > +
> > +#ifdef CONFIG_BPF_SYSCALL
> > +
> > +BTF_ID_LIST(btf_io_uring_ids)
> > +BTF_ID(struct, io_ring_ctx)
> > +BTF_ID(struct, io_mapped_ubuf)
> > +
> > +struct bpf_io_uring_seq_info {
> > +	struct io_ring_ctx *ctx;
> > +	unsigned long index;
> > +};
> > +
> > +static int bpf_io_uring_init_seq(void *priv_data, struct bpf_iter_aux_info *aux)
> > +{
> > +	struct bpf_io_uring_seq_info *info = priv_data;
> > +	struct io_ring_ctx *ctx = aux->ctx;
> > +
> > +	info->ctx = ctx;
> > +	return 0;
> > +}
> > +
> > +static int bpf_io_uring_iter_attach(struct bpf_prog *prog,
> > +				    union bpf_iter_link_info *linfo,
> > +				    struct bpf_iter_aux_info *aux)
> > +{
> > +	struct io_ring_ctx *ctx;
> > +	struct fd f;
> > +	int ret;
> > +
> > +	f = fdget(linfo->io_uring.io_uring_fd);
> > +	if (unlikely(!f.file))
> > +		return -EBADF;
> > +
> > +	ret = -EOPNOTSUPP;
> > +	if (unlikely(f.file->f_op != &io_uring_fops))
> > +		goto out_fput;
> > +
> > +	ret = -ENXIO;
> > +	ctx = f.file->private_data;
> > +	if (unlikely(!percpu_ref_tryget(&ctx->refs)))
> > +		goto out_fput;
> > +
> > +	ret = 0;
> > +	aux->ctx = ctx;
> > +
> > +out_fput:
> > +	fdput(f);
> > +	return ret;
> > +}
> > +
> > +static void bpf_io_uring_iter_detach(struct bpf_iter_aux_info *aux)
> > +{
> > +	percpu_ref_put(&aux->ctx->refs);
> > +}
> > +
> > +/* io_uring iterator for registered buffers */
> > +
> > +struct bpf_iter__io_uring_buf {
> > +	__bpf_md_ptr(struct bpf_iter_meta *, meta);
> > +	__bpf_md_ptr(struct io_ring_ctx *, ctx);
> > +	__bpf_md_ptr(struct io_mapped_ubuf *, ubuf);
> > +	unsigned long index;
> > +};
>
> I would suggest to change "unsigned long index" to either u32 or u64.
> This structure is also the bpf program context and in bpf program context,
> "index" will be u64. Then on 32bit system, we potentially
> could have issues.
>

Ack, will do.

> > +
> > +static void *__bpf_io_uring_buf_seq_get_next(struct bpf_io_uring_seq_info *info)
> > +{
> > +	if (info->index < info->ctx->nr_user_bufs)
> > +		return info->ctx->user_bufs[info->index++];
> > +	return NULL;
> > +}
> > +
> > +static void *bpf_io_uring_buf_seq_start(struct seq_file *seq, loff_t *pos)
> > +{
> > +	struct bpf_io_uring_seq_info *info = seq->private;
> > +	struct io_mapped_ubuf *ubuf;
> > +
> > +	/* Indicate to userspace that the uring lock is contended */
> > +	if (!mutex_trylock(&info->ctx->uring_lock))
> > +		return ERR_PTR(-EDEADLK);
> > +
> > +	ubuf = __bpf_io_uring_buf_seq_get_next(info);
> > +	if (!ubuf)
> > +		return NULL;
> > +
> > +	if (*pos == 0)
> > +		++*pos;
> > +	return ubuf;
> > +}
> > +
> > +static void *bpf_io_uring_buf_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> > +{
> > +	struct bpf_io_uring_seq_info *info = seq->private;
> > +
> > +	++*pos;
> > +	return __bpf_io_uring_buf_seq_get_next(info);
> > +}
> > +
> > +DEFINE_BPF_ITER_FUNC(io_uring_buf, struct bpf_iter_meta *meta,
> > +		     struct io_ring_ctx *ctx, struct io_mapped_ubuf *ubuf,
> > +		     unsigned long index)
>
> Again, change "unsigned long" to "u32" or "u64".
>

Ack.

> > [...]
> > +static struct bpf_iter_reg io_uring_buf_reg_info = {
> > +	.target            = "io_uring_buf",
> > +	.feature	   = BPF_ITER_RESCHED,
> > +	.attach_target     = bpf_io_uring_iter_attach,
> > +	.detach_target     = bpf_io_uring_iter_detach,
>
> Since you have this extra `io_uring_fd` for the iterator, you may want
> to implement show_fdinfo and fill_link_info callback functions here.
>

Ack, but some questions:

What should it have? e.g. it easy to go from map_id to map fd if one wants
access to the map attached to the iterator, but not sure how one can obtain more
information about target fd from io_uring or epoll iterators.

Should I delegate to their show_fdinfo and dump using that?

The type/target is already available in link_info, not sure what other useful
information can be added there, which allows obtaining the io_uring/epoll fd.

> > +	.ctx_arg_info_size = 2,
> > +	.ctx_arg_info = {
> > +		{ offsetof(struct bpf_iter__io_uring_buf, ctx),
> > +		  PTR_TO_BTF_ID },
> > +		{ offsetof(struct bpf_iter__io_uring_buf, ubuf),
> > +		  PTR_TO_BTF_ID_OR_NULL },
> > +	},
> > +	.seq_info	   = &bpf_io_uring_buf_seq_info,
> > +};
> > +
> > +static int __init io_uring_iter_init(void)
> > +{
> > +	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
> > +	io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
> > +	return bpf_iter_reg_target(&io_uring_buf_reg_info);
> > +}
> > +late_initcall(io_uring_iter_init);
> > +
> > +#endif
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 56098c866704..ddb9d4520a3f 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1509,8 +1509,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
> >   	extern int bpf_iter_ ## target(args);			\
> >   	int __init bpf_iter_ ## target(args) { return 0; }
> > +struct io_ring_ctx;
> >   struct bpf_iter_aux_info {
> >   	struct bpf_map *map;
> > +	struct io_ring_ctx *ctx;
> >   };
>
> Can we use union here? Note that below bpf_iter_link_info in
> uapi/linux/bpf.h, map_fd and io_uring_fd is also an union.
>

So the reason I didn't use a union was the link->aux.map check in
bpf_iter.c::__get_seq_info. Even if we switch to union bpf_iter_aux_info, it
needs some way to determine whether link is for map type, so maybe a string
comparison there? Leaving it out of union felt cleaner, also I move both
io_ring_ctx and eventpoll file into a union in later patch.

> >   typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 6297eafdc40f..3323defa99a1 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -91,6 +91,9 @@ union bpf_iter_link_info {
> >   	struct {
> >   		__u32	map_fd;
> >   	} map;
> > +	struct {
> > +		__u32   io_uring_fd;
> > +	} io_uring;
> >   };
> >   /* BPF syscall commands, see bpf(2) man-page for more details. */
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 6297eafdc40f..3323defa99a1 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -91,6 +91,9 @@ union bpf_iter_link_info {
> >   	struct {
> >   		__u32	map_fd;
> >   	} map;
> > +	struct {
> > +		__u32   io_uring_fd;
> > +	} io_uring;
> >   };
> >   /* BPF syscall commands, see bpf(2) man-page for more details. */
> >

--
Kartikeya
Yonghong Song Nov. 18, 2021, 7:13 p.m. UTC | #3
On 11/18/21 10:28 AM, Kumar Kartikeya Dwivedi wrote:
> On Thu, Nov 18, 2021 at 10:51:59PM IST, Yonghong Song wrote:
>>
>>
>> On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
>>> This change adds eBPF iterator for buffers registered in io_uring ctx.
>>> It gives access to the ctx, the index of the registered buffer, and a
>>> pointer to the io_uring_ubuf itself. This allows the iterator to save
>>> info related to buffers added to an io_uring instance, that isn't easy
>>> to export using the fdinfo interface (like exact struct page composing
>>> the registered buffer).
>>>
>>> The primary usecase this is enabling is checkpoint/restore support.
>>>
>>> Note that we need to use mutex_trylock when the file is read from, in
>>> seq_start functions, as the order of lock taken is opposite of what it
>>> would be when io_uring operation reads the same file.  We take
>>> seq_file->lock, then ctx->uring_lock, while io_uring would first take
>>> ctx->uring_lock and then seq_file->lock for the same ctx.
>>>
>>> This can lead to a deadlock scenario described below:
>>>
>>>         CPU 0				CPU 1
>>>
>>>         vfs_read
>>>         mutex_lock(&seq_file->lock)	io_read
>>> 					mutex_lock(&ctx->uring_lock)
>>>         mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
>>> 					mutex_lock(&seq_file->lock)
>>
>> It is not clear which mutex_lock switched to mutex_trylock.
> 
> The one in vfs_read.
> 
>>  From below example, it looks like &ctx->uring_lock. But if this is
>> the case, we could have deadlock, right?
>>
> 
> Sorry, I will make the commit message clearer in the next version.
> 
> The sequence on CPU 0 is for normal read(2) on iterator.
> For CPU 1, it is an io_uring instance trying to do same on iterator attached to
> itself.
> 
> So CPU 0 does
> 
> sys_read
> vfs_read
>   bpf_seq_read
>   mutex_lock(&seq_file->lock)    # A
>    io_uring_buf_seq_start
>    mutex_lock(&ctx->uring_lock)  # B
> 
> and CPU 1 does
> 
> io_uring_enter
> mutex_lock(&ctx->uring_lock)    # B
>   io_read
>    bpf_seq_read
>    mutex_lock(&seq_file->lock)   # A
>    ...
> 
> Since the order of locks is opposite, it can deadlock. So I switched the
> mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for
> this case, then it will release seq_file->lock and CPU 1 will make progress.
> 
> The second problem without use of trylock is described below (for same case of
> io_uring reading from iterator attached to itself).
> 
> Let me know if I missed something.

Thanks for explanation. The above diagram is much better.

> 
>>>
>>> The trylock also protects the case where io_uring tries to read from
>>> iterator attached to itself (same ctx), where the order of locks would
>>> be:
>>>    io_uring_enter
>>>     mutex_lock(&ctx->uring_lock) <-----------.
>>>     io_read				    \
>>>      seq_read				     \
>>>       mutex_lock(&seq_file->lock)		     /
>>>       mutex_lock(&ctx->uring_lock) # deadlock-`
>>>
>>> In both these cases (recursive read and contended uring_lock), -EDEADLK
>>> is returned to userspace.
>>>
>>> In the future, this iterator will be extended to directly support
>>> iteration of bvec Flexible Array Member, so that when there is no
>>> corresponding VMA that maps to the registered buffer (e.g. if VMA is
>>> destroyed after pinning pages), we are able to reconstruct the
>>> registration on restore by dumping the page contents and then replaying
>>> them into a temporary mapping used for registration later. All this is
>>> out of scope for the current series however, but builds upon this
>>> iterator.
>>>
>>> Cc: Jens Axboe <axboe@kernel.dk>
>>> Cc: Pavel Begunkov <asml.silence@gmail.com>
>>> Cc: io-uring@vger.kernel.org
>>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>>> ---
>>>    fs/io_uring.c                  | 179 +++++++++++++++++++++++++++++++++
>>>    include/linux/bpf.h            |   2 +
>>>    include/uapi/linux/bpf.h       |   3 +
>>>    tools/include/uapi/linux/bpf.h |   3 +
>>>    4 files changed, 187 insertions(+)
>>>
[...]
> 
>>> [...]
>>> +static struct bpf_iter_reg io_uring_buf_reg_info = {
>>> +	.target            = "io_uring_buf",
>>> +	.feature	   = BPF_ITER_RESCHED,
>>> +	.attach_target     = bpf_io_uring_iter_attach,
>>> +	.detach_target     = bpf_io_uring_iter_detach,
>>
>> Since you have this extra `io_uring_fd` for the iterator, you may want
>> to implement show_fdinfo and fill_link_info callback functions here.
>>
> 
> Ack, but some questions:
> 
> What should it have? e.g. it easy to go from map_id to map fd if one wants
> access to the map attached to the iterator, but not sure how one can obtain more
> information about target fd from io_uring or epoll iterators.

Just to be clear, I am talking about uapi struct bpf_link_info.
I agree that fd is not really useful. So I guess it is up to you
whether you want to show fd to user or not. We can always
add it later if needed.

> 
> Should I delegate to their show_fdinfo and dump using that?
> 
> The type/target is already available in link_info, not sure what other useful
> information can be added there, which allows obtaining the io_uring/epoll fd.
> 
>>> +	.ctx_arg_info_size = 2,
>>> +	.ctx_arg_info = {
>>> +		{ offsetof(struct bpf_iter__io_uring_buf, ctx),
>>> +		  PTR_TO_BTF_ID },
>>> +		{ offsetof(struct bpf_iter__io_uring_buf, ubuf),
>>> +		  PTR_TO_BTF_ID_OR_NULL },
>>> +	},
>>> +	.seq_info	   = &bpf_io_uring_buf_seq_info,
>>> +};
>>> +
>>> +static int __init io_uring_iter_init(void)
>>> +{
>>> +	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
>>> +	io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
>>> +	return bpf_iter_reg_target(&io_uring_buf_reg_info);
>>> +}
>>> +late_initcall(io_uring_iter_init);
>>> +
>>> +#endif
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index 56098c866704..ddb9d4520a3f 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -1509,8 +1509,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>>>    	extern int bpf_iter_ ## target(args);			\
>>>    	int __init bpf_iter_ ## target(args) { return 0; }
>>> +struct io_ring_ctx;
>>>    struct bpf_iter_aux_info {
>>>    	struct bpf_map *map;
>>> +	struct io_ring_ctx *ctx;
>>>    };
>>
>> Can we use union here? Note that below bpf_iter_link_info in
>> uapi/linux/bpf.h, map_fd and io_uring_fd is also an union.
>>
> 
> So the reason I didn't use a union was the link->aux.map check in
> bpf_iter.c::__get_seq_info. Even if we switch to union bpf_iter_aux_info, it
> needs some way to determine whether link is for map type, so maybe a string
> comparison there? Leaving it out of union felt cleaner, also I move both
> io_ring_ctx and eventpoll file into a union in later patch.

I see. the seq_ops for map element iterator is different from others.
the seq_ops is not from main iter registration but from map_ops.

I think your change is okay. But maybe a comment to explain why
map is different from others in struct bpf_iter_aux_info.

> 
>>>    typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 6297eafdc40f..3323defa99a1 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>>>    	struct {
>>>    		__u32	map_fd;
>>>    	} map;
>>> +	struct {
>>> +		__u32   io_uring_fd;
>>> +	} io_uring;
>>>    };
>>>    /* BPF syscall commands, see bpf(2) man-page for more details. */
>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>> index 6297eafdc40f..3323defa99a1 100644
>>> --- a/tools/include/uapi/linux/bpf.h
>>> +++ b/tools/include/uapi/linux/bpf.h
>>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>>>    	struct {
>>>    		__u32	map_fd;
>>>    	} map;
>>> +	struct {
>>> +		__u32   io_uring_fd;
>>> +	} io_uring;
>>>    };
>>>    /* BPF syscall commands, see bpf(2) man-page for more details. */
>>>
> 
> --
> Kartikeya
>
Alexei Starovoitov Nov. 18, 2021, 10:02 p.m. UTC | #4
On Tue, Nov 16, 2021 at 11:12:30AM +0530, Kumar Kartikeya Dwivedi wrote:
> This change adds eBPF iterator for buffers registered in io_uring ctx.
> It gives access to the ctx, the index of the registered buffer, and a
> pointer to the io_uring_ubuf itself. This allows the iterator to save
> info related to buffers added to an io_uring instance, that isn't easy
> to export using the fdinfo interface (like exact struct page composing
> the registered buffer).
> 
> The primary usecase this is enabling is checkpoint/restore support.
> 
> Note that we need to use mutex_trylock when the file is read from, in
> seq_start functions, as the order of lock taken is opposite of what it
> would be when io_uring operation reads the same file.  We take
> seq_file->lock, then ctx->uring_lock, while io_uring would first take
> ctx->uring_lock and then seq_file->lock for the same ctx.
> 
> This can lead to a deadlock scenario described below:
> 
>       CPU 0				CPU 1
> 
>       vfs_read
>       mutex_lock(&seq_file->lock)	io_read
> 					mutex_lock(&ctx->uring_lock)
>       mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
> 					mutex_lock(&seq_file->lock)
> 
> The trylock also protects the case where io_uring tries to read from
> iterator attached to itself (same ctx), where the order of locks would
> be:
>  io_uring_enter
>   mutex_lock(&ctx->uring_lock) <-----------.
>   io_read				    \
>    seq_read				     \
>     mutex_lock(&seq_file->lock)		     /
>     mutex_lock(&ctx->uring_lock) # deadlock-`
> 
> In both these cases (recursive read and contended uring_lock), -EDEADLK
> is returned to userspace.
> 
> In the future, this iterator will be extended to directly support
> iteration of bvec Flexible Array Member, so that when there is no
> corresponding VMA that maps to the registered buffer (e.g. if VMA is
> destroyed after pinning pages), we are able to reconstruct the
> registration on restore by dumping the page contents and then replaying
> them into a temporary mapping used for registration later. All this is
> out of scope for the current series however, but builds upon this
> iterator.

From BPF infra perspective these new iterators fit very well and
I don't see any issues maintaining this interface while kernel keeps
changing, but this commit log and shallowness of the selftests
makes me question feasibility of this approach in particular with io_uring.
Is it even possible to scan all internal bits of io_uring and reconstruct
it later? The bpf iter is only the read part. Don't you need the write part
for CRIU ? Even for reads only... io_uring has complex inner state.
Like bpf itself which cannot be realistically CRIU-ed.
I don't think we can merge this in pieces. We need to wait until there is
full working CRIU framework that uses these new iterators.
Kumar Kartikeya Dwivedi Nov. 19, 2021, 4:15 a.m. UTC | #5
On Fri, Nov 19, 2021 at 03:32:26AM IST, Alexei Starovoitov wrote:
> On Tue, Nov 16, 2021 at 11:12:30AM +0530, Kumar Kartikeya Dwivedi wrote:
> > This change adds eBPF iterator for buffers registered in io_uring ctx.
> > It gives access to the ctx, the index of the registered buffer, and a
> > pointer to the io_uring_ubuf itself. This allows the iterator to save
> > info related to buffers added to an io_uring instance, that isn't easy
> > to export using the fdinfo interface (like exact struct page composing
> > the registered buffer).
> >
> > The primary usecase this is enabling is checkpoint/restore support.
> >
> > Note that we need to use mutex_trylock when the file is read from, in
> > seq_start functions, as the order of lock taken is opposite of what it
> > would be when io_uring operation reads the same file.  We take
> > seq_file->lock, then ctx->uring_lock, while io_uring would first take
> > ctx->uring_lock and then seq_file->lock for the same ctx.
> >
> > This can lead to a deadlock scenario described below:
> >
> >       CPU 0				CPU 1
> >
> >       vfs_read
> >       mutex_lock(&seq_file->lock)	io_read
> > 					mutex_lock(&ctx->uring_lock)
> >       mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
> > 					mutex_lock(&seq_file->lock)
> >
> > The trylock also protects the case where io_uring tries to read from
> > iterator attached to itself (same ctx), where the order of locks would
> > be:
> >  io_uring_enter
> >   mutex_lock(&ctx->uring_lock) <-----------.
> >   io_read				    \
> >    seq_read				     \
> >     mutex_lock(&seq_file->lock)		     /
> >     mutex_lock(&ctx->uring_lock) # deadlock-`
> >
> > In both these cases (recursive read and contended uring_lock), -EDEADLK
> > is returned to userspace.
> >
> > In the future, this iterator will be extended to directly support
> > iteration of bvec Flexible Array Member, so that when there is no
> > corresponding VMA that maps to the registered buffer (e.g. if VMA is
> > destroyed after pinning pages), we are able to reconstruct the
> > registration on restore by dumping the page contents and then replaying
> > them into a temporary mapping used for registration later. All this is
> > out of scope for the current series however, but builds upon this
> > iterator.
>
> From BPF infra perspective these new iterators fit very well and
> I don't see any issues maintaining this interface while kernel keeps
> changing, but this commit log and shallowness of the selftests
> makes me question feasibility of this approach in particular with io_uring.
> Is it even possible to scan all internal bits of io_uring and reconstruct
> it later? The bpf iter is only the read part. Don't you need the write part
> for CRIU ? Even for reads only... io_uring has complex inner state.

Yes, the inner state is complex and often entangled with other task attributes,
like credentials, mm, file descriptors, other io_uring fds (wq_fd, etc.) but for
now these iterators are a good starting point to implement the majority of the
missing features in CRIU userspace. These iterators (including task* ones), and
procfs allow us to collect enough state to correlate various resources and form
relationships (e.g. which fd or buffer of which task(s) is registered, which
io_uring was used for wq_fd registration, or which eventfd was registered,
etc.). Thanks to access to io_ring_ctx, and iter being a tracing prog, we can
do usual pointer access to read some of that data.

> Like bpf itself which cannot be realistically CRIU-ed.
> I don't think we can merge this in pieces. We need to wait until there is
> full working CRIU framework that uses these new iterators.

I don't intend to add a 'write' framework. The usual process with CRIU is to
gather enough state to reconstruct the kernel resource by repeating steps
similar to what it might have performed during it's lifetime, e.g. approximating
a trace of execution leading to the same task state and then repeating that on
restore.

While a 'write' framework with first class kernel support for checkpoint/restore
would actually make CRIU's life a lot more simpler, there usually has been a lot
of pushback against interfaces like that, hence the approach has been to add
features that can extract the relevant information out of the kernel, and do the
rest of the work in userspace.

E.g. if we find that fd 1 and 2 are registered in the io_uring, and buffer at
0xabcd with len 128, then we first restore fd 1 and 2 (which can be anything)
at restore time, then after this restore is complete, continue restoration of
io_uring by executing file registration step for both [0], and depending on if
the mapping existed for 0xabcd in task mm, restore buffer once the mm of the
task has been restored, or otherwise map a temporary buffer, replay data, and
register it and destroy mappings. This would be similar to what might have
happened in the task itself. The correct order of events to do the restore
will be managed by CRIU itself.

It doesn't have to be exactly the same stpes, just enough for the task to not
notice a difference and not break its assumptions, and lead to the same end
result of task and resource state.

Even the task of determining whether fd 1 and 2 belong to which fdtable is
non-trivial and ineffecient rn. You can look at [0] for a similar case that was
solved for epoll, which works but we can do better (BPF didn't exist at that
time so it was the best we could do back then).

Also, this work is part of GSoC. There is already code that is waiting for this
to fill in the missing pieces [0]. If you want me to add a sample/selftest that
demonstrates/tests how this can be used to reconstruct a task's io_uring, I can
certainly do that. We've already spent a few months contemplating on a few
approaches and this turned out to be the best/most powerful. At one point I had
to scrap some my earlier patches completely because they couldn't work with
descriptorless io_uring. Iterator seem like the best solution so far that can
adapt gracefully to feature additions in something seeing as heavy development
as io_uring.

  [0]: https://github.com/checkpoint-restore/criu/commit/cfa3f405d522334076fc4d687bd077bee3186ccf#diff-d2cfa5a05213c854d539de003a23a286311ae81431026d3d50b0068c0cb5a852
  [1]: https://github.com/checkpoint-restore/criu/pull/1597

--
Kartikeya
Kumar Kartikeya Dwivedi Nov. 19, 2021, 4:44 a.m. UTC | #6
On Fri, Nov 19, 2021 at 09:45:23AM IST, Kumar Kartikeya Dwivedi wrote:
> [...]
> E.g. if we find that fd 1 and 2 are registered in the io_uring, and buffer at
> 0xabcd with len 128, then we first restore fd 1 and 2 (which can be anything)
> at restore time, then after this restore is complete, continue restoration of
> io_uring by executing file registration step for both [0], and depending on if

This was meant to link to https://criu.org/Fdinfo_engine.

> [...]

--
Kartikeya
Alexei Starovoitov Nov. 19, 2021, 4:56 a.m. UTC | #7
On Fri, Nov 19, 2021 at 09:45:23AM +0530, Kumar Kartikeya Dwivedi wrote:
> 
> Also, this work is part of GSoC. There is already code that is waiting for this
> to fill in the missing pieces [0]. If you want me to add a sample/selftest that
> demonstrates/tests how this can be used to reconstruct a task's io_uring, I can
> certainly do that. We've already spent a few months contemplating on a few
> approaches and this turned out to be the best/most powerful. At one point I had
> to scrap some my earlier patches completely because they couldn't work with
> descriptorless io_uring. Iterator seem like the best solution so far that can
> adapt gracefully to feature additions in something seeing as heavy development
> as io_uring.
> 
>   [0]: https://github.com/checkpoint-restore/criu/commit/cfa3f405d522334076fc4d687bd077bee3186ccf#diff-d2cfa5a05213c854d539de003a23a286311ae81431026d3d50b0068c0cb5a852
>   [1]: https://github.com/checkpoint-restore/criu/pull/1597

Is that the main PR? 1095 changed files? Is it stale or something?
Is there a way to view the actual logic that exercises these bpf iterators?
Kumar Kartikeya Dwivedi Nov. 19, 2021, 5:16 a.m. UTC | #8
On Fri, Nov 19, 2021 at 10:26:59AM IST, Alexei Starovoitov wrote:
> On Fri, Nov 19, 2021 at 09:45:23AM +0530, Kumar Kartikeya Dwivedi wrote:
> >
> > Also, this work is part of GSoC. There is already code that is waiting for this
> > to fill in the missing pieces [0]. If you want me to add a sample/selftest that
> > demonstrates/tests how this can be used to reconstruct a task's io_uring, I can
> > certainly do that. We've already spent a few months contemplating on a few
> > approaches and this turned out to be the best/most powerful. At one point I had
> > to scrap some my earlier patches completely because they couldn't work with
> > descriptorless io_uring. Iterator seem like the best solution so far that can
> > adapt gracefully to feature additions in something seeing as heavy development
> > as io_uring.
> >
> >   [0]: https://github.com/checkpoint-restore/criu/commit/cfa3f405d522334076fc4d687bd077bee3186ccf#diff-d2cfa5a05213c854d539de003a23a286311ae81431026d3d50b0068c0cb5a852
> >   [1]: https://github.com/checkpoint-restore/criu/pull/1597
>
> Is that the main PR? 1095 changed files? Is it stale or something?
> Is there a way to view the actual logic that exercises these bpf iterators?

No, there is no code exercising BPF iterator in that PR yet (since it wouldn't
build/run in CI). There's some code I have locally that uses these to collect
the necessary information, I can post that, either as a sample or selftest in
the next version, or separately on GH for you to take a look.

I still rebased it so that you can see the rest of the actual code.

--
Kartikeya
Alexei Starovoitov Nov. 19, 2021, 5:24 a.m. UTC | #9
On Thu, Nov 18, 2021 at 9:17 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, Nov 19, 2021 at 10:26:59AM IST, Alexei Starovoitov wrote:
> > On Fri, Nov 19, 2021 at 09:45:23AM +0530, Kumar Kartikeya Dwivedi wrote:
> > >
> > > Also, this work is part of GSoC. There is already code that is waiting for this
> > > to fill in the missing pieces [0]. If you want me to add a sample/selftest that
> > > demonstrates/tests how this can be used to reconstruct a task's io_uring, I can
> > > certainly do that. We've already spent a few months contemplating on a few
> > > approaches and this turned out to be the best/most powerful. At one point I had
> > > to scrap some my earlier patches completely because they couldn't work with
> > > descriptorless io_uring. Iterator seem like the best solution so far that can
> > > adapt gracefully to feature additions in something seeing as heavy development
> > > as io_uring.
> > >
> > >   [0]: https://github.com/checkpoint-restore/criu/commit/cfa3f405d522334076fc4d687bd077bee3186ccf#diff-d2cfa5a05213c854d539de003a23a286311ae81431026d3d50b0068c0cb5a852
> > >   [1]: https://github.com/checkpoint-restore/criu/pull/1597
> >
> > Is that the main PR? 1095 changed files? Is it stale or something?
> > Is there a way to view the actual logic that exercises these bpf iterators?
>
> No, there is no code exercising BPF iterator in that PR yet (since it wouldn't
> build/run in CI). There's some code I have locally that uses these to collect
> the necessary information, I can post that, either as a sample or selftest in
> the next version, or separately on GH for you to take a look.
>
> I still rebased it so that you can see the rest of the actual code.

I would like to see a working end to end solution.

Also I'd like to hear what Jens and Pavel have to say about
applicability of CRIU to io_uring in general.
Kumar Kartikeya Dwivedi Nov. 19, 2021, 6:12 a.m. UTC | #10
On Fri, Nov 19, 2021 at 10:54:21AM IST, Alexei Starovoitov wrote:
> On Thu, Nov 18, 2021 at 9:17 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Fri, Nov 19, 2021 at 10:26:59AM IST, Alexei Starovoitov wrote:
> > > On Fri, Nov 19, 2021 at 09:45:23AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > >
> > > > Also, this work is part of GSoC. There is already code that is waiting for this
> > > > to fill in the missing pieces [0]. If you want me to add a sample/selftest that
> > > > demonstrates/tests how this can be used to reconstruct a task's io_uring, I can
> > > > certainly do that. We've already spent a few months contemplating on a few
> > > > approaches and this turned out to be the best/most powerful. At one point I had
> > > > to scrap some my earlier patches completely because they couldn't work with
> > > > descriptorless io_uring. Iterator seem like the best solution so far that can
> > > > adapt gracefully to feature additions in something seeing as heavy development
> > > > as io_uring.
> > > >
> > > >   [0]: https://github.com/checkpoint-restore/criu/commit/cfa3f405d522334076fc4d687bd077bee3186ccf#diff-d2cfa5a05213c854d539de003a23a286311ae81431026d3d50b0068c0cb5a852
> > > >   [1]: https://github.com/checkpoint-restore/criu/pull/1597
> > >
> > > Is that the main PR? 1095 changed files? Is it stale or something?
> > > Is there a way to view the actual logic that exercises these bpf iterators?
> >
> > No, there is no code exercising BPF iterator in that PR yet (since it wouldn't
> > build/run in CI). There's some code I have locally that uses these to collect
> > the necessary information, I can post that, either as a sample or selftest in
> > the next version, or separately on GH for you to take a look.
> >
> > I still rebased it so that you can see the rest of the actual code.
>
> I would like to see a working end to end solution.
>

Understood, I'll address this in v2.

> Also I'd like to hear what Jens and Pavel have to say about
> applicability of CRIU to io_uring in general.

--
Kartikeya
Pavel Begunkov Dec. 3, 2021, 3:52 p.m. UTC | #11
On 11/19/21 05:24, Alexei Starovoitov wrote:
> On Thu, Nov 18, 2021 at 9:17 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
>>
>> On Fri, Nov 19, 2021 at 10:26:59AM IST, Alexei Starovoitov wrote:
>>> On Fri, Nov 19, 2021 at 09:45:23AM +0530, Kumar Kartikeya Dwivedi wrote:
>>>>
>>>> Also, this work is part of GSoC. There is already code that is waiting for this
>>>> to fill in the missing pieces [0]. If you want me to add a sample/selftest that
>>>> demonstrates/tests how this can be used to reconstruct a task's io_uring, I can
>>>> certainly do that. We've already spent a few months contemplating on a few
>>>> approaches and this turned out to be the best/most powerful. At one point I had
>>>> to scrap some my earlier patches completely because they couldn't work with
>>>> descriptorless io_uring. Iterator seem like the best solution so far that can
>>>> adapt gracefully to feature additions in something seeing as heavy development
>>>> as io_uring.
>>>>
>>>>    [0]: https://github.com/checkpoint-restore/criu/commit/cfa3f405d522334076fc4d687bd077bee3186ccf#diff-d2cfa5a05213c854d539de003a23a286311ae81431026d3d50b0068c0cb5a852
>>>>    [1]: https://github.com/checkpoint-restore/criu/pull/1597
>>>
>>> Is that the main PR? 1095 changed files? Is it stale or something?
>>> Is there a way to view the actual logic that exercises these bpf iterators?
>>
>> No, there is no code exercising BPF iterator in that PR yet (since it wouldn't
>> build/run in CI). There's some code I have locally that uses these to collect
>> the necessary information, I can post that, either as a sample or selftest in
>> the next version, or separately on GH for you to take a look.
>>
>> I still rebased it so that you can see the rest of the actual code.
> 
> I would like to see a working end to end solution.
> 
> Also I'd like to hear what Jens and Pavel have to say about
> applicability of CRIU to io_uring in general.

First, we have no way to know what requests are in flight, without it
CR doesn't make much sense. The most compelling way for me is to add
a feature to fail all in-flights as it does when is closed. But maybe,
you already did solve it somehow?

There is probably a way to restore registered buffers and files, though
it may be tough considering that files may not have corresponding fds in
the userspace, buffers may be unmapped, buffers may come from
shmem/etc. and other corner cases.

There are also not covered here pieces of state, SELECT_BUFFER
buffers, personalities (aka creds), registered eventfd, io-wq
configuration, etc. I'm assuming you'll be checking them and
failing CR if any of them is there.

And the last point, there will be some stuff CR of which is
likely to be a bad idea. E.g. registered dmabuf's,
pre-registered DMA mappings, zerocopy contexts and so on.

IOW, if the first point is solved, there may be a subset of ring
setups that can probably be CR. That should cover a good amount
of cases. I don't have a strong opinion on the whole thing,
I guess it depends on the amount of problems to implement
in-flight cancellations.
Kumar Kartikeya Dwivedi Dec. 3, 2021, 11:16 p.m. UTC | #12
On Fri, Dec 03, 2021 at 09:22:54PM IST, Pavel Begunkov wrote:
> On 11/19/21 05:24, Alexei Starovoitov wrote:
> > [...]
> >
> > Also I'd like to hear what Jens and Pavel have to say about
> > applicability of CRIU to io_uring in general.
>

Hi Pavel, thanks for taking a look!

> First, we have no way to know what requests are in flight, without it
> CR doesn't make much sense. The most compelling way for me is to add
> a feature to fail all in-flights as it does when is closed. But maybe,
> you already did solve it somehow?

Indeed, as you note, there is no way to currently inspect in-flight requests,
what we can do is wait for a barrier operation to synchronize against all
previous requests.

So for now, my idea is to drain the ring (by waiting for all in-flight requests
to complete), by using a IOSQE_IO_DRAIN IORING_OP_NOP, and then waiting with a
fixed timeout (so that if forward progress depends on a blocked task/ourselves),
we can fail the dumping. This is ofcourse best effort, but it has worked well
for many of the cases I tested so far.

This might have some other issues, e.g. not being able to accommodate all posted
completions in the CQ ring due to unreaped completions from when it was
checkpointed. In that case we can simply give up, since otherwise recreating the
ring as it was becomes very hard if we let it trample over unread items (it is
unclear how I can send in completitions at restore that were in overflow list at
dump).

One idea I had in mind was to add support to post a dummy CQE entry (by
submitting a e.g. IORING_OP_NOP) where the fields of CQE are set during
submission time. This allows faking a completed request, then at restore we can
push all these into the overflow list and project the state as it were if the CQ
ring was full. At dump time it allows us to continually reap completion items.
If we detect that kernel doesn't support overflow, we fail.

Adjustment of the kernel side tail is not as hard (we can use IORING_OP_NOP
completitions to fill it up, then rewrite entries).

There were other operations (like registering buffers) that had similar side
effect of synchronization of ring state (waiting for it to become idle) before
returning to userspace, but that was pre-5.13.

Also we have to do this ring synchronization fairly early during the dump, since
it can lead to addition of other resources (e.g. fds) to the task that then need
to be dumped again.

> There is probably a way to restore registered buffers and files, though
> it may be tough considering that files may not have corresponding fds in
> the userspace, buffers may be unmapped, buffers may come from
> shmem/etc. and other corner cases.

See [0] for some explanation on all that. CRIU also knows if certain VMA comes
from shmem or not (whose restoration is already handled separately).

>
> There are also not covered here pieces of state, SELECT_BUFFER
> buffers, personalities (aka creds), registered eventfd, io-wq
> configuration, etc. I'm assuming you'll be checking them and
> failing CR if any of them is there.

Personalities are not as hard (IIUC), because all the required state is
available through fdinfo. In the PR linked in this thread, there is code to
parse it and restore using the saved credentials (though we might want to
implement UID mapping options, or either let the user do image rewriting for
that, which is a separate concern).

Ideally I'd like to be able to grab this state from the iterator as well, but it
needs atleast a bpf_xa_for_each helper, since io_uring's show_fdinfo skips some
crucial data when it detects contention over uring_lock (and doesn't indicate
this at all) :(. See the conditional printing on 'has_lock'.

SELECT_BUFFER is indeed unhandled rn. I'm contemplating ways on how to extend
the iterator so that it can loop over all items of generic structures like
Xarray in general while taking appropriate locks relevant for the specific hook
in particular. Both personalities registration and IORING_OP_PROVIDE_BUFFERS
insert use an Xarray, so it might make sense to rather add a bpf_xa_for_each
than introducing another iterator, and only mark it as safe for this iterator
context (where appropriate locks e.g. ctx->uring_lock is held).

For registered eventfd, and io-wq, you can look at [0] to see how I am solving
that, TLDR I just map the underlying structure to the open fd in the task. eBPF
is flexible enough to also allow state inspection in case e.g. the corresponding
eventfd is closed, so that we can recreate it, register, and then close again
when restoring. Same with files directly added to the fixed file set, the whole
idea was to bring in eBPF so that dumping these resources is possible when they
are "hidden" from normal view.

[0]: https://lore.kernel.org/bpf/20211201042333.2035153-11-memxor@gmail.com

>
> And the last point, there will be some stuff CR of which is
> likely to be a bad idea. E.g. registered dmabuf's,
> pre-registered DMA mappings, zerocopy contexts and so on.
>

Yes, we can just fail the dump for these cases. There are many other cases (in
general) where we just have to give up.

> IOW, if the first point is solved, there may be a subset of ring
> setups that can probably be CR. That should cover a good amount
> of cases. I don't have a strong opinion on the whole thing,
> I guess it depends on the amount of problems to implement
> in-flight cancellations.
>
> --
> Pavel Begunkov

--
Kartikeya
diff mbox series

Patch

diff --git a/fs/io_uring.c b/fs/io_uring.c
index b07196b4511c..46a110989155 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -81,6 +81,7 @@ 
 #include <linux/tracehook.h>
 #include <linux/audit.h>
 #include <linux/security.h>
+#include <linux/btf_ids.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/io_uring.h>
@@ -11125,3 +11126,181 @@  static int __init io_uring_init(void)
 	return 0;
 };
 __initcall(io_uring_init);
+
+#ifdef CONFIG_BPF_SYSCALL
+
+BTF_ID_LIST(btf_io_uring_ids)
+BTF_ID(struct, io_ring_ctx)
+BTF_ID(struct, io_mapped_ubuf)
+
+struct bpf_io_uring_seq_info {
+	struct io_ring_ctx *ctx;
+	unsigned long index;
+};
+
+static int bpf_io_uring_init_seq(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+	struct bpf_io_uring_seq_info *info = priv_data;
+	struct io_ring_ctx *ctx = aux->ctx;
+
+	info->ctx = ctx;
+	return 0;
+}
+
+static int bpf_io_uring_iter_attach(struct bpf_prog *prog,
+				    union bpf_iter_link_info *linfo,
+				    struct bpf_iter_aux_info *aux)
+{
+	struct io_ring_ctx *ctx;
+	struct fd f;
+	int ret;
+
+	f = fdget(linfo->io_uring.io_uring_fd);
+	if (unlikely(!f.file))
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (unlikely(f.file->f_op != &io_uring_fops))
+		goto out_fput;
+
+	ret = -ENXIO;
+	ctx = f.file->private_data;
+	if (unlikely(!percpu_ref_tryget(&ctx->refs)))
+		goto out_fput;
+
+	ret = 0;
+	aux->ctx = ctx;
+
+out_fput:
+	fdput(f);
+	return ret;
+}
+
+static void bpf_io_uring_iter_detach(struct bpf_iter_aux_info *aux)
+{
+	percpu_ref_put(&aux->ctx->refs);
+}
+
+/* io_uring iterator for registered buffers */
+
+struct bpf_iter__io_uring_buf {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct io_ring_ctx *, ctx);
+	__bpf_md_ptr(struct io_mapped_ubuf *, ubuf);
+	unsigned long index;
+};
+
+static void *__bpf_io_uring_buf_seq_get_next(struct bpf_io_uring_seq_info *info)
+{
+	if (info->index < info->ctx->nr_user_bufs)
+		return info->ctx->user_bufs[info->index++];
+	return NULL;
+}
+
+static void *bpf_io_uring_buf_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+	struct io_mapped_ubuf *ubuf;
+
+	/* Indicate to userspace that the uring lock is contended */
+	if (!mutex_trylock(&info->ctx->uring_lock))
+		return ERR_PTR(-EDEADLK);
+
+	ubuf = __bpf_io_uring_buf_seq_get_next(info);
+	if (!ubuf)
+		return NULL;
+
+	if (*pos == 0)
+		++*pos;
+	return ubuf;
+}
+
+static void *bpf_io_uring_buf_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+
+	++*pos;
+	return __bpf_io_uring_buf_seq_get_next(info);
+}
+
+DEFINE_BPF_ITER_FUNC(io_uring_buf, struct bpf_iter_meta *meta,
+		     struct io_ring_ctx *ctx, struct io_mapped_ubuf *ubuf,
+		     unsigned long index)
+
+static int __bpf_io_uring_buf_seq_show(struct seq_file *seq, void *v, bool in_stop)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+	struct bpf_iter__io_uring_buf ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, in_stop);
+	if (!prog)
+		return 0;
+
+	ctx.meta = &meta;
+	ctx.ctx = info->ctx;
+	ctx.ubuf = v;
+	ctx.index = info->index ? info->index - !in_stop : 0;
+
+	return bpf_iter_run_prog(prog, &ctx);
+}
+
+static int bpf_io_uring_buf_seq_show(struct seq_file *seq, void *v)
+{
+	return __bpf_io_uring_buf_seq_show(seq, v, false);
+}
+
+static void bpf_io_uring_buf_seq_stop(struct seq_file *seq, void *v)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+
+	/* If IS_ERR(v) is true, then ctx->uring_lock wasn't taken */
+	if (IS_ERR(v))
+		return;
+	if (!v)
+		__bpf_io_uring_buf_seq_show(seq, v, true);
+	else if (info->index) /* restart from index */
+		info->index--;
+	mutex_unlock(&info->ctx->uring_lock);
+}
+
+static const struct seq_operations bpf_io_uring_buf_seq_ops = {
+	.start = bpf_io_uring_buf_seq_start,
+	.next  = bpf_io_uring_buf_seq_next,
+	.stop  = bpf_io_uring_buf_seq_stop,
+	.show  = bpf_io_uring_buf_seq_show,
+};
+
+static const struct bpf_iter_seq_info bpf_io_uring_buf_seq_info = {
+	.seq_ops          = &bpf_io_uring_buf_seq_ops,
+	.init_seq_private = bpf_io_uring_init_seq,
+	.fini_seq_private = NULL,
+	.seq_priv_size    = sizeof(struct bpf_io_uring_seq_info),
+};
+
+static struct bpf_iter_reg io_uring_buf_reg_info = {
+	.target            = "io_uring_buf",
+	.feature	   = BPF_ITER_RESCHED,
+	.attach_target     = bpf_io_uring_iter_attach,
+	.detach_target     = bpf_io_uring_iter_detach,
+	.ctx_arg_info_size = 2,
+	.ctx_arg_info = {
+		{ offsetof(struct bpf_iter__io_uring_buf, ctx),
+		  PTR_TO_BTF_ID },
+		{ offsetof(struct bpf_iter__io_uring_buf, ubuf),
+		  PTR_TO_BTF_ID_OR_NULL },
+	},
+	.seq_info	   = &bpf_io_uring_buf_seq_info,
+};
+
+static int __init io_uring_iter_init(void)
+{
+	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
+	io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
+	return bpf_iter_reg_target(&io_uring_buf_reg_info);
+}
+late_initcall(io_uring_iter_init);
+
+#endif
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 56098c866704..ddb9d4520a3f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1509,8 +1509,10 @@  int bpf_obj_get_user(const char __user *pathname, int flags);
 	extern int bpf_iter_ ## target(args);			\
 	int __init bpf_iter_ ## target(args) { return 0; }
 
+struct io_ring_ctx;
 struct bpf_iter_aux_info {
 	struct bpf_map *map;
+	struct io_ring_ctx *ctx;
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6297eafdc40f..3323defa99a1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@  union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u32   io_uring_fd;
+	} io_uring;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6297eafdc40f..3323defa99a1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@  union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u32   io_uring_fd;
+	} io_uring;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */