diff mbox series

[RESEND,x3,v9,1/9] iov_iter: add copy_struct_from_iter()

Message ID 6caae597eb20da5ea23e53e8e64ce0c4f4d9c6d2.1623972519.git.osandov@fb.com (mailing list archive)
State New
Headers show
Series fs: interface for directly reading/writing compressed data | expand

Commit Message

Omar Sandoval June 17, 2021, 11:51 p.m. UTC
From: Omar Sandoval <osandov@fb.com>

This is essentially copy_struct_from_user() but for an iov_iter.

Suggested-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 include/linux/uio.h |  1 +
 lib/iov_iter.c      | 91 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 92 insertions(+)

Comments

Linus Torvalds June 18, 2021, 6:50 p.m. UTC | #1
On Thu, Jun 17, 2021 at 4:51 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> This is essentially copy_struct_from_user() but for an iov_iter.

So I continue to think that this series looks fine - if we want this
interface at all.

I do note a few issues with this iov patch, though - partly probably
because I have been reading Al's cleanup patches that had some
optimizations in place.

And in particular, I now react to this:

> +       iov_iter_advance(i, usize);

at the end of copy_struct_from_iter().

It's very wasteful to use the generic iov_iter_advance() function,
when you just had special functions for each of the iterator cases.

Because that generic function will now just end up re-testing that
whole "what kind was it" and then do each kind separately.

So it would actually be a lot simpler and m,ore efficient to just do
that "advance" part as you go through the cases, iow just do

        iov_iter_iovec_advance(i, usize);

at the end of the iter_is_iovec/iter_is_kvec cases, and

        iov_iter_bvec_advance(i, usize)

for the bvec case.

I think that you may need it to be based on Al's series for that to
work, which might be inconvenient, though.

One other non-code issue: particularly since you only handle a subset
of the iov_iter cases, it would be nice to have an explanation for
_why_ those particular cases.

IOW, have some trivial explanation for each of the cases. "iovec" is
for regular read/write, what triggers the kvec and bvec cases?

But also, the other way around. Why doesn't the pipe case trigger? No
splice support?

              Linus
Al Viro June 18, 2021, 7:42 p.m. UTC | #2
On Fri, Jun 18, 2021 at 11:50:25AM -0700, Linus Torvalds wrote:

> I think that you may need it to be based on Al's series for that to
> work, which might be inconvenient, though.
> 
> One other non-code issue: particularly since you only handle a subset
> of the iov_iter cases, it would be nice to have an explanation for
> _why_ those particular cases.
> 
> IOW, have some trivial explanation for each of the cases. "iovec" is
> for regular read/write, what triggers the kvec and bvec cases?
> 
> But also, the other way around. Why doesn't the pipe case trigger? No
> splice support?

Pipe ones are strictly destinations - they can't be sources.  So if you
see it called for one of those, you've a bug.

Xarray ones are *not* - they can be sources, and that's missing here.

Much more unpleasant, though, is that this thing has hard dependency on
nr_seg == 1 *AND* openly suggests the use of iov_iter_single_seg_count(),
which is completely wrong.  That sucker has some weird users left (as
of #work.iov_iter), but all of them are actually due to API deficiencies
and I very much hope to kill that thing off.

Why not simply add iov_iter_check_zeroes(), that would be called after
copy_from_iter() and verified that all that's left in the iterator
consists of zeroes?  Then this copy_struct_from_...() would be
trivial to express through those two.  And check_zeroes would also
be trivial, especially on top of #work.iov_iter.  With no calls of
iov_iter_advance() at all, while we are at it...

IDGI... Omar, what semantics do you really want from that primitive?
Al Viro June 18, 2021, 7:49 p.m. UTC | #3
On Fri, Jun 18, 2021 at 07:42:41PM +0000, Al Viro wrote:

> Pipe ones are strictly destinations - they can't be sources.  So if you
> see it called for one of those, you've a bug.
> 
> Xarray ones are *not* - they can be sources, and that's missing here.
> 
> Much more unpleasant, though, is that this thing has hard dependency on
> nr_seg == 1 *AND* openly suggests the use of iov_iter_single_seg_count(),
> which is completely wrong.  That sucker has some weird users left (as
> of #work.iov_iter), but all of them are actually due to API deficiencies
> and I very much hope to kill that thing off.
> 
> Why not simply add iov_iter_check_zeroes(), that would be called after
> copy_from_iter() and verified that all that's left in the iterator
> consists of zeroes?  Then this copy_struct_from_...() would be
> trivial to express through those two.  And check_zeroes would also
> be trivial, especially on top of #work.iov_iter.  With no calls of
> iov_iter_advance() at all, while we are at it...
> 
> IDGI... Omar, what semantics do you really want from that primitive?

And for pity sake, let's not do that EXPORT_SYMBOL_GPL() posturing there.
If it's a sane general-purpose API, it doesn't matter who uses it;
if it's not, it shouldn't be exported in the first place.

It can be implemented via the already exported primitives, so it's
not as if we prevented anyone from doing an equivalent...
Omar Sandoval June 18, 2021, 8:32 p.m. UTC | #4
On Fri, Jun 18, 2021 at 07:42:41PM +0000, Al Viro wrote:
> On Fri, Jun 18, 2021 at 11:50:25AM -0700, Linus Torvalds wrote:
> 
> > I think that you may need it to be based on Al's series for that to
> > work, which might be inconvenient, though.
> > 
> > One other non-code issue: particularly since you only handle a subset
> > of the iov_iter cases, it would be nice to have an explanation for
> > _why_ those particular cases.
> > 
> > IOW, have some trivial explanation for each of the cases. "iovec" is
> > for regular read/write, what triggers the kvec and bvec cases?
> > 
> > But also, the other way around. Why doesn't the pipe case trigger? No
> > splice support?
> 
> Pipe ones are strictly destinations - they can't be sources.  So if you
> see it called for one of those, you've a bug.
> 
> Xarray ones are *not* - they can be sources, and that's missing here.

Ah, ITER_XARRAY was added recently so I missed it.

> Much more unpleasant, though, is that this thing has hard dependency on
> nr_seg == 1 *AND* openly suggests the use of iov_iter_single_seg_count(),
> which is completely wrong.  That sucker has some weird users left (as
> of #work.iov_iter), but all of them are actually due to API deficiencies
> and I very much hope to kill that thing off.
> 
> Why not simply add iov_iter_check_zeroes(), that would be called after
> copy_from_iter() and verified that all that's left in the iterator
> consists of zeroes?  Then this copy_struct_from_...() would be
> trivial to express through those two.  And check_zeroes would also
> be trivial, especially on top of #work.iov_iter.  With no calls of
> iov_iter_advance() at all, while we are at it...
> 
> IDGI... Omar, what semantics do you really want from that primitive?

RWF_ENCODED is intended to be used like this:

	struct encoded_iov encoded_iov = {
		/* compression metadata */ ...
	};
	char compressed_data[] = ...;
	struct iovec iov[] = {
		{ &encoded_iov, sizeof(encoded_iov) },
		{ compressed_data, sizeof(compressed_data) },
	};
	pwritev2(fd, iov, 2, -1, RWF_ENCODED);

Basically, we squirrel away the compression metadata in the first
element of the iovec array, and we use iov[0].iov_len so that we can
support future extensions of struct encoded_iov in the style of
copy_struct_from_user().

So this doesn't require nr_seg == 1. On the contrary, it's expected that
the rest of the iovec has the compressed payload. And to support the
copy_struct_from_user()-style versioning, we need to know the size of
the struct encoded_iov that userspace gave us, which is the reason for
the iov_iter_single_seg_count().

I know this interface isn't the prettiest. It started as a
Btrfs-specific ioctl, but this approach was suggested as a way to avoid
having a whole new I/O path:
https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/
The copy_struct_from_iter() thing was proposed as a way to allow future
extensions here:
https://lore.kernel.org/linux-btrfs/20191022020215.csdwgi3ky27rfidf@yavin.dot.cyphar.com/

Please let me know if you have any suggestions for how to improve this.

Thanks,
Omar
Omar Sandoval June 18, 2021, 8:33 p.m. UTC | #5
On Fri, Jun 18, 2021 at 07:49:53PM +0000, Al Viro wrote:
> On Fri, Jun 18, 2021 at 07:42:41PM +0000, Al Viro wrote:
> 
> > Pipe ones are strictly destinations - they can't be sources.  So if you
> > see it called for one of those, you've a bug.
> > 
> > Xarray ones are *not* - they can be sources, and that's missing here.
> > 
> > Much more unpleasant, though, is that this thing has hard dependency on
> > nr_seg == 1 *AND* openly suggests the use of iov_iter_single_seg_count(),
> > which is completely wrong.  That sucker has some weird users left (as
> > of #work.iov_iter), but all of them are actually due to API deficiencies
> > and I very much hope to kill that thing off.
> > 
> > Why not simply add iov_iter_check_zeroes(), that would be called after
> > copy_from_iter() and verified that all that's left in the iterator
> > consists of zeroes?  Then this copy_struct_from_...() would be
> > trivial to express through those two.  And check_zeroes would also
> > be trivial, especially on top of #work.iov_iter.  With no calls of
> > iov_iter_advance() at all, while we are at it...
> > 
> > IDGI... Omar, what semantics do you really want from that primitive?
> 
> And for pity sake, let's not do that EXPORT_SYMBOL_GPL() posturing there.
> If it's a sane general-purpose API, it doesn't matter who uses it;
> if it's not, it shouldn't be exported in the first place.
> 
> It can be implemented via the already exported primitives, so it's
> not as if we prevented anyone from doing an equivalent...

Fair enough, I'll fix that.
Al Viro June 18, 2021, 8:58 p.m. UTC | #6
On Fri, Jun 18, 2021 at 01:32:26PM -0700, Omar Sandoval wrote:

> RWF_ENCODED is intended to be used like this:
> 
> 	struct encoded_iov encoded_iov = {
> 		/* compression metadata */ ...
> 	};
> 	char compressed_data[] = ...;
> 	struct iovec iov[] = {
> 		{ &encoded_iov, sizeof(encoded_iov) },
> 		{ compressed_data, sizeof(compressed_data) },
> 	};
> 	pwritev2(fd, iov, 2, -1, RWF_ENCODED);
> 
> Basically, we squirrel away the compression metadata in the first
> element of the iovec array, and we use iov[0].iov_len so that we can
> support future extensions of struct encoded_iov in the style of
> copy_struct_from_user().

Yecchhh...

> So this doesn't require nr_seg == 1. On the contrary, it's expected that
> the rest of the iovec has the compressed payload. And to support the
> copy_struct_from_user()-style versioning, we need to know the size of
> the struct encoded_iov that userspace gave us, which is the reason for
> the iov_iter_single_seg_count().
> 
> I know this interface isn't the prettiest. It started as a
> Btrfs-specific ioctl, but this approach was suggested as a way to avoid
> having a whole new I/O path:
> https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/
> The copy_struct_from_iter() thing was proposed as a way to allow future
> extensions here:
> https://lore.kernel.org/linux-btrfs/20191022020215.csdwgi3ky27rfidf@yavin.dot.cyphar.com/
> 
> Please let me know if you have any suggestions for how to improve this.

Just put the size of the encoded part first and be done with that.
Magical effect of the iovec sizes is a bloody bad idea.

And on top of #work.iov_iter something like

bool iov_iter_check_zeroes(struct iov_iter *i, size_t bytes)
{
	bool failed = false;
        iterate_and_advance(i, bytes, base, len, off,
			failed = (check_zeroed_user(base, len) != 1),
			failed = (memchr_inv(base, 0, len) != NULL),
			)
	if (unlikely(failed))
		iov_iter_revert(i, bytes);
	return !failed;
}

would do "is that chunk all-zeroes?" just fine.  It's that simple...
Linus Torvalds June 18, 2021, 9:10 p.m. UTC | #7
On Fri, Jun 18, 2021 at 1:58 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Jun 18, 2021 at 01:32:26PM -0700, Omar Sandoval wrote:
>
> > RWF_ENCODED is intended to be used like this:
> >
> >       struct encoded_iov encoded_iov = {
> >               /* compression metadata */ ...
> >       };
> >       char compressed_data[] = ...;
> >       struct iovec iov[] = {
> >               { &encoded_iov, sizeof(encoded_iov) },
> >               { compressed_data, sizeof(compressed_data) },
> >       };
> >       pwritev2(fd, iov, 2, -1, RWF_ENCODED);
> >
> > Basically, we squirrel away the compression metadata in the first
> > element of the iovec array, and we use iov[0].iov_len so that we can
> > support future extensions of struct encoded_iov in the style of
> > copy_struct_from_user().
>
> Yecchhh...

Al, this has been true since the beginning, and was the whole point of the set.

> Just put the size of the encoded part first and be done with that.
> Magical effect of the iovec sizes is a bloody bad idea.

That makes everything uglier and more complicated, honestly. Then
you'd have to do it in _two_ operations ("get the size, then get the
rest"), *AND* you'd have to worry about all the corner-cases (ie
people putting the structure in pieces across multiple iov entries.

So it would be slower, more complex, and much more likely to have bugs.

So no. Not acceptable. The "in the first iov" is simple, efficient,
and avoids all the problems.

The size *is* encoded already - in the iov itself. Encoding it
anywhere else is much worse.

The only issue I have is that the issue itself is kind of ugly -
regardless of any iov issues. And the "encryption" side of it doesn't
actually seem to be relevant or solvable using this model anyway, so
that side is questionable.

The alternative would be to have an ioctl rather than make this be
about the IO operations (and then that encoded data would be
explicitly separate).

Which I suggested originally, but apparently people who want to use
this had some real reasons not to.

But encoding the structure without having the rule of "first iov only"
is entirely unacceptable to me. See above. It's objectively much much
worse.

             Linus
Al Viro June 18, 2021, 9:32 p.m. UTC | #8
On Fri, Jun 18, 2021 at 02:10:36PM -0700, Linus Torvalds wrote:

> > Just put the size of the encoded part first and be done with that.
> > Magical effect of the iovec sizes is a bloody bad idea.
> 
> That makes everything uglier and more complicated, honestly. Then
> you'd have to do it in _two_ operations ("get the size, then get the
> rest"), *AND* you'd have to worry about all the corner-cases (ie
> people putting the structure in pieces across multiple iov entries.

Huh?  All corner cases are already taken care of by copy_from_iter{,_full}().
What I'm proposing is to have the size as a field in 'encoded' and
do this
	if (!copy_from_iter_full(&encoded, sizeof(encoded), &i))
		return -EFAULT;
	if (encoded.size > sizeof(encoded)) {
		// newer than what we expect
		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
			return -EINVAL;
	} else if (encoded.size < sizeof(encoded)) {
		// older than what we expect
		iov_iter_revert(&i, sizeof(encoded) - encoded.size);
		memset((void *)&encoded + encoded.size, 0, sizoef(encoded) - encoded.size);
	}

I don't think it would be more complex, but that's a matter of taste;
I *really* doubt it would be any slower or have higher odds of bugs,
regardless of the corner cases.

And it certainly would be much smaller on the lib/iov_iter.c side -
implementation of iov_iter_check_zeroes() would be simply this:

bool iov_iter_check_zeroes(struct iov_iter *i, size_t size)
{
	bool failed = false;

	iterate_and_advance(i, bytes, base, len, off,
		failed = (check_zeroed_user(base, len) != 1),
		failed = (memchr_inv(base, 0, len) != NULL))
	if (unlikely(failed))
		iov_iter_revert(i, bytes);
	return !failed;
}

And that's it, no need to do anything special for xarray, etc.
This + EXPORT_SYMBOL + extern in uio.h + snippet above in the
user...

I could buy an argument that for userland the need to add
	encoded.size = sizeof(encoded);
or equivalent when initializing that thing would make life too complex,
but on the kernel side I'd say that Omar's variant is considerably more
complex than the above...
Linus Torvalds June 18, 2021, 9:40 p.m. UTC | #9
On Fri, Jun 18, 2021 at 2:32 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Huh?  All corner cases are already taken care of by copy_from_iter{,_full}().
> What I'm proposing is to have the size as a field in 'encoded' and
> do this

Hmm. Making it part of the structure does make it easier (also for the
sending userspace side, that doesn't now have to create yet another
iov or copy the structure or whatever).

Except your code doesn't actually handle the "smaller than expected"
case correctly, since by the time it even checks for that, it will
possibly already have failed. So you actually had a bug there - you
can't use the "xyz_full()" version and get it right.

That's fixable.

So I guess I'd be ok with that version.

             Linus
Omar Sandoval June 18, 2021, 10:10 p.m. UTC | #10
On Fri, Jun 18, 2021 at 02:40:51PM -0700, Linus Torvalds wrote:
> On Fri, Jun 18, 2021 at 2:32 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Huh?  All corner cases are already taken care of by copy_from_iter{,_full}().
> > What I'm proposing is to have the size as a field in 'encoded' and
> > do this
> 
> Hmm. Making it part of the structure does make it easier (also for the
> sending userspace side, that doesn't now have to create yet another
> iov or copy the structure or whatever).
> 
> Except your code doesn't actually handle the "smaller than expected"
> case correctly, since by the time it even checks for that, it will
> possibly already have failed. So you actually had a bug there - you
> can't use the "xyz_full()" version and get it right.
> 
> That's fixable.

Right, we either need to read the size first and then the rest:

	size_t copy_size;
        if (!copy_from_iter_full(&encoded.size, sizeof(encoded.size),
				 &i))
                return -EFAULT;
	if (encoded.size > PAGE_SIZE)
		return -E2BIG;
	if (encoded.size < ENCODED_IOV_SIZE_VER0)
		return -EINVAL;
	if (!copy_from_iter_full(&encoded.size + 1,
				 min(sizeof(encoded), encoded.size) - sizeof(encoded.size),
				 &i))
                return -EFAULT;
        if (encoded.size > sizeof(encoded)) {
                // newer than what we expect
                if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
                        return -EINVAL;
        } else if (encoded.size < sizeof(encoded)) {
                // older than what we expect
                memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
        }

Or do the same reverting thing that Al did, but with copy_from_iter()
instead of copy_from_iter_full() and being careful with the copied count
(which I'm not 100% sure I got correct here):

	size_t copied = copy_from_iter(&encoded, sizeof(encoded), &i);
	if (copied < offsetofend(struct encoded_iov, size))
		return -EFAULT;
	if (encoded.size > PAGE_SIZE)
		return -E2BIG;
	if (encoded.size < ENCODED_IOV_SIZE_VER0)
		return -EINVAL;
	if (encoded.size > sizeof(encoded)) {
		if (copied < sizeof(encoded)
			return -EFAULT;
		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
			return -EINVAL;
	} else if (encoded.size < sizeof(encoded)) {
		// older than what we expect
		if (copied < encoded.size)
			return -EFAULT;
		iov_iter_revert(&i, copied - encoded.size);
		memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
	}
Al Viro June 18, 2021, 10:14 p.m. UTC | #11
On Fri, Jun 18, 2021 at 02:40:51PM -0700, Linus Torvalds wrote:
> On Fri, Jun 18, 2021 at 2:32 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Huh?  All corner cases are already taken care of by copy_from_iter{,_full}().
> > What I'm proposing is to have the size as a field in 'encoded' and
> > do this
> 
> Hmm. Making it part of the structure does make it easier (also for the
> sending userspace side, that doesn't now have to create yet another
> iov or copy the structure or whatever).
> 
> Except your code doesn't actually handle the "smaller than expected"
> case correctly, since by the time it even checks for that, it will
> possibly already have failed. So you actually had a bug there - you
> can't use the "xyz_full()" version and get it right.

Right you are - should be something along the lines of

#define MIN_ENCODED_SIZE minimal size, e.g. offsetof of the next field after .size

	size = copy_from_iter(&encoded, sizeof(encoded), &i);
	if (unlikely(size < sizeof(encoded))) {
		// the total length is less than expected
		// must be at least encoded.size, though, and it would better
		// cover the .size field itself.
	    	if (size < MIN_ENCODED_SIZE || size < encoded.size)
			sod off
	}
	if (sizeof(encoded) < encoded.size) {
		// newer than expected
		same as in previous variant
	} else if (size > encoded.size) {
		// older than expected
		iov_iter_revert(size - encoded.size);
		memset(....) as in previous variant
	}
Al Viro June 18, 2021, 10:32 p.m. UTC | #12
On Fri, Jun 18, 2021 at 03:10:03PM -0700, Omar Sandoval wrote:

> Or do the same reverting thing that Al did, but with copy_from_iter()
> instead of copy_from_iter_full() and being careful with the copied count
> (which I'm not 100% sure I got correct here):
> 
> 	size_t copied = copy_from_iter(&encoded, sizeof(encoded), &i);
> 	if (copied < offsetofend(struct encoded_iov, size))
> 		return -EFAULT;
> 	if (encoded.size > PAGE_SIZE)
> 		return -E2BIG;
> 	if (encoded.size < ENCODED_IOV_SIZE_VER0)
> 		return -EINVAL;
> 	if (encoded.size > sizeof(encoded)) {
> 		if (copied < sizeof(encoded)
> 			return -EFAULT;
> 		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
> 			return -EINVAL;
> 	} else if (encoded.size < sizeof(encoded)) {
> 		// older than what we expect
> 		if (copied < encoded.size)
> 			return -EFAULT;
> 		iov_iter_revert(&i, copied - encoded.size);
> 		memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
> 	}    

simpler than that, actually -

	copied = copy_from_iter(&encoded, sizeof(encoded), &i);
	if (unlikely(copied < sizeof(encoded))) {
		if (copied < offsetofend(struct encoded_iov, size) ||
		    copied < encoded.size)
			return iov_iter_count(i) ? -EFAULT : -EINVAL;
	}
	if (encoded.size > sizeof(encoded)) {
		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
			return -EINVAL;
	} else if (encoded.size < sizeof(encoded)) {
		// copied can't be less than encoded.size here - otherwise
		// we'd have copied < sizeof(encoded) and the check above
		// would've buggered off
		iov_iter_revert(&i, copied - encoded.size);
		memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
	}

should do it.
Omar Sandoval June 19, 2021, 12:43 a.m. UTC | #13
On Fri, Jun 18, 2021 at 10:32:54PM +0000, Al Viro wrote:
> On Fri, Jun 18, 2021 at 03:10:03PM -0700, Omar Sandoval wrote:
> 
> > Or do the same reverting thing that Al did, but with copy_from_iter()
> > instead of copy_from_iter_full() and being careful with the copied count
> > (which I'm not 100% sure I got correct here):
> > 
> > 	size_t copied = copy_from_iter(&encoded, sizeof(encoded), &i);
> > 	if (copied < offsetofend(struct encoded_iov, size))
> > 		return -EFAULT;
> > 	if (encoded.size > PAGE_SIZE)
> > 		return -E2BIG;
> > 	if (encoded.size < ENCODED_IOV_SIZE_VER0)
> > 		return -EINVAL;
> > 	if (encoded.size > sizeof(encoded)) {
> > 		if (copied < sizeof(encoded)
> > 			return -EFAULT;
> > 		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
> > 			return -EINVAL;
> > 	} else if (encoded.size < sizeof(encoded)) {
> > 		// older than what we expect
> > 		if (copied < encoded.size)
> > 			return -EFAULT;
> > 		iov_iter_revert(&i, copied - encoded.size);
> > 		memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
> > 	}    
> 
> simpler than that, actually -
> 
> 	copied = copy_from_iter(&encoded, sizeof(encoded), &i);
> 	if (unlikely(copied < sizeof(encoded))) {
> 		if (copied < offsetofend(struct encoded_iov, size) ||
> 		    copied < encoded.size)
> 			return iov_iter_count(i) ? -EFAULT : -EINVAL;
> 	}
> 	if (encoded.size > sizeof(encoded)) {
> 		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
> 			return -EINVAL;
> 	} else if (encoded.size < sizeof(encoded)) {
> 		// copied can't be less than encoded.size here - otherwise
> 		// we'd have copied < sizeof(encoded) and the check above
> 		// would've buggered off
> 		iov_iter_revert(&i, copied - encoded.size);
> 		memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
> 	}
> 
> should do it.

Thanks, Al, I'll send an updated version with this approach next week.
Omar Sandoval June 21, 2021, 6:46 p.m. UTC | #14
On Fri, Jun 18, 2021 at 05:43:21PM -0700, Omar Sandoval wrote:
> On Fri, Jun 18, 2021 at 10:32:54PM +0000, Al Viro wrote:
> > On Fri, Jun 18, 2021 at 03:10:03PM -0700, Omar Sandoval wrote:
> > 
> > > Or do the same reverting thing that Al did, but with copy_from_iter()
> > > instead of copy_from_iter_full() and being careful with the copied count
> > > (which I'm not 100% sure I got correct here):
> > > 
> > > 	size_t copied = copy_from_iter(&encoded, sizeof(encoded), &i);
> > > 	if (copied < offsetofend(struct encoded_iov, size))
> > > 		return -EFAULT;
> > > 	if (encoded.size > PAGE_SIZE)
> > > 		return -E2BIG;
> > > 	if (encoded.size < ENCODED_IOV_SIZE_VER0)
> > > 		return -EINVAL;
> > > 	if (encoded.size > sizeof(encoded)) {
> > > 		if (copied < sizeof(encoded)
> > > 			return -EFAULT;
> > > 		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
> > > 			return -EINVAL;
> > > 	} else if (encoded.size < sizeof(encoded)) {
> > > 		// older than what we expect
> > > 		if (copied < encoded.size)
> > > 			return -EFAULT;
> > > 		iov_iter_revert(&i, copied - encoded.size);
> > > 		memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
> > > 	}    
> > 
> > simpler than that, actually -
> > 
> > 	copied = copy_from_iter(&encoded, sizeof(encoded), &i);
> > 	if (unlikely(copied < sizeof(encoded))) {
> > 		if (copied < offsetofend(struct encoded_iov, size) ||
> > 		    copied < encoded.size)
> > 			return iov_iter_count(i) ? -EFAULT : -EINVAL;
> > 	}
> > 	if (encoded.size > sizeof(encoded)) {
> > 		if (!iov_iter_check_zeroes(&i, encoded.size - sizeof(encoded))
> > 			return -EINVAL;
> > 	} else if (encoded.size < sizeof(encoded)) {
> > 		// copied can't be less than encoded.size here - otherwise
> > 		// we'd have copied < sizeof(encoded) and the check above
> > 		// would've buggered off
> > 		iov_iter_revert(&i, copied - encoded.size);
> > 		memset((void *)&encoded + encoded.size, 0, sizeof(encoded) - encoded.size);
> > 	}
> > 
> > should do it.
> 
> Thanks, Al, I'll send an updated version with this approach next week.

Okay, so this works for the write side of RWF_ENCODED, but it causes
problems for the read side. That currently works like so:

	struct encoded_iov encoded_iov;
	char compressed_data[...];
	struct iovec iov[] = {
		{ &encoded_iov, sizeof(encoded_iov) },
		{ compressed_data, sizeof(compressed_data) },
	};
	preadv2(fd, iov, 2, -1, RWF_ENCODED);

The kernel fills in the encoded_iov with the compression metadata and
the remaining buffers with the compressed data. The kernel needs to know
how much of the iovec is for the encoded_iov. The backwards
compatibility is similar to the write side: if the kernel size is less
than the userspace size, then we can fill in extra zeroes. If the kernel
size is greater than the userspace size and all of the extra metadata is
zero, then we can omit it. If the extra metadata is non-zero, then we
return an error.

How do we get the userspace size with the encoded_iov.size approach?
We'd have to read the size from the iov_iter before writing to the rest
of the iov_iter. Is it okay to mix the iov_iter as a source and
destination like this? From what I can tell, it's not intended to be
used like this.
Linus Torvalds June 21, 2021, 7:33 p.m. UTC | #15
On Mon, Jun 21, 2021 at 11:46 AM Omar Sandoval <osandov@osandov.com> wrote:
>
> How do we get the userspace size with the encoded_iov.size approach?
> We'd have to read the size from the iov_iter before writing to the rest
> of the iov_iter. Is it okay to mix the iov_iter as a source and
> destination like this? From what I can tell, it's not intended to be
> used like this.

I guess it could work that way, but yes, it's ugly as hell. And I
really don't want a readv() system call - that should write to the
result buffer - to first have to read from it.

So I think the original "just make it be the first iov entry" is the
better approach, even if Al hates it.

Although I still get the feeling that using an ioctl is the *really*
correct way to go. That was my first reaction to the series
originally, and I still don't see why we'd have encoded data in a
regular read/write path.

What was the argument against ioctl's, again?

To me, this isn't all that different from the fsverity things we
added, where filesystem people were happy to try to work out some
common model and add FS_IOC_*_VERITY* ioctls.

               Linus
Omar Sandoval June 21, 2021, 8:46 p.m. UTC | #16
On Mon, Jun 21, 2021 at 12:33:17PM -0700, Linus Torvalds wrote:
> On Mon, Jun 21, 2021 at 11:46 AM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > How do we get the userspace size with the encoded_iov.size approach?
> > We'd have to read the size from the iov_iter before writing to the rest
> > of the iov_iter. Is it okay to mix the iov_iter as a source and
> > destination like this? From what I can tell, it's not intended to be
> > used like this.
> 
> I guess it could work that way, but yes, it's ugly as hell. And I
> really don't want a readv() system call - that should write to the
> result buffer - to first have to read from it.
> 
> So I think the original "just make it be the first iov entry" is the
> better approach, even if Al hates it.
> 
> Although I still get the feeling that using an ioctl is the *really*
> correct way to go. That was my first reaction to the series
> originally, and I still don't see why we'd have encoded data in a
> regular read/write path.
> 
> What was the argument against ioctl's, again?

The suggestion came from Dave Chinner here:
https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/

His objection to an ioctl was two-fold:

1. This interfaces looks really similar to normal read/write, so we
   should try to use the normal read/write interface for it. Perhaps
   this trouble with iov_iter has refuted that.
2. The last time we had Btrfs-specific ioctls that eventually became
   generic (FIDEDUPERANGE and FICLONE{,RANGE}), the generalization was
   painful. Part of the problem with clone/dedupe was that the Btrfs
   ioctls were underspecified. I think I've done a better job of
   documenting all of the semantics and corner cases for the encoded I/O
   interface (and if not, I can address this). The other part of the
   problem is that there were various sanity checks in the normal
   read/write paths that were missed or drifted out of sync in the
   ioctls. That requires some vigilance going forward. Maybe starting
   this off as a generic (not Btrfs-specific) ioctl right off the bat
   will help.

If we do go the ioctl route, then we also have to decide how much of
preadv2/pwritev2 it should emulate. Should it use the fd offset, or
should that be an ioctl argument? Some of the RWF_ flags would be useful
for encoded I/O, too (RWF_DSYNC, RWF_SYNC, RWF_APPEND), should it
support those? These bring us back to Dave's first point.
Omar Sandoval June 21, 2021, 8:55 p.m. UTC | #17
On Mon, Jun 21, 2021 at 01:46:04PM -0700, Omar Sandoval wrote:
> On Mon, Jun 21, 2021 at 12:33:17PM -0700, Linus Torvalds wrote:
> > On Mon, Jun 21, 2021 at 11:46 AM Omar Sandoval <osandov@osandov.com> wrote:
> > >
> > > How do we get the userspace size with the encoded_iov.size approach?
> > > We'd have to read the size from the iov_iter before writing to the rest
> > > of the iov_iter. Is it okay to mix the iov_iter as a source and
> > > destination like this? From what I can tell, it's not intended to be
> > > used like this.
> > 
> > I guess it could work that way, but yes, it's ugly as hell. And I
> > really don't want a readv() system call - that should write to the
> > result buffer - to first have to read from it.
> > 
> > So I think the original "just make it be the first iov entry" is the
> > better approach, even if Al hates it.
> > 
> > Although I still get the feeling that using an ioctl is the *really*
> > correct way to go. That was my first reaction to the series
> > originally, and I still don't see why we'd have encoded data in a
> > regular read/write path.
> > 
> > What was the argument against ioctl's, again?
> 
> The suggestion came from Dave Chinner here:
> https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/
> 
> His objection to an ioctl was two-fold:
> 
> 1. This interfaces looks really similar to normal read/write, so we
>    should try to use the normal read/write interface for it. Perhaps
>    this trouble with iov_iter has refuted that.
> 2. The last time we had Btrfs-specific ioctls that eventually became
>    generic (FIDEDUPERANGE and FICLONE{,RANGE}), the generalization was
>    painful. Part of the problem with clone/dedupe was that the Btrfs
>    ioctls were underspecified. I think I've done a better job of
>    documenting all of the semantics and corner cases for the encoded I/O
>    interface (and if not, I can address this). The other part of the
>    problem is that there were various sanity checks in the normal
>    read/write paths that were missed or drifted out of sync in the
>    ioctls. That requires some vigilance going forward. Maybe starting
>    this off as a generic (not Btrfs-specific) ioctl right off the bat
>    will help.
> 
> If we do go the ioctl route, then we also have to decide how much of
> preadv2/pwritev2 it should emulate. Should it use the fd offset, or
> should that be an ioctl argument? Some of the RWF_ flags would be useful
> for encoded I/O, too (RWF_DSYNC, RWF_SYNC, RWF_APPEND), should it
> support those? These bring us back to Dave's first point.

Oops, I dropped Dave from the Cc list at some point. Adding him back
now.
Dave Chinner June 22, 2021, 10:06 p.m. UTC | #18
On Mon, Jun 21, 2021 at 01:55:03PM -0700, Omar Sandoval wrote:
> On Mon, Jun 21, 2021 at 01:46:04PM -0700, Omar Sandoval wrote:
> > On Mon, Jun 21, 2021 at 12:33:17PM -0700, Linus Torvalds wrote:
> > > On Mon, Jun 21, 2021 at 11:46 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > >
> > > > How do we get the userspace size with the encoded_iov.size approach?
> > > > We'd have to read the size from the iov_iter before writing to the rest
> > > > of the iov_iter. Is it okay to mix the iov_iter as a source and
> > > > destination like this? From what I can tell, it's not intended to be
> > > > used like this.
> > > 
> > > I guess it could work that way, but yes, it's ugly as hell. And I
> > > really don't want a readv() system call - that should write to the
> > > result buffer - to first have to read from it.
> > > 
> > > So I think the original "just make it be the first iov entry" is the
> > > better approach, even if Al hates it.
> > > 
> > > Although I still get the feeling that using an ioctl is the *really*
> > > correct way to go. That was my first reaction to the series
> > > originally, and I still don't see why we'd have encoded data in a
> > > regular read/write path.
> > > 
> > > What was the argument against ioctl's, again?
> > 
> > The suggestion came from Dave Chinner here:
> > https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/
> > 
> > His objection to an ioctl was two-fold:
> > 
> > 1. This interfaces looks really similar to normal read/write, so we
> >    should try to use the normal read/write interface for it. Perhaps
> >    this trouble with iov_iter has refuted that.
> > 2. The last time we had Btrfs-specific ioctls that eventually became
> >    generic (FIDEDUPERANGE and FICLONE{,RANGE}), the generalization was
> >    painful. Part of the problem with clone/dedupe was that the Btrfs
> >    ioctls were underspecified. I think I've done a better job of
> >    documenting all of the semantics and corner cases for the encoded I/O
> >    interface (and if not, I can address this). The other part of the
> >    problem is that there were various sanity checks in the normal
> >    read/write paths that were missed or drifted out of sync in the
> >    ioctls. That requires some vigilance going forward. Maybe starting
> >    this off as a generic (not Btrfs-specific) ioctl right off the bat
> >    will help.
> > 
> > If we do go the ioctl route, then we also have to decide how much of
> > preadv2/pwritev2 it should emulate. Should it use the fd offset, or
> > should that be an ioctl argument? Some of the RWF_ flags would be useful
> > for encoded I/O, too (RWF_DSYNC, RWF_SYNC, RWF_APPEND), should it
> > support those? These bring us back to Dave's first point.
> 
> Oops, I dropped Dave from the Cc list at some point. Adding him back
> now.

Fair summary. The only other thing that I'd add is this is an IO
interface that requires issuing physical IO. So if someone wants
high throughput for encoded IO, we really need AIO and/or io_uring
support, and we get that for free if we use readv2/writev2
interfaces.

Yes, it could be an ioctl() interface, but I think that this sort of
functionality is exactly what extensible syscalls like
preadv2/pwritev2 should be used for. It's a slight variant on normal
IO, and that's exactly what the RWF_* flags are intended to be used
for - allowing interesting per-IO variant behaviour without having
to completely re-implemnt the IO path via custom ioctls every time
we want slightly different functionality...

Cheers,

Dave.
Omar Sandoval June 23, 2021, 5:49 p.m. UTC | #19
On Wed, Jun 23, 2021 at 08:06:39AM +1000, Dave Chinner wrote:
> On Mon, Jun 21, 2021 at 01:55:03PM -0700, Omar Sandoval wrote:
> > On Mon, Jun 21, 2021 at 01:46:04PM -0700, Omar Sandoval wrote:
> > > On Mon, Jun 21, 2021 at 12:33:17PM -0700, Linus Torvalds wrote:
> > > > On Mon, Jun 21, 2021 at 11:46 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > > >
> > > > > How do we get the userspace size with the encoded_iov.size approach?
> > > > > We'd have to read the size from the iov_iter before writing to the rest
> > > > > of the iov_iter. Is it okay to mix the iov_iter as a source and
> > > > > destination like this? From what I can tell, it's not intended to be
> > > > > used like this.
> > > > 
> > > > I guess it could work that way, but yes, it's ugly as hell. And I
> > > > really don't want a readv() system call - that should write to the
> > > > result buffer - to first have to read from it.
> > > > 
> > > > So I think the original "just make it be the first iov entry" is the
> > > > better approach, even if Al hates it.
> > > > 
> > > > Although I still get the feeling that using an ioctl is the *really*
> > > > correct way to go. That was my first reaction to the series
> > > > originally, and I still don't see why we'd have encoded data in a
> > > > regular read/write path.
> > > > 
> > > > What was the argument against ioctl's, again?
> > > 
> > > The suggestion came from Dave Chinner here:
> > > https://lore.kernel.org/linux-fsdevel/20190905021012.GL7777@dread.disaster.area/
> > > 
> > > His objection to an ioctl was two-fold:
> > > 
> > > 1. This interfaces looks really similar to normal read/write, so we
> > >    should try to use the normal read/write interface for it. Perhaps
> > >    this trouble with iov_iter has refuted that.
> > > 2. The last time we had Btrfs-specific ioctls that eventually became
> > >    generic (FIDEDUPERANGE and FICLONE{,RANGE}), the generalization was
> > >    painful. Part of the problem with clone/dedupe was that the Btrfs
> > >    ioctls were underspecified. I think I've done a better job of
> > >    documenting all of the semantics and corner cases for the encoded I/O
> > >    interface (and if not, I can address this). The other part of the
> > >    problem is that there were various sanity checks in the normal
> > >    read/write paths that were missed or drifted out of sync in the
> > >    ioctls. That requires some vigilance going forward. Maybe starting
> > >    this off as a generic (not Btrfs-specific) ioctl right off the bat
> > >    will help.
> > > 
> > > If we do go the ioctl route, then we also have to decide how much of
> > > preadv2/pwritev2 it should emulate. Should it use the fd offset, or
> > > should that be an ioctl argument? Some of the RWF_ flags would be useful
> > > for encoded I/O, too (RWF_DSYNC, RWF_SYNC, RWF_APPEND), should it
> > > support those? These bring us back to Dave's first point.
> > 
> > Oops, I dropped Dave from the Cc list at some point. Adding him back
> > now.
> 
> Fair summary. The only other thing that I'd add is this is an IO
> interface that requires issuing physical IO. So if someone wants
> high throughput for encoded IO, we really need AIO and/or io_uring
> support, and we get that for free if we use readv2/writev2
> interfaces.
> 
> Yes, it could be an ioctl() interface, but I think that this sort of
> functionality is exactly what extensible syscalls like
> preadv2/pwritev2 should be used for. It's a slight variant on normal
> IO, and that's exactly what the RWF_* flags are intended to be used
> for - allowing interesting per-IO variant behaviour without having
> to completely re-implemnt the IO path via custom ioctls every time
> we want slightly different functionality...

Al, Linus, what do you think? Is there a path forward for this series as
is? I'd be happy to have this functionality merged in any form, but I do
think that this approach with preadv2/pwritev2 using iov_len is decent
relative to the alternatives.
Linus Torvalds June 23, 2021, 6:28 p.m. UTC | #20
On Wed, Jun 23, 2021 at 10:49 AM Omar Sandoval <osandov@osandov.com> wrote:
>
> Al, Linus, what do you think? Is there a path forward for this series as
> is?

So the "read from user space in order to write" is a no-go for me. It
completely violates what a "read()" system call should do. It also
entirely violates what an iovec can and should do.

And honestly, if Al hates the "first iov entry" model, I'm not sure I
want to merge that version - I personally find it fine, but Al is
effectively the iov-iter maintainer.

I do worry a bit about the "first iov entry" simply because it might
work for "writev2()" when given virtual user space addresses - but I
think it's conceptually broken for things like direct-IO which might
do things by physical address, and what is a contiguous user space
virtual address is not necessarily a contiguous physical address.

Yes, the filesystem can - and does - hide that path by basically not
doing direct-IO on the first entry at all, and just treat is very
specially in the front end of the IO access, but that only reinforces
the whole "this is not at all like read/write".

Similar issues might crop up in other situations, ie splice etc, where
it's not at all obvious that the iov_iter boundaries would be
maintained as it moves through the system.

So while I personally find the "first iov entry" model fairly
reasonable, I think Dave is being disingenuous when he says that it
looks like a normal read/write. It very much does not. The above is
quite fundamental.

>  I'd be happy to have this functionality merged in any form, but I do
> think that this approach with preadv2/pwritev2 using iov_len is decent
> relative to the alternatives.

As mentioned, I find it acceptable. I'm completely unimpressed with
Dave's argument, but ioctl's aren't perfect either, so weak or not,
that argument being bogus doesn't necessarily mean that the iovec
entry model is wrong.

That said, thinking about exactly the fact that I don't think a
translation from iovec to anything else can be truly valid, I find the
iter_is_iovec() case to be the only obviously valid one.

Which gets me back to: how can any of the non-iovec alternatives ever
be valid? You did mention having missed ITER_XARRAY, but my question
is more fundamental than that. How could a non-iter_is_iovec ever be
valid? There are no possible interfaces that can generate such a thing
sanely.

                Linus
Omar Sandoval June 23, 2021, 7:33 p.m. UTC | #21
On Wed, Jun 23, 2021 at 11:28:15AM -0700, Linus Torvalds wrote:
> On Wed, Jun 23, 2021 at 10:49 AM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > Al, Linus, what do you think? Is there a path forward for this series as
> > is?
> 
> So the "read from user space in order to write" is a no-go for me. It
> completely violates what a "read()" system call should do. It also
> entirely violates what an iovec can and should do.
> 
> And honestly, if Al hates the "first iov entry" model, I'm not sure I
> want to merge that version - I personally find it fine, but Al is
> effectively the iov-iter maintainer.
> 
> I do worry a bit about the "first iov entry" simply because it might
> work for "writev2()" when given virtual user space addresses - but I
> think it's conceptually broken for things like direct-IO which might
> do things by physical address, and what is a contiguous user space
> virtual address is not necessarily a contiguous physical address.
> 
> Yes, the filesystem can - and does - hide that path by basically not
> doing direct-IO on the first entry at all, and just treat is very
> specially in the front end of the IO access, but that only reinforces
> the whole "this is not at all like read/write".
> 
> Similar issues might crop up in other situations, ie splice etc, where
> it's not at all obvious that the iov_iter boundaries would be
> maintained as it moves through the system.
> 
> So while I personally find the "first iov entry" model fairly
> reasonable, I think Dave is being disingenuous when he says that it
> looks like a normal read/write. It very much does not. The above is
> quite fundamental.
> 
> >  I'd be happy to have this functionality merged in any form, but I do
> > think that this approach with preadv2/pwritev2 using iov_len is decent
> > relative to the alternatives.
> 
> As mentioned, I find it acceptable. I'm completely unimpressed with
> Dave's argument, but ioctl's aren't perfect either, so weak or not,
> that argument being bogus doesn't necessarily mean that the iovec
> entry model is wrong.
> 
> That said, thinking about exactly the fact that I don't think a
> translation from iovec to anything else can be truly valid, I find the
> iter_is_iovec() case to be the only obviously valid one.
> 
> Which gets me back to: how can any of the non-iovec alternatives ever
> be valid? You did mention having missed ITER_XARRAY, but my question
> is more fundamental than that. How could a non-iter_is_iovec ever be
> valid? There are no possible interfaces that can generate such a thing
> sanely.

I only implemented the bvec and kvec cases for completeness, since
copy_struct_from_iter() would appear to be a generic helper. At least
for RWF_ENCODED, a bvec seems pretty bogus, but it doesn't seem too
far-flung to imagine an in-kernel user of RWF_ENCODED that uses a kvec.

One other option that we haven't considered is ditching the
copy_struct_from_user() semantics and going the simpler route of adding
some reserved space to the end of struct encoded_iov:

struct encoded_iov {
	__aligned_u64 len;
	__aligned_u64 unencoded_len;
	__aligned_u64 unencoded_offset;
	__u32 compression;
	__u32 encryption;
	__u8 reserved[32];
};

Then we can do an unconditional copy_from_user_full(sizeof(struct
encoded_iov)) and check the reserved space in the typical fashion.

(And in the unlikely case that we use up all of that space with
extensions, I suppose we could have an RWF_ENCODED2 with a matching
struct encoded_iov2.)
Al Viro June 23, 2021, 7:45 p.m. UTC | #22
On Wed, Jun 23, 2021 at 10:49:51AM -0700, Omar Sandoval wrote:

> > Fair summary. The only other thing that I'd add is this is an IO
> > interface that requires issuing physical IO. So if someone wants
> > high throughput for encoded IO, we really need AIO and/or io_uring
> > support, and we get that for free if we use readv2/writev2
> > interfaces.
> > 
> > Yes, it could be an ioctl() interface, but I think that this sort of
> > functionality is exactly what extensible syscalls like
> > preadv2/pwritev2 should be used for. It's a slight variant on normal
> > IO, and that's exactly what the RWF_* flags are intended to be used
> > for - allowing interesting per-IO variant behaviour without having
> > to completely re-implemnt the IO path via custom ioctls every time
> > we want slightly different functionality...
> 
> Al, Linus, what do you think? Is there a path forward for this series as
> is? I'd be happy to have this functionality merged in any form, but I do
> think that this approach with preadv2/pwritev2 using iov_len is decent
> relative to the alternatives.

IMO we might be better off with explicit ioctl - this magical mystery shite
with special meaning of the first iovec length is, IMO, more than enough
to make it a bad fit for read/write family.

It's *not* just a "slightly different functionality" - it's very different
calling conventions.  And the deeper one needs to dig into the interface
details to parse what's going on, the less it differs from ioctl() mess.

Said that, why do you need a variable-length header on the read side,
in the first place?
Omar Sandoval June 23, 2021, 8:46 p.m. UTC | #23
On Wed, Jun 23, 2021 at 07:45:59PM +0000, Al Viro wrote:
> On Wed, Jun 23, 2021 at 10:49:51AM -0700, Omar Sandoval wrote:
> 
> > > Fair summary. The only other thing that I'd add is this is an IO
> > > interface that requires issuing physical IO. So if someone wants
> > > high throughput for encoded IO, we really need AIO and/or io_uring
> > > support, and we get that for free if we use readv2/writev2
> > > interfaces.
> > > 
> > > Yes, it could be an ioctl() interface, but I think that this sort of
> > > functionality is exactly what extensible syscalls like
> > > preadv2/pwritev2 should be used for. It's a slight variant on normal
> > > IO, and that's exactly what the RWF_* flags are intended to be used
> > > for - allowing interesting per-IO variant behaviour without having
> > > to completely re-implemnt the IO path via custom ioctls every time
> > > we want slightly different functionality...
> > 
> > Al, Linus, what do you think? Is there a path forward for this series as
> > is? I'd be happy to have this functionality merged in any form, but I do
> > think that this approach with preadv2/pwritev2 using iov_len is decent
> > relative to the alternatives.
> 
> IMO we might be better off with explicit ioctl - this magical mystery shite
> with special meaning of the first iovec length is, IMO, more than enough
> to make it a bad fit for read/write family.
> 
> It's *not* just a "slightly different functionality" - it's very different
> calling conventions.  And the deeper one needs to dig into the interface
> details to parse what's going on, the less it differs from ioctl() mess.
> 
> Said that, why do you need a variable-length header on the read side,
> in the first place?

Suppose we add a new field representing a new type of encoding to the
end of encoded_iov. On the write side, the caller might want to specify
that the data is encoded in that new way, of course. But on the read
side, if the data is encoded in that new way, then the kernel will want
to return that. The kernel needs to know if the user's structure
includes the new field (otherwise when it copies the full struct out, it
will write into what the user thinks is the data instead).

As I mentioned in my reply to Linus, maybe we can stick with
preadv2/pwritev2, but make the struct encoded_iov structure a fixed size
with some reserved space for future expansion. That makes this a lot
less special: just copy a fixed size structure, then read/write the
rest. And then we don't need to reinvent the rest of the
preadv2/pwritev2 path for an ioctl.

Between a fixed size structure and an ioctl, what would you prefer?
Al Viro June 23, 2021, 9:39 p.m. UTC | #24
On Wed, Jun 23, 2021 at 01:46:50PM -0700, Omar Sandoval wrote:

> Suppose we add a new field representing a new type of encoding to the
> end of encoded_iov. On the write side, the caller might want to specify
> that the data is encoded in that new way, of course. But on the read
> side, if the data is encoded in that new way, then the kernel will want
> to return that. The kernel needs to know if the user's structure
> includes the new field (otherwise when it copies the full struct out, it
> will write into what the user thinks is the data instead).

Er...  What's the problem with simply copying that extended structure out,
followed by the data?

IOW, why can't the caller pick the header out of the whole thing and
deal with it in whatever way it likes?  Why should kernel need to do
anything special here?

IDGI...  Userland had always been able to deal with that kind of stuff;
you read e.g. gzipped data into buffer, you decode the header, you figure
out how long it is and how far out does the payload begin, etc.

How is that different?
Omar Sandoval June 23, 2021, 9:58 p.m. UTC | #25
On Wed, Jun 23, 2021 at 09:39:48PM +0000, Al Viro wrote:
> On Wed, Jun 23, 2021 at 01:46:50PM -0700, Omar Sandoval wrote:
> 
> > Suppose we add a new field representing a new type of encoding to the
> > end of encoded_iov. On the write side, the caller might want to specify
> > that the data is encoded in that new way, of course. But on the read
> > side, if the data is encoded in that new way, then the kernel will want
> > to return that. The kernel needs to know if the user's structure
> > includes the new field (otherwise when it copies the full struct out, it
> > will write into what the user thinks is the data instead).
> 
> Er...  What's the problem with simply copying that extended structure out,
> followed by the data?
> 
> IOW, why can't the caller pick the header out of the whole thing and
> deal with it in whatever way it likes?  Why should kernel need to do
> anything special here?
> 
> IDGI...  Userland had always been able to deal with that kind of stuff;
> you read e.g. gzipped data into buffer, you decode the header, you figure
> out how long it is and how far out does the payload begin, etc.
> 
> How is that different?

Ah, I was stuck on thinking about this calling convention:

	struct encoded_iov encoded_iov;
	char compressed_data[...];
	struct iovec iov[] = {
		{ &encoded_iov, sizeof(encoded_iov) },
		{ compressed_data, sizeof(compressed_data) },
	};
	preadv2(fd, iov, 2, -1, RWF_ENCODED);

But what you described would look more like:

	// Needs to be large enough for maximum returned header + data.
	char buffer[...];
	struct iovec iov[] = {
		{ buffer, sizeof(buffer) },
	};
	preadv2(fd, iov, 2, -1, RWF_ENCODED);
	// We should probably align the buffer.
	struct encoded_iov *encoded_iov = (void *)buffer;
	char *data = buffer + encoded_iov->size;

That's a little uglier, but it should work, and allows for arbitrary
extensions. So, among these three alternatives (fixed size structure
with reserved space, variable size structure like above, or ioctl),
which would you prefer?
Al Viro June 23, 2021, 10:26 p.m. UTC | #26
On Wed, Jun 23, 2021 at 02:58:32PM -0700, Omar Sandoval wrote:

> Ah, I was stuck on thinking about this calling convention:
> 
> 	struct encoded_iov encoded_iov;
> 	char compressed_data[...];
> 	struct iovec iov[] = {
> 		{ &encoded_iov, sizeof(encoded_iov) },
> 		{ compressed_data, sizeof(compressed_data) },
> 	};
> 	preadv2(fd, iov, 2, -1, RWF_ENCODED);
> 
> But what you described would look more like:
> 
> 	// Needs to be large enough for maximum returned header + data.
> 	char buffer[...];
> 	struct iovec iov[] = {
> 		{ buffer, sizeof(buffer) },
> 	};
> 	preadv2(fd, iov, 2, -1, RWF_ENCODED);
> 	// We should probably align the buffer.
> 	struct encoded_iov *encoded_iov = (void *)buffer;
> 	char *data = buffer + encoded_iov->size;
> 
> That's a little uglier, but it should work, and allows for arbitrary
> extensions. So, among these three alternatives (fixed size structure
> with reserved space, variable size structure like above, or ioctl),
> which would you prefer?

Variable-sized structure would seem to be the easiest from the kernel
POV and the interface is the easiest to describe - "you read the
encoded data preceded by the header"...
Matthew Wilcox June 24, 2021, 2 a.m. UTC | #27
On Wed, Jun 23, 2021 at 02:58:32PM -0700, Omar Sandoval wrote:
> But what you described would look more like:
> 
> 	// Needs to be large enough for maximum returned header + data.
> 	char buffer[...];
> 	struct iovec iov[] = {
> 		{ buffer, sizeof(buffer) },
> 	};
> 	preadv2(fd, iov, 2, -1, RWF_ENCODED);
> 	// We should probably align the buffer.
> 	struct encoded_iov *encoded_iov = (void *)buffer;
> 	char *data = buffer + encoded_iov->size;
> 
> That's a little uglier, but it should work, and allows for arbitrary
> extensions. So, among these three alternatives (fixed size structure
> with reserved space, variable size structure like above, or ioctl),
> which would you prefer?

Does that work for O_DIRECT and the required 512-byte alignment?
Omar Sandoval June 24, 2021, 6:14 a.m. UTC | #28
On Thu, Jun 24, 2021 at 03:00:39AM +0100, Matthew Wilcox wrote:
> On Wed, Jun 23, 2021 at 02:58:32PM -0700, Omar Sandoval wrote:
> > But what you described would look more like:
> > 
> > 	// Needs to be large enough for maximum returned header + data.
> > 	char buffer[...];
> > 	struct iovec iov[] = {
> > 		{ buffer, sizeof(buffer) },
> > 	};
> > 	preadv2(fd, iov, 2, -1, RWF_ENCODED);
> > 	// We should probably align the buffer.
> > 	struct encoded_iov *encoded_iov = (void *)buffer;
> > 	char *data = buffer + encoded_iov->size;
> > 
> > That's a little uglier, but it should work, and allows for arbitrary
> > extensions. So, among these three alternatives (fixed size structure
> > with reserved space, variable size structure like above, or ioctl),
> > which would you prefer?
> 
> Does that work for O_DIRECT and the required 512-byte alignment?

I suppose the kernel could pad the encoded_iov structure with zeroes to
the next sector boundary, since zeroes are effectively noops for
encoded_iov. (As an aside, RWF_ENCODED is always "direct I/O" in the
sense that it bypasses the page cache, but not necessarily in the sense
that it does DMA to/from the user buffers. The Btrfs implementation
doesn't do the latter yet.)
Christoph Hellwig June 24, 2021, 6:41 a.m. UTC | #29
I'm also really worried with overloading the regular r/w path and
iov_iter with ever more special cases.  We already have various
performance problems in the path, and adding more special cases ain't
gonna help.
Omar Sandoval June 24, 2021, 7:50 a.m. UTC | #30
On Thu, Jun 24, 2021 at 07:41:12AM +0100, Christoph Hellwig wrote:
> I'm also really worried with overloading the regular r/w path and
> iov_iter with ever more special cases.  We already have various
> performance problems in the path, and adding more special cases ain't
> gonna help.

The changes to the normal path are:

* An extra check for RWF_ENCODED and FMODE_ENCODED_IO in kiocb_set_rw_flags().
* Splitting some of the checks in generic_write_checks() into a new
  function.
* Checks for the IOCB_ENCODED flag in the filesystem's
  read_iter/write_iter.

At least for Btrfs, the rest happens in a completely separate code path.
So, there are a couple of extra checks, but it's not as drastic as it
might first appear.
Linus Torvalds June 24, 2021, 5:52 p.m. UTC | #31
On Wed, Jun 23, 2021 at 11:15 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> On Thu, Jun 24, 2021 at 03:00:39AM +0100, Matthew Wilcox wrote:
> >
> > Does that work for O_DIRECT and the required 512-byte alignment?
>
> I suppose the kernel could pad the encoded_iov structure with zeroes to
> the next sector boundary, since zeroes are effectively noops for
> encoded_iov.

Ugh.

I really think the whole "embed the control structure in the stream"
is wrong. The alignment issue is just another sign of that.

Separating it out is the right thing to do. At least the "first iov
entry" thing did separate the control structure from the actual data.
I detest the whole "embed the two together".

            Linus
Omar Sandoval June 24, 2021, 6:28 p.m. UTC | #32
On Thu, Jun 24, 2021 at 10:52:17AM -0700, Linus Torvalds wrote:
> On Wed, Jun 23, 2021 at 11:15 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > On Thu, Jun 24, 2021 at 03:00:39AM +0100, Matthew Wilcox wrote:
> > >
> > > Does that work for O_DIRECT and the required 512-byte alignment?
> >
> > I suppose the kernel could pad the encoded_iov structure with zeroes to
> > the next sector boundary, since zeroes are effectively noops for
> > encoded_iov.
> 
> Ugh.
> 
> I really think the whole "embed the control structure in the stream"
> is wrong. The alignment issue is just another sign of that.
> 
> Separating it out is the right thing to do. At least the "first iov
> entry" thing did separate the control structure from the actual data.
> I detest the whole "embed the two together".

I'll suggest the fixed-size struct encoded_iov again, then. If we're
willing to give up some of the flexibility of a variable size, then
userspace can always put the fixed-size structure in its own iovec or
include it inline with the data, depending on what's more convenient and
whether it's using O_DIRECT. A fixed size is much easier for both the
kernel and userspace to deal with. Do we really need to support
unlimited extensions to encoded_iov, or can we stick 32-64 bytes of
reserved space at the end of the structure and call it a day?
Linus Torvalds June 24, 2021, 9:07 p.m. UTC | #33
On Thu, Jun 24, 2021 at 11:28 AM Omar Sandoval <osandov@osandov.com> wrote:
>
> I'll suggest the fixed-size struct encoded_iov again, then. If we're
> willing to give up some of the flexibility of a variable size, then
> userspace can always put the fixed-size structure in its own iovec or
> include it inline with the data, depending on what's more convenient and
> whether it's using O_DIRECT.

I really would prefer to have the separate pointer to it.

Fixed size doesn't help. It's still "mixed in" unless you have a
clearly separate pointer. Sure, user space *could* use a separate iov
entry if it wants to, but then it becomes a user choice rather than
part of the design.

That separate data structure would be the only way to do it for a
ioctl() interface, but in the readv/writev world the whole separate
"first iov entry" does that too.

I also worry that this "raw compressed data" thing isn't the only
thing people will want to do. I could easily see some kind of
"end-to-end CRC read/write" where the user passes in not just the
data, but also checksums for it to validate it (maybe because you're
doing a file copy and had the original checksums, but also maybe
because user space simply has a known good copy and doesn't want
errors re-introduced due to memory corruption).

And I continue to think that this whole issue isn't all that different
from the FSVERITY thing.

Of course, the real take-away is that "preadv2/pwritev2()" is a
horrible interface. It should have been more extensible, rather than
the lazy "just add another flag argument".

I think we finally may have gotten a real extensible interface right
with openat2(), and that "open_how" thing, but maybe I'm being naive
and it will turn out that that wasn't so great either.

Maybe we'll some day end up with a "preadv3()" that has an extensible
"struct io_how" argument.

Interfaces are hard.

                Linus
Martin K. Petersen June 24, 2021, 10:41 p.m. UTC | #34
Linus,

> I also worry that this "raw compressed data" thing isn't the only
> thing people will want to do. I could easily see some kind of
> "end-to-end CRC read/write" where the user passes in not just the
> data, but also checksums for it to validate it (maybe because you're
> doing a file copy and had the original checksums, but also maybe
> because user space simply has a known good copy and doesn't want
> errors re-introduced due to memory corruption).

We already support passing CRCs down to be validated by the hardware for
both NVMe and SCSI. This currently only works from the block layer
down. When enabled, the checksums are generated by the block layer for
writes and the data is validated against the checksums sent by the
storage on reads.

Over the years various attempts at adding support for passing the
checksum buffers in from userland have failed for exactly the reasons
outlined in this thread (Joel, Darrick, Bob). Would love to have a
generic way of passing this kind of information...
Matthew Wilcox June 25, 2021, 3:38 a.m. UTC | #35
On Thu, Jun 24, 2021 at 06:41:52PM -0400, Martin K. Petersen wrote:
> 
> Linus,
> 
> > I also worry that this "raw compressed data" thing isn't the only
> > thing people will want to do. I could easily see some kind of
> > "end-to-end CRC read/write" where the user passes in not just the
> > data, but also checksums for it to validate it (maybe because you're
> > doing a file copy and had the original checksums, but also maybe
> > because user space simply has a known good copy and doesn't want
> > errors re-introduced due to memory corruption).
> 
> We already support passing CRCs down to be validated by the hardware for
> both NVMe and SCSI. This currently only works from the block layer
> down. When enabled, the checksums are generated by the block layer for
> writes and the data is validated against the checksums sent by the
> storage on reads.
> 
> Over the years various attempts at adding support for passing the
> checksum buffers in from userland have failed for exactly the reasons
> outlined in this thread (Joel, Darrick, Bob). Would love to have a
> generic way of passing this kind of information...

Does it make any kind of sense to talk about doing this for buffered I/O,
given that we can't generate them for (eg) mmaped files?  Or does this
only make sense to pass in for O_DIRECT accesses?
Linus Torvalds June 25, 2021, 4:16 p.m. UTC | #36
On Thu, Jun 24, 2021 at 8:38 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> Does it make any kind of sense to talk about doing this for buffered I/O,
> given that we can't generate them for (eg) mmaped files?

Sure we can.

Or rather, some people might very well like to do it even for mutable
data. In fact, _especially_ for mutable data.

You might want to do things like "write out the state I verified just
a moment ago", and if it has changed since then, you *want* the result
to be invalid because the checksums no longer match - in case somebody
else changed the data you used for the state calculation and
verification in the meantime. It's very much why you'd want a separate
checksum in the first place.

Yeah, yeah,  you can - and people do - just do things like this with a
separate checksum. But if you know that the filesystem has internal
checksumming support _anyway_, you might want to use it, and basically
say "use this checksum, if the data doesn't match when I read it back
I want to get an IO error".

(The "data doesn't match" _could_ be just due to DRAM corruption etc,
of course. Some people care about things like that. You want
"verified" filesystem contents - it might not be about security, it
might simply be about "I have validated this data and if it's not the
same data any more it's useless and I need to re-generate it").

Am I a big believer in this model? No. Portability concerns (across
OS'es, across filesystems, even just across backups on the same exact
system) means that even if we did this, very few people would use it.

People who want this end up using an external checksum instead and do
it outside of and separately from the actual IO, because then they can
do it on existing systems.

So my argument is not "we want this". My argument is purely that some
buffered filesystem IO case isn't actually any different from the
traditional "I want access to the low-level sector hardware checksum
data". The use cases are basically exactly the same.

Of course, basically nobody does that hw sector checksum either, for
all the same reasons, even if it's been around for decades.

So my "checksum metadata interface" is not something I'm a big
believer in, but I really don't think it's really all _that_ different
from the whole "compressed format interface" that this whole patch
series is about. They are pretty much the same thing in many ways.

                Linus
Omar Sandoval June 25, 2021, 9:07 p.m. UTC | #37
On Fri, Jun 25, 2021 at 09:16:15AM -0700, Linus Torvalds wrote:
> On Thu, Jun 24, 2021 at 8:38 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Does it make any kind of sense to talk about doing this for buffered I/O,
> > given that we can't generate them for (eg) mmaped files?
> 
> Sure we can.
> 
> Or rather, some people might very well like to do it even for mutable
> data. In fact, _especially_ for mutable data.
> 
> You might want to do things like "write out the state I verified just
> a moment ago", and if it has changed since then, you *want* the result
> to be invalid because the checksums no longer match - in case somebody
> else changed the data you used for the state calculation and
> verification in the meantime. It's very much why you'd want a separate
> checksum in the first place.
> 
> Yeah, yeah,  you can - and people do - just do things like this with a
> separate checksum. But if you know that the filesystem has internal
> checksumming support _anyway_, you might want to use it, and basically
> say "use this checksum, if the data doesn't match when I read it back
> I want to get an IO error".
> 
> (The "data doesn't match" _could_ be just due to DRAM corruption etc,
> of course. Some people care about things like that. You want
> "verified" filesystem contents - it might not be about security, it
> might simply be about "I have validated this data and if it's not the
> same data any more it's useless and I need to re-generate it").
> 
> Am I a big believer in this model? No. Portability concerns (across
> OS'es, across filesystems, even just across backups on the same exact
> system) means that even if we did this, very few people would use it.
> 
> People who want this end up using an external checksum instead and do
> it outside of and separately from the actual IO, because then they can
> do it on existing systems.
> 
> So my argument is not "we want this". My argument is purely that some
> buffered filesystem IO case isn't actually any different from the
> traditional "I want access to the low-level sector hardware checksum
> data". The use cases are basically exactly the same.
> 
> Of course, basically nobody does that hw sector checksum either, for
> all the same reasons, even if it's been around for decades.
> 
> So my "checksum metadata interface" is not something I'm a big
> believer in, but I really don't think it's really all _that_ different
> from the whole "compressed format interface" that this whole patch
> series is about. They are pretty much the same thing in many ways.

I see the similarity in the sense that we basically want to pass some
extra metadata down with the read or write. So then do we want to add
preadv3/pwritev3 for encoded I/O now so that checksums can use it in the
future? The encoding metadata could go in this "struct io_how", either
directly or in a separate structure with a pointer in "struct io_how".
It could get messy with compat syscalls.
Omar Sandoval July 7, 2021, 5:59 p.m. UTC | #38
On Fri, Jun 25, 2021 at 02:07:59PM -0700, Omar Sandoval wrote:
> On Fri, Jun 25, 2021 at 09:16:15AM -0700, Linus Torvalds wrote:
> > On Thu, Jun 24, 2021 at 8:38 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > Does it make any kind of sense to talk about doing this for buffered I/O,
> > > given that we can't generate them for (eg) mmaped files?
> > 
> > Sure we can.
> > 
> > Or rather, some people might very well like to do it even for mutable
> > data. In fact, _especially_ for mutable data.
> > 
> > You might want to do things like "write out the state I verified just
> > a moment ago", and if it has changed since then, you *want* the result
> > to be invalid because the checksums no longer match - in case somebody
> > else changed the data you used for the state calculation and
> > verification in the meantime. It's very much why you'd want a separate
> > checksum in the first place.
> > 
> > Yeah, yeah,  you can - and people do - just do things like this with a
> > separate checksum. But if you know that the filesystem has internal
> > checksumming support _anyway_, you might want to use it, and basically
> > say "use this checksum, if the data doesn't match when I read it back
> > I want to get an IO error".
> > 
> > (The "data doesn't match" _could_ be just due to DRAM corruption etc,
> > of course. Some people care about things like that. You want
> > "verified" filesystem contents - it might not be about security, it
> > might simply be about "I have validated this data and if it's not the
> > same data any more it's useless and I need to re-generate it").
> > 
> > Am I a big believer in this model? No. Portability concerns (across
> > OS'es, across filesystems, even just across backups on the same exact
> > system) means that even if we did this, very few people would use it.
> > 
> > People who want this end up using an external checksum instead and do
> > it outside of and separately from the actual IO, because then they can
> > do it on existing systems.
> > 
> > So my argument is not "we want this". My argument is purely that some
> > buffered filesystem IO case isn't actually any different from the
> > traditional "I want access to the low-level sector hardware checksum
> > data". The use cases are basically exactly the same.
> > 
> > Of course, basically nobody does that hw sector checksum either, for
> > all the same reasons, even if it's been around for decades.
> > 
> > So my "checksum metadata interface" is not something I'm a big
> > believer in, but I really don't think it's really all _that_ different
> > from the whole "compressed format interface" that this whole patch
> > series is about. They are pretty much the same thing in many ways.
> 
> I see the similarity in the sense that we basically want to pass some
> extra metadata down with the read or write. So then do we want to add
> preadv3/pwritev3 for encoded I/O now so that checksums can use it in the
> future? The encoding metadata could go in this "struct io_how", either
> directly or in a separate structure with a pointer in "struct io_how".
> It could get messy with compat syscalls.

Ping. What's the path forward here? At this point, it seems like an
ioctl is the path of least resistance.
Josef Bacik July 19, 2021, 3:44 p.m. UTC | #39
On 7/7/21 1:59 PM, Omar Sandoval wrote:
> On Fri, Jun 25, 2021 at 02:07:59PM -0700, Omar Sandoval wrote:
>> On Fri, Jun 25, 2021 at 09:16:15AM -0700, Linus Torvalds wrote:
>>> On Thu, Jun 24, 2021 at 8:38 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>>
>>>> Does it make any kind of sense to talk about doing this for buffered I/O,
>>>> given that we can't generate them for (eg) mmaped files?
>>>
>>> Sure we can.
>>>
>>> Or rather, some people might very well like to do it even for mutable
>>> data. In fact, _especially_ for mutable data.
>>>
>>> You might want to do things like "write out the state I verified just
>>> a moment ago", and if it has changed since then, you *want* the result
>>> to be invalid because the checksums no longer match - in case somebody
>>> else changed the data you used for the state calculation and
>>> verification in the meantime. It's very much why you'd want a separate
>>> checksum in the first place.
>>>
>>> Yeah, yeah,  you can - and people do - just do things like this with a
>>> separate checksum. But if you know that the filesystem has internal
>>> checksumming support _anyway_, you might want to use it, and basically
>>> say "use this checksum, if the data doesn't match when I read it back
>>> I want to get an IO error".
>>>
>>> (The "data doesn't match" _could_ be just due to DRAM corruption etc,
>>> of course. Some people care about things like that. You want
>>> "verified" filesystem contents - it might not be about security, it
>>> might simply be about "I have validated this data and if it's not the
>>> same data any more it's useless and I need to re-generate it").
>>>
>>> Am I a big believer in this model? No. Portability concerns (across
>>> OS'es, across filesystems, even just across backups on the same exact
>>> system) means that even if we did this, very few people would use it.
>>>
>>> People who want this end up using an external checksum instead and do
>>> it outside of and separately from the actual IO, because then they can
>>> do it on existing systems.
>>>
>>> So my argument is not "we want this". My argument is purely that some
>>> buffered filesystem IO case isn't actually any different from the
>>> traditional "I want access to the low-level sector hardware checksum
>>> data". The use cases are basically exactly the same.
>>>
>>> Of course, basically nobody does that hw sector checksum either, for
>>> all the same reasons, even if it's been around for decades.
>>>
>>> So my "checksum metadata interface" is not something I'm a big
>>> believer in, but I really don't think it's really all _that_ different
>>> from the whole "compressed format interface" that this whole patch
>>> series is about. They are pretty much the same thing in many ways.
>>
>> I see the similarity in the sense that we basically want to pass some
>> extra metadata down with the read or write. So then do we want to add
>> preadv3/pwritev3 for encoded I/O now so that checksums can use it in the
>> future? The encoding metadata could go in this "struct io_how", either
>> directly or in a separate structure with a pointer in "struct io_how".
>> It could get messy with compat syscalls.
> 
> Ping. What's the path forward here? At this point, it seems like an
> ioctl is the path of least resistance.
> 

At this point we've been deadlocked on this for too long.  Put it in a btrfs 
IOCTL, if somebody wants to extend it generically in the future then godspeed, 
we can tie into that interface after the fact.  Thanks,

Josef
diff mbox series

Patch

diff --git a/include/linux/uio.h b/include/linux/uio.h
index d3ec87706d75..cbaf6b3bfcbc 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -129,6 +129,7 @@  size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
+int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i);
 
 size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c701b7a187f2..129f264416ff 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -995,6 +995,97 @@  size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_from_iter);
 
+/**
+ * copy_struct_from_iter - copy a struct from an iov_iter
+ * @dst: Destination buffer.
+ * @ksize: Size of @dst struct.
+ * @i: Source iterator.
+ *
+ * Copies a struct from an iov_iter in a way that guarantees
+ * backwards-compatibility for struct arguments in an iovec (as long as the
+ * rules for copy_struct_from_user() are followed).
+ *
+ * The source struct is assumed to be stored in the current segment of the
+ * iov_iter, and its size is the size of the current segment. The iov_iter must
+ * be positioned at the beginning of the current segment.
+ *
+ * The recommended usage is something like the following:
+ *
+ *   int do_foo(struct iov_iter *i)
+ *   {
+ *     size_t usize = iov_iter_single_seg_count(i);
+ *     struct foo karg;
+ *     int err;
+ *
+ *     if (usize > PAGE_SIZE)
+ *       return -E2BIG;
+ *     if (usize < FOO_SIZE_VER0)
+ *       return -EINVAL;
+ *     err = copy_struct_from_iter(&karg, sizeof(karg), i);
+ *     if (err)
+ *       return err;
+ *
+ *     // ...
+ *   }
+ *
+ * Returns 0 on success or one of the following errors:
+ *  * -E2BIG:  (size of current segment > @ksize) and there are non-zero
+ *             trailing bytes in the current segment.
+ *  * -EFAULT: access to userspace failed.
+ *  * -EINVAL: the iterator is not at the beginning of the current segment.
+ *
+ * On success, the iterator is advanced to the next segment. On error, the
+ * iterator is not advanced.
+ */
+int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i)
+{
+	size_t usize;
+	int ret;
+
+	if (i->iov_offset != 0)
+		return -EINVAL;
+	if (iter_is_iovec(i)) {
+		usize = i->iov->iov_len;
+		might_fault();
+		if (copyin(dst, i->iov->iov_base, min(ksize, usize)))
+			return -EFAULT;
+		if (usize > ksize) {
+			ret = check_zeroed_user(i->iov->iov_base + ksize,
+						usize - ksize);
+			if (ret < 0)
+				return ret;
+			else if (ret == 0)
+				return -E2BIG;
+		}
+	} else if (iov_iter_is_kvec(i)) {
+		usize = i->kvec->iov_len;
+		memcpy(dst, i->kvec->iov_base, min(ksize, usize));
+		if (usize > ksize &&
+		    memchr_inv(i->kvec->iov_base + ksize, 0, usize - ksize))
+			return -E2BIG;
+	} else if (iov_iter_is_bvec(i)) {
+		char *p;
+
+		usize = i->bvec->bv_len;
+		p = kmap_local_page(i->bvec->bv_page);
+		memcpy(dst, p + i->bvec->bv_offset, min(ksize, usize));
+		if (usize > ksize &&
+		    memchr_inv(p + i->bvec->bv_offset + ksize, 0,
+			       usize - ksize)) {
+			kunmap_local(p);
+			return -E2BIG;
+		}
+		kunmap_local(p);
+	} else {
+		return -EFAULT;
+	}
+	if (usize < ksize)
+		memset(dst + usize, 0, ksize - usize);
+	iov_iter_advance(i, usize);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(copy_struct_from_iter);
+
 static size_t pipe_zero(size_t bytes, struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;