diff mbox series

[v4] btrfs: pass checksum type via BTRFS_IOC_FS_INFO ioctl

Message ID 20200629075329.36969-1-johannes.thumshirn@wdc.com (mailing list archive)
State New, archived
Headers show
Series [v4] btrfs: pass checksum type via BTRFS_IOC_FS_INFO ioctl | expand

Commit Message

Johannes Thumshirn June 29, 2020, 7:53 a.m. UTC
With the recent addition of filesystem checksum types other than CRC32c,
it is not anymore hard-coded which checksum type a btrfs filesystem uses.

Up to now there is no good way to read the filesystem checksum, apart from
reading the filesystem UUID and then query sysfs for the checksum type.

Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl
command which usually is used to query filesystem features. Also add a
flags member indicating that the kernel responded with a set csum_type and
csum_size field.

For compatibility reasons, only return the csum_type and csum_size if the
BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE flag was passed to the kernel. Also
clear any unknown flags so we don't pass false positives to user-space
newer than the kernel.

To simplify further additions to the ioctl, also switch the padding to a
u8 array. Pahole was used to verify the result of this switch:

pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko
struct btrfs_ioctl_fs_info_args {
        __u64                      max_id;               /*     0     8 */
        __u64                      num_devices;          /*     8     8 */
        __u8                       fsid[16];             /*    16    16 */
        __u32                      nodesize;             /*    32     4 */
        __u32                      sectorsize;           /*    36     4 */
        __u32                      clone_alignment;      /*    40     4 */
        __u32                      flags;                /*    44     4 */
        __u16                      csum_type;            /*    48     2 */
        __u16                      csum_size;            /*    50     2 */
        __u8                       reserved[972];        /*    52   972 */

        /* size: 1024, cachelines: 16, members: 10 */
};

Fixes: 3951e7f050ac ("btrfs: add xxhash64 to checksumming algorithms")
Fixes: 3831bf0094ab ("btrfs: add sha256 to checksumming algorithm")
CC: stable@vger.kernel.org # 5.5+
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
Changes to v3:
* make flags in/out (David)
* make csum return opt-in (Hans)

Changes to v2:
* add additional csum_size (David)
* rename flag value to BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE to reflect
  additional size

Changes to v1:
* add 'out' comment to be consistent (Hans)
* remove le16_to_cpu() (kbuild robot)
* switch padding to be all u8 (David)
---
 fs/btrfs/ioctl.c           | 16 +++++++++++++---
 include/uapi/linux/btrfs.h | 14 ++++++++++++--
 2 files changed, 25 insertions(+), 5 deletions(-)

Comments

Hans van Kranenburg June 30, 2020, 12:25 a.m. UTC | #1
Hi,

Ok, this is more complicated than I thought, again.

On 6/29/20 9:53 AM, Johannes Thumshirn wrote:
> With the recent addition of filesystem checksum types other than CRC32c,
> it is not anymore hard-coded which checksum type a btrfs filesystem uses.
> 
> Up to now there is no good way to read the filesystem checksum, apart from
> reading the filesystem UUID and then query sysfs for the checksum type.
> 
> Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl
> command which usually is used to query filesystem features. Also add a
> flags member indicating that the kernel responded with a set csum_type and
> csum_size field.
> 
> For compatibility reasons, only return the csum_type and csum_size if the
> BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE flag was passed to the kernel.

No, that's not for 'compatibility reasons'. It's to allow "fine grained
data retrieval".

My question for David is still relevant: "If we think that with whatever
being added in the future the output will still only contain a bunch of
values that are very cheap to collect when filling the response buffer,
then why would you still want to have this?"

(The 'for compatibility reason' is about the V1 and V2 of the patch
behavior, that sets the very useful flags output field so that the
caller knows which parts of the result actually have a meaning.)

---- 8< ----

The thing is that accepting input opens cans (tm) of worms (tm).

First of all, there is the issue of the lack of checking for a zeroed
buffer from user space.

As we found out, that's actually a bigger problem for an ioctl where the
flags (and garbage in them - possibly silently) modify behavior of
execution and subsequent result (LOGICAL_INO case) than in the case of
just a data collecting ioctl which returns a bunch of values: garbage
in, garbage out, and the (old) calling code is probably ignoring both
input and output fields.

However, demanding unused fields (and unused input flags) to be zero
(like LOGICAL_INO_V2 does now) has an interesting side effect... If I
ask for flags 11110011011101111, and I get -EINVAL, then apparently I'm
running some kernel that is missing support for one of the bits...
¯\_(ツ)_/¯ Good luck finding out which one of the 1s is problematic. Try
all permutations? Try them one by one? Or was there any other path in
the code that returned the -EINVAL?

(We had a discussion about this on IRC earlier this evening.)

So, to help user space a bit, you could then use a flags /* out */ field
that, in combination with -EINVAL will contain the flags of all
functionality that the running kernel can actually support. Then the
calling code can decide if there's something acceptable in there, or to
error out and tell the user to upgrade. This needs to be there from the
early days of the ioctl of course, or when starting to use more
previously unused fields that were properly zeroed out in the output before.

Also, using the same flags field for input as well as for output seems
to be clever because it saves some bytes which can be used for other
stuff in the future, but an interesting remark made was that if anyone
ever wants to write support for a tool like strace for the ioctl, then
it would be oh so nice if that could be stateless, so it can tell you
something like "ha! the user requested features WXYZ and it was reported
back that only X and Z are supported", instead of having to jump through
inconvenient hoops trying to save state from earlier data seen flying by.

I know that in general ioctl stuff is a big mess already, but it would
be fun to think of things that can be done to decrease the rate in which
it's getting even worse, for example by collecting some best practices
and been-there-done-that-oops examples in some guide or whatever... "Ok,
I have this input buffer which was never checked, what options do I have
left before having to do _V+1?". And I do not think that should be done
in a hurry while we want to have this patch in now.

---- >8 ----

So, (apologies if I'm already causing you a headache right at the
beginning of the week), I'd say... Let's for now park everything I just
wrote and drop the whole idea of using input flags for FS_INFO, and keep
it output-only and just properly document the output values that have a
meaning with the flags bits. Because at least we know that the unused
output fields were zeroed before (phew!!).

> Also
> clear any unknown flags so we don't pass false positives to user-space
> newer than the kernel.

^^ about the /* in/out */

> To simplify further additions to the ioctl, also switch the padding to a
> u8 array. Pahole was used to verify the result of this switch:
> 
> pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko
> struct btrfs_ioctl_fs_info_args {
>         __u64                      max_id;               /*     0     8 */
>         __u64                      num_devices;          /*     8     8 */
>         __u8                       fsid[16];             /*    16    16 */
>         __u32                      nodesize;             /*    32     4 */
>         __u32                      sectorsize;           /*    36     4 */
>         __u32                      clone_alignment;      /*    40     4 */
>         __u32                      flags;                /*    44     4 */
>         __u16                      csum_type;            /*    48     2 */
>         __u16                      csum_size;            /*    50     2 */
>         __u8                       reserved[972];        /*    52   972 */
> 
>         /* size: 1024, cachelines: 16, members: 10 */
> };
> 
> Fixes: 3951e7f050ac ("btrfs: add xxhash64 to checksumming algorithms")
> Fixes: 3831bf0094ab ("btrfs: add sha256 to checksumming algorithm")
> CC: stable@vger.kernel.org # 5.5+
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
> Changes to v3:
> * make flags in/out (David)
> * make csum return opt-in (Hans)
> 
> Changes to v2:
> * add additional csum_size (David)
> * rename flag value to BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE to reflect
>   additional size
> 
> Changes to v1:
> * add 'out' comment to be consistent (Hans)
> * remove le16_to_cpu() (kbuild robot)
> * switch padding to be all u8 (David)
> ---
>  fs/btrfs/ioctl.c           | 16 +++++++++++++---
>  include/uapi/linux/btrfs.h | 14 ++++++++++++--
>  2 files changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index b3e4c632d80c..4d70b918f656 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3198,11 +3198,15 @@ static long btrfs_ioctl_fs_info(struct btrfs_fs_info *fs_info,
>  	struct btrfs_ioctl_fs_info_args *fi_args;
>  	struct btrfs_device *device;
>  	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> +	u32 inflags;
>  	int ret = 0;
>  
> -	fi_args = kzalloc(sizeof(*fi_args), GFP_KERNEL);
> -	if (!fi_args)
> -		return -ENOMEM;
> +	fi_args = memdup_user(arg, sizeof(*fi_args));
> +	if (IS_ERR(fi_args))
> +		return PTR_ERR(fi_args);
> +
> +	inflags = fi_args->flags;
> +	fi_args->flags = 0;

Now you got the flags, but the user provided contents are still present
in parts of fi_args that are not going to be overwritten. I don't think
that's what you would have wanted.

What about zeroing the whole fi_args again after getting the input args
out? At least in the past all unused output was zeroed out. Now it's
garbage in, garbage out.

But if we go back to out-only, that's not relevant any more. :)

>  	rcu_read_lock();
>  	fi_args->num_devices = fs_devices->num_devices;
> @@ -3218,6 +3222,12 @@ static long btrfs_ioctl_fs_info(struct btrfs_fs_info *fs_info,
>  	fi_args->sectorsize = fs_info->sectorsize;
>  	fi_args->clone_alignment = fs_info->sectorsize;
>  
> +	if (inflags & BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE) {
> +		fi_args->csum_type = btrfs_super_csum_type(fs_info->super_copy);
> +		fi_args->csum_size = btrfs_super_csum_size(fs_info->super_copy);
> +		fi_args->flags |= BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE;
> +	}
> +
>  	if (copy_to_user(arg, fi_args, sizeof(*fi_args)))
>  		ret = -EFAULT;
>  
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index e6b6cb0f8bc6..c130eaea416e 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -250,10 +250,20 @@ struct btrfs_ioctl_fs_info_args {
>  	__u32 nodesize;				/* out */
>  	__u32 sectorsize;			/* out */
>  	__u32 clone_alignment;			/* out */
> -	__u32 reserved32;
> -	__u64 reserved[122];			/* pad to 1k */
> +	__u32 flags;				/* in/out */
> +	__u16 csum_type;			/* out */
> +	__u16 csum_size;			/* out */
> +	__u8 reserved[972];			/* pad to 1k */
>  };
>  
> +/*
> + * fs_info ioctl flags
> + *
> + * Used by:
> + * struct btrfs_ioctl_fs_info_args
> + */
> +#define BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE		(1 << 0)
> +
>  /*
>   * feature flags
>   *
> 

K
Johannes Thumshirn June 30, 2020, 1:09 p.m. UTC | #2
On 30/06/2020 02:25, Hans van Kranenburg wrote:
> So, (apologies if I'm already causing you a headache right at the
> beginning of the week), I'd say... Let's for now park everything I just
> wrote and drop the whole idea of using input flags for FS_INFO, and keep
> it output-only and just properly document the output values that have a
> meaning with the flags bits. Because at least we know that the unused
> output fields were zeroed before (phew!!).

So basically you say we're going back to v3? David, do you want me to 
resend v3 as v5?
Hans van Kranenburg June 30, 2020, 2:08 p.m. UTC | #3
On 6/30/20 3:09 PM, Johannes Thumshirn wrote:
> On 30/06/2020 02:25, Hans van Kranenburg wrote:
>> So, (apologies if I'm already causing you a headache right at the
>> beginning of the week), I'd say... Let's for now park everything I just
>> wrote and drop the whole idea of using input flags for FS_INFO, and keep
>> it output-only and just properly document the output values that have a
>> meaning with the flags bits. Because at least we know that the unused
>> output fields were zeroed before (phew!!).
> 
> So basically you say we're going back to v3? David, do you want me to 
> resend v3 as v5?

I have been thinking out loud, mostly, and then after a night of sleep I
would think: "wait, no, what I thought was a very good idea isn't at
all, actually".

I hope that does not come across as "telling you what to do". ;] I'd
rather hear if someone can spot new flaws in my reasoning.

Hans
diff mbox series

Patch

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b3e4c632d80c..4d70b918f656 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3198,11 +3198,15 @@  static long btrfs_ioctl_fs_info(struct btrfs_fs_info *fs_info,
 	struct btrfs_ioctl_fs_info_args *fi_args;
 	struct btrfs_device *device;
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	u32 inflags;
 	int ret = 0;
 
-	fi_args = kzalloc(sizeof(*fi_args), GFP_KERNEL);
-	if (!fi_args)
-		return -ENOMEM;
+	fi_args = memdup_user(arg, sizeof(*fi_args));
+	if (IS_ERR(fi_args))
+		return PTR_ERR(fi_args);
+
+	inflags = fi_args->flags;
+	fi_args->flags = 0;
 
 	rcu_read_lock();
 	fi_args->num_devices = fs_devices->num_devices;
@@ -3218,6 +3222,12 @@  static long btrfs_ioctl_fs_info(struct btrfs_fs_info *fs_info,
 	fi_args->sectorsize = fs_info->sectorsize;
 	fi_args->clone_alignment = fs_info->sectorsize;
 
+	if (inflags & BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE) {
+		fi_args->csum_type = btrfs_super_csum_type(fs_info->super_copy);
+		fi_args->csum_size = btrfs_super_csum_size(fs_info->super_copy);
+		fi_args->flags |= BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE;
+	}
+
 	if (copy_to_user(arg, fi_args, sizeof(*fi_args)))
 		ret = -EFAULT;
 
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index e6b6cb0f8bc6..c130eaea416e 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -250,10 +250,20 @@  struct btrfs_ioctl_fs_info_args {
 	__u32 nodesize;				/* out */
 	__u32 sectorsize;			/* out */
 	__u32 clone_alignment;			/* out */
-	__u32 reserved32;
-	__u64 reserved[122];			/* pad to 1k */
+	__u32 flags;				/* in/out */
+	__u16 csum_type;			/* out */
+	__u16 csum_size;			/* out */
+	__u8 reserved[972];			/* pad to 1k */
 };
 
+/*
+ * fs_info ioctl flags
+ *
+ * Used by:
+ * struct btrfs_ioctl_fs_info_args
+ */
+#define BTRFS_FS_INFO_FLAG_CSUM_TYPE_SIZE		(1 << 0)
+
 /*
  * feature flags
  *