[09/10,V2] Use FIEMAP for FIBMAP calls

Message ID	20181205091728.29903-10-cmaiolino@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Carlos Maiolino <cmaiolino@redhat.com> To: linux-fsdevel@vger.kernel.org Cc: hch@lst.de, adilger@dilger.ca, sandeen@redhat.com, david@fromorbit.com Subject: [PATCH 09/10 V2] Use FIEMAP for FIBMAP calls Date: Wed, 5 Dec 2018 10:17:27 +0100 Message-Id: <20181205091728.29903-10-cmaiolino@redhat.com> In-Reply-To: <20181205091728.29903-1-cmaiolino@redhat.com> References: <20181205091728.29903-1-cmaiolino@redhat.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	New ->fiemap infrastructure and ->bmap removal \| expand [00/10,V2] New ->fiemap infrastructure and ->bmap removal [01/10] fs: Enable bmap() function to properly return errors [02/10] cachefiles: drop direct usage of ->bmap method. [03/10] ecryptfs: drop direct calls to ->bmap [04/10,V2] fibmap: Use bmap instead of ->bmap method in ioctl_fibmap [05/10] fs: Move start and length fiemap fields into fiemap_extent_info [06/10] iomap: Remove length and start fields from iomap_fiemap [07/10] fs: Use a void pointer to store fiemap_extent [08/10,V2] fiemap: Use a callback to fill fiemap extents [09/10,V2] Use FIEMAP for FIBMAP calls [10/10] xfs: Get rid of ->bmap

Carlos Maiolino Dec. 5, 2018, 9:17 a.m. UTC

Enables the usage of FIEMAP ioctl infrastructure to handle FIBMAP calls.
From now on, ->bmap() methods can start to be removed from filesystems
which already provides ->fiemap().

This adds a new helper - bmap_fiemap() - which is used to fill in the
fiemap request, call into the underlying filesystem and check the flags
set in the extent requested.

Add a new fiemap fill extent callback to handl the in-kernel only
fiemap_extent structure used for FIBMAP.

V2:
	- Now based on the updated fiemap_extent_info,
	- move the fiemap call itself to a new helper

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
---
 fs/inode.c         | 42 ++++++++++++++++++++++++++++++++++++++++--
 fs/ioctl.c         | 32 ++++++++++++++++++++++++++++++++
 include/linux/fs.h |  2 ++
 3 files changed, 74 insertions(+), 2 deletions(-)

Darrick J. Wong Dec. 5, 2018, 5:36 p.m. UTC | #1

On Wed, Dec 05, 2018 at 10:17:27AM +0100, Carlos Maiolino wrote:
> Enables the usage of FIEMAP ioctl infrastructure to handle FIBMAP calls.
> From now on, ->bmap() methods can start to be removed from filesystems
> which already provides ->fiemap().
> 
> This adds a new helper - bmap_fiemap() - which is used to fill in the
> fiemap request, call into the underlying filesystem and check the flags
> set in the extent requested.
> 
> Add a new fiemap fill extent callback to handl the in-kernel only
> fiemap_extent structure used for FIBMAP.
> 
> V2:
> 	- Now based on the updated fiemap_extent_info,
> 	- move the fiemap call itself to a new helper
> 
> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
>  fs/inode.c         | 42 ++++++++++++++++++++++++++++++++++++++++--
>  fs/ioctl.c         | 32 ++++++++++++++++++++++++++++++++
>  include/linux/fs.h |  2 ++
>  3 files changed, 74 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index db681d310465..f07cc183ddbd 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1578,6 +1578,40 @@ void iput(struct inode *inode)
>  }
>  EXPORT_SYMBOL(iput);
>  
> +static int bmap_fiemap(struct inode *inode, sector_t *block)
> +{
> +	struct fiemap_extent_info fieinfo = { 0, };
> +	struct fiemap_extent fextent;
> +	u64 start = *block << inode->i_blkbits;
> +	int error = -EINVAL;
> +
> +	fextent.fe_logical = 0;
> +	fextent.fe_physical = 0;
> +	fieinfo.fi_extents_max = 1;
> +	fieinfo.fi_extents_mapped = 0;
> +	fieinfo.fi_extents_start = &fextent;
> +	fieinfo.fi_start = start;
> +	fieinfo.fi_len = 1 << inode->i_blkbits;
> +	fieinfo.fi_flags = 0;
> +	fieinfo.fi_cb = fiemap_fill_kernel_extent;
> +
> +	error = inode->i_op->fiemap(inode, &fieinfo);
> +
> +	if (error)
> +		return error;
> +
> +	if (fieinfo.fi_flags & (FIEMAP_EXTENT_UNKNOWN |
> +				FIEMAP_EXTENT_ENCODED |
> +				FIEMAP_EXTENT_DATA_INLINE |
> +				FIEMAP_EXTENT_UNWRITTEN))
> +		return -EINVAL;

AFAICT, three of the filesystems that support COW writes (xfs, ocfs2,
and btrfs) do not return bmap results for files with shared blocks.
This check here should include FIEMAP_EXTENT_SHARED since external
overwrites of a COW file block are bad news on btrfs (and ocfs2 and
xfs).

> +
> +	*block = (fextent.fe_physical +
> +		  (start - fextent.fe_logical)) >> inode->i_blkbits;

Hmmm, so there's nothing here checking that the physical device fiemap
reports is the same device that was passed into the mount.  This is
trivially true for most of the filesystems that implement bmap and
fiemap, but definitely not true for xfs or btrfs.  I would bet most
userspace callers of bmap (since it's an ext2-era ioctl) make that
assumption and don't even know how to find the device.

On xfs, the bmap implementation won't return any results for realtime
files, but it looks as though we suddenly will start doing that here,
because in the new bmap implementation we will use fiemap, and fiemap
reports extents without providing any context about which device they're
on, and that context-less extent gets passed back to bmap_fiemap.

In any case, I think a better solution to the multi-device problem is to
start returning device information via struct fiemap_extent, at least
inside the kernel.  Use one of the reserved fields to declare a new
'__u32 fe_device' field in struct fiemap_extent which can be the dev_t
device number, and then you can check that against inode->i_sb->s_bdev
to avoid returning results for the non-primary device of a multi-device
filesystem.

> +
> +	return error;
> +}
> +
>  /**
>   *	bmap	- find a block number in a file
>   *	@inode:  inode owning the block number being requested
> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
>   */
>  int bmap(struct inode *inode, sector_t *block)
>  {
> -	if (!inode->i_mapping->a_ops->bmap)
> +	if (inode->i_op->fiemap)
> +		return bmap_fiemap(inode, block);
> +	else if (inode->i_mapping->a_ops->bmap)
> +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> +						       *block);
> +	else
>  		return -EINVAL;

Waitaminute.  btrfs currently supports fiemap but not bmap, and now
suddenly it will support this legacy interface they've never supported
before.  Are they on board with this?

--D

>  
> -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
>  	return 0;
>  }
>  EXPORT_SYMBOL(bmap);
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 6086978fe01e..bfa59df332bf 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>  }
>  
> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> +			    u64 phys, u64 len, u32 flags)
> +{
> +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> +
> +	/* only count the extents */
> +	if (fieinfo->fi_extents_max == 0) {
> +		fieinfo->fi_extents_mapped++;
> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> +	}
> +
> +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> +		return 1;
> +
> +	if (flags & SET_UNKNOWN_FLAGS)
> +		flags |= FIEMAP_EXTENT_UNKNOWN;
> +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> +		flags |= FIEMAP_EXTENT_ENCODED;
> +	if (flags & SET_NOT_ALIGNED_FLAGS)
> +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> +
> +	extent->fe_logical = logical;
> +	extent->fe_physical = phys;
> +	extent->fe_length = len;
> +	extent->fe_flags = flags;
> +
> +	fieinfo->fi_extents_mapped++;
> +
> +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> +		return 1;
> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> +}
>  /**
>   * fiemap_fill_next_extent - Fiemap helper function
>   * @fieinfo:	Fiemap context passed into ->fiemap
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7a434979201c..28bb523d532a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
>  	fiemap_fill_cb	fi_cb;
>  };
>  
> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> +			      u64 phys, u64 len, u32 flags);
>  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
>  			    u64 phys, u64 len, u32 flags);
>  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> -- 
> 2.17.2
>

Carlos Maiolino Dec. 7, 2018, 9:09 a.m. UTC | #2

On Wed, Dec 05, 2018 at 09:36:50AM -0800, Darrick J. Wong wrote:
> On Wed, Dec 05, 2018 at 10:17:27AM +0100, Carlos Maiolino wrote:
> > Enables the usage of FIEMAP ioctl infrastructure to handle FIBMAP calls.
> > From now on, ->bmap() methods can start to be removed from filesystems
> > which already provides ->fiemap().
> > 
> > This adds a new helper - bmap_fiemap() - which is used to fill in the
> > fiemap request, call into the underlying filesystem and check the flags
> > set in the extent requested.
> > 
> > Add a new fiemap fill extent callback to handl the in-kernel only
> > fiemap_extent structure used for FIBMAP.
> > 
> > V2:
> > 	- Now based on the updated fiemap_extent_info,
> > 	- move the fiemap call itself to a new helper
> > 
> > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
> > ---
> >  fs/inode.c         | 42 ++++++++++++++++++++++++++++++++++++++++--
> >  fs/ioctl.c         | 32 ++++++++++++++++++++++++++++++++
> >  include/linux/fs.h |  2 ++
> >  3 files changed, 74 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index db681d310465..f07cc183ddbd 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1578,6 +1578,40 @@ void iput(struct inode *inode)
> >  }
> >  EXPORT_SYMBOL(iput);
> >  
> > +static int bmap_fiemap(struct inode *inode, sector_t *block)
> > +{
> > +	struct fiemap_extent_info fieinfo = { 0, };
> > +	struct fiemap_extent fextent;
> > +	u64 start = *block << inode->i_blkbits;
> > +	int error = -EINVAL;
> > +
> > +	fextent.fe_logical = 0;
> > +	fextent.fe_physical = 0;
> > +	fieinfo.fi_extents_max = 1;
> > +	fieinfo.fi_extents_mapped = 0;
> > +	fieinfo.fi_extents_start = &fextent;
> > +	fieinfo.fi_start = start;
> > +	fieinfo.fi_len = 1 << inode->i_blkbits;
> > +	fieinfo.fi_flags = 0;
> > +	fieinfo.fi_cb = fiemap_fill_kernel_extent;
> > +
> > +	error = inode->i_op->fiemap(inode, &fieinfo);
> > +
> > +	if (error)
> > +		return error;
> > +
> > +	if (fieinfo.fi_flags & (FIEMAP_EXTENT_UNKNOWN |
> > +				FIEMAP_EXTENT_ENCODED |
> > +				FIEMAP_EXTENT_DATA_INLINE |
> > +				FIEMAP_EXTENT_UNWRITTEN))
> > +		return -EINVAL;
> 
> AFAICT, three of the filesystems that support COW writes (xfs, ocfs2,
> and btrfs) do not return bmap results for files with shared blocks.
> This check here should include FIEMAP_EXTENT_SHARED since external
> overwrites of a COW file block are bad news on btrfs (and ocfs2 and
> xfs).

Yes, it does need to check for FIEMAP_EXTENT_SHARED too, I had it on my plans
but I forgot to add it when setting up the flags. Thanks for reminding me.

> 
> > +
> > +	*block = (fextent.fe_physical +
> > +		  (start - fextent.fe_logical)) >> inode->i_blkbits;
> 
> Hmmm, so there's nothing here checking that the physical device fiemap
> reports is the same device that was passed into the mount.  This is
> trivially true for most of the filesystems that implement bmap and
> fiemap, but definitely not true for xfs or btrfs.  I would bet most
> userspace callers of bmap (since it's an ext2-era ioctl) make that
> assumption and don't even know how to find the device.
> 
> On xfs, the bmap implementation won't return any results for realtime
> files, but it looks as though we suddenly will start doing that here,
> because in the new bmap implementation we will use fiemap, and fiemap
> reports extents without providing any context about which device they're
> on, and that context-less extent gets passed back to bmap_fiemap.
> 
> In any case, I think a better solution to the multi-device problem is to
> start returning device information via struct fiemap_extent, at least
> inside the kernel.  Use one of the reserved fields to declare a new
> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> device number, and then you can check that against inode->i_sb->s_bdev
> to avoid returning results for the non-primary device of a multi-device
> filesystem.

Yes, you are right, I haven't thought about multi-dev filesystems. I checked
btrfs code and it doesn't even support fibmap, exactly because of this problem,
I wonder though, why it does support FIEMAP then, maybe because the fiemap idea
isn't provide a way to userspace do IO directly to the device?!

I'm not sure if crossing dev information is enough though, I did a quick read of
btrfs code, and the assumption that the block/extent location won't change on
time, could lead to a time bomb in the future. I wonder if it wouldn't maybe be
better to add a flag, to identify the usage type, and the filesystem itself
would define if it should return anything, or not. Like, for example, passing in
fieinfo->fi_flags, something like FIEMAP_FIBMAP, and check inside the filesystem
for it.

From my shallow understanding of btrfs, looks like the location of the data, can
be moved inside the same device, so, even if the devices are the same as you
suggested, there is no guarantee the offset will be the same.

Cheers.

> 
> > +
> > +	return error;
> > +}
> > +
> >  /**
> >   *	bmap	- find a block number in a file
> >   *	@inode:  inode owning the block number being requested
> > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> >   */
> >  int bmap(struct inode *inode, sector_t *block)
> >  {
> > -	if (!inode->i_mapping->a_ops->bmap)
> > +	if (inode->i_op->fiemap)
> > +		return bmap_fiemap(inode, block);
> > +	else if (inode->i_mapping->a_ops->bmap)
> > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > +						       *block);
> > +	else
> >  		return -EINVAL;
> 
> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> suddenly it will support this legacy interface they've never supported
> before.  Are they on board with this?
> 
> --D
> 
> >  
> > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> >  	return 0;
> >  }
> >  EXPORT_SYMBOL(bmap);
> > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > index 6086978fe01e..bfa59df332bf 100644
> > --- a/fs/ioctl.c
> > +++ b/fs/ioctl.c
> > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >  }
> >  
> > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > +			    u64 phys, u64 len, u32 flags)
> > +{
> > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > +
> > +	/* only count the extents */
> > +	if (fieinfo->fi_extents_max == 0) {
> > +		fieinfo->fi_extents_mapped++;
> > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > +	}
> > +
> > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > +		return 1;
> > +
> > +	if (flags & SET_UNKNOWN_FLAGS)
> > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > +		flags |= FIEMAP_EXTENT_ENCODED;
> > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > +
> > +	extent->fe_logical = logical;
> > +	extent->fe_physical = phys;
> > +	extent->fe_length = len;
> > +	extent->fe_flags = flags;
> > +
> > +	fieinfo->fi_extents_mapped++;
> > +
> > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > +		return 1;
> > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > +}
> >  /**
> >   * fiemap_fill_next_extent - Fiemap helper function
> >   * @fieinfo:	Fiemap context passed into ->fiemap
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 7a434979201c..28bb523d532a 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> >  	fiemap_fill_cb	fi_cb;
> >  };
> >  
> > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > +			      u64 phys, u64 len, u32 flags);
> >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> >  			    u64 phys, u64 len, u32 flags);
> >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > -- 
> > 2.17.2
> >

Andreas Dilger Dec. 7, 2018, 8:14 p.m. UTC | #3

On Dec 7, 2018, at 2:09 AM, Carlos Maiolino <cmaiolino@redhat.com> wrote:
> On Wed, Dec 05, 2018 at 09:36:50AM -0800, Darrick J. Wong wrote:
>> 
>>> +
>>> +	*block = (fextent.fe_physical +
>>> +		  (start - fextent.fe_logical)) >> inode->i_blkbits;
>> 
>> Hmmm, so there's nothing here checking that the physical device fiemap
>> reports is the same device that was passed into the mount.  This is
>> trivially true for most of the filesystems that implement bmap and
>> fiemap, but definitely not true for xfs or btrfs.  I would bet most
>> userspace callers of bmap (since it's an ext2-era ioctl) make that
>> assumption and don't even know how to find the device.
>> 
>> On xfs, the bmap implementation won't return any results for realtime
>> files, but it looks as though we suddenly will start doing that here,
>> because in the new bmap implementation we will use fiemap, and fiemap
>> reports extents without providing any context about which device they're
>> on, and that context-less extent gets passed back to bmap_fiemap.
>> 
>> In any case, I think a better solution to the multi-device problem is to
>> start returning device information via struct fiemap_extent, at least
>> inside the kernel.  Use one of the reserved fields to declare a new
>> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
>> device number, and then you can check that against inode->i_sb->s_bdev
>> to avoid returning results for the non-primary device of a multi-device
>> filesystem.

We're using fe_device = fe_reserved[0] to return the device number for Lustre.
For Lustre, the "device number" is just a server index, since the server's
block device number is irrelevant on the client.  For local filesystems, it
should return the 32-bit st_rdev device number to distinguish devices.

I have patches for e2fsprogs filefrag to print the fe_device field.

> Yes, you are right, I haven't thought about multi-dev filesystems. I checked
> btrfs code and it doesn't even support fibmap, exactly because of this problem,
> I wonder though, why it does support FIEMAP then, maybe because the fiemap idea
> isn't provide a way to userspace do IO directly to the device?!
> 
> I'm not sure if crossing dev information is enough though, I did a quick read of
> btrfs code, and the assumption that the block/extent location won't change on
> time, could lead to a time bomb in the future. I wonder if it wouldn't maybe be
> better to add a flag, to identify the usage type, and the filesystem itself
> would define if it should return anything, or not. Like, for example, passing in
> fieinfo->fi_flags, something like FIEMAP_FIBMAP, and check inside the filesystem
> for it.

The FIEMAP_EXTENT_ENCODED flag is meant to be returned when the extent cannot be
read directly from the block device.

For FIBMAP, it should return an error if ENCODED is set, since this file is not
suitable for directly booting a kernel (LILO is the only user of FIBMAP that I'm
aware of).  The filefrag utility prefers to use FIEMAP for efficiency, and only
falls back to FIBMAP if FIEMAP fails.

> From my shallow understanding of btrfs, looks like the location of the data, can
> be moved inside the same device, so, even if the devices are the same as you
> suggested, there is no guarantee the offset will be the same.

On a related note, btrfs also supports compressed extents, which isn't handled
by the current FIEMAP ioctl properly.  There was a patch proposed ages ago to
add FIEMAP_EXTENT_DATA_COMPRESSED, but didn't _quite_ make it over the finish
line, https://www.spinics.net/lists/xfs/msg24629.html has the last discussion.

It added EXTENT_DATA_COMPRESSED and used fe_reserved64[0] as fe_phys_length to
return the on-disk extent size, while fe_length would rename to fe_logi_length
(with a compat macro) to still represented the logical extent length.

 #define FIEMAP_EXTENT_DATA_COMPRESSED	0x00000040 /* Data compressed by fs.
 						    * Sets EXTENT_DATA_ENCODED */
 -	__u64 fe_reserved64[2];
 +	__u64 fe_phys_length; /* physical length in bytes, undefined if
 +			       * DATA_COMPRESSED not set */

There was some discussion on whether there should be a second flag like
FIEMAP_EXTENT_PHYS_LENGTH that is set when the fe_phys_length field is
valid, independent of whether the data is compressed or not.

Since you are reworking the FIEMAP code anyway, would you be interested to
revive that patch series?

Cheers, Andreas

Christoph Hellwig Jan. 14, 2019, 4:56 p.m. UTC | #4

> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> +			    u64 phys, u64 len, u32 flags)

Any reason this function isn't in inode.c next to the caller and marked
static?

Otherwise looks fine except for the additional sanity checking pointed
out by Darrick.

> +	/* only count the extents */
> +	if (fieinfo->fi_extents_max == 0) {
> +		fieinfo->fi_extents_mapped++;
> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;

Maybe do a 'goto out' here?

> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;

And reuse this return.   Bonus points for using a good old
if here:

	if (flags & FIEMAP_EXTENT_LAST)
		return 1;
	return 0;

Carlos Maiolino Feb. 4, 2019, 3:11 p.m. UTC | #5

Hi, Sorry for the long delay Darrick.

> > +	fextent.fe_logical = 0;
> > +	fextent.fe_physical = 0;
> > +	fieinfo.fi_extents_max = 1;
> > +	fieinfo.fi_extents_mapped = 0;
> > +	fieinfo.fi_extents_start = &fextent;
> > +	fieinfo.fi_start = start;
> > +	fieinfo.fi_len = 1 << inode->i_blkbits;
> > +	fieinfo.fi_flags = 0;
> > +	fieinfo.fi_cb = fiemap_fill_kernel_extent;
> > +
> > +	error = inode->i_op->fiemap(inode, &fieinfo);
> > +
> > +	if (error)
> > +		return error;
> > +
> > +	if (fieinfo.fi_flags & (FIEMAP_EXTENT_UNKNOWN |
> > +				FIEMAP_EXTENT_ENCODED |
> > +				FIEMAP_EXTENT_DATA_INLINE |
> > +				FIEMAP_EXTENT_UNWRITTEN))
> > +		return -EINVAL;
> 
> AFAICT, three of the filesystems that support COW writes (xfs, ocfs2,
> and btrfs) do not return bmap results for files with shared blocks.
> This check here should include FIEMAP_EXTENT_SHARED since external
> overwrites of a COW file block are bad news on btrfs (and ocfs2 and
> xfs).

ok, np

> 
> > +
> > +	*block = (fextent.fe_physical +
> > +		  (start - fextent.fe_logical)) >> inode->i_blkbits;
> 
> Hmmm, so there's nothing here checking that the physical device fiemap
> reports is the same device that was passed into the mount.  This is
> trivially true for most of the filesystems that implement bmap and
> fiemap, but definitely not true for xfs or btrfs.  I would bet most
> userspace callers of bmap (since it's an ext2-era ioctl) make that
> assumption and don't even know how to find the device.

Makes sense.

> 
> On xfs, the bmap implementation won't return any results for realtime
> files, but it looks as though we suddenly will start doing that here,
> because in the new bmap implementation we will use fiemap, and fiemap
> reports extents without providing any context about which device they're
> on, and that context-less extent gets passed back to bmap_fiemap.
> 
> In any case, I think a better solution to the multi-device problem is to
> start returning device information via struct fiemap_extent, at least
> inside the kernel.  Use one of the reserved fields to declare a new
> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> device number, and then you can check that against inode->i_sb->s_bdev
> to avoid returning results for the non-primary device of a multi-device
> filesystem.

I agree we should address it here, but I don't think fiemap_extent is the right
place for it, it is linked to the UAPI, and changing it is usually not a good
idea.

I think I got your idea anyway, but, what if, instead returning the bdev in
fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
with such information?

> 
> > +
> > +	return error;
> > +}
> > +
> >  /**
> >   *	bmap	- find a block number in a file
> >   *	@inode:  inode owning the block number being requested
> > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> >   */
> >  int bmap(struct inode *inode, sector_t *block)
> >  {
> > -	if (!inode->i_mapping->a_ops->bmap)
> > +	if (inode->i_op->fiemap)
> > +		return bmap_fiemap(inode, block);
> > +	else if (inode->i_mapping->a_ops->bmap)
> > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > +						       *block);
> > +	else
> >  		return -EINVAL;
> 
> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> suddenly it will support this legacy interface they've never supported
> before.  Are they on board with this?
> 
> --D
> 
> >  
> > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> >  	return 0;
> >  }
> >  EXPORT_SYMBOL(bmap);
> > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > index 6086978fe01e..bfa59df332bf 100644
> > --- a/fs/ioctl.c
> > +++ b/fs/ioctl.c
> > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >  }
> >  
> > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > +			    u64 phys, u64 len, u32 flags)
> > +{
> > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > +
> > +	/* only count the extents */
> > +	if (fieinfo->fi_extents_max == 0) {
> > +		fieinfo->fi_extents_mapped++;
> > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > +	}
> > +
> > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > +		return 1;
> > +
> > +	if (flags & SET_UNKNOWN_FLAGS)
> > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > +		flags |= FIEMAP_EXTENT_ENCODED;
> > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > +
> > +	extent->fe_logical = logical;
> > +	extent->fe_physical = phys;
> > +	extent->fe_length = len;
> > +	extent->fe_flags = flags;
> > +
> > +	fieinfo->fi_extents_mapped++;
> > +
> > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > +		return 1;
> > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > +}
> >  /**
> >   * fiemap_fill_next_extent - Fiemap helper function
> >   * @fieinfo:	Fiemap context passed into ->fiemap
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 7a434979201c..28bb523d532a 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> >  	fiemap_fill_cb	fi_cb;
> >  };
> >  
> > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > +			      u64 phys, u64 len, u32 flags);
> >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> >  			    u64 phys, u64 len, u32 flags);
> >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > -- 
> > 2.17.2
> >

Darrick J. Wong Feb. 4, 2019, 6:27 p.m. UTC | #6

On Mon, Feb 04, 2019 at 04:11:47PM +0100, Carlos Maiolino wrote:
> Hi, Sorry for the long delay Darrick.
> 
> > > +	fextent.fe_logical = 0;
> > > +	fextent.fe_physical = 0;
> > > +	fieinfo.fi_extents_max = 1;
> > > +	fieinfo.fi_extents_mapped = 0;
> > > +	fieinfo.fi_extents_start = &fextent;
> > > +	fieinfo.fi_start = start;
> > > +	fieinfo.fi_len = 1 << inode->i_blkbits;
> > > +	fieinfo.fi_flags = 0;
> > > +	fieinfo.fi_cb = fiemap_fill_kernel_extent;
> > > +
> > > +	error = inode->i_op->fiemap(inode, &fieinfo);
> > > +
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	if (fieinfo.fi_flags & (FIEMAP_EXTENT_UNKNOWN |
> > > +				FIEMAP_EXTENT_ENCODED |
> > > +				FIEMAP_EXTENT_DATA_INLINE |
> > > +				FIEMAP_EXTENT_UNWRITTEN))
> > > +		return -EINVAL;
> > 
> > AFAICT, three of the filesystems that support COW writes (xfs, ocfs2,
> > and btrfs) do not return bmap results for files with shared blocks.
> > This check here should include FIEMAP_EXTENT_SHARED since external
> > overwrites of a COW file block are bad news on btrfs (and ocfs2 and
> > xfs).
> 
> ok, np
> 
> > 
> > > +
> > > +	*block = (fextent.fe_physical +
> > > +		  (start - fextent.fe_logical)) >> inode->i_blkbits;
> > 
> > Hmmm, so there's nothing here checking that the physical device fiemap
> > reports is the same device that was passed into the mount.  This is
> > trivially true for most of the filesystems that implement bmap and
> > fiemap, but definitely not true for xfs or btrfs.  I would bet most
> > userspace callers of bmap (since it's an ext2-era ioctl) make that
> > assumption and don't even know how to find the device.
> 
> Makes sense.
> 
> > 
> > On xfs, the bmap implementation won't return any results for realtime
> > files, but it looks as though we suddenly will start doing that here,
> > because in the new bmap implementation we will use fiemap, and fiemap
> > reports extents without providing any context about which device they're
> > on, and that context-less extent gets passed back to bmap_fiemap.
> > 
> > In any case, I think a better solution to the multi-device problem is to
> > start returning device information via struct fiemap_extent, at least
> > inside the kernel.  Use one of the reserved fields to declare a new
> > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > device number, and then you can check that against inode->i_sb->s_bdev
> > to avoid returning results for the non-primary device of a multi-device
> > filesystem.
> 
> I agree we should address it here, but I don't think fiemap_extent is the right
> place for it, it is linked to the UAPI, and changing it is usually not a good
> idea.

Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
into some sort of dev_t/per-device cookie should be fine.  Userspace
shouldn't be expecting any meaning in reserved areas.

> I think I got your idea anyway, but, what if, instead returning the bdev in
> fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> with such information?

I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.

--D

> > 
> > > +
> > > +	return error;
> > > +}
> > > +
> > >  /**
> > >   *	bmap	- find a block number in a file
> > >   *	@inode:  inode owning the block number being requested
> > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > >   */
> > >  int bmap(struct inode *inode, sector_t *block)
> > >  {
> > > -	if (!inode->i_mapping->a_ops->bmap)
> > > +	if (inode->i_op->fiemap)
> > > +		return bmap_fiemap(inode, block);
> > > +	else if (inode->i_mapping->a_ops->bmap)
> > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > +						       *block);
> > > +	else
> > >  		return -EINVAL;
> > 
> > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > suddenly it will support this legacy interface they've never supported
> > before.  Are they on board with this?
> > 
> > --D
> > 
> > >  
> > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > >  	return 0;
> > >  }
> > >  EXPORT_SYMBOL(bmap);
> > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > index 6086978fe01e..bfa59df332bf 100644
> > > --- a/fs/ioctl.c
> > > +++ b/fs/ioctl.c
> > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > >  }
> > >  
> > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > +			    u64 phys, u64 len, u32 flags)
> > > +{
> > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > +
> > > +	/* only count the extents */
> > > +	if (fieinfo->fi_extents_max == 0) {
> > > +		fieinfo->fi_extents_mapped++;
> > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > +	}
> > > +
> > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > +		return 1;
> > > +
> > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > +
> > > +	extent->fe_logical = logical;
> > > +	extent->fe_physical = phys;
> > > +	extent->fe_length = len;
> > > +	extent->fe_flags = flags;
> > > +
> > > +	fieinfo->fi_extents_mapped++;
> > > +
> > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > +		return 1;
> > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > +}
> > >  /**
> > >   * fiemap_fill_next_extent - Fiemap helper function
> > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 7a434979201c..28bb523d532a 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > >  	fiemap_fill_cb	fi_cb;
> > >  };
> > >  
> > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > +			      u64 phys, u64 len, u32 flags);
> > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > >  			    u64 phys, u64 len, u32 flags);
> > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > -- 
> > > 2.17.2
> > > 
> 
> -- 
> Carlos

Carlos Maiolino Feb. 5, 2019, 9:56 a.m. UTC | #7

Hi Christoph.

On Mon, Jan 14, 2019 at 05:56:17PM +0100, Christoph Hellwig wrote:
> > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > +			    u64 phys, u64 len, u32 flags)
> 
> Any reason this function isn't in inode.c next to the caller and marked
> static?
> 

No reason other than to keep it close to its peer fiemap_fill_user_extent(), I
honestly do prefer to keep both together than in separated files. But, I'm up
to move it to fs/inode.c if required.

> Otherwise looks fine except for the additional sanity checking pointed
> out by Darrick.

Working on that.

> 
> > +	/* only count the extents */
> > +	if (fieinfo->fi_extents_max == 0) {
> > +		fieinfo->fi_extents_mapped++;
> > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> 
> Maybe do a 'goto out' here?
> 
> > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> 
> And reuse this return.   Bonus points for using a good old
> if here:
> 
> 	if (flags & FIEMAP_EXTENT_LAST)
> 		return 1;
> 	return 0;

Ok, will be in the new version, thanks for the review :)

Christoph Hellwig Feb. 5, 2019, 6:25 p.m. UTC | #8

On Tue, Feb 05, 2019 at 10:56:01AM +0100, Carlos Maiolino wrote:
> > Any reason this function isn't in inode.c next to the caller and marked
> > static?
> > 
> 
> No reason other than to keep it close to its peer fiemap_fill_user_extent(), I
> honestly do prefer to keep both together than in separated files. But, I'm up
> to move it to fs/inode.c if required.

After your series fiemap_fill_user_extent should be static and close
to it's caller, so with the kernel one in inode.c everything should
be neat and symmetric.

Carlos Maiolino Feb. 6, 2019, 9:50 a.m. UTC | #9

On Tue, Feb 05, 2019 at 07:25:18PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 05, 2019 at 10:56:01AM +0100, Carlos Maiolino wrote:
> > > Any reason this function isn't in inode.c next to the caller and marked
> > > static?
> > > 
> > 
> > No reason other than to keep it close to its peer fiemap_fill_user_extent(), I
> > honestly do prefer to keep both together than in separated files. But, I'm up
> > to move it to fs/inode.c if required.
> 
> After your series fiemap_fill_user_extent should be static and close
> to it's caller, so with the kernel one in inode.c everything should
> be neat and symmetric.

You are right, I didn't pay attention to that, thanks for the heads up, I'll fix
it on the next version

Carlos Maiolino Feb. 6, 2019, 1:37 p.m. UTC | #10

> > > In any case, I think a better solution to the multi-device problem is to
> > > start returning device information via struct fiemap_extent, at least
> > > inside the kernel.  Use one of the reserved fields to declare a new
> > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > device number, and then you can check that against inode->i_sb->s_bdev
> > > to avoid returning results for the non-primary device of a multi-device
> > > filesystem.
> > 
> > I agree we should address it here, but I don't think fiemap_extent is the right
> > place for it, it is linked to the UAPI, and changing it is usually not a good
> > idea.
> 
> Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> into some sort of dev_t/per-device cookie should be fine.  Userspace
> shouldn't be expecting any meaning in reserved areas.
> 
> > I think I got your idea anyway, but, what if, instead returning the bdev in
> > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > with such information?
> 
> I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.

Ok, may I ask why not?

My apologies if I am wrong, but, per my understanding, there is nothing today,
which tells userspace which device belongs the extent map reported by FIEMAP.
If it belongs to the RT device in XFS, or whatever disk in a raid in BTRFS, we
simply do not provide such information. So, the goal is to provide a way to tell
the filesystem if a FIEMAP or a FIBMAP has been requested, so the current
behavior of both ioctls won't change.

Enabling filesystems to return device information into fiemap_extent requires
modification of all filesystems to provide such information, which will not have
any use other than matching the mounted device to the device where the extent
is.

A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive than the
device id in fiemap_extent. I don't see much advantage in adding the device id
instead of using the flag.

A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
userspace, so, it would require a check to make sure it didn't come from
userspace if ioctl_fiemap() was used.

I think there are 2 other possibilities which can be used to fix this.

- Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
- If the device id is a must for you, maybe add the device id into
  fiemap_extent_info instead of fiemap_extent. So we don't mess with a UAPI
  exported data structure and still provides a way to the filesystems to provide
  which device the mapped extent is in.

What you think?

Cheers


> 
> --D
> 
> > > 
> > > > +
> > > > +	return error;
> > > > +}
> > > > +
> > > >  /**
> > > >   *	bmap	- find a block number in a file
> > > >   *	@inode:  inode owning the block number being requested
> > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > >   */
> > > >  int bmap(struct inode *inode, sector_t *block)
> > > >  {
> > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > +	if (inode->i_op->fiemap)
> > > > +		return bmap_fiemap(inode, block);
> > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > +						       *block);
> > > > +	else
> > > >  		return -EINVAL;
> > > 
> > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > suddenly it will support this legacy interface they've never supported
> > > before.  Are they on board with this?
> > > 
> > > --D
> > > 
> > > >  
> > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > >  	return 0;
> > > >  }
> > > >  EXPORT_SYMBOL(bmap);
> > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > index 6086978fe01e..bfa59df332bf 100644
> > > > --- a/fs/ioctl.c
> > > > +++ b/fs/ioctl.c
> > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > >  }
> > > >  
> > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > +			    u64 phys, u64 len, u32 flags)
> > > > +{
> > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > +
> > > > +	/* only count the extents */
> > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > +		fieinfo->fi_extents_mapped++;
> > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > +	}
> > > > +
> > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > +		return 1;
> > > > +
> > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > +
> > > > +	extent->fe_logical = logical;
> > > > +	extent->fe_physical = phys;
> > > > +	extent->fe_length = len;
> > > > +	extent->fe_flags = flags;
> > > > +
> > > > +	fieinfo->fi_extents_mapped++;
> > > > +
> > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > +		return 1;
> > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > +}
> > > >  /**
> > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index 7a434979201c..28bb523d532a 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > >  	fiemap_fill_cb	fi_cb;
> > > >  };
> > > >  
> > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > +			      u64 phys, u64 len, u32 flags);
> > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > >  			    u64 phys, u64 len, u32 flags);
> > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > -- 
> > > > 2.17.2
> > > > 
> > 
> > -- 
> > Carlos

Darrick J. Wong Feb. 6, 2019, 8:44 p.m. UTC | #11

On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > In any case, I think a better solution to the multi-device problem is to
> > > > start returning device information via struct fiemap_extent, at least
> > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > to avoid returning results for the non-primary device of a multi-device
> > > > filesystem.
> > > 
> > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > idea.
> > 
> > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > shouldn't be expecting any meaning in reserved areas.
> > 
> > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > with such information?
> > 
> > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> 
> Ok, may I ask why not?

I think it's a bad idea to add a flag to FIEMAP to change its behavior
to suit an older and even crappier legacy interface (i.e. FIBMAP).

FIBMAP is architecturally broken in that we can't /ever/ provide the
context of "which device does this map to?"

FIEMAP is architecturally deficient as well, but its ioctl structure
definition is flexible enough that we can report "which device does this
map to".

I want to enhance FIEMAP to deal with multi-device filesystems
correctly, and as much as I want to kill FIBMAP, I can't because of zipl
and *lilo.

> My apologies if I am wrong, but, per my understanding, there is
> nothing today, which tells userspace which device belongs the extent
> map reported by FIEMAP.

Right...

> If it belongs to the RT device in XFS, or whatever disk in a raid in
> BTRFS, we simply do not provide such information.

Right...

> So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> a FIBMAP has been requested, so the current behavior of both ioctls
> won't change.

...but from my point of view, the FIEMAP behavior *ought* to change to
be more expressive.  Once that's done, we can use the more expressive
FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.

The whole point of having fe_reserved* fields in struct fiemap_extent is
so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
start returning data in a reserved field.  New userspace programs that
know about the flag can start reading information from the new field if
they see the flag, and old userspace programs don't know about the flag
and won't be any worse off.

> Enabling filesystems to return device information into fiemap_extent
> requires modification of all filesystems to provide such information,
> which will not have any use other than matching the mounted device to
> the device where the extent is.

Perhaps it would help for me to present a more concrete proposal:

--- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
+++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
@@ -22,7 +22,19 @@ struct fiemap_extent {
 	__u64 fe_length;   /* length in bytes for this extent */
 	__u64 fe_reserved64[2];
 	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
-	__u32 fe_reserved[3];
+
+	/*
+	 * Underlying device that this extent is stored on.
+	 *
+	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
+	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
+	 * set, this field is a 32-bit cookie that can be used to distinguish
+	 * between backing devices but has no intrinsic meaning.  If neither
+	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
+	 * EXTENT_DEV flags may be set at any time.
+	 */
+	__u32 fe_device;
+	__u32 fe_reserved[2];
 };
 
 struct fiemap {
@@ -66,5 +78,14 @@ struct fiemap {
 						    * merged for efficiency. */
 #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
 						    * files. */
+#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
+						    * structure containing the
+						    * major and minor numbers
+						    * of a block device. */
+#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
+						    * cookie that can be used
+						    * to distinguish physical
+						    * devices but otherwise
+						    * has no meaning. */
 
 #endif /* _LINUX_FIEMAP_H */

Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
encoding fe_device = new_encode_dev(xfs_get_device_for_file()).

Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
and encode the replica number in fe_device.

Existing filesystems can be left unchanged, in which case neither
EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
meaningless, the same as they are today.  Reporting fe_device is entirely
optional.

Userspace programs will now be able to tell which device the file data
lives on, which has been sort-of requested for years, if the filesystem
chooses to start exporting that information.

Your FIBMAP-via-FIEMAP backend can do something like:

/* FIBMAP only returns results for the same block device backing the fs. */
if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
	return 0;

/* Can't tell what is the backing device, bail out. */
if (fe->fe_flags & EXTENT_DEV_COOKIE)
	return 0;

/*
 * Either fe_device matches the backing device or the implementation
 * doesn't tell us about the backing device, so assume it's ok.
 */
<return FIBMAP results>

So that's how I'd solve a longstanding design problem of FIEMAP and then
take advantage of that solution to remedy my objections to the proposed
"Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
behavior flag that userspace knows about but isn't allowed to pass in.

> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> than the device id in fiemap_extent. I don't see much advantage in
> adding the device id instead of using the flag.
> 
> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> userspace, so, it would require a check to make sure it didn't come from
> userspace if ioctl_fiemap() was used.
> 
> I think there are 2 other possibilities which can be used to fix this.
> 
> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> - If the device id is a must for you, maybe add the device id into
>   fiemap_extent_info instead of fiemap_extent.

That won't work with btrfs, which can store file extents on multiple
different physical devices.

>   So we don't mess with a UAPI exported data structure and still
>   provides a way to the filesystems to provide which device the mapped
>   extent is in.
> 
> What you think?
> 
> Cheers
> 
> 
> > 
> > --D
> > 
> > > > 
> > > > > +
> > > > > +	return error;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   *	bmap	- find a block number in a file
> > > > >   *	@inode:  inode owning the block number being requested
> > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > >   */
> > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > >  {
> > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > +	if (inode->i_op->fiemap)
> > > > > +		return bmap_fiemap(inode, block);
> > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > +						       *block);
> > > > > +	else
> > > > >  		return -EINVAL;
> > > > 
> > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > suddenly it will support this legacy interface they've never supported
> > > > before.  Are they on board with this?
> > > > 
> > > > --D
> > > > 
> > > > >  
> > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > >  	return 0;
> > > > >  }
> > > > >  EXPORT_SYMBOL(bmap);
> > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > --- a/fs/ioctl.c
> > > > > +++ b/fs/ioctl.c
> > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > >  }
> > > > >  
> > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > +{
> > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > +
> > > > > +	/* only count the extents */
> > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > +		fieinfo->fi_extents_mapped++;
> > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > +	}
> > > > > +
> > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > +		return 1;
> > > > > +
> > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > +
> > > > > +	extent->fe_logical = logical;
> > > > > +	extent->fe_physical = phys;
> > > > > +	extent->fe_length = len;
> > > > > +	extent->fe_flags = flags;
> > > > > +
> > > > > +	fieinfo->fi_extents_mapped++;
> > > > > +
> > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > +		return 1;
> > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > +}
> > > > >  /**
> > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > index 7a434979201c..28bb523d532a 100644
> > > > > --- a/include/linux/fs.h
> > > > > +++ b/include/linux/fs.h
> > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > >  	fiemap_fill_cb	fi_cb;
> > > > >  };
> > > > >  
> > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > +			      u64 phys, u64 len, u32 flags);
> > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > >  			    u64 phys, u64 len, u32 flags);
> > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > -- 
> > > > > 2.17.2
> > > > > 
> > > 
> > > -- 
> > > Carlos
> 
> -- 
> Carlos

Andreas Dilger Feb. 6, 2019, 9:04 p.m. UTC | #12

On Feb 6, 2019, at 6:37 AM, Carlos Maiolino <cmaiolino@redhat.com> wrote:
>>> On Wed, Dec 05, 2018 at 09:36:50AM -0800, Darrick J. Wong wrote:
>>>> In any case, I think a better solution to the multi-device problem is to
>>>> start returning device information via struct fiemap_extent, at least
>>>> inside the kernel.  Use one of the reserved fields to declare a new
>>>> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
>>>> device number, and then you can check that against inode->i_sb->s_bdev
>>>> to avoid returning results for the non-primary device of a multi-device
>>>> filesystem.
>>> 
>>> I agree we should address it here, but I don't think fiemap_extent is the right
>>> place for it, it is linked to the UAPI, and changing it is usually not a good
>>> idea.
>> 
>> Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
>> into some sort of dev_t/per-device cookie should be fine.  Userspace
>> shouldn't be expecting any meaning in reserved areas.

We are already using the __u32 fiemap_extent::fe_reserved[0] as fe_device for
Lustre, to return the server index to userspace for filefrag with suitable
patches.  That is needed because a single file may be striped across multiple
servers, and could instead return the dev_t for local multi-device filesystems.

>>> I think I got your idea anyway, but, what if, instead returning the bdev in
>>> fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
>>> idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
>>> with such information?
>> 
>> I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> 
> Ok, may I ask why not?
> 
> My apologies if I am wrong, but, per my understanding, there is nothing today,
> which tells userspace which device belongs the extent map reported by FIEMAP.
> If it belongs to the RT device in XFS, or whatever disk in a raid in BTRFS, we
> simply do not provide such information. So, the goal is to provide a way to tell
> the filesystem if a FIEMAP or a FIBMAP has been requested, so the current
> behavior of both ioctls won't change.
> 
> Enabling filesystems to return device information into fiemap_extent requires
> modification of all filesystems to provide such information, which will not have
> any use other than matching the mounted device to the device where the extent
> is.

Filling in the fe_device field is not harmful for existing filesystems, since it
has virtually zero cost (not more than zeroing the field to avoid leaking kernel
data) and older userspace tools would just ignore it.  What would be better than
just filling in the fe_device field would be to also add:

    #define FIEMAP_EXTENT_DEVICE 0x2000

to indicate that fe_device contains a valid value.  That tells userspace that the
filesystem is filling in the field, and allows compatibility with older kernels
and allows incremental addition for filesystems that can handle this (XFS, BtrFS).

We haven't added the FIEMAP_EXTENT_DEVICE flag for Lustre, but it would make sense
to do so.

> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive than the
> device id in fiemap_extent. I don't see much advantage in adding the device id
> instead of using the flag.

We also have for Lustre:

    #define FIEMAP_FLAG_DEVICE_ORDER 0x40000000

which requests that the kernel FIEMAP return the extents for each block device
first rather than in file logical block order.  That avoids interleaving the
extents across all of the devices in e.g. 1MB chunks (think RAID-0) which would
force the maximum returned extent size to 1MB even though there are much larger
contiguous extents allocated on each device.  Instead, DEVICE_ORDER returns
all of the extents for device 0 first, then device 1 next, etc.  This shows if
the on-disk allocation is good or bad, and also fills in the fe_device field.

> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> userspace, so, it would require a check to make sure it didn't come from
> userspace if ioctl_fiemap() was used.

Are you talking about a FIEMAP_FLAG_FIBMAP flag, or about returning the fe_device
field?  I think that passing a flag like FIEMAP_FLAG_DEVICE_ORDER from userspace
is fine in this case, because it has a concrete meaning and is not just an internal
flag.

> I think there are 2 other possibilities which can be used to fix this.
> 
> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> - If the device id is a must for you, maybe add the device id into
>  fiemap_extent_info instead of fiemap_extent. So we don't mess with a UAPI
>  exported data structure and still provides a way to the filesystems to provide
>  which device the mapped extent is in.

No, that would mean all of the change, without making it more useful to userspace.

Also, if with only a device per fiemap_extent_info then it won't handle filesystems
that may allocate a single file on multiple devices, such as BtrFS and Lustre.

Cheers, Andreas

>>>> 
>>>>> +
>>>>> +	return error;
>>>>> +}
>>>>> +
>>>>> /**
>>>>>  *	bmap	- find a block number in a file
>>>>>  *	@inode:  inode owning the block number being requested
>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
>>>>>  */
>>>>> int bmap(struct inode *inode, sector_t *block)
>>>>> {
>>>>> -	if (!inode->i_mapping->a_ops->bmap)
>>>>> +	if (inode->i_op->fiemap)
>>>>> +		return bmap_fiemap(inode, block);
>>>>> +	else if (inode->i_mapping->a_ops->bmap)
>>>>> +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
>>>>> +						       *block);
>>>>> +	else
>>>>> 		return -EINVAL;
>>>> 
>>>> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
>>>> suddenly it will support this legacy interface they've never supported
>>>> before.  Are they on board with this?
>>>> 
>>>> --D
>>>> 
>>>>> 
>>>>> -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
>>>>> 	return 0;
>>>>> }
>>>>> EXPORT_SYMBOL(bmap);
>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c
>>>>> index 6086978fe01e..bfa59df332bf 100644
>>>>> --- a/fs/ioctl.c
>>>>> +++ b/fs/ioctl.c
>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>> 	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>> }
>>>>> 
>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>> +			    u64 phys, u64 len, u32 flags)
>>>>> +{
>>>>> +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
>>>>> +
>>>>> +	/* only count the extents */
>>>>> +	if (fieinfo->fi_extents_max == 0) {
>>>>> +		fieinfo->fi_extents_mapped++;
>>>>> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>> +	}
>>>>> +
>>>>> +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
>>>>> +		return 1;
>>>>> +
>>>>> +	if (flags & SET_UNKNOWN_FLAGS)
>>>>> +		flags |= FIEMAP_EXTENT_UNKNOWN;
>>>>> +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
>>>>> +		flags |= FIEMAP_EXTENT_ENCODED;
>>>>> +	if (flags & SET_NOT_ALIGNED_FLAGS)
>>>>> +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
>>>>> +
>>>>> +	extent->fe_logical = logical;
>>>>> +	extent->fe_physical = phys;
>>>>> +	extent->fe_length = len;
>>>>> +	extent->fe_flags = flags;
>>>>> +
>>>>> +	fieinfo->fi_extents_mapped++;
>>>>> +
>>>>> +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
>>>>> +		return 1;
>>>>> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>> +}
>>>>> /**
>>>>>  * fiemap_fill_next_extent - Fiemap helper function
>>>>>  * @fieinfo:	Fiemap context passed into ->fiemap
>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>>> index 7a434979201c..28bb523d532a 100644
>>>>> --- a/include/linux/fs.h
>>>>> +++ b/include/linux/fs.h
>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
>>>>> 	fiemap_fill_cb	fi_cb;
>>>>> };
>>>>> 
>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
>>>>> +			      u64 phys, u64 len, u32 flags);
>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
>>>>> 			    u64 phys, u64 len, u32 flags);
>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
>>>>> --
>>>>> 2.17.2
>>>>> 
>>> 
>>> --
>>> Carlos
> 
> --
> Carlos


Cheers, Andreas

Andreas Dilger Feb. 6, 2019, 9:13 p.m. UTC | #13

On Feb 6, 2019, at 1:44 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
>>>>> In any case, I think a better solution to the multi-device problem is to
>>>>> start returning device information via struct fiemap_extent, at least
>>>>> inside the kernel.  Use one of the reserved fields to declare a new
>>>>> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
>>>>> device number, and then you can check that against inode->i_sb->s_bdev
>>>>> to avoid returning results for the non-primary device of a multi-device
>>>>> filesystem.
>>>> 
>>>> I agree we should address it here, but I don't think fiemap_extent is the right
>>>> place for it, it is linked to the UAPI, and changing it is usually not a good
>>>> idea.
>>> 
>>> Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
>>> into some sort of dev_t/per-device cookie should be fine.  Userspace
>>> shouldn't be expecting any meaning in reserved areas.
>>> 
>>>> I think I got your idea anyway, but, what if, instead returning the bdev in
>>>> fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
>>>> idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
>>>> with such information?
>>> 
>>> I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
>> 
>> Ok, may I ask why not?
> 
> I think it's a bad idea to add a flag to FIEMAP to change its behavior
> to suit an older and even crappier legacy interface (i.e. FIBMAP).
> 
> FIBMAP is architecturally broken in that we can't /ever/ provide the
> context of "which device does this map to?"
> 
> FIEMAP is architecturally deficient as well, but its ioctl structure
> definition is flexible enough that we can report "which device does this
> map to".
> 
> I want to enhance FIEMAP to deal with multi-device filesystems
> correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> and *lilo.
> 
>> My apologies if I am wrong, but, per my understanding, there is
>> nothing today, which tells userspace which device belongs the extent
>> map reported by FIEMAP.
> 
> Right...
> 
>> If it belongs to the RT device in XFS, or whatever disk in a raid in
>> BTRFS, we simply do not provide such information.
> 
> Right...
> 
>> So, the goal is to provide a way to tell the filesystem if a FIEMAP or
>> a FIBMAP has been requested, so the current behavior of both ioctls
>> won't change.
> 
> ...but from my point of view, the FIEMAP behavior *ought* to change to
> be more expressive.  Once that's done, we can use the more expressive
> FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> 
> The whole point of having fe_reserved* fields in struct fiemap_extent is
> so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> start returning data in a reserved field.  New userspace programs that
> know about the flag can start reading information from the new field if
> they see the flag, and old userspace programs don't know about the flag
> and won't be any worse off.

Exactly correct.

>> Enabling filesystems to return device information into fiemap_extent
>> requires modification of all filesystems to provide such information,
>> which will not have any use other than matching the mounted device to
>> the device where the extent is.
> 
> Perhaps it would help for me to present a more concrete proposal:
> 
> --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> @@ -22,7 +22,19 @@ struct fiemap_extent {
> 	__u64 fe_length;   /* length in bytes for this extent */
> 	__u64 fe_reserved64[2];
> 	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> -	__u32 fe_reserved[3];
> +
> +	/*
> +	 * Underlying device that this extent is stored on.
> +	 *
> +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> +	 * set, this field is a 32-bit cookie that can be used to distinguish
> +	 * between backing devices but has no intrinsic meaning.  If neither
> +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> +	 * EXTENT_DEV flags may be set at any time.
> +	 */
> +	__u32 fe_device;
> +	__u32 fe_reserved[2];
> };
> 
> struct fiemap {
> @@ -66,5 +78,14 @@ struct fiemap {
> 						    * merged for efficiency. */
> #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> 						    * files. */
> +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> +						    * structure containing the
> +						    * major and minor numbers
> +						    * of a block device. */
> +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> +						    * cookie that can be used
> +						    * to distinguish physical
> +						    * devices but otherwise
> +						    * has no meaning. */
> 
> #endif /* _LINUX_FIEMAP_H */
> 
> Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> encoding:
> 
>         fe_device = new_encode_dev(xfs_get_device_for_file());
> 
> Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> and encode the replica number in fe_device.
> 
> Existing filesystems can be left unchanged, in which case neither
> EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> meaningless, the same as they are today.  Reporting fe_device is entirely
> optional.

I like this better than my plain "FIEMAP_EXTENT_DEVICE" proposal, since it
allows userspace to distinguish between an actual dev_t a unique-but-
locally-meaninless identifier that is needed for network filesystems.

Cheers, Andreas

> Userspace programs will now be able to tell which device the file data
> lives on, which has been sort-of requested for years, if the filesystem
> chooses to start exporting that information.
> 
> Your FIBMAP-via-FIEMAP backend can do something like:
> 
>     /* FIBMAP only returns results for the same block device backing the fs. */
>     if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> 	return 0;
> 
>     /* Can't tell what is the backing device, bail out. */
>     if (fe->fe_flags & EXTENT_DEV_COOKIE)
> 	return 0;
> 
>     /*
>      * Either fe_device matches the backing device or the implementation
>      * doesn't tell us about the backing device, so assume it's ok.
>      */
>     <return FIBMAP results>
> 
> So that's how I'd solve a longstanding design problem of FIEMAP and then
> take advantage of that solution to remedy my objections to the proposed
> "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> behavior flag that userspace knows about but isn't allowed to pass in.
> 
>> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
>> than the device id in fiemap_extent. I don't see much advantage in
>> adding the device id instead of using the flag.
>> 
>> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
>> userspace, so, it would require a check to make sure it didn't come from
>> userspace if ioctl_fiemap() was used.
>> 
>> I think there are 2 other possibilities which can be used to fix this.
>> 
>> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
>> - If the device id is a must for you, maybe add the device id into
>>  fiemap_extent_info instead of fiemap_extent.
> 
> That won't work with btrfs, which can store file extents on multiple
> different physical devices.
> 
>>  So we don't mess with a UAPI exported data structure and still
>>  provides a way to the filesystems to provide which device the mapped
>>  extent is in.
>> 
>> What you think?
>> 
>> Cheers
>> 
>> 
>>> 
>>> --D
>>> 
>>>>> 
>>>>>> +
>>>>>> +	return error;
>>>>>> +}
>>>>>> +
>>>>>> /**
>>>>>>  *	bmap	- find a block number in a file
>>>>>>  *	@inode:  inode owning the block number being requested
>>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
>>>>>>  */
>>>>>> int bmap(struct inode *inode, sector_t *block)
>>>>>> {
>>>>>> -	if (!inode->i_mapping->a_ops->bmap)
>>>>>> +	if (inode->i_op->fiemap)
>>>>>> +		return bmap_fiemap(inode, block);
>>>>>> +	else if (inode->i_mapping->a_ops->bmap)
>>>>>> +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
>>>>>> +						       *block);
>>>>>> +	else
>>>>>> 		return -EINVAL;
>>>>> 
>>>>> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
>>>>> suddenly it will support this legacy interface they've never supported
>>>>> before.  Are they on board with this?
>>>>> 
>>>>> --D
>>>>> 
>>>>>> 
>>>>>> -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
>>>>>> 	return 0;
>>>>>> }
>>>>>> EXPORT_SYMBOL(bmap);
>>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c
>>>>>> index 6086978fe01e..bfa59df332bf 100644
>>>>>> --- a/fs/ioctl.c
>>>>>> +++ b/fs/ioctl.c
>>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>>> 	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>> }
>>>>>> 
>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>>> +			    u64 phys, u64 len, u32 flags)
>>>>>> +{
>>>>>> +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
>>>>>> +
>>>>>> +	/* only count the extents */
>>>>>> +	if (fieinfo->fi_extents_max == 0) {
>>>>>> +		fieinfo->fi_extents_mapped++;
>>>>>> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>> +	}
>>>>>> +
>>>>>> +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
>>>>>> +		return 1;
>>>>>> +
>>>>>> +	if (flags & SET_UNKNOWN_FLAGS)
>>>>>> +		flags |= FIEMAP_EXTENT_UNKNOWN;
>>>>>> +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
>>>>>> +		flags |= FIEMAP_EXTENT_ENCODED;
>>>>>> +	if (flags & SET_NOT_ALIGNED_FLAGS)
>>>>>> +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
>>>>>> +
>>>>>> +	extent->fe_logical = logical;
>>>>>> +	extent->fe_physical = phys;
>>>>>> +	extent->fe_length = len;
>>>>>> +	extent->fe_flags = flags;
>>>>>> +
>>>>>> +	fieinfo->fi_extents_mapped++;
>>>>>> +
>>>>>> +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
>>>>>> +		return 1;
>>>>>> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>> +}
>>>>>> /**
>>>>>>  * fiemap_fill_next_extent - Fiemap helper function
>>>>>>  * @fieinfo:	Fiemap context passed into ->fiemap
>>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>>>> index 7a434979201c..28bb523d532a 100644
>>>>>> --- a/include/linux/fs.h
>>>>>> +++ b/include/linux/fs.h
>>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
>>>>>> 	fiemap_fill_cb	fi_cb;
>>>>>> };
>>>>>> 
>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
>>>>>> +			      u64 phys, u64 len, u32 flags);
>>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
>>>>>> 			    u64 phys, u64 len, u32 flags);
>>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
>>>>>> --
>>>>>> 2.17.2
>>>>>> 
>>>> 
>>>> --
>>>> Carlos
>> 
>> --
>> Carlos


Cheers, Andreas

Carlos Maiolino Feb. 7, 2019, 9:52 a.m. UTC | #14

> >> If it belongs to the RT device in XFS, or whatever disk in a raid in
> >> BTRFS, we simply do not provide such information.
> > 
> > Right...
> > 
> >> So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> >> a FIBMAP has been requested, so the current behavior of both ioctls
> >> won't change.
> > 
> > ...but from my point of view, the FIEMAP behavior *ought* to change to
> > be more expressive.  Once that's done, we can use the more expressive
> > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> > 
> > The whole point of having fe_reserved* fields in struct fiemap_extent is
> > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> > start returning data in a reserved field.  New userspace programs that
> > know about the flag can start reading information from the new field if
> > they see the flag, and old userspace programs don't know about the flag
> > and won't be any worse off.
> 

Btw, I am not saying I don't like the idea, I like it. What I was trying to do
was to avoid touching UAPI in this patchset. But... I'll try to implement your
idea here, send it to the list and raise my shields.

Thanks for the help Andreas/Darrick.

> Exactly correct.
> 
> >> Enabling filesystems to return device information into fiemap_extent
> >> requires modification of all filesystems to provide such information,
> >> which will not have any use other than matching the mounted device to
> >> the device where the extent is.
> > 
> > Perhaps it would help for me to present a more concrete proposal:
> > 
> > --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> > +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> > @@ -22,7 +22,19 @@ struct fiemap_extent {
> > 	__u64 fe_length;   /* length in bytes for this extent */
> > 	__u64 fe_reserved64[2];
> > 	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> > -	__u32 fe_reserved[3];
> > +
> > +	/*
> > +	 * Underlying device that this extent is stored on.
> > +	 *
> > +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> > +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> > +	 * set, this field is a 32-bit cookie that can be used to distinguish
> > +	 * between backing devices but has no intrinsic meaning.  If neither
> > +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> > +	 * EXTENT_DEV flags may be set at any time.
> > +	 */
> > +	__u32 fe_device;
> > +	__u32 fe_reserved[2];
> > };
> > 
> > struct fiemap {
> > @@ -66,5 +78,14 @@ struct fiemap {
> > 						    * merged for efficiency. */
> > #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> > 						    * files. */
> > +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> > +						    * structure containing the
> > +						    * major and minor numbers
> > +						    * of a block device. */
> > +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> > +						    * cookie that can be used
> > +						    * to distinguish physical
> > +						    * devices but otherwise
> > +						    * has no meaning. */
> > 
> > #endif /* _LINUX_FIEMAP_H */
> > 
> > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> > encoding:
> > 
> >         fe_device = new_encode_dev(xfs_get_device_for_file());
> > 
> > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> > and encode the replica number in fe_device.
> > 
> > Existing filesystems can be left unchanged, in which case neither
> > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> > meaningless, the same as they are today.  Reporting fe_device is entirely
> > optional.
> 
> I like this better than my plain "FIEMAP_EXTENT_DEVICE" proposal, since it
> allows userspace to distinguish between an actual dev_t a unique-but-
> locally-meaninless identifier that is needed for network filesystems.
> 
> Cheers, Andreas
> 
> > Userspace programs will now be able to tell which device the file data
> > lives on, which has been sort-of requested for years, if the filesystem
> > chooses to start exporting that information.
> > 
> > Your FIBMAP-via-FIEMAP backend can do something like:
> > 
> >     /* FIBMAP only returns results for the same block device backing the fs. */
> >     if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> > 	return 0;
> > 
> >     /* Can't tell what is the backing device, bail out. */
> >     if (fe->fe_flags & EXTENT_DEV_COOKIE)
> > 	return 0;
> > 
> >     /*
> >      * Either fe_device matches the backing device or the implementation
> >      * doesn't tell us about the backing device, so assume it's ok.
> >      */
> >     <return FIBMAP results>
> > 
> > So that's how I'd solve a longstanding design problem of FIEMAP and then
> > take advantage of that solution to remedy my objections to the proposed
> > "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> > behavior flag that userspace knows about but isn't allowed to pass in.
> > 
> >> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> >> than the device id in fiemap_extent. I don't see much advantage in
> >> adding the device id instead of using the flag.
> >> 
> >> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> >> userspace, so, it would require a check to make sure it didn't come from
> >> userspace if ioctl_fiemap() was used.
> >> 
> >> I think there are 2 other possibilities which can be used to fix this.
> >> 
> >> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> >> - If the device id is a must for you, maybe add the device id into
> >>  fiemap_extent_info instead of fiemap_extent.
> > 
> > That won't work with btrfs, which can store file extents on multiple
> > different physical devices.
> > 
> >>  So we don't mess with a UAPI exported data structure and still
> >>  provides a way to the filesystems to provide which device the mapped
> >>  extent is in.
> >> 
> >> What you think?
> >> 
> >> Cheers
> >> 
> >> 
> >>> 
> >>> --D
> >>> 
> >>>>> 
> >>>>>> +
> >>>>>> +	return error;
> >>>>>> +}
> >>>>>> +
> >>>>>> /**
> >>>>>>  *	bmap	- find a block number in a file
> >>>>>>  *	@inode:  inode owning the block number being requested
> >>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> >>>>>>  */
> >>>>>> int bmap(struct inode *inode, sector_t *block)
> >>>>>> {
> >>>>>> -	if (!inode->i_mapping->a_ops->bmap)
> >>>>>> +	if (inode->i_op->fiemap)
> >>>>>> +		return bmap_fiemap(inode, block);
> >>>>>> +	else if (inode->i_mapping->a_ops->bmap)
> >>>>>> +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> >>>>>> +						       *block);
> >>>>>> +	else
> >>>>>> 		return -EINVAL;
> >>>>> 
> >>>>> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> >>>>> suddenly it will support this legacy interface they've never supported
> >>>>> before.  Are they on board with this?
> >>>>> 
> >>>>> --D
> >>>>> 
> >>>>>> 
> >>>>>> -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> >>>>>> 	return 0;
> >>>>>> }
> >>>>>> EXPORT_SYMBOL(bmap);
> >>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c
> >>>>>> index 6086978fe01e..bfa59df332bf 100644
> >>>>>> --- a/fs/ioctl.c
> >>>>>> +++ b/fs/ioctl.c
> >>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> >>>>>> 	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >>>>>> }
> >>>>>> 
> >>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> >>>>>> +			    u64 phys, u64 len, u32 flags)
> >>>>>> +{
> >>>>>> +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> >>>>>> +
> >>>>>> +	/* only count the extents */
> >>>>>> +	if (fieinfo->fi_extents_max == 0) {
> >>>>>> +		fieinfo->fi_extents_mapped++;
> >>>>>> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >>>>>> +	}
> >>>>>> +
> >>>>>> +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> >>>>>> +		return 1;
> >>>>>> +
> >>>>>> +	if (flags & SET_UNKNOWN_FLAGS)
> >>>>>> +		flags |= FIEMAP_EXTENT_UNKNOWN;
> >>>>>> +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> >>>>>> +		flags |= FIEMAP_EXTENT_ENCODED;
> >>>>>> +	if (flags & SET_NOT_ALIGNED_FLAGS)
> >>>>>> +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> >>>>>> +
> >>>>>> +	extent->fe_logical = logical;
> >>>>>> +	extent->fe_physical = phys;
> >>>>>> +	extent->fe_length = len;
> >>>>>> +	extent->fe_flags = flags;
> >>>>>> +
> >>>>>> +	fieinfo->fi_extents_mapped++;
> >>>>>> +
> >>>>>> +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> >>>>>> +		return 1;
> >>>>>> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >>>>>> +}
> >>>>>> /**
> >>>>>>  * fiemap_fill_next_extent - Fiemap helper function
> >>>>>>  * @fieinfo:	Fiemap context passed into ->fiemap
> >>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >>>>>> index 7a434979201c..28bb523d532a 100644
> >>>>>> --- a/include/linux/fs.h
> >>>>>> +++ b/include/linux/fs.h
> >>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> >>>>>> 	fiemap_fill_cb	fi_cb;
> >>>>>> };
> >>>>>> 
> >>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> >>>>>> +			      u64 phys, u64 len, u32 flags);
> >>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> >>>>>> 			    u64 phys, u64 len, u32 flags);
> >>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> >>>>>> --
> >>>>>> 2.17.2
> >>>>>> 
> >>>> 
> >>>> --
> >>>> Carlos
> >> 
> >> --
> >> Carlos
> 
> 
> Cheers, Andreas
> 
> 
> 
> 
>

Carlos Maiolino Feb. 7, 2019, 11:59 a.m. UTC | #15

On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > > In any case, I think a better solution to the multi-device problem is to
> > > > > start returning device information via struct fiemap_extent, at least
> > > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > > to avoid returning results for the non-primary device of a multi-device
> > > > > filesystem.
> > > > 
> > > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > > idea.
> > > 
> > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > > shouldn't be expecting any meaning in reserved areas.
> > > 
> > > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > > with such information?
> > > 
> > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> > 
> > Ok, may I ask why not?
> 
> I think it's a bad idea to add a flag to FIEMAP to change its behavior
> to suit an older and even crappier legacy interface (i.e. FIBMAP).
> 
> FIBMAP is architecturally broken in that we can't /ever/ provide the
> context of "which device does this map to?"
> 
> FIEMAP is architecturally deficient as well, but its ioctl structure
> definition is flexible enough that we can report "which device does this
> map to".
> 
> I want to enhance FIEMAP to deal with multi-device filesystems
> correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> and *lilo.
> 
> > My apologies if I am wrong, but, per my understanding, there is
> > nothing today, which tells userspace which device belongs the extent
> > map reported by FIEMAP.
> 
> Right...
> 
> > If it belongs to the RT device in XFS, or whatever disk in a raid in
> > BTRFS, we simply do not provide such information.
> 
> Right...
> 
> > So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> > a FIBMAP has been requested, so the current behavior of both ioctls
> > won't change.
> 
> ...but from my point of view, the FIEMAP behavior *ought* to change to
> be more expressive.  Once that's done, we can use the more expressive
> FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> 
> The whole point of having fe_reserved* fields in struct fiemap_extent is
> so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> start returning data in a reserved field.  New userspace programs that
> know about the flag can start reading information from the new field if
> they see the flag, and old userspace programs don't know about the flag
> and won't be any worse off.
> 
> > Enabling filesystems to return device information into fiemap_extent
> > requires modification of all filesystems to provide such information,
> > which will not have any use other than matching the mounted device to
> > the device where the extent is.
> 
> Perhaps it would help for me to present a more concrete proposal:
> 
> --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> @@ -22,7 +22,19 @@ struct fiemap_extent {
>  	__u64 fe_length;   /* length in bytes for this extent */
>  	__u64 fe_reserved64[2];
>  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> -	__u32 fe_reserved[3];
> +
> +	/*
> +	 * Underlying device that this extent is stored on.
> +	 *
> +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> +	 * set, this field is a 32-bit cookie that can be used to distinguish
> +	 * between backing devices but has no intrinsic meaning.  If neither
> +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> +	 * EXTENT_DEV flags may be set at any time.
> +	 */
> +	__u32 fe_device;
> +	__u32 fe_reserved[2];
>  };
>  
>  struct fiemap {
> @@ -66,5 +78,14 @@ struct fiemap {
>  						    * merged for efficiency. */
>  #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
>  						    * files. */
> +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> +						    * structure containing the
> +						    * major and minor numbers
> +						    * of a block device. */
> +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> +						    * cookie that can be used
> +						    * to distinguish physical
> +						    * devices but otherwise
> +						    * has no meaning. */
>  
>  #endif /* _LINUX_FIEMAP_H */
> 
> Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
> 
> Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> and encode the replica number in fe_device.
> 

All of this makes sense, but I'm struggling to understand what you mean by
replica number here, and why it justify a second flag.

> Existing filesystems can be left unchanged, in which case neither
> EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> meaningless, the same as they are today.  Reporting fe_device is entirely
> optional.
> 
> Userspace programs will now be able to tell which device the file data
> lives on, which has been sort-of requested for years, if the filesystem
> chooses to start exporting that information.
> 
> Your FIBMAP-via-FIEMAP backend can do something like:
> 
> /* FIBMAP only returns results for the same block device backing the fs. */
> if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> 	return 0;
> 
> /* Can't tell what is the backing device, bail out. */
> if (fe->fe_flags & EXTENT_DEV_COOKIE)
> 	return 0;
> 

Ok, the first conditional, is ok, the second one is not making sense to me.
Looks like you are basically using it to flag the filesystem can't tell
exactly which device the current extent is, let's say for example, distributed
filesystems, where the physical extent can actually be on a different machine.
But I can't say for sure, can you give me more details about what you are trying
to achieve here?



> /*
>  * Either fe_device matches the backing device or the implementation
>  * doesn't tell us about the backing device, so assume it's ok.
>  */
> <return FIBMAP results>
>

This actually looks to contradict what you have been complaining, about some
filesystems which doesn't support FIBMAP currently, will now suddenly start to
support. Assuming it's ok if the implementation doesn't tell us about the
backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the
backing device, assuming it's ok will just fall into your first complain.

Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE
you mentioned.

> So that's how I'd solve a longstanding design problem of FIEMAP and then
> take advantage of that solution to remedy my objections to the proposed
> "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> behavior flag that userspace knows about but isn't allowed to pass in.
>

> > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> > than the device id in fiemap_extent. I don't see much advantage in
> > adding the device id instead of using the flag.
> > 
> > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> > userspace, so, it would require a check to make sure it didn't come from
> > userspace if ioctl_fiemap() was used.
> > 
> > I think there are 2 other possibilities which can be used to fix this.
> > 
> > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> > - If the device id is a must for you, maybe add the device id into
> >   fiemap_extent_info instead of fiemap_extent.
> 
> That won't work with btrfs, which can store file extents on multiple
> different physical devices.
> 
> >   So we don't mess with a UAPI exported data structure and still
> >   provides a way to the filesystems to provide which device the mapped
> >   extent is in.
> > 
> > What you think?
> > 
> > Cheers
> > 
> > 
> > > 
> > > --D
> > > 
> > > > > 
> > > > > > +
> > > > > > +	return error;
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > >   *	bmap	- find a block number in a file
> > > > > >   *	@inode:  inode owning the block number being requested
> > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > > >   */
> > > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > > >  {
> > > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > > +	if (inode->i_op->fiemap)
> > > > > > +		return bmap_fiemap(inode, block);
> > > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > > +						       *block);
> > > > > > +	else
> > > > > >  		return -EINVAL;
> > > > > 
> > > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > > suddenly it will support this legacy interface they've never supported
> > > > > before.  Are they on board with this?
> > > > > 
> > > > > --D
> > > > > 
> > > > > >  
> > > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > > >  	return 0;
> > > > > >  }
> > > > > >  EXPORT_SYMBOL(bmap);
> > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > > --- a/fs/ioctl.c
> > > > > > +++ b/fs/ioctl.c
> > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > >  }
> > > > > >  
> > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > > +{
> > > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > > +
> > > > > > +	/* only count the extents */
> > > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > > +		fieinfo->fi_extents_mapped++;
> > > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > > +		return 1;
> > > > > > +
> > > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > > +
> > > > > > +	extent->fe_logical = logical;
> > > > > > +	extent->fe_physical = phys;
> > > > > > +	extent->fe_length = len;
> > > > > > +	extent->fe_flags = flags;
> > > > > > +
> > > > > > +	fieinfo->fi_extents_mapped++;
> > > > > > +
> > > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > > +		return 1;
> > > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > +}
> > > > > >  /**
> > > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > index 7a434979201c..28bb523d532a 100644
> > > > > > --- a/include/linux/fs.h
> > > > > > +++ b/include/linux/fs.h
> > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > > >  	fiemap_fill_cb	fi_cb;
> > > > > >  };
> > > > > >  
> > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > +			      u64 phys, u64 len, u32 flags);
> > > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > >  			    u64 phys, u64 len, u32 flags);
> > > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > > -- 
> > > > > > 2.17.2
> > > > > > 
> > > > 
> > > > -- 
> > > > Carlos
> > 
> > -- 
> > Carlos

Carlos Maiolino Feb. 7, 2019, 12:36 p.m. UTC | #16

Apologies, I forgot to mention another thing..

On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > > In any case, I think a better solution to the multi-device problem is to
> > > > > start returning device information via struct fiemap_extent, at least
> > > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > > to avoid returning results for the non-primary device of a multi-device
> > > > > filesystem.
> > > > 
> > > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > > idea.
> > > 
> > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > > shouldn't be expecting any meaning in reserved areas.
> > > 
> > > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > > with such information?
> > > 
> > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> > 
> > Ok, may I ask why not?
> 
> I think it's a bad idea to add a flag to FIEMAP to change its behavior
> to suit an older and even crappier legacy interface (i.e. FIBMAP).
> 
> FIBMAP is architecturally broken in that we can't /ever/ provide the
> context of "which device does this map to?"
> 
> FIEMAP is architecturally deficient as well, but its ioctl structure
> definition is flexible enough that we can report "which device does this
> map to".
> 
> I want to enhance FIEMAP to deal with multi-device filesystems
> correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> and *lilo.
> 
> > My apologies if I am wrong, but, per my understanding, there is
> > nothing today, which tells userspace which device belongs the extent
> > map reported by FIEMAP.
> 
> Right...
> 
> > If it belongs to the RT device in XFS, or whatever disk in a raid in
> > BTRFS, we simply do not provide such information.
> 
> Right...
> 
> > So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> > a FIBMAP has been requested, so the current behavior of both ioctls
> > won't change.
> 
> ...but from my point of view, the FIEMAP behavior *ought* to change to
> be more expressive.  Once that's done, we can use the more expressive
> FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> 
> The whole point of having fe_reserved* fields in struct fiemap_extent is
> so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> start returning data in a reserved field.  New userspace programs that
> know about the flag can start reading information from the new field if
> they see the flag, and old userspace programs don't know about the flag
> and won't be any worse off.
> 
> > Enabling filesystems to return device information into fiemap_extent
> > requires modification of all filesystems to provide such information,
> > which will not have any use other than matching the mounted device to
> > the device where the extent is.
> 
> Perhaps it would help for me to present a more concrete proposal:
> 
> --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> @@ -22,7 +22,19 @@ struct fiemap_extent {
>  	__u64 fe_length;   /* length in bytes for this extent */
>  	__u64 fe_reserved64[2];
>  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> -	__u32 fe_reserved[3];
> +
> +	/*
> +	 * Underlying device that this extent is stored on.
> +	 *
> +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> +	 * set, this field is a 32-bit cookie that can be used to distinguish
> +	 * between backing devices but has no intrinsic meaning.  If neither
> +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> +	 * EXTENT_DEV flags may be set at any time.
> +	 */
> +	__u32 fe_device;
> +	__u32 fe_reserved[2];
>  };
>  
>  struct fiemap {
> @@ -66,5 +78,14 @@ struct fiemap {
>  						    * merged for efficiency. */
>  #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
>  						    * files. */
> +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> +						    * structure containing the
> +						    * major and minor numbers
> +						    * of a block device. */
> +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> +						    * cookie that can be used
> +						    * to distinguish physical
> +						    * devices but otherwise
> +						    * has no meaning. */
>  
>  #endif /* _LINUX_FIEMAP_H */
> 
> Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> encoding fe_device = new_encode_dev(xfs_get_device_for_file()).

Here, I believe you are forgetting that filesystems do not touch fiemap_extent
directly. We call fiemap_fell_next_extent() helper to fill each extent found by
fiemap. So, in either way, we'd need to modify fiemap_fill_next_extent() and the
callbacks being used to accommodate this new field or create a new helper to
modify the device which doesn't sound reasonable. So, either way, we will end up
needing to modify all filesystems.

So, although I really like the idea of improving the FIEMAP interface, I'm
starting to consider another patchset for it. I think it requires an interface
change big enough to fit in this patchset, which actually has a different
purpose. Or, maybe, address this at the end of this patchset, leaving different
interface changes in different patchsets, instead of making many changes all at
once, mixed together.

> 
> Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> and encode the replica number in fe_device.
> 
> Existing filesystems can be left unchanged, in which case neither
> EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> meaningless, the same as they are today.  Reporting fe_device is entirely
> optional.
> 
> Userspace programs will now be able to tell which device the file data
> lives on, which has been sort-of requested for years, if the filesystem
> chooses to start exporting that information.
> 
> Your FIBMAP-via-FIEMAP backend can do something like:
> 
> /* FIBMAP only returns results for the same block device backing the fs. */
> if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> 	return 0;
> 
> /* Can't tell what is the backing device, bail out. */
> if (fe->fe_flags & EXTENT_DEV_COOKIE)
> 	return 0;
> 
> /*
>  * Either fe_device matches the backing device or the implementation
>  * doesn't tell us about the backing device, so assume it's ok.
>  */
> <return FIBMAP results>
> 
> So that's how I'd solve a longstanding design problem of FIEMAP and then
> take advantage of that solution to remedy my objections to the proposed
> "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> behavior flag that userspace knows about but isn't allowed to pass in.
> 
> > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> > than the device id in fiemap_extent. I don't see much advantage in
> > adding the device id instead of using the flag.
> > 
> > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> > userspace, so, it would require a check to make sure it didn't come from
> > userspace if ioctl_fiemap() was used.
> > 
> > I think there are 2 other possibilities which can be used to fix this.
> > 
> > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> > - If the device id is a must for you, maybe add the device id into
> >   fiemap_extent_info instead of fiemap_extent.
> 
> That won't work with btrfs, which can store file extents on multiple
> different physical devices.
> 
> >   So we don't mess with a UAPI exported data structure and still
> >   provides a way to the filesystems to provide which device the mapped
> >   extent is in.
> > 
> > What you think?
> > 
> > Cheers
> > 
> > 
> > > 
> > > --D
> > > 
> > > > > 
> > > > > > +
> > > > > > +	return error;
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > >   *	bmap	- find a block number in a file
> > > > > >   *	@inode:  inode owning the block number being requested
> > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > > >   */
> > > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > > >  {
> > > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > > +	if (inode->i_op->fiemap)
> > > > > > +		return bmap_fiemap(inode, block);
> > > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > > +						       *block);
> > > > > > +	else
> > > > > >  		return -EINVAL;
> > > > > 
> > > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > > suddenly it will support this legacy interface they've never supported
> > > > > before.  Are they on board with this?
> > > > > 
> > > > > --D
> > > > > 
> > > > > >  
> > > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > > >  	return 0;
> > > > > >  }
> > > > > >  EXPORT_SYMBOL(bmap);
> > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > > --- a/fs/ioctl.c
> > > > > > +++ b/fs/ioctl.c
> > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > >  }
> > > > > >  
> > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > > +{
> > > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > > +
> > > > > > +	/* only count the extents */
> > > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > > +		fieinfo->fi_extents_mapped++;
> > > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > > +		return 1;
> > > > > > +
> > > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > > +
> > > > > > +	extent->fe_logical = logical;
> > > > > > +	extent->fe_physical = phys;
> > > > > > +	extent->fe_length = len;
> > > > > > +	extent->fe_flags = flags;
> > > > > > +
> > > > > > +	fieinfo->fi_extents_mapped++;
> > > > > > +
> > > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > > +		return 1;
> > > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > +}
> > > > > >  /**
> > > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > index 7a434979201c..28bb523d532a 100644
> > > > > > --- a/include/linux/fs.h
> > > > > > +++ b/include/linux/fs.h
> > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > > >  	fiemap_fill_cb	fi_cb;
> > > > > >  };
> > > > > >  
> > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > +			      u64 phys, u64 len, u32 flags);
> > > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > >  			    u64 phys, u64 len, u32 flags);
> > > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > > -- 
> > > > > > 2.17.2
> > > > > > 
> > > > 
> > > > -- 
> > > > Carlos
> > 
> > -- 
> > Carlos

Darrick J. Wong Feb. 7, 2019, 5:02 p.m. UTC | #17

On Thu, Feb 07, 2019 at 12:59:54PM +0100, Carlos Maiolino wrote:
> On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > > > In any case, I think a better solution to the multi-device problem is to
> > > > > > start returning device information via struct fiemap_extent, at least
> > > > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > > > to avoid returning results for the non-primary device of a multi-device
> > > > > > filesystem.
> > > > > 
> > > > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > > > idea.
> > > > 
> > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > > > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > > > shouldn't be expecting any meaning in reserved areas.
> > > > 
> > > > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > > > with such information?
> > > > 
> > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> > > 
> > > Ok, may I ask why not?
> > 
> > I think it's a bad idea to add a flag to FIEMAP to change its behavior
> > to suit an older and even crappier legacy interface (i.e. FIBMAP).
> > 
> > FIBMAP is architecturally broken in that we can't /ever/ provide the
> > context of "which device does this map to?"
> > 
> > FIEMAP is architecturally deficient as well, but its ioctl structure
> > definition is flexible enough that we can report "which device does this
> > map to".
> > 
> > I want to enhance FIEMAP to deal with multi-device filesystems
> > correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> > and *lilo.
> > 
> > > My apologies if I am wrong, but, per my understanding, there is
> > > nothing today, which tells userspace which device belongs the extent
> > > map reported by FIEMAP.
> > 
> > Right...
> > 
> > > If it belongs to the RT device in XFS, or whatever disk in a raid in
> > > BTRFS, we simply do not provide such information.
> > 
> > Right...
> > 
> > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> > > a FIBMAP has been requested, so the current behavior of both ioctls
> > > won't change.
> > 
> > ...but from my point of view, the FIEMAP behavior *ought* to change to
> > be more expressive.  Once that's done, we can use the more expressive
> > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> > 
> > The whole point of having fe_reserved* fields in struct fiemap_extent is
> > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> > start returning data in a reserved field.  New userspace programs that
> > know about the flag can start reading information from the new field if
> > they see the flag, and old userspace programs don't know about the flag
> > and won't be any worse off.
> > 
> > > Enabling filesystems to return device information into fiemap_extent
> > > requires modification of all filesystems to provide such information,
> > > which will not have any use other than matching the mounted device to
> > > the device where the extent is.
> > 
> > Perhaps it would help for me to present a more concrete proposal:
> > 
> > --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> > +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> > @@ -22,7 +22,19 @@ struct fiemap_extent {
> >  	__u64 fe_length;   /* length in bytes for this extent */
> >  	__u64 fe_reserved64[2];
> >  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> > -	__u32 fe_reserved[3];
> > +
> > +	/*
> > +	 * Underlying device that this extent is stored on.
> > +	 *
> > +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> > +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> > +	 * set, this field is a 32-bit cookie that can be used to distinguish
> > +	 * between backing devices but has no intrinsic meaning.  If neither
> > +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> > +	 * EXTENT_DEV flags may be set at any time.
> > +	 */
> > +	__u32 fe_device;
> > +	__u32 fe_reserved[2];
> >  };
> >  
> >  struct fiemap {
> > @@ -66,5 +78,14 @@ struct fiemap {
> >  						    * merged for efficiency. */
> >  #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> >  						    * files. */
> > +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> > +						    * structure containing the
> > +						    * major and minor numbers
> > +						    * of a block device. */
> > +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> > +						    * cookie that can be used
> > +						    * to distinguish physical
> > +						    * devices but otherwise
> > +						    * has no meaning. */
> >  
> >  #endif /* _LINUX_FIEMAP_H */
> > 
> > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> > encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
> > 
> > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> > and encode the replica number in fe_device.
> > 
> 
> All of this makes sense, but I'm struggling to understand what you mean by
> replica number here, and why it justify a second flag.

I left in the "device cookie" thing in the proposal to accomodate a
request from the Lustre folks to be able to report which replica is
storing a particular extent map.  Apparently the replica id is simply a
32-bit number that isn't inherently useful, hence the vagueness around
what "cookie" really means...

...oh, right, lustre fell out of drivers/staging/.  You could probably
leave it out then.

> > Existing filesystems can be left unchanged, in which case neither
> > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> > meaningless, the same as they are today.  Reporting fe_device is entirely
> > optional.
> > 
> > Userspace programs will now be able to tell which device the file data
> > lives on, which has been sort-of requested for years, if the filesystem
> > chooses to start exporting that information.
> > 
> > Your FIBMAP-via-FIEMAP backend can do something like:
> > 
> > /* FIBMAP only returns results for the same block device backing the fs. */
> > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> > 	return 0;
> > 
> > /* Can't tell what is the backing device, bail out. */
> > if (fe->fe_flags & EXTENT_DEV_COOKIE)
> > 	return 0;
> > 
> 
> Ok, the first conditional, is ok, the second one is not making sense to me.
> Looks like you are basically using it to flag the filesystem can't tell
> exactly which device the current extent is, let's say for example, distributed
> filesystems, where the physical extent can actually be on a different machine.
> But I can't say for sure, can you give me more details about what you are trying
> to achieve here?

You've understood me correctly. :)

> > /*
> >  * Either fe_device matches the backing device or the implementation
> >  * doesn't tell us about the backing device, so assume it's ok.
> >  */
> > <return FIBMAP results>
> >
> 
> This actually looks to contradict what you have been complaining, about some
> filesystems which doesn't support FIBMAP currently, will now suddenly start to
> support. Assuming it's ok if the implementation doesn't tell us about the
> backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the
> backing device, assuming it's ok will just fall into your first complain.

Sorry, this thread has been going on so long that I forgot your goal for
this series. :/

Specifically, I had forgotten that you're removing the ->bmap pointer,
which means that filesystems don't have any particular way to signal
"Yes on FIEMAP, no on FIBMAP".  Somehow I had thought that you were
merely creating a generic_file_bmap() that would call FIEMAP and ripping
out all the adhoc bmap implementations.

Hmm, how many filesystems support FIEMAP and not FIBMAP?

btrfs, nilfs2, and overlayfs.  Also bad_inode.c...?

Hmm, how many filesystems support FIBMAP and not FIEMAP?

adfs, affs, befs, bfs, efs, exofs, fat, freevxfs, fuse(?), nfs, hfsplus,
isofs, jfs, minixfs, ntfs, qnx[46], reiserfs, sysv, udf, and ufs.

> Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE
> you mentioned.

I think you've understood it about as well as I can explain it.  Maybe
Andreas will have more to say about the lustre replica id, but OTOH it's
gone and so there's no user of it, so we could just drop it until lustre
comes back.

> > So that's how I'd solve a longstanding design problem of FIEMAP and then
> > take advantage of that solution to remedy my objections to the proposed
> > "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> > behavior flag that userspace knows about but isn't allowed to pass in.
> >
> 
> > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> > > than the device id in fiemap_extent. I don't see much advantage in
> > > adding the device id instead of using the flag.
> > > 
> > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> > > userspace, so, it would require a check to make sure it didn't come from
> > > userspace if ioctl_fiemap() was used.
> > > 
> > > I think there are 2 other possibilities which can be used to fix this.
> > > 
> > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> > > - If the device id is a must for you, maybe add the device id into
> > >   fiemap_extent_info instead of fiemap_extent.
> > 
> > That won't work with btrfs, which can store file extents on multiple
> > different physical devices.
> > 
> > >   So we don't mess with a UAPI exported data structure and still
> > >   provides a way to the filesystems to provide which device the mapped
> > >   extent is in.
> > > 
> > > What you think?
> > > 
> > > Cheers
> > > 
> > > 
> > > > 
> > > > --D
> > > > 
> > > > > > 
> > > > > > > +
> > > > > > > +	return error;
> > > > > > > +}
> > > > > > > +
> > > > > > >  /**
> > > > > > >   *	bmap	- find a block number in a file
> > > > > > >   *	@inode:  inode owning the block number being requested
> > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > > > >   */
> > > > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > > > >  {
> > > > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > > > +	if (inode->i_op->fiemap)
> > > > > > > +		return bmap_fiemap(inode, block);
> > > > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > > > +						       *block);
> > > > > > > +	else
> > > > > > >  		return -EINVAL;
> > > > > > 
> > > > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > > > suddenly it will support this legacy interface they've never supported
> > > > > > before.  Are they on board with this?
> > > > > > 
> > > > > > --D
> > > > > > 
> > > > > > >  
> > > > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > > > >  	return 0;
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL(bmap);
> > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > > > --- a/fs/ioctl.c
> > > > > > > +++ b/fs/ioctl.c
> > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > >  }
> > > > > > >  
> > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > > > +{
> > > > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > > > +
> > > > > > > +	/* only count the extents */
> > > > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > > > +		fieinfo->fi_extents_mapped++;
> > > > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > > > +		return 1;
> > > > > > > +
> > > > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > > > +
> > > > > > > +	extent->fe_logical = logical;
> > > > > > > +	extent->fe_physical = phys;
> > > > > > > +	extent->fe_length = len;
> > > > > > > +	extent->fe_flags = flags;
> > > > > > > +
> > > > > > > +	fieinfo->fi_extents_mapped++;
> > > > > > > +
> > > > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > > > +		return 1;
> > > > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > +}
> > > > > > >  /**
> > > > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > > index 7a434979201c..28bb523d532a 100644
> > > > > > > --- a/include/linux/fs.h
> > > > > > > +++ b/include/linux/fs.h
> > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > > > >  	fiemap_fill_cb	fi_cb;
> > > > > > >  };
> > > > > > >  
> > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > +			      u64 phys, u64 len, u32 flags);
> > > > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > >  			    u64 phys, u64 len, u32 flags);
> > > > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > > > -- 
> > > > > > > 2.17.2
> > > > > > > 
> > > > > 
> > > > > -- 
> > > > > Carlos
> > > 
> > > -- 
> > > Carlos
> 
> -- 
> Carlos

Darrick J. Wong Feb. 7, 2019, 6:16 p.m. UTC | #18

On Thu, Feb 07, 2019 at 01:36:41PM +0100, Carlos Maiolino wrote:
> Apologies, I forgot to mention another thing..
> 
> On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > > > In any case, I think a better solution to the multi-device problem is to
> > > > > > start returning device information via struct fiemap_extent, at least
> > > > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > > > to avoid returning results for the non-primary device of a multi-device
> > > > > > filesystem.
> > > > > 
> > > > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > > > idea.
> > > > 
> > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > > > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > > > shouldn't be expecting any meaning in reserved areas.
> > > > 
> > > > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > > > with such information?
> > > > 
> > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> > > 
> > > Ok, may I ask why not?
> > 
> > I think it's a bad idea to add a flag to FIEMAP to change its behavior
> > to suit an older and even crappier legacy interface (i.e. FIBMAP).
> > 
> > FIBMAP is architecturally broken in that we can't /ever/ provide the
> > context of "which device does this map to?"
> > 
> > FIEMAP is architecturally deficient as well, but its ioctl structure
> > definition is flexible enough that we can report "which device does this
> > map to".
> > 
> > I want to enhance FIEMAP to deal with multi-device filesystems
> > correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> > and *lilo.
> > 
> > > My apologies if I am wrong, but, per my understanding, there is
> > > nothing today, which tells userspace which device belongs the extent
> > > map reported by FIEMAP.
> > 
> > Right...
> > 
> > > If it belongs to the RT device in XFS, or whatever disk in a raid in
> > > BTRFS, we simply do not provide such information.
> > 
> > Right...
> > 
> > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> > > a FIBMAP has been requested, so the current behavior of both ioctls
> > > won't change.
> > 
> > ...but from my point of view, the FIEMAP behavior *ought* to change to
> > be more expressive.  Once that's done, we can use the more expressive
> > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> > 
> > The whole point of having fe_reserved* fields in struct fiemap_extent is
> > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> > start returning data in a reserved field.  New userspace programs that
> > know about the flag can start reading information from the new field if
> > they see the flag, and old userspace programs don't know about the flag
> > and won't be any worse off.
> > 
> > > Enabling filesystems to return device information into fiemap_extent
> > > requires modification of all filesystems to provide such information,
> > > which will not have any use other than matching the mounted device to
> > > the device where the extent is.
> > 
> > Perhaps it would help for me to present a more concrete proposal:
> > 
> > --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> > +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> > @@ -22,7 +22,19 @@ struct fiemap_extent {
> >  	__u64 fe_length;   /* length in bytes for this extent */
> >  	__u64 fe_reserved64[2];
> >  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> > -	__u32 fe_reserved[3];
> > +
> > +	/*
> > +	 * Underlying device that this extent is stored on.
> > +	 *
> > +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> > +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> > +	 * set, this field is a 32-bit cookie that can be used to distinguish
> > +	 * between backing devices but has no intrinsic meaning.  If neither
> > +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> > +	 * EXTENT_DEV flags may be set at any time.
> > +	 */
> > +	__u32 fe_device;
> > +	__u32 fe_reserved[2];
> >  };
> >  
> >  struct fiemap {
> > @@ -66,5 +78,14 @@ struct fiemap {
> >  						    * merged for efficiency. */
> >  #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> >  						    * files. */
> > +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> > +						    * structure containing the
> > +						    * major and minor numbers
> > +						    * of a block device. */
> > +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> > +						    * cookie that can be used
> > +						    * to distinguish physical
> > +						    * devices but otherwise
> > +						    * has no meaning. */
> >  
> >  #endif /* _LINUX_FIEMAP_H */
> > 
> > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> > encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
> 
> Here, I believe you are forgetting that filesystems do not touch fiemap_extent
> directly. We call fiemap_fell_next_extent() helper to fill each extent found by
> fiemap. So, in either way, we'd need to modify fiemap_fill_next_extent() and the
> callbacks being used to accommodate this new field or create a new helper to
> modify the device which doesn't sound reasonable. So, either way, we will end up
> needing to modify all filesystems.

Yep.  Drat.  I guess you could add a bdev parameter to
fiemap_fill_next_extent, and we'd use that to encode fe_device.  If the
fs passes NULL then we just get it from the superblock or something.

> So, although I really like the idea of improving the FIEMAP interface, I'm
> starting to consider another patchset for it. I think it requires an interface
> change big enough to fit in this patchset, which actually has a different
> purpose. Or, maybe, address this at the end of this patchset, leaving different
> interface changes in different patchsets, instead of making many changes all at
> once, mixed together.

<nod> I think you're right, fiemap upgrades as one series and then
fibmap-via-fiemap as the second one.

--D

> > 
> > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> > and encode the replica number in fe_device.
> > 
> > Existing filesystems can be left unchanged, in which case neither
> > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> > meaningless, the same as they are today.  Reporting fe_device is entirely
> > optional.
> > 
> > Userspace programs will now be able to tell which device the file data
> > lives on, which has been sort-of requested for years, if the filesystem
> > chooses to start exporting that information.
> > 
> > Your FIBMAP-via-FIEMAP backend can do something like:
> > 
> > /* FIBMAP only returns results for the same block device backing the fs. */
> > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> > 	return 0;
> > 
> > /* Can't tell what is the backing device, bail out. */
> > if (fe->fe_flags & EXTENT_DEV_COOKIE)
> > 	return 0;
> > 
> > /*
> >  * Either fe_device matches the backing device or the implementation
> >  * doesn't tell us about the backing device, so assume it's ok.
> >  */
> > <return FIBMAP results>
> > 
> > So that's how I'd solve a longstanding design problem of FIEMAP and then
> > take advantage of that solution to remedy my objections to the proposed
> > "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> > behavior flag that userspace knows about but isn't allowed to pass in.
> > 
> > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> > > than the device id in fiemap_extent. I don't see much advantage in
> > > adding the device id instead of using the flag.
> > > 
> > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> > > userspace, so, it would require a check to make sure it didn't come from
> > > userspace if ioctl_fiemap() was used.
> > > 
> > > I think there are 2 other possibilities which can be used to fix this.
> > > 
> > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> > > - If the device id is a must for you, maybe add the device id into
> > >   fiemap_extent_info instead of fiemap_extent.
> > 
> > That won't work with btrfs, which can store file extents on multiple
> > different physical devices.
> > 
> > >   So we don't mess with a UAPI exported data structure and still
> > >   provides a way to the filesystems to provide which device the mapped
> > >   extent is in.
> > > 
> > > What you think?
> > > 
> > > Cheers
> > > 
> > > 
> > > > 
> > > > --D
> > > > 
> > > > > > 
> > > > > > > +
> > > > > > > +	return error;
> > > > > > > +}
> > > > > > > +
> > > > > > >  /**
> > > > > > >   *	bmap	- find a block number in a file
> > > > > > >   *	@inode:  inode owning the block number being requested
> > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > > > >   */
> > > > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > > > >  {
> > > > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > > > +	if (inode->i_op->fiemap)
> > > > > > > +		return bmap_fiemap(inode, block);
> > > > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > > > +						       *block);
> > > > > > > +	else
> > > > > > >  		return -EINVAL;
> > > > > > 
> > > > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > > > suddenly it will support this legacy interface they've never supported
> > > > > > before.  Are they on board with this?
> > > > > > 
> > > > > > --D
> > > > > > 
> > > > > > >  
> > > > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > > > >  	return 0;
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL(bmap);
> > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > > > --- a/fs/ioctl.c
> > > > > > > +++ b/fs/ioctl.c
> > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > >  }
> > > > > > >  
> > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > > > +{
> > > > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > > > +
> > > > > > > +	/* only count the extents */
> > > > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > > > +		fieinfo->fi_extents_mapped++;
> > > > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > > > +		return 1;
> > > > > > > +
> > > > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > > > +
> > > > > > > +	extent->fe_logical = logical;
> > > > > > > +	extent->fe_physical = phys;
> > > > > > > +	extent->fe_length = len;
> > > > > > > +	extent->fe_flags = flags;
> > > > > > > +
> > > > > > > +	fieinfo->fi_extents_mapped++;
> > > > > > > +
> > > > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > > > +		return 1;
> > > > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > +}
> > > > > > >  /**
> > > > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > > index 7a434979201c..28bb523d532a 100644
> > > > > > > --- a/include/linux/fs.h
> > > > > > > +++ b/include/linux/fs.h
> > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > > > >  	fiemap_fill_cb	fi_cb;
> > > > > > >  };
> > > > > > >  
> > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > +			      u64 phys, u64 len, u32 flags);
> > > > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > >  			    u64 phys, u64 len, u32 flags);
> > > > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > > > -- 
> > > > > > > 2.17.2
> > > > > > > 
> > > > > 
> > > > > -- 
> > > > > Carlos
> > > 
> > > -- 
> > > Carlos
> 
> -- 
> Carlos

Andreas Dilger Feb. 7, 2019, 9:25 p.m. UTC | #19

On Feb 7, 2019, at 10:02 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Thu, Feb 07, 2019 at 12:59:54PM +0100, Carlos Maiolino wrote:
>> On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
>>> 
>>> ...but from my point of view, the FIEMAP behavior *ought* to change to
>>> be more expressive.  Once that's done, we can use the more expressive
>>> FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
>>> 
>>> The whole point of having fe_reserved* fields in struct fiemap_extent is
>>> so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
>>> start returning data in a reserved field.  New userspace programs that
>>> know about the flag can start reading information from the new field if
>>> they see the flag, and old userspace programs don't know about the flag
>>> and won't be any worse off.
>>> 
>>> Perhaps it would help for me to present a more concrete proposal:
>>> 
>>> --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
>>> +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
>>> @@ -22,7 +22,19 @@ struct fiemap_extent {
>>> 	__u64 fe_length;   /* length in bytes for this extent */
>>> 	__u64 fe_reserved64[2];
>>> 	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
>>> -	__u32 fe_reserved[3];
>>> +
>>> +	/*
>>> +	 * Underlying device that this extent is stored on.
>>> +	 *
>>> +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
>>> +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
>>> +	 * set, this field is a 32-bit cookie that can be used to distinguish
>>> +	 * between backing devices but has no intrinsic meaning.  If neither
>>> +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
>>> +	 * EXTENT_DEV flags may be set at any time.
>>> +	 */
>>> +	__u32 fe_device;
>>> +	__u32 fe_reserved[2];
>>> };
>>> 
>>> struct fiemap {
>>> @@ -66,5 +78,14 @@ struct fiemap {
>>> 						    * merged for efficiency. */
>>> #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
>>> 						    * files. */
>>> +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
>>> +						    * structure containing the
>>> +						    * major and minor numbers
>>> +						    * of a block device. */
>>> +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
>>> +						    * cookie that can be used
>>> +						    * to distinguish physical
>>> +						    * devices but otherwise
>>> +						    * has no meaning. */
>>> 
>>> #endif /* _LINUX_FIEMAP_H */
>>> 
>>> Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
>>> encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
>>> 
>>> Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
>>> and encode the replica number in fe_device.
>>> 
>> 
>> All of this makes sense, but I'm struggling to understand what you mean by
>> replica number here, and why it justify a second flag.
> 
> I left in the "device cookie" thing in the proposal to accomodate a
> request from the Lustre folks to be able to report which replica is
> storing a particular extent map.  Apparently the replica id is simply a
> 32-bit number that isn't inherently useful, hence the vagueness around
> what "cookie" really means...
> 
> ...oh, right, lustre fell out of drivers/staging/.  You could probably
> leave it out then.

Do we really need to be this way, about reserving a single flag for Lustre,
which will likely also be useful for other filesystems?  It's not like
Lustre is some closed-source binary module for which we need to make life
difficult, it is used by many thousands of the largest computers at labs
and universities and companies around the world.  We are working to clean
up the code outside the staging tree and resubmit it.  Not reserving a flag
just means we will continue to use random values in Lustre before it can
be merged, which will make life harder when we try to merge again.


In the case of Lustre, the proposed DEV_COOKIE would indicate fe_device is
the integer index number of the server on which each extent of the file is
located (Darrick's "replica number" term is not really correct).  The server
index is familiar to all Lustre users, so having filefrag print out the
device number of "0000" or "0009" is totally clear to them.

For pNFS or Ceph or other network filesystems (if they implement filefrag)
it could be an index or some other number (e.g. the IP address of the server
or low bits of the  UUID or whatever).  Reading back in the archives of the
original FIEMAP discussion, it seems BtrFS would prefer to use DEV_COOKIE
instead of DEV_T, because it uses internal RAID encoding and not plain block
devices, but I'm not familiar with the details there.


Alternately, or in addition to, a DEV_COOKIE flag which indicates that the
same fe_device field is "not a device", it would be possible to add:

    #define FIEMAP_NO_DIRECT      0x40000000

and/or:

    #define FIEMAP_EXTENT_NET     0x80000000   /* Data stored remotely.
                                                * Sets NO_DIRECT flag */

returned by the filesystem that indicates the extent blocks are not local
to the node, so FIBMAP should return an error (-EOPNOTSUP or -EREMOTE or
whatever) because the file can't be booted from.  In that case, we could
return FIEMAP_EXTENT_DEVICE to indicate the fe_device field is valid, and
return FIEMAP_EXTENT_NET to indicate the values in fe_device are not local
block devices, just filesystem-specific values to distinguish devices.

However, I'm open to both _DEV_COOKIE and _NET flag if that is preferred,
since I think the two are somewhat complementary.

>> This actually looks to contradict what you have been complaining, about some
>> filesystems which doesn't support FIBMAP currently, will now suddenly start to
>> support. Assuming it's ok if the implementation doesn't tell us about the
>> backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the
>> backing device, assuming it's ok will just fall into your first complain.
> 
> Sorry, this thread has been going on so long that I forgot your goal for
> this series. :/
> 
> Specifically, I had forgotten that you're removing the ->bmap pointer,
> which means that filesystems don't have any particular way to signal
> "Yes on FIEMAP, no on FIBMAP".  Somehow I had thought that you were
> merely creating a generic_file_bmap() that would call FIEMAP and ripping
> out all the adhoc bmap implementations.

Just a reminder here, you should set FIEMAP_FLAG_SYNC when mapping FIBMAP
to FIEMAP so that the data on that file is flushed to disk before returning,
since the block mapping may not be assigned yet or may be unstable, which
could lead to an unbootable system if used for LILO.

Cheers, Andreas

>>> Existing filesystems can be left unchanged, in which case neither
>>> EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
>>> meaningless, the same as they are today.  Reporting fe_device is entirely
>>> optional.
>>> 
>>> Userspace programs will now be able to tell which device the file data
>>> lives on, which has been sort-of requested for years, if the filesystem
>>> chooses to start exporting that information.
>>> 
>>> Your FIBMAP-via-FIEMAP backend can do something like:
>>> 
>>> /* FIBMAP only returns results for the same block device backing the fs. */
>>> if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
>>> 	return 0;
>>> 
>>> /* Can't tell what is the backing device, bail out. */
>>> if (fe->fe_flags & EXTENT_DEV_COOKIE)
>>> 	return 0;
>>> 
>> 
>> Ok, the first conditional, is ok, the second one is not making sense to me.
>> Looks like you are basically using it to flag the filesystem can't tell
>> exactly which device the current extent is, let's say for example, distributed
>> filesystems, where the physical extent can actually be on a different machine.
>> But I can't say for sure, can you give me more details about what you are trying
>> to achieve here?
> 
> You've understood me correctly. :)
> 
>>> /*
>>> * Either fe_device matches the backing device or the implementation
>>> * doesn't tell us about the backing device, so assume it's ok.
>>> */
>>> <return FIBMAP results>
>>> 
>> 
>> Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE
>> you mentioned.
> 
> I think you've understood it about as well as I can explain it.  Maybe
> Andreas will have more to say about the lustre replica id, but OTOH it's
> gone and so there's no user of it, so we could just drop it until lustre
> comes back.
> 

>>> So that's how I'd solve a longstanding design problem of FIEMAP and then
>>> take advantage of that solution to remedy my objections to the proposed
>>> "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
>>> behavior flag that userspace knows about but isn't allowed to pass in.
>>> 
>> 
>>>> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
>>>> than the device id in fiemap_extent. I don't see much advantage in
>>>> adding the device id instead of using the flag.
>>>> 
>>>> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
>>>> userspace, so, it would require a check to make sure it didn't come from
>>>> userspace if ioctl_fiemap() was used.
>>>> 
>>>> I think there are 2 other possibilities which can be used to fix this.
>>>> 
>>>> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
>>>> - If the device id is a must for you, maybe add the device id into
>>>>  fiemap_extent_info instead of fiemap_extent.
>>> 
>>> That won't work with btrfs, which can store file extents on multiple
>>> different physical devices.
>>> 
>>>>  So we don't mess with a UAPI exported data structure and still
>>>>  provides a way to the filesystems to provide which device the mapped
>>>>  extent is in.
>>>> 
>>>> What you think?
>>>> 
>>>> Cheers
>>>> 
>>>> 
>>>>> 
>>>>> --D
>>>>> 
>>>>>>> 
>>>>>>>> +
>>>>>>>> +	return error;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> /**
>>>>>>>>  *	bmap	- find a block number in a file
>>>>>>>>  *	@inode:  inode owning the block number being requested
>>>>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
>>>>>>>>  */
>>>>>>>> int bmap(struct inode *inode, sector_t *block)
>>>>>>>> {
>>>>>>>> -	if (!inode->i_mapping->a_ops->bmap)
>>>>>>>> +	if (inode->i_op->fiemap)
>>>>>>>> +		return bmap_fiemap(inode, block);
>>>>>>>> +	else if (inode->i_mapping->a_ops->bmap)
>>>>>>>> +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
>>>>>>>> +						       *block);
>>>>>>>> +	else
>>>>>>>> 		return -EINVAL;
>>>>>>> 
>>>>>>> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
>>>>>>> suddenly it will support this legacy interface they've never supported
>>>>>>> before.  Are they on board with this?
>>>>>>> 
>>>>>>> --D
>>>>>>> 
>>>>>>>> 
>>>>>>>> -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
>>>>>>>> 	return 0;
>>>>>>>> }
>>>>>>>> EXPORT_SYMBOL(bmap);
>>>>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c
>>>>>>>> index 6086978fe01e..bfa59df332bf 100644
>>>>>>>> --- a/fs/ioctl.c
>>>>>>>> +++ b/fs/ioctl.c
>>>>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>>>>> 	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>>>> }
>>>>>>>> 
>>>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
>>>>>>>> +			    u64 phys, u64 len, u32 flags)
>>>>>>>> +{
>>>>>>>> +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
>>>>>>>> +
>>>>>>>> +	/* only count the extents */
>>>>>>>> +	if (fieinfo->fi_extents_max == 0) {
>>>>>>>> +		fieinfo->fi_extents_mapped++;
>>>>>>>> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
>>>>>>>> +		return 1;
>>>>>>>> +
>>>>>>>> +	if (flags & SET_UNKNOWN_FLAGS)
>>>>>>>> +		flags |= FIEMAP_EXTENT_UNKNOWN;
>>>>>>>> +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
>>>>>>>> +		flags |= FIEMAP_EXTENT_ENCODED;
>>>>>>>> +	if (flags & SET_NOT_ALIGNED_FLAGS)
>>>>>>>> +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
>>>>>>>> +
>>>>>>>> +	extent->fe_logical = logical;
>>>>>>>> +	extent->fe_physical = phys;
>>>>>>>> +	extent->fe_length = len;
>>>>>>>> +	extent->fe_flags = flags;
>>>>>>>> +
>>>>>>>> +	fieinfo->fi_extents_mapped++;
>>>>>>>> +
>>>>>>>> +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
>>>>>>>> +		return 1;
>>>>>>>> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
>>>>>>>> +}
>>>>>>>> /**
>>>>>>>>  * fiemap_fill_next_extent - Fiemap helper function
>>>>>>>>  * @fieinfo:	Fiemap context passed into ->fiemap
>>>>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>>>>>> index 7a434979201c..28bb523d532a 100644
>>>>>>>> --- a/include/linux/fs.h
>>>>>>>> +++ b/include/linux/fs.h
>>>>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
>>>>>>>> 	fiemap_fill_cb	fi_cb;
>>>>>>>> };
>>>>>>>> 
>>>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
>>>>>>>> +			      u64 phys, u64 len, u32 flags);
>>>>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
>>>>>>>> 			    u64 phys, u64 len, u32 flags);
>>>>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
>>>>>>>> --
>>>>>>>> 2.17.2
>>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Carlos
>>>> 
>>>> --
>>>> Carlos
>> 
>> --
>> Carlos


Cheers, Andreas

Christoph Hellwig Feb. 8, 2019, 8:43 a.m. UTC | #20

On Thu, Feb 07, 2019 at 10:52:33AM +0100, Carlos Maiolino wrote:
> Btw, I am not saying I don't like the idea, I like it. What I was trying to do
> was to avoid touching UAPI in this patchset. But... I'll try to implement your
> idea here, send it to the list and raise my shields.

Agreed.  Please don't change the FIEMAP uapi.  If we need to check
for a request coming from bmap just defined an internal FIEMAP flag
as the last available flag in the flags word, and reject it when
it comes from userspace in fiemap.

Christoph Hellwig Feb. 8, 2019, 8:46 a.m. UTC | #21

On Thu, Feb 07, 2019 at 02:25:01PM -0700, Andreas Dilger wrote:
> Do we really need to be this way, about reserving a single flag for Lustre,
> which will likely also be useful for other filesystems?  It's not like
> Lustre is some closed-source binary module for which we need to make life
> difficult, it is used by many thousands of the largest computers at labs
> and universities and companies around the world.  We are working to clean
> up the code outside the staging tree and resubmit it.  Not reserving a flag
> just means we will continue to use random values in Lustre before it can
> be merged, which will make life harder when we try to merge again.

No, it is available in source, but otherwise just as bad.  And we generally
only define APIs for in-kernel usage.

If we can come up with a good API for in-kernel filesystems we can do
that, otherwise hell no.  And staging for that matter qualifies as out
of tree.

That being said I'm really worried about these FIEMAP extensions as
userspace has no business poking into details of the placement (vs
just the layout).

But all that belongs into a separate dicussion instead of dragging down
this series where it does not belong at all.

Carlos Maiolino Feb. 8, 2019, 8:58 a.m. UTC | #22

On Thu, Feb 07, 2019 at 10:16:55AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 07, 2019 at 01:36:41PM +0100, Carlos Maiolino wrote:
> > Apologies, I forgot to mention another thing..
> > 
> > On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> > > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > > > > In any case, I think a better solution to the multi-device problem is to
> > > > > > > start returning device information via struct fiemap_extent, at least
> > > > > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > > > > to avoid returning results for the non-primary device of a multi-device
> > > > > > > filesystem.
> > > > > > 
> > > > > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > > > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > > > > idea.
> > > > > 
> > > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > > > > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > > > > shouldn't be expecting any meaning in reserved areas.
> > > > > 
> > > > > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > > > > with such information?
> > > > > 
> > > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> > > > 
> > > > Ok, may I ask why not?
> > > 
> > > I think it's a bad idea to add a flag to FIEMAP to change its behavior
> > > to suit an older and even crappier legacy interface (i.e. FIBMAP).
> > > 
> > > FIBMAP is architecturally broken in that we can't /ever/ provide the
> > > context of "which device does this map to?"
> > > 
> > > FIEMAP is architecturally deficient as well, but its ioctl structure
> > > definition is flexible enough that we can report "which device does this
> > > map to".
> > > 
> > > I want to enhance FIEMAP to deal with multi-device filesystems
> > > correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> > > and *lilo.
> > > 
> > > > My apologies if I am wrong, but, per my understanding, there is
> > > > nothing today, which tells userspace which device belongs the extent
> > > > map reported by FIEMAP.
> > > 
> > > Right...
> > > 
> > > > If it belongs to the RT device in XFS, or whatever disk in a raid in
> > > > BTRFS, we simply do not provide such information.
> > > 
> > > Right...
> > > 
> > > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> > > > a FIBMAP has been requested, so the current behavior of both ioctls
> > > > won't change.
> > > 
> > > ...but from my point of view, the FIEMAP behavior *ought* to change to
> > > be more expressive.  Once that's done, we can use the more expressive
> > > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> > > 
> > > The whole point of having fe_reserved* fields in struct fiemap_extent is
> > > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> > > start returning data in a reserved field.  New userspace programs that
> > > know about the flag can start reading information from the new field if
> > > they see the flag, and old userspace programs don't know about the flag
> > > and won't be any worse off.
> > > 
> > > > Enabling filesystems to return device information into fiemap_extent
> > > > requires modification of all filesystems to provide such information,
> > > > which will not have any use other than matching the mounted device to
> > > > the device where the extent is.
> > > 
> > > Perhaps it would help for me to present a more concrete proposal:
> > > 
> > > --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> > > +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> > > @@ -22,7 +22,19 @@ struct fiemap_extent {
> > >  	__u64 fe_length;   /* length in bytes for this extent */
> > >  	__u64 fe_reserved64[2];
> > >  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> > > -	__u32 fe_reserved[3];
> > > +
> > > +	/*
> > > +	 * Underlying device that this extent is stored on.
> > > +	 *
> > > +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> > > +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> > > +	 * set, this field is a 32-bit cookie that can be used to distinguish
> > > +	 * between backing devices but has no intrinsic meaning.  If neither
> > > +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> > > +	 * EXTENT_DEV flags may be set at any time.
> > > +	 */
> > > +	__u32 fe_device;
> > > +	__u32 fe_reserved[2];
> > >  };
> > >  
> > >  struct fiemap {
> > > @@ -66,5 +78,14 @@ struct fiemap {
> > >  						    * merged for efficiency. */
> > >  #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> > >  						    * files. */
> > > +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> > > +						    * structure containing the
> > > +						    * major and minor numbers
> > > +						    * of a block device. */
> > > +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> > > +						    * cookie that can be used
> > > +						    * to distinguish physical
> > > +						    * devices but otherwise
> > > +						    * has no meaning. */
> > >  
> > >  #endif /* _LINUX_FIEMAP_H */
> > > 
> > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> > > encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
> > 
> > Here, I believe you are forgetting that filesystems do not touch fiemap_extent
> > directly. We call fiemap_fell_next_extent() helper to fill each extent found by
> > fiemap. So, in either way, we'd need to modify fiemap_fill_next_extent() and the
> > callbacks being used to accommodate this new field or create a new helper to
> > modify the device which doesn't sound reasonable. So, either way, we will end up
> > needing to modify all filesystems.
> 
> Yep.  Drat.  I guess you could add a bdev parameter to
> fiemap_fill_next_extent, and we'd use that to encode fe_device.  If the
> fs passes NULL then we just get it from the superblock or something.
> 
> > So, although I really like the idea of improving the FIEMAP interface, I'm
> > starting to consider another patchset for it. I think it requires an interface
> > change big enough to fit in this patchset, which actually has a different
> > purpose. Or, maybe, address this at the end of this patchset, leaving different
> > interface changes in different patchsets, instead of making many changes all at
> > once, mixed together.
> 
> <nod> I think you're right, fiemap upgrades as one series and then
> fibmap-via-fiemap as the second one.
> 

Ok, fair enough, looks like we have an agreement :P I'll work on this direction
now, and set aside this patchset while we can improve FIEMAP to return the
device id, and then rebase this patchset on top of that.

Thanks for the review

> --D
> 
> > > 
> > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> > > and encode the replica number in fe_device.
> > > 
> > > Existing filesystems can be left unchanged, in which case neither
> > > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> > > meaningless, the same as they are today.  Reporting fe_device is entirely
> > > optional.
> > > 
> > > Userspace programs will now be able to tell which device the file data
> > > lives on, which has been sort-of requested for years, if the filesystem
> > > chooses to start exporting that information.
> > > 
> > > Your FIBMAP-via-FIEMAP backend can do something like:
> > > 
> > > /* FIBMAP only returns results for the same block device backing the fs. */
> > > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> > > 	return 0;
> > > 
> > > /* Can't tell what is the backing device, bail out. */
> > > if (fe->fe_flags & EXTENT_DEV_COOKIE)
> > > 	return 0;
> > > 
> > > /*
> > >  * Either fe_device matches the backing device or the implementation
> > >  * doesn't tell us about the backing device, so assume it's ok.
> > >  */
> > > <return FIBMAP results>
> > > 
> > > So that's how I'd solve a longstanding design problem of FIEMAP and then
> > > take advantage of that solution to remedy my objections to the proposed
> > > "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> > > behavior flag that userspace knows about but isn't allowed to pass in.
> > > 
> > > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> > > > than the device id in fiemap_extent. I don't see much advantage in
> > > > adding the device id instead of using the flag.
> > > > 
> > > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> > > > userspace, so, it would require a check to make sure it didn't come from
> > > > userspace if ioctl_fiemap() was used.
> > > > 
> > > > I think there are 2 other possibilities which can be used to fix this.
> > > > 
> > > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> > > > - If the device id is a must for you, maybe add the device id into
> > > >   fiemap_extent_info instead of fiemap_extent.
> > > 
> > > That won't work with btrfs, which can store file extents on multiple
> > > different physical devices.
> > > 
> > > >   So we don't mess with a UAPI exported data structure and still
> > > >   provides a way to the filesystems to provide which device the mapped
> > > >   extent is in.
> > > > 
> > > > What you think?
> > > > 
> > > > Cheers
> > > > 
> > > > 
> > > > > 
> > > > > --D
> > > > > 
> > > > > > > 
> > > > > > > > +
> > > > > > > > +	return error;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >  /**
> > > > > > > >   *	bmap	- find a block number in a file
> > > > > > > >   *	@inode:  inode owning the block number being requested
> > > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > > > > >   */
> > > > > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > > > > >  {
> > > > > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > > > > +	if (inode->i_op->fiemap)
> > > > > > > > +		return bmap_fiemap(inode, block);
> > > > > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > > > > +						       *block);
> > > > > > > > +	else
> > > > > > > >  		return -EINVAL;
> > > > > > > 
> > > > > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > > > > suddenly it will support this legacy interface they've never supported
> > > > > > > before.  Are they on board with this?
> > > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > >  
> > > > > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > > > > >  	return 0;
> > > > > > > >  }
> > > > > > > >  EXPORT_SYMBOL(bmap);
> > > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > > > > --- a/fs/ioctl.c
> > > > > > > > +++ b/fs/ioctl.c
> > > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > > > > +{
> > > > > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > > > > +
> > > > > > > > +	/* only count the extents */
> > > > > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > > > > +		fieinfo->fi_extents_mapped++;
> > > > > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > > > > +		return 1;
> > > > > > > > +
> > > > > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > > > > +
> > > > > > > > +	extent->fe_logical = logical;
> > > > > > > > +	extent->fe_physical = phys;
> > > > > > > > +	extent->fe_length = len;
> > > > > > > > +	extent->fe_flags = flags;
> > > > > > > > +
> > > > > > > > +	fieinfo->fi_extents_mapped++;
> > > > > > > > +
> > > > > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > > > > +		return 1;
> > > > > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > > +}
> > > > > > > >  /**
> > > > > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > > > index 7a434979201c..28bb523d532a 100644
> > > > > > > > --- a/include/linux/fs.h
> > > > > > > > +++ b/include/linux/fs.h
> > > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > > > > >  	fiemap_fill_cb	fi_cb;
> > > > > > > >  };
> > > > > > > >  
> > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > > +			      u64 phys, u64 len, u32 flags);
> > > > > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > >  			    u64 phys, u64 len, u32 flags);
> > > > > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > > > > -- 
> > > > > > > > 2.17.2
> > > > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > Carlos
> > > > 
> > > > -- 
> > > > Carlos
> > 
> > -- 
> > Carlos

Carlos Maiolino Feb. 8, 2019, 9:03 a.m. UTC | #23

On Thu, Feb 07, 2019 at 09:02:10AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 07, 2019 at 12:59:54PM +0100, Carlos Maiolino wrote:
> > On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> > > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote:
> > > > > > > In any case, I think a better solution to the multi-device problem is to
> > > > > > > start returning device information via struct fiemap_extent, at least
> > > > > > > inside the kernel.  Use one of the reserved fields to declare a new
> > > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t
> > > > > > > device number, and then you can check that against inode->i_sb->s_bdev
> > > > > > > to avoid returning results for the non-primary device of a multi-device
> > > > > > > filesystem.
> > > > > > 
> > > > > > I agree we should address it here, but I don't think fiemap_extent is the right
> > > > > > place for it, it is linked to the UAPI, and changing it is usually not a good
> > > > > > idea.
> > > > > 
> > > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields
> > > > > into some sort of dev_t/per-device cookie should be fine.  Userspace
> > > > > shouldn't be expecting any meaning in reserved areas.
> > > > > 
> > > > > > I think I got your idea anyway, but, what if, instead returning the bdev in
> > > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to
> > > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do
> > > > > > with such information?
> > > > > 
> > > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers.
> > > > 
> > > > Ok, may I ask why not?
> > > 
> > > I think it's a bad idea to add a flag to FIEMAP to change its behavior
> > > to suit an older and even crappier legacy interface (i.e. FIBMAP).
> > > 
> > > FIBMAP is architecturally broken in that we can't /ever/ provide the
> > > context of "which device does this map to?"
> > > 
> > > FIEMAP is architecturally deficient as well, but its ioctl structure
> > > definition is flexible enough that we can report "which device does this
> > > map to".
> > > 
> > > I want to enhance FIEMAP to deal with multi-device filesystems
> > > correctly, and as much as I want to kill FIBMAP, I can't because of zipl
> > > and *lilo.
> > > 
> > > > My apologies if I am wrong, but, per my understanding, there is
> > > > nothing today, which tells userspace which device belongs the extent
> > > > map reported by FIEMAP.
> > > 
> > > Right...
> > > 
> > > > If it belongs to the RT device in XFS, or whatever disk in a raid in
> > > > BTRFS, we simply do not provide such information.
> > > 
> > > Right...
> > > 
> > > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or
> > > > a FIBMAP has been requested, so the current behavior of both ioctls
> > > > won't change.
> > > 
> > > ...but from my point of view, the FIEMAP behavior *ought* to change to
> > > be more expressive.  Once that's done, we can use the more expressive
> > > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> > > 
> > > The whole point of having fe_reserved* fields in struct fiemap_extent is
> > > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> > > start returning data in a reserved field.  New userspace programs that
> > > know about the flag can start reading information from the new field if
> > > they see the flag, and old userspace programs don't know about the flag
> > > and won't be any worse off.
> > > 
> > > > Enabling filesystems to return device information into fiemap_extent
> > > > requires modification of all filesystems to provide such information,
> > > > which will not have any use other than matching the mounted device to
> > > > the device where the extent is.
> > > 
> > > Perhaps it would help for me to present a more concrete proposal:
> > > 
> > > --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> > > +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> > > @@ -22,7 +22,19 @@ struct fiemap_extent {
> > >  	__u64 fe_length;   /* length in bytes for this extent */
> > >  	__u64 fe_reserved64[2];
> > >  	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> > > -	__u32 fe_reserved[3];
> > > +
> > > +	/*
> > > +	 * Underlying device that this extent is stored on.
> > > +	 *
> > > +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> > > +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> > > +	 * set, this field is a 32-bit cookie that can be used to distinguish
> > > +	 * between backing devices but has no intrinsic meaning.  If neither
> > > +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> > > +	 * EXTENT_DEV flags may be set at any time.
> > > +	 */
> > > +	__u32 fe_device;
> > > +	__u32 fe_reserved[2];
> > >  };
> > >  
> > >  struct fiemap {
> > > @@ -66,5 +78,14 @@ struct fiemap {
> > >  						    * merged for efficiency. */
> > >  #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> > >  						    * files. */
> > > +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> > > +						    * structure containing the
> > > +						    * major and minor numbers
> > > +						    * of a block device. */
> > > +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> > > +						    * cookie that can be used
> > > +						    * to distinguish physical
> > > +						    * devices but otherwise
> > > +						    * has no meaning. */
> > >  
> > >  #endif /* _LINUX_FIEMAP_H */
> > > 
> > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> > > encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
> > > 
> > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> > > and encode the replica number in fe_device.
> > > 
> > 
> > All of this makes sense, but I'm struggling to understand what you mean by
> > replica number here, and why it justify a second flag.
> 
> I left in the "device cookie" thing in the proposal to accomodate a
> request from the Lustre folks to be able to report which replica is
> storing a particular extent map.  Apparently the replica id is simply a
> 32-bit number that isn't inherently useful, hence the vagueness around
> what "cookie" really means...
> 
> ...oh, right, lustre fell out of drivers/staging/.  You could probably
> leave it out then.
> 
> > > Existing filesystems can be left unchanged, in which case neither
> > > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> > > meaningless, the same as they are today.  Reporting fe_device is entirely
> > > optional.
> > > 
> > > Userspace programs will now be able to tell which device the file data
> > > lives on, which has been sort-of requested for years, if the filesystem
> > > chooses to start exporting that information.
> > > 
> > > Your FIBMAP-via-FIEMAP backend can do something like:
> > > 
> > > /* FIBMAP only returns results for the same block device backing the fs. */
> > > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> > > 	return 0;
> > > 
> > > /* Can't tell what is the backing device, bail out. */
> > > if (fe->fe_flags & EXTENT_DEV_COOKIE)
> > > 	return 0;
> > > 
> > 
> > Ok, the first conditional, is ok, the second one is not making sense to me.
> > Looks like you are basically using it to flag the filesystem can't tell
> > exactly which device the current extent is, let's say for example, distributed
> > filesystems, where the physical extent can actually be on a different machine.
> > But I can't say for sure, can you give me more details about what you are trying
> > to achieve here?
> 
> You've understood me correctly. :)
> 
> > > /*
> > >  * Either fe_device matches the backing device or the implementation
> > >  * doesn't tell us about the backing device, so assume it's ok.
> > >  */
> > > <return FIBMAP results>
> > >
> > 
> > This actually looks to contradict what you have been complaining, about some
> > filesystems which doesn't support FIBMAP currently, will now suddenly start to
> > support. Assuming it's ok if the implementation doesn't tell us about the
> > backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the
> > backing device, assuming it's ok will just fall into your first complain.
> 
> Sorry, this thread has been going on so long that I forgot your goal for
> this series. :/
> 
> Specifically, I had forgotten that you're removing the ->bmap pointer,
> which means that filesystems don't have any particular way to signal
> "Yes on FIEMAP, no on FIBMAP".  Somehow I had thought that you were
> merely creating a generic_file_bmap() that would call FIEMAP and ripping
> out all the adhoc bmap implementations.
> 
> Hmm, how many filesystems support FIEMAP and not FIBMAP?
> 
> btrfs, nilfs2, and overlayfs.  Also bad_inode.c...?
> 
> Hmm, how many filesystems support FIBMAP and not FIEMAP?
> 
> adfs, affs, befs, bfs, efs, exofs, fat, freevxfs, fuse(?), nfs, hfsplus,
> isofs, jfs, minixfs, ntfs, qnx[46], reiserfs, sysv, udf, and ufs.

Eh, that's why we should keep:

if (inode->i_op->fiemap)
	return bmap_fiemap(inode, block);
else if (..a_ops->bmap)
	->a_ops->bmap(...)
else
	return -EINVAL;


> 
> > Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE
> > you mentioned.
> 
> I think you've understood it about as well as I can explain it.  Maybe
> Andreas will have more to say about the lustre replica id, but OTOH it's
> gone and so there's no user of it, so we could just drop it until lustre
> comes back.
>

Ok, thanks for confirming.

> > > So that's how I'd solve a longstanding design problem of FIEMAP and then
> > > take advantage of that solution to remedy my objections to the proposed
> > > "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> > > behavior flag that userspace knows about but isn't allowed to pass in.
> > >
> > 
> > > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> > > > than the device id in fiemap_extent. I don't see much advantage in
> > > > adding the device id instead of using the flag.
> > > > 
> > > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> > > > userspace, so, it would require a check to make sure it didn't come from
> > > > userspace if ioctl_fiemap() was used.
> > > > 
> > > > I think there are 2 other possibilities which can be used to fix this.
> > > > 
> > > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> > > > - If the device id is a must for you, maybe add the device id into
> > > >   fiemap_extent_info instead of fiemap_extent.
> > > 
> > > That won't work with btrfs, which can store file extents on multiple
> > > different physical devices.
> > > 
> > > >   So we don't mess with a UAPI exported data structure and still
> > > >   provides a way to the filesystems to provide which device the mapped
> > > >   extent is in.
> > > > 
> > > > What you think?
> > > > 
> > > > Cheers
> > > > 
> > > > 
> > > > > 
> > > > > --D
> > > > > 
> > > > > > > 
> > > > > > > > +
> > > > > > > > +	return error;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >  /**
> > > > > > > >   *	bmap	- find a block number in a file
> > > > > > > >   *	@inode:  inode owning the block number being requested
> > > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> > > > > > > >   */
> > > > > > > >  int bmap(struct inode *inode, sector_t *block)
> > > > > > > >  {
> > > > > > > > -	if (!inode->i_mapping->a_ops->bmap)
> > > > > > > > +	if (inode->i_op->fiemap)
> > > > > > > > +		return bmap_fiemap(inode, block);
> > > > > > > > +	else if (inode->i_mapping->a_ops->bmap)
> > > > > > > > +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> > > > > > > > +						       *block);
> > > > > > > > +	else
> > > > > > > >  		return -EINVAL;
> > > > > > > 
> > > > > > > Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> > > > > > > suddenly it will support this legacy interface they've never supported
> > > > > > > before.  Are they on board with this?
> > > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > >  
> > > > > > > > -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > > > > > > >  	return 0;
> > > > > > > >  }
> > > > > > > >  EXPORT_SYMBOL(bmap);
> > > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > > > > > > index 6086978fe01e..bfa59df332bf 100644
> > > > > > > > --- a/fs/ioctl.c
> > > > > > > > +++ b/fs/ioctl.c
> > > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > >  	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> > > > > > > > +			    u64 phys, u64 len, u32 flags)
> > > > > > > > +{
> > > > > > > > +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> > > > > > > > +
> > > > > > > > +	/* only count the extents */
> > > > > > > > +	if (fieinfo->fi_extents_max == 0) {
> > > > > > > > +		fieinfo->fi_extents_mapped++;
> > > > > > > > +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> > > > > > > > +		return 1;
> > > > > > > > +
> > > > > > > > +	if (flags & SET_UNKNOWN_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_UNKNOWN;
> > > > > > > > +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_ENCODED;
> > > > > > > > +	if (flags & SET_NOT_ALIGNED_FLAGS)
> > > > > > > > +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> > > > > > > > +
> > > > > > > > +	extent->fe_logical = logical;
> > > > > > > > +	extent->fe_physical = phys;
> > > > > > > > +	extent->fe_length = len;
> > > > > > > > +	extent->fe_flags = flags;
> > > > > > > > +
> > > > > > > > +	fieinfo->fi_extents_mapped++;
> > > > > > > > +
> > > > > > > > +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> > > > > > > > +		return 1;
> > > > > > > > +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> > > > > > > > +}
> > > > > > > >  /**
> > > > > > > >   * fiemap_fill_next_extent - Fiemap helper function
> > > > > > > >   * @fieinfo:	Fiemap context passed into ->fiemap
> > > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > > > index 7a434979201c..28bb523d532a 100644
> > > > > > > > --- a/include/linux/fs.h
> > > > > > > > +++ b/include/linux/fs.h
> > > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> > > > > > > >  	fiemap_fill_cb	fi_cb;
> > > > > > > >  };
> > > > > > > >  
> > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > > +			      u64 phys, u64 len, u32 flags);
> > > > > > > >  int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > > > > > > >  			    u64 phys, u64 len, u32 flags);
> > > > > > > >  int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> > > > > > > > -- 
> > > > > > > > 2.17.2
> > > > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > Carlos
> > > > 
> > > > -- 
> > > > Carlos
> > 
> > -- 
> > Carlos

Carlos Maiolino Feb. 8, 2019, 9:08 a.m. UTC | #24

On Thu, Feb 07, 2019 at 02:25:01PM -0700, Andreas Dilger wrote:
> On Feb 7, 2019, at 10:02 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Thu, Feb 07, 2019 at 12:59:54PM +0100, Carlos Maiolino wrote:
> >> On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote:
> >>> 
> >>> ...but from my point of view, the FIEMAP behavior *ought* to change to
> >>> be more expressive.  Once that's done, we can use the more expressive
> >>> FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems.
> >>> 
> >>> The whole point of having fe_reserved* fields in struct fiemap_extent is
> >>> so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can
> >>> start returning data in a reserved field.  New userspace programs that
> >>> know about the flag can start reading information from the new field if
> >>> they see the flag, and old userspace programs don't know about the flag
> >>> and won't be any worse off.
> >>> 
> >>> Perhaps it would help for me to present a more concrete proposal:
> >>> 
> >>> --- a/include/uapi/linux/fiemap.h	2019-01-18 10:53:44.000000000 -0800
> >>> +++ b/include/uapi/linux/fiemap.h	2019-02-06 12:25:52.813935941 -0800
> >>> @@ -22,7 +22,19 @@ struct fiemap_extent {
> >>> 	__u64 fe_length;   /* length in bytes for this extent */
> >>> 	__u64 fe_reserved64[2];
> >>> 	__u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> >>> -	__u32 fe_reserved[3];
> >>> +
> >>> +	/*
> >>> +	 * Underlying device that this extent is stored on.
> >>> +	 *
> >>> +	 * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the
> >>> +	 * major and minor numbers of a device.  If FIEMAP_EXTENT_DEV_COOKIE is
> >>> +	 * set, this field is a 32-bit cookie that can be used to distinguish
> >>> +	 * between backing devices but has no intrinsic meaning.  If neither
> >>> +	 * EXTENT_DEV flag is set, this field is meaningless.  Only one of the
> >>> +	 * EXTENT_DEV flags may be set at any time.
> >>> +	 */
> >>> +	__u32 fe_device;
> >>> +	__u32 fe_reserved[2];
> >>> };
> >>> 
> >>> struct fiemap {
> >>> @@ -66,5 +78,14 @@ struct fiemap {
> >>> 						    * merged for efficiency. */
> >>> #define FIEMAP_EXTENT_SHARED		0x00002000 /* Space shared with other
> >>> 						    * files. */
> >>> +#define FIEMAP_EXTENT_DEV_T		0x00004000 /* fe_device is a dev_t
> >>> +						    * structure containing the
> >>> +						    * major and minor numbers
> >>> +						    * of a block device. */
> >>> +#define FIEMAP_EXTENT_DEV_COOKIE	0x00008000 /* fe_device is a 32-bit
> >>> +						    * cookie that can be used
> >>> +						    * to distinguish physical
> >>> +						    * devices but otherwise
> >>> +						    * has no meaning. */
> >>> 
> >>> #endif /* _LINUX_FIEMAP_H */
> >>> 
> >>> Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start
> >>> encoding fe_device = new_encode_dev(xfs_get_device_for_file()).
> >>> 
> >>> Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE
> >>> and encode the replica number in fe_device.
> >>> 
> >> 
> >> All of this makes sense, but I'm struggling to understand what you mean by
> >> replica number here, and why it justify a second flag.
> > 
> > I left in the "device cookie" thing in the proposal to accomodate a
> > request from the Lustre folks to be able to report which replica is
> > storing a particular extent map.  Apparently the replica id is simply a
> > 32-bit number that isn't inherently useful, hence the vagueness around
> > what "cookie" really means...
> > 
> > ...oh, right, lustre fell out of drivers/staging/.  You could probably
> > leave it out then.
> 
> Do we really need to be this way, about reserving a single flag for Lustre,
> which will likely also be useful for other filesystems?  It's not like
> Lustre is some closed-source binary module for which we need to make life
> difficult, it is used by many thousands of the largest computers at labs
> and universities and companies around the world.  We are working to clean
> up the code outside the staging tree and resubmit it.  Not reserving a flag
> just means we will continue to use random values in Lustre before it can
> be merged, which will make life harder when we try to merge again.
> 

Agreed, it's a flag that may benefit different filesystems, not only lustre.

> 
> In the case of Lustre, the proposed DEV_COOKIE would indicate fe_device is
> the integer index number of the server on which each extent of the file is
> located (Darrick's "replica number" term is not really correct).  The server
> index is familiar to all Lustre users, so having filefrag print out the
> device number of "0000" or "0009" is totally clear to them.
> 

Thanks for the info. /me doesn't know LustreFS.

> For pNFS or Ceph or other network filesystems (if they implement filefrag)
> it could be an index or some other number (e.g. the IP address of the server
> or low bits of the  UUID or whatever).  Reading back in the archives of the
> original FIEMAP discussion, it seems BtrFS would prefer to use DEV_COOKIE
> instead of DEV_T, because it uses internal RAID encoding and not plain block
> devices, but I'm not familiar with the details there.
> 
> 
> Alternately, or in addition to, a DEV_COOKIE flag which indicates that the
> same fe_device field is "not a device", it would be possible to add:
> 
>     #define FIEMAP_NO_DIRECT      0x40000000
> 
> and/or:
> 
>     #define FIEMAP_EXTENT_NET     0x80000000   /* Data stored remotely.
>                                                 * Sets NO_DIRECT flag */
> 
> returned by the filesystem that indicates the extent blocks are not local
> to the node, so FIBMAP should return an error (-EOPNOTSUP or -EREMOTE or
> whatever) because the file can't be booted from.  In that case, we could
> return FIEMAP_EXTENT_DEVICE to indicate the fe_device field is valid, and
> return FIEMAP_EXTENT_NET to indicate the values in fe_device are not local
> block devices, just filesystem-specific values to distinguish devices.
> 
> However, I'm open to both _DEV_COOKIE and _NET flag if that is preferred,
> since I think the two are somewhat complementary.
> 

I'd rather avoid going down this far in the rabbit hole. Once we have fe_device
field and the basic flags, would be relatively easy to propose new flags, but
now, new flags should be discussed on a patch proposal I believe. Discussing
which flags should/shouldn't be added here, will be pointless.

Let me work on the patchset to update FIEMAP, and then we can discuss such thing
there. I'll Cc you on the patches if you want to.

Cheers

> >> This actually looks to contradict what you have been complaining, about some
> >> filesystems which doesn't support FIBMAP currently, will now suddenly start to
> >> support. Assuming it's ok if the implementation doesn't tell us about the
> >> backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the
> >> backing device, assuming it's ok will just fall into your first complain.
> > 
> > Sorry, this thread has been going on so long that I forgot your goal for
> > this series. :/
> > 
> > Specifically, I had forgotten that you're removing the ->bmap pointer,
> > which means that filesystems don't have any particular way to signal
> > "Yes on FIEMAP, no on FIBMAP".  Somehow I had thought that you were
> > merely creating a generic_file_bmap() that would call FIEMAP and ripping
> > out all the adhoc bmap implementations.
> 
> Just a reminder here, you should set FIEMAP_FLAG_SYNC when mapping FIBMAP
> to FIEMAP so that the data on that file is flushed to disk before returning,
> since the block mapping may not be assigned yet or may be unstable, which
> could lead to an unbootable system if used for LILO.


> 
> Cheers, Andreas
> 
> >>> Existing filesystems can be left unchanged, in which case neither
> >>> EXTENT_DEV flag is set in fe_flags and the bits in fe_device are
> >>> meaningless, the same as they are today.  Reporting fe_device is entirely
> >>> optional.
> >>> 
> >>> Userspace programs will now be able to tell which device the file data
> >>> lives on, which has been sort-of requested for years, if the filesystem
> >>> chooses to start exporting that information.
> >>> 
> >>> Your FIBMAP-via-FIEMAP backend can do something like:
> >>> 
> >>> /* FIBMAP only returns results for the same block device backing the fs. */
> >>> if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device)
> >>> 	return 0;
> >>> 
> >>> /* Can't tell what is the backing device, bail out. */
> >>> if (fe->fe_flags & EXTENT_DEV_COOKIE)
> >>> 	return 0;
> >>> 
> >> 
> >> Ok, the first conditional, is ok, the second one is not making sense to me.
> >> Looks like you are basically using it to flag the filesystem can't tell
> >> exactly which device the current extent is, let's say for example, distributed
> >> filesystems, where the physical extent can actually be on a different machine.
> >> But I can't say for sure, can you give me more details about what you are trying
> >> to achieve here?
> > 
> > You've understood me correctly. :)
> > 
> >>> /*
> >>> * Either fe_device matches the backing device or the implementation
> >>> * doesn't tell us about the backing device, so assume it's ok.
> >>> */
> >>> <return FIBMAP results>
> >>> 
> >> 
> >> Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE
> >> you mentioned.
> > 
> > I think you've understood it about as well as I can explain it.  Maybe
> > Andreas will have more to say about the lustre replica id, but OTOH it's
> > gone and so there's no user of it, so we could just drop it until lustre
> > comes back.
> > 
> 
> >>> So that's how I'd solve a longstanding design problem of FIEMAP and then
> >>> take advantage of that solution to remedy my objections to the proposed
> >>> "Use FIEMAP for FIBMAP" series.  It doesn't require a FIEMAP_FLAG
> >>> behavior flag that userspace knows about but isn't allowed to pass in.
> >>> 
> >> 
> >>>> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive
> >>>> than the device id in fiemap_extent. I don't see much advantage in
> >>>> adding the device id instead of using the flag.
> >>>> 
> >>>> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via
> >>>> userspace, so, it would require a check to make sure it didn't come from
> >>>> userspace if ioctl_fiemap() was used.
> >>>> 
> >>>> I think there are 2 other possibilities which can be used to fix this.
> >>>> 
> >>>> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or,
> >>>> - If the device id is a must for you, maybe add the device id into
> >>>>  fiemap_extent_info instead of fiemap_extent.
> >>> 
> >>> That won't work with btrfs, which can store file extents on multiple
> >>> different physical devices.
> >>> 
> >>>>  So we don't mess with a UAPI exported data structure and still
> >>>>  provides a way to the filesystems to provide which device the mapped
> >>>>  extent is in.
> >>>> 
> >>>> What you think?
> >>>> 
> >>>> Cheers
> >>>> 
> >>>> 
> >>>>> 
> >>>>> --D
> >>>>> 
> >>>>>>> 
> >>>>>>>> +
> >>>>>>>> +	return error;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> /**
> >>>>>>>>  *	bmap	- find a block number in a file
> >>>>>>>>  *	@inode:  inode owning the block number being requested
> >>>>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput);
> >>>>>>>>  */
> >>>>>>>> int bmap(struct inode *inode, sector_t *block)
> >>>>>>>> {
> >>>>>>>> -	if (!inode->i_mapping->a_ops->bmap)
> >>>>>>>> +	if (inode->i_op->fiemap)
> >>>>>>>> +		return bmap_fiemap(inode, block);
> >>>>>>>> +	else if (inode->i_mapping->a_ops->bmap)
> >>>>>>>> +		*block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
> >>>>>>>> +						       *block);
> >>>>>>>> +	else
> >>>>>>>> 		return -EINVAL;
> >>>>>>> 
> >>>>>>> Waitaminute.  btrfs currently supports fiemap but not bmap, and now
> >>>>>>> suddenly it will support this legacy interface they've never supported
> >>>>>>> before.  Are they on board with this?
> >>>>>>> 
> >>>>>>> --D
> >>>>>>> 
> >>>>>>>> 
> >>>>>>>> -	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> >>>>>>>> 	return 0;
> >>>>>>>> }
> >>>>>>>> EXPORT_SYMBOL(bmap);
> >>>>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c
> >>>>>>>> index 6086978fe01e..bfa59df332bf 100644
> >>>>>>>> --- a/fs/ioctl.c
> >>>>>>>> +++ b/fs/ioctl.c
> >>>>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> >>>>>>>> 	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >>>>>>>> }
> >>>>>>>> 
> >>>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical,
> >>>>>>>> +			    u64 phys, u64 len, u32 flags)
> >>>>>>>> +{
> >>>>>>>> +	struct fiemap_extent *extent = fieinfo->fi_extents_start;
> >>>>>>>> +
> >>>>>>>> +	/* only count the extents */
> >>>>>>>> +	if (fieinfo->fi_extents_max == 0) {
> >>>>>>>> +		fieinfo->fi_extents_mapped++;
> >>>>>>>> +		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >>>>>>>> +	}
> >>>>>>>> +
> >>>>>>>> +	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
> >>>>>>>> +		return 1;
> >>>>>>>> +
> >>>>>>>> +	if (flags & SET_UNKNOWN_FLAGS)
> >>>>>>>> +		flags |= FIEMAP_EXTENT_UNKNOWN;
> >>>>>>>> +	if (flags & SET_NO_UNMOUNTED_IO_FLAGS)
> >>>>>>>> +		flags |= FIEMAP_EXTENT_ENCODED;
> >>>>>>>> +	if (flags & SET_NOT_ALIGNED_FLAGS)
> >>>>>>>> +		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
> >>>>>>>> +
> >>>>>>>> +	extent->fe_logical = logical;
> >>>>>>>> +	extent->fe_physical = phys;
> >>>>>>>> +	extent->fe_length = len;
> >>>>>>>> +	extent->fe_flags = flags;
> >>>>>>>> +
> >>>>>>>> +	fieinfo->fi_extents_mapped++;
> >>>>>>>> +
> >>>>>>>> +	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
> >>>>>>>> +		return 1;
> >>>>>>>> +	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
> >>>>>>>> +}
> >>>>>>>> /**
> >>>>>>>>  * fiemap_fill_next_extent - Fiemap helper function
> >>>>>>>>  * @fieinfo:	Fiemap context passed into ->fiemap
> >>>>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >>>>>>>> index 7a434979201c..28bb523d532a 100644
> >>>>>>>> --- a/include/linux/fs.h
> >>>>>>>> +++ b/include/linux/fs.h
> >>>>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info {
> >>>>>>>> 	fiemap_fill_cb	fi_cb;
> >>>>>>>> };
> >>>>>>>> 
> >>>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical,
> >>>>>>>> +			      u64 phys, u64 len, u32 flags);
> >>>>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> >>>>>>>> 			    u64 phys, u64 len, u32 flags);
> >>>>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> >>>>>>>> --
> >>>>>>>> 2.17.2
> >>>>>>>> 
> >>>>>> 
> >>>>>> --
> >>>>>> Carlos
> >>>> 
> >>>> --
> >>>> Carlos
> >> 
> >> --
> >> Carlos
> 
> 
> Cheers, Andreas
> 
> 
> 
> 
>

Carlos Maiolino Feb. 8, 2019, 10:36 a.m. UTC | #25

On Fri, Feb 08, 2019 at 09:46:12AM +0100, Christoph Hellwig wrote:
> On Thu, Feb 07, 2019 at 02:25:01PM -0700, Andreas Dilger wrote:
> > Do we really need to be this way, about reserving a single flag for Lustre,
> > which will likely also be useful for other filesystems?  It's not like
> > Lustre is some closed-source binary module for which we need to make life
> > difficult, it is used by many thousands of the largest computers at labs
> > and universities and companies around the world.  We are working to clean
> > up the code outside the staging tree and resubmit it.  Not reserving a flag
> > just means we will continue to use random values in Lustre before it can
> > be merged, which will make life harder when we try to merge again.
> 
> No, it is available in source, but otherwise just as bad.  And we generally
> only define APIs for in-kernel usage.
> 
> If we can come up with a good API for in-kernel filesystems we can do
> that, otherwise hell no.  And staging for that matter qualifies as out
> of tree.
> 
> That being said I'm really worried about these FIEMAP extensions as
> userspace has no business poking into details of the placement (vs
> just the layout).
> 
I tend to say that identifying on which device an extent is is better than
simply saying 'it maps to physical blocks X-Z, but it's your problem to identify
which device X-Z belongs to'.

> But all that belongs into a separate dicussion instead of dragging down
> this series where it does not belong at all.

Agreed, but now I'm on a kind of dead-end :P

Darrick's concerns are valid, regarding letting currently unsupported
filesystems to suddenly allow FIBMAP calls, but on the other hand, his proposed
solution, which is also valid, requires a new discussion/patchset to discuss an
improvement of the FIEMAP infra-structure, and 'fix' the problem mentioned.
Using a flag to identify FIBMAP calls has been rejected. So, I'd accept
suggestions on how to move this patch forward, without requiring the
improvements suggested by Darrick, and, without using a flag to tag FIBMAP
calls, as suggested by me, I'm kind of running out of ideas by now :(

Cheers

Andreas Dilger Feb. 8, 2019, 9:03 p.m. UTC | #26

On Feb 8, 2019, at 3:36 AM, Carlos Maiolino <cmaiolino@redhat.com> wrote:
> 
> On Fri, Feb 08, 2019 at 09:46:12AM +0100, Christoph Hellwig wrote:
>> On Thu, Feb 07, 2019 at 02:25:01PM -0700, Andreas Dilger wrote:
>>> Do we really need to be this way, about reserving a single flag for Lustre,
>>> which will likely also be useful for other filesystems?  It's not like
>>> Lustre is some closed-source binary module for which we need to make life
>>> difficult, it is used by many thousands of the largest computers at labs
>>> and universities and companies around the world.  We are working to clean
>>> up the code outside the staging tree and resubmit it.  Not reserving a flag
>>> just means we will continue to use random values in Lustre before it can
>>> be merged, which will make life harder when we try to merge again.
>> 
>> No, it is available in source, but otherwise just as bad.  And we generally
>> only define APIs for in-kernel usage.
>> 
>> If we can come up with a good API for in-kernel filesystems we can do
>> that, otherwise hell no.  And staging for that matter qualifies as out
>> of tree.
>> 
>> That being said I'm really worried about these FIEMAP extensions as
>> userspace has no business poking into details of the placement (vs
>> just the layout).
> 
> I tend to say that identifying on which device an extent is is better than
> simply saying 'it maps to physical blocks X-Z, but it's your problem to identify
> which device X-Z belongs to'.
> 
>> But all that belongs into a separate dicussion instead of dragging down
>> this series where it does not belong at all.
> 
> Agreed, but now I'm on a kind of dead-end :P
> 
> Darrick's concerns are valid, regarding letting currently unsupported
> filesystems to suddenly allow FIBMAP calls,

I don't think there is a huge danger of people suddenly moving to use LILO
on f2fs or Btrfs with new kernels.  In most cases, it _would_ just work,
but the FIBMAP->FIEMAP layer needs to check for FIEMAP_FLAG_NOT_ALIGNED,
FIEMAP_FLAG_ENCODED, and FIEMAP_FLAG_DEVICE flags that would make this
unsuitable for booting.

> but on the other hand, his proposed
> solution, which is also valid, requires a new discussion/patchset to discuss an
> improvement of the FIEMAP infra-structure, and 'fix' the problem mentioned.
> Using a flag to identify FIBMAP calls has been rejected. So, I'd accept
> suggestions on how to move this patch forward, without requiring the
> improvements suggested by Darrick, and, without using a flag to tag FIBMAP
> calls, as suggested by me, I'm kind of running out of ideas by now :(

I think Darrick was against a flag like "FIEMAP_FLAG_FIBMAP" because it could
be specified from userspace, and it is a bit ugly and has no other value than
preventing FIBMAP from working on filesystems that don't support it today.

As Christoph mentioned, such a flag could be OK as long as it is masked from
userspace in the top-level ioctl_fiemap() handler (though to be honest, there
is no benefit for a userspace app to set this flag, it would just increase the
chance the ioctl(FIEMAP) call will fail).

     #define FIEMAP_FLAG_FIBMAP 0x80000000

Filesystems that don't want FIBMAP to work at all should return -ENOTTY from
their ->fiemap() handler.

That said, there is *still* a need for fe_device checking in ioctl_fibmap(),
because for filesystems that allow both FIEMAP and FIBMAP (i.e. the most common
ones like ext4 and XFS) there may still be reasons for FIBMAP to fail for some
files if they are unsuitable (e.g. data stored on multiple devices). That isn't
something that the filesystems should be checking themselves.

Cheers, Andreas

Christoph Hellwig Feb. 11, 2019, 12:57 p.m. UTC | #27

On Fri, Feb 08, 2019 at 09:43:52AM +0100, Christoph Hellwig wrote:
> Agreed.  Please don't change the FIEMAP uapi.  If we need to check
> for a request coming from bmap just defined an internal FIEMAP flag
> as the last available flag in the flags word, and reject it when
> it comes from userspace in fiemap.

Based on the new thread you started it seems like this go lost.  I think
we are 99% done with the bmap through fiemap series, and it is just
missing this internal flag to make progress.  Let's finish this off
before starting another big project in the area.

Carlos Maiolino Feb. 11, 2019, 4:21 p.m. UTC | #28

On Mon, Feb 11, 2019 at 01:57:08PM +0100, Christoph Hellwig wrote:
> On Fri, Feb 08, 2019 at 09:43:52AM +0100, Christoph Hellwig wrote:
> > Agreed.  Please don't change the FIEMAP uapi.  If we need to check
> > for a request coming from bmap just defined an internal FIEMAP flag
> > as the last available flag in the flags word, and reject it when
> > it comes from userspace in fiemap.
> 
> Based on the new thread you started it seems like this go lost.  I think
> we are 99% done with the bmap through fiemap series, and it is just
> missing this internal flag to make progress.  Let's finish this off
> before starting another big project in the area.

I'm more than happy in see your reply, I'd love to finish this, but I got stuck
in how to make some filesystems deny a FIBMAP call once they do support FIEMAP
but not FIBMAP.

I have this patch almost ready to go anyway. Do you agree in keep this flag in
fi_flags field? Or maybe some other place, dunno, maybe a new fi_private field.

>

Christoph Hellwig Feb. 11, 2019, 4:48 p.m. UTC | #29

On Mon, Feb 11, 2019 at 05:21:40PM +0100, Carlos Maiolino wrote:
> I'm more than happy in see your reply, I'd love to finish this, but I got stuck
> in how to make some filesystems deny a FIBMAP call once they do support FIEMAP
> but not FIBMAP.
> 
> I have this patch almost ready to go anyway. Do you agree in keep this flag in
> fi_flags field? Or maybe some other place, dunno, maybe a new fi_private field.

Have it in fi_flags as a purely in-kernel flag, and then deny the
bmap calls through fiemap based on it.

[09/10,V2] Use FIEMAP for FIBMAP calls

Commit Message

Comments

Patch