Message ID | 20181205091728.29903-10-cmaiolino@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | New ->fiemap infrastructure and ->bmap removal | expand |
On Wed, Dec 05, 2018 at 10:17:27AM +0100, Carlos Maiolino wrote: > Enables the usage of FIEMAP ioctl infrastructure to handle FIBMAP calls. > From now on, ->bmap() methods can start to be removed from filesystems > which already provides ->fiemap(). > > This adds a new helper - bmap_fiemap() - which is used to fill in the > fiemap request, call into the underlying filesystem and check the flags > set in the extent requested. > > Add a new fiemap fill extent callback to handl the in-kernel only > fiemap_extent structure used for FIBMAP. > > V2: > - Now based on the updated fiemap_extent_info, > - move the fiemap call itself to a new helper > > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> > --- > fs/inode.c | 42 ++++++++++++++++++++++++++++++++++++++++-- > fs/ioctl.c | 32 ++++++++++++++++++++++++++++++++ > include/linux/fs.h | 2 ++ > 3 files changed, 74 insertions(+), 2 deletions(-) > > diff --git a/fs/inode.c b/fs/inode.c > index db681d310465..f07cc183ddbd 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -1578,6 +1578,40 @@ void iput(struct inode *inode) > } > EXPORT_SYMBOL(iput); > > +static int bmap_fiemap(struct inode *inode, sector_t *block) > +{ > + struct fiemap_extent_info fieinfo = { 0, }; > + struct fiemap_extent fextent; > + u64 start = *block << inode->i_blkbits; > + int error = -EINVAL; > + > + fextent.fe_logical = 0; > + fextent.fe_physical = 0; > + fieinfo.fi_extents_max = 1; > + fieinfo.fi_extents_mapped = 0; > + fieinfo.fi_extents_start = &fextent; > + fieinfo.fi_start = start; > + fieinfo.fi_len = 1 << inode->i_blkbits; > + fieinfo.fi_flags = 0; > + fieinfo.fi_cb = fiemap_fill_kernel_extent; > + > + error = inode->i_op->fiemap(inode, &fieinfo); > + > + if (error) > + return error; > + > + if (fieinfo.fi_flags & (FIEMAP_EXTENT_UNKNOWN | > + FIEMAP_EXTENT_ENCODED | > + FIEMAP_EXTENT_DATA_INLINE | > + FIEMAP_EXTENT_UNWRITTEN)) > + return -EINVAL; AFAICT, three of the filesystems that support COW writes (xfs, ocfs2, and btrfs) do not return bmap results for files with shared blocks. This check here should include FIEMAP_EXTENT_SHARED since external overwrites of a COW file block are bad news on btrfs (and ocfs2 and xfs). > + > + *block = (fextent.fe_physical + > + (start - fextent.fe_logical)) >> inode->i_blkbits; Hmmm, so there's nothing here checking that the physical device fiemap reports is the same device that was passed into the mount. This is trivially true for most of the filesystems that implement bmap and fiemap, but definitely not true for xfs or btrfs. I would bet most userspace callers of bmap (since it's an ext2-era ioctl) make that assumption and don't even know how to find the device. On xfs, the bmap implementation won't return any results for realtime files, but it looks as though we suddenly will start doing that here, because in the new bmap implementation we will use fiemap, and fiemap reports extents without providing any context about which device they're on, and that context-less extent gets passed back to bmap_fiemap. In any case, I think a better solution to the multi-device problem is to start returning device information via struct fiemap_extent, at least inside the kernel. Use one of the reserved fields to declare a new '__u32 fe_device' field in struct fiemap_extent which can be the dev_t device number, and then you can check that against inode->i_sb->s_bdev to avoid returning results for the non-primary device of a multi-device filesystem. > + > + return error; > +} > + > /** > * bmap - find a block number in a file > * @inode: inode owning the block number being requested > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > */ > int bmap(struct inode *inode, sector_t *block) > { > - if (!inode->i_mapping->a_ops->bmap) > + if (inode->i_op->fiemap) > + return bmap_fiemap(inode, block); > + else if (inode->i_mapping->a_ops->bmap) > + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > + *block); > + else > return -EINVAL; Waitaminute. btrfs currently supports fiemap but not bmap, and now suddenly it will support this legacy interface they've never supported before. Are they on board with this? --D > > - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > return 0; > } > EXPORT_SYMBOL(bmap); > diff --git a/fs/ioctl.c b/fs/ioctl.c > index 6086978fe01e..bfa59df332bf 100644 > --- a/fs/ioctl.c > +++ b/fs/ioctl.c > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > } > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > + u64 phys, u64 len, u32 flags) > +{ > + struct fiemap_extent *extent = fieinfo->fi_extents_start; > + > + /* only count the extents */ > + if (fieinfo->fi_extents_max == 0) { > + fieinfo->fi_extents_mapped++; > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > + } > + > + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > + return 1; > + > + if (flags & SET_UNKNOWN_FLAGS) > + flags |= FIEMAP_EXTENT_UNKNOWN; > + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > + flags |= FIEMAP_EXTENT_ENCODED; > + if (flags & SET_NOT_ALIGNED_FLAGS) > + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > + > + extent->fe_logical = logical; > + extent->fe_physical = phys; > + extent->fe_length = len; > + extent->fe_flags = flags; > + > + fieinfo->fi_extents_mapped++; > + > + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > + return 1; > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > +} > /** > * fiemap_fill_next_extent - Fiemap helper function > * @fieinfo: Fiemap context passed into ->fiemap > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 7a434979201c..28bb523d532a 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > fiemap_fill_cb fi_cb; > }; > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > + u64 phys, u64 len, u32 flags); > int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > u64 phys, u64 len, u32 flags); > int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > -- > 2.17.2 >
On Wed, Dec 05, 2018 at 09:36:50AM -0800, Darrick J. Wong wrote: > On Wed, Dec 05, 2018 at 10:17:27AM +0100, Carlos Maiolino wrote: > > Enables the usage of FIEMAP ioctl infrastructure to handle FIBMAP calls. > > From now on, ->bmap() methods can start to be removed from filesystems > > which already provides ->fiemap(). > > > > This adds a new helper - bmap_fiemap() - which is used to fill in the > > fiemap request, call into the underlying filesystem and check the flags > > set in the extent requested. > > > > Add a new fiemap fill extent callback to handl the in-kernel only > > fiemap_extent structure used for FIBMAP. > > > > V2: > > - Now based on the updated fiemap_extent_info, > > - move the fiemap call itself to a new helper > > > > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> > > --- > > fs/inode.c | 42 ++++++++++++++++++++++++++++++++++++++++-- > > fs/ioctl.c | 32 ++++++++++++++++++++++++++++++++ > > include/linux/fs.h | 2 ++ > > 3 files changed, 74 insertions(+), 2 deletions(-) > > > > diff --git a/fs/inode.c b/fs/inode.c > > index db681d310465..f07cc183ddbd 100644 > > --- a/fs/inode.c > > +++ b/fs/inode.c > > @@ -1578,6 +1578,40 @@ void iput(struct inode *inode) > > } > > EXPORT_SYMBOL(iput); > > > > +static int bmap_fiemap(struct inode *inode, sector_t *block) > > +{ > > + struct fiemap_extent_info fieinfo = { 0, }; > > + struct fiemap_extent fextent; > > + u64 start = *block << inode->i_blkbits; > > + int error = -EINVAL; > > + > > + fextent.fe_logical = 0; > > + fextent.fe_physical = 0; > > + fieinfo.fi_extents_max = 1; > > + fieinfo.fi_extents_mapped = 0; > > + fieinfo.fi_extents_start = &fextent; > > + fieinfo.fi_start = start; > > + fieinfo.fi_len = 1 << inode->i_blkbits; > > + fieinfo.fi_flags = 0; > > + fieinfo.fi_cb = fiemap_fill_kernel_extent; > > + > > + error = inode->i_op->fiemap(inode, &fieinfo); > > + > > + if (error) > > + return error; > > + > > + if (fieinfo.fi_flags & (FIEMAP_EXTENT_UNKNOWN | > > + FIEMAP_EXTENT_ENCODED | > > + FIEMAP_EXTENT_DATA_INLINE | > > + FIEMAP_EXTENT_UNWRITTEN)) > > + return -EINVAL; > > AFAICT, three of the filesystems that support COW writes (xfs, ocfs2, > and btrfs) do not return bmap results for files with shared blocks. > This check here should include FIEMAP_EXTENT_SHARED since external > overwrites of a COW file block are bad news on btrfs (and ocfs2 and > xfs). Yes, it does need to check for FIEMAP_EXTENT_SHARED too, I had it on my plans but I forgot to add it when setting up the flags. Thanks for reminding me. > > > + > > + *block = (fextent.fe_physical + > > + (start - fextent.fe_logical)) >> inode->i_blkbits; > > Hmmm, so there's nothing here checking that the physical device fiemap > reports is the same device that was passed into the mount. This is > trivially true for most of the filesystems that implement bmap and > fiemap, but definitely not true for xfs or btrfs. I would bet most > userspace callers of bmap (since it's an ext2-era ioctl) make that > assumption and don't even know how to find the device. > > On xfs, the bmap implementation won't return any results for realtime > files, but it looks as though we suddenly will start doing that here, > because in the new bmap implementation we will use fiemap, and fiemap > reports extents without providing any context about which device they're > on, and that context-less extent gets passed back to bmap_fiemap. > > In any case, I think a better solution to the multi-device problem is to > start returning device information via struct fiemap_extent, at least > inside the kernel. Use one of the reserved fields to declare a new > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t > device number, and then you can check that against inode->i_sb->s_bdev > to avoid returning results for the non-primary device of a multi-device > filesystem. Yes, you are right, I haven't thought about multi-dev filesystems. I checked btrfs code and it doesn't even support fibmap, exactly because of this problem, I wonder though, why it does support FIEMAP then, maybe because the fiemap idea isn't provide a way to userspace do IO directly to the device?! I'm not sure if crossing dev information is enough though, I did a quick read of btrfs code, and the assumption that the block/extent location won't change on time, could lead to a time bomb in the future. I wonder if it wouldn't maybe be better to add a flag, to identify the usage type, and the filesystem itself would define if it should return anything, or not. Like, for example, passing in fieinfo->fi_flags, something like FIEMAP_FIBMAP, and check inside the filesystem for it. From my shallow understanding of btrfs, looks like the location of the data, can be moved inside the same device, so, even if the devices are the same as you suggested, there is no guarantee the offset will be the same. Cheers. > > > + > > + return error; > > +} > > + > > /** > > * bmap - find a block number in a file > > * @inode: inode owning the block number being requested > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > > */ > > int bmap(struct inode *inode, sector_t *block) > > { > > - if (!inode->i_mapping->a_ops->bmap) > > + if (inode->i_op->fiemap) > > + return bmap_fiemap(inode, block); > > + else if (inode->i_mapping->a_ops->bmap) > > + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > > + *block); > > + else > > return -EINVAL; > > Waitaminute. btrfs currently supports fiemap but not bmap, and now > suddenly it will support this legacy interface they've never supported > before. Are they on board with this? > > --D > > > > > - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > > return 0; > > } > > EXPORT_SYMBOL(bmap); > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > index 6086978fe01e..bfa59df332bf 100644 > > --- a/fs/ioctl.c > > +++ b/fs/ioctl.c > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > } > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > + u64 phys, u64 len, u32 flags) > > +{ > > + struct fiemap_extent *extent = fieinfo->fi_extents_start; > > + > > + /* only count the extents */ > > + if (fieinfo->fi_extents_max == 0) { > > + fieinfo->fi_extents_mapped++; > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > + } > > + > > + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > > + return 1; > > + > > + if (flags & SET_UNKNOWN_FLAGS) > > + flags |= FIEMAP_EXTENT_UNKNOWN; > > + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > > + flags |= FIEMAP_EXTENT_ENCODED; > > + if (flags & SET_NOT_ALIGNED_FLAGS) > > + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > > + > > + extent->fe_logical = logical; > > + extent->fe_physical = phys; > > + extent->fe_length = len; > > + extent->fe_flags = flags; > > + > > + fieinfo->fi_extents_mapped++; > > + > > + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > > + return 1; > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > +} > > /** > > * fiemap_fill_next_extent - Fiemap helper function > > * @fieinfo: Fiemap context passed into ->fiemap > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 7a434979201c..28bb523d532a 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > > fiemap_fill_cb fi_cb; > > }; > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > > + u64 phys, u64 len, u32 flags); > > int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > > u64 phys, u64 len, u32 flags); > > int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > > -- > > 2.17.2 > >
On Dec 7, 2018, at 2:09 AM, Carlos Maiolino <cmaiolino@redhat.com> wrote: > On Wed, Dec 05, 2018 at 09:36:50AM -0800, Darrick J. Wong wrote: >> >>> + >>> + *block = (fextent.fe_physical + >>> + (start - fextent.fe_logical)) >> inode->i_blkbits; >> >> Hmmm, so there's nothing here checking that the physical device fiemap >> reports is the same device that was passed into the mount. This is >> trivially true for most of the filesystems that implement bmap and >> fiemap, but definitely not true for xfs or btrfs. I would bet most >> userspace callers of bmap (since it's an ext2-era ioctl) make that >> assumption and don't even know how to find the device. >> >> On xfs, the bmap implementation won't return any results for realtime >> files, but it looks as though we suddenly will start doing that here, >> because in the new bmap implementation we will use fiemap, and fiemap >> reports extents without providing any context about which device they're >> on, and that context-less extent gets passed back to bmap_fiemap. >> >> In any case, I think a better solution to the multi-device problem is to >> start returning device information via struct fiemap_extent, at least >> inside the kernel. Use one of the reserved fields to declare a new >> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t >> device number, and then you can check that against inode->i_sb->s_bdev >> to avoid returning results for the non-primary device of a multi-device >> filesystem. We're using fe_device = fe_reserved[0] to return the device number for Lustre. For Lustre, the "device number" is just a server index, since the server's block device number is irrelevant on the client. For local filesystems, it should return the 32-bit st_rdev device number to distinguish devices. I have patches for e2fsprogs filefrag to print the fe_device field. > Yes, you are right, I haven't thought about multi-dev filesystems. I checked > btrfs code and it doesn't even support fibmap, exactly because of this problem, > I wonder though, why it does support FIEMAP then, maybe because the fiemap idea > isn't provide a way to userspace do IO directly to the device?! > > I'm not sure if crossing dev information is enough though, I did a quick read of > btrfs code, and the assumption that the block/extent location won't change on > time, could lead to a time bomb in the future. I wonder if it wouldn't maybe be > better to add a flag, to identify the usage type, and the filesystem itself > would define if it should return anything, or not. Like, for example, passing in > fieinfo->fi_flags, something like FIEMAP_FIBMAP, and check inside the filesystem > for it. The FIEMAP_EXTENT_ENCODED flag is meant to be returned when the extent cannot be read directly from the block device. For FIBMAP, it should return an error if ENCODED is set, since this file is not suitable for directly booting a kernel (LILO is the only user of FIBMAP that I'm aware of). The filefrag utility prefers to use FIEMAP for efficiency, and only falls back to FIBMAP if FIEMAP fails. > From my shallow understanding of btrfs, looks like the location of the data, can > be moved inside the same device, so, even if the devices are the same as you > suggested, there is no guarantee the offset will be the same. On a related note, btrfs also supports compressed extents, which isn't handled by the current FIEMAP ioctl properly. There was a patch proposed ages ago to add FIEMAP_EXTENT_DATA_COMPRESSED, but didn't _quite_ make it over the finish line, https://www.spinics.net/lists/xfs/msg24629.html has the last discussion. It added EXTENT_DATA_COMPRESSED and used fe_reserved64[0] as fe_phys_length to return the on-disk extent size, while fe_length would rename to fe_logi_length (with a compat macro) to still represented the logical extent length. #define FIEMAP_EXTENT_DATA_COMPRESSED 0x00000040 /* Data compressed by fs. * Sets EXTENT_DATA_ENCODED */ - __u64 fe_reserved64[2]; + __u64 fe_phys_length; /* physical length in bytes, undefined if + * DATA_COMPRESSED not set */ There was some discussion on whether there should be a second flag like FIEMAP_EXTENT_PHYS_LENGTH that is set when the fe_phys_length field is valid, independent of whether the data is compressed or not. Since you are reworking the FIEMAP code anyway, would you be interested to revive that patch series? Cheers, Andreas
> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > + u64 phys, u64 len, u32 flags) Any reason this function isn't in inode.c next to the caller and marked static? Otherwise looks fine except for the additional sanity checking pointed out by Darrick. > + /* only count the extents */ > + if (fieinfo->fi_extents_max == 0) { > + fieinfo->fi_extents_mapped++; > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; Maybe do a 'goto out' here? > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; And reuse this return. Bonus points for using a good old if here: if (flags & FIEMAP_EXTENT_LAST) return 1; return 0;
Hi, Sorry for the long delay Darrick. > > + fextent.fe_logical = 0; > > + fextent.fe_physical = 0; > > + fieinfo.fi_extents_max = 1; > > + fieinfo.fi_extents_mapped = 0; > > + fieinfo.fi_extents_start = &fextent; > > + fieinfo.fi_start = start; > > + fieinfo.fi_len = 1 << inode->i_blkbits; > > + fieinfo.fi_flags = 0; > > + fieinfo.fi_cb = fiemap_fill_kernel_extent; > > + > > + error = inode->i_op->fiemap(inode, &fieinfo); > > + > > + if (error) > > + return error; > > + > > + if (fieinfo.fi_flags & (FIEMAP_EXTENT_UNKNOWN | > > + FIEMAP_EXTENT_ENCODED | > > + FIEMAP_EXTENT_DATA_INLINE | > > + FIEMAP_EXTENT_UNWRITTEN)) > > + return -EINVAL; > > AFAICT, three of the filesystems that support COW writes (xfs, ocfs2, > and btrfs) do not return bmap results for files with shared blocks. > This check here should include FIEMAP_EXTENT_SHARED since external > overwrites of a COW file block are bad news on btrfs (and ocfs2 and > xfs). ok, np > > > + > > + *block = (fextent.fe_physical + > > + (start - fextent.fe_logical)) >> inode->i_blkbits; > > Hmmm, so there's nothing here checking that the physical device fiemap > reports is the same device that was passed into the mount. This is > trivially true for most of the filesystems that implement bmap and > fiemap, but definitely not true for xfs or btrfs. I would bet most > userspace callers of bmap (since it's an ext2-era ioctl) make that > assumption and don't even know how to find the device. Makes sense. > > On xfs, the bmap implementation won't return any results for realtime > files, but it looks as though we suddenly will start doing that here, > because in the new bmap implementation we will use fiemap, and fiemap > reports extents without providing any context about which device they're > on, and that context-less extent gets passed back to bmap_fiemap. > > In any case, I think a better solution to the multi-device problem is to > start returning device information via struct fiemap_extent, at least > inside the kernel. Use one of the reserved fields to declare a new > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t > device number, and then you can check that against inode->i_sb->s_bdev > to avoid returning results for the non-primary device of a multi-device > filesystem. I agree we should address it here, but I don't think fiemap_extent is the right place for it, it is linked to the UAPI, and changing it is usually not a good idea. I think I got your idea anyway, but, what if, instead returning the bdev in fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do with such information? > > > + > > + return error; > > +} > > + > > /** > > * bmap - find a block number in a file > > * @inode: inode owning the block number being requested > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > > */ > > int bmap(struct inode *inode, sector_t *block) > > { > > - if (!inode->i_mapping->a_ops->bmap) > > + if (inode->i_op->fiemap) > > + return bmap_fiemap(inode, block); > > + else if (inode->i_mapping->a_ops->bmap) > > + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > > + *block); > > + else > > return -EINVAL; > > Waitaminute. btrfs currently supports fiemap but not bmap, and now > suddenly it will support this legacy interface they've never supported > before. Are they on board with this? > > --D > > > > > - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > > return 0; > > } > > EXPORT_SYMBOL(bmap); > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > index 6086978fe01e..bfa59df332bf 100644 > > --- a/fs/ioctl.c > > +++ b/fs/ioctl.c > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > } > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > + u64 phys, u64 len, u32 flags) > > +{ > > + struct fiemap_extent *extent = fieinfo->fi_extents_start; > > + > > + /* only count the extents */ > > + if (fieinfo->fi_extents_max == 0) { > > + fieinfo->fi_extents_mapped++; > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > + } > > + > > + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > > + return 1; > > + > > + if (flags & SET_UNKNOWN_FLAGS) > > + flags |= FIEMAP_EXTENT_UNKNOWN; > > + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > > + flags |= FIEMAP_EXTENT_ENCODED; > > + if (flags & SET_NOT_ALIGNED_FLAGS) > > + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > > + > > + extent->fe_logical = logical; > > + extent->fe_physical = phys; > > + extent->fe_length = len; > > + extent->fe_flags = flags; > > + > > + fieinfo->fi_extents_mapped++; > > + > > + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > > + return 1; > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > +} > > /** > > * fiemap_fill_next_extent - Fiemap helper function > > * @fieinfo: Fiemap context passed into ->fiemap > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 7a434979201c..28bb523d532a 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > > fiemap_fill_cb fi_cb; > > }; > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > > + u64 phys, u64 len, u32 flags); > > int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > > u64 phys, u64 len, u32 flags); > > int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > > -- > > 2.17.2 > >
On Mon, Feb 04, 2019 at 04:11:47PM +0100, Carlos Maiolino wrote: > Hi, Sorry for the long delay Darrick. > > > > + fextent.fe_logical = 0; > > > + fextent.fe_physical = 0; > > > + fieinfo.fi_extents_max = 1; > > > + fieinfo.fi_extents_mapped = 0; > > > + fieinfo.fi_extents_start = &fextent; > > > + fieinfo.fi_start = start; > > > + fieinfo.fi_len = 1 << inode->i_blkbits; > > > + fieinfo.fi_flags = 0; > > > + fieinfo.fi_cb = fiemap_fill_kernel_extent; > > > + > > > + error = inode->i_op->fiemap(inode, &fieinfo); > > > + > > > + if (error) > > > + return error; > > > + > > > + if (fieinfo.fi_flags & (FIEMAP_EXTENT_UNKNOWN | > > > + FIEMAP_EXTENT_ENCODED | > > > + FIEMAP_EXTENT_DATA_INLINE | > > > + FIEMAP_EXTENT_UNWRITTEN)) > > > + return -EINVAL; > > > > AFAICT, three of the filesystems that support COW writes (xfs, ocfs2, > > and btrfs) do not return bmap results for files with shared blocks. > > This check here should include FIEMAP_EXTENT_SHARED since external > > overwrites of a COW file block are bad news on btrfs (and ocfs2 and > > xfs). > > ok, np > > > > > > + > > > + *block = (fextent.fe_physical + > > > + (start - fextent.fe_logical)) >> inode->i_blkbits; > > > > Hmmm, so there's nothing here checking that the physical device fiemap > > reports is the same device that was passed into the mount. This is > > trivially true for most of the filesystems that implement bmap and > > fiemap, but definitely not true for xfs or btrfs. I would bet most > > userspace callers of bmap (since it's an ext2-era ioctl) make that > > assumption and don't even know how to find the device. > > Makes sense. > > > > > On xfs, the bmap implementation won't return any results for realtime > > files, but it looks as though we suddenly will start doing that here, > > because in the new bmap implementation we will use fiemap, and fiemap > > reports extents without providing any context about which device they're > > on, and that context-less extent gets passed back to bmap_fiemap. > > > > In any case, I think a better solution to the multi-device problem is to > > start returning device information via struct fiemap_extent, at least > > inside the kernel. Use one of the reserved fields to declare a new > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t > > device number, and then you can check that against inode->i_sb->s_bdev > > to avoid returning results for the non-primary device of a multi-device > > filesystem. > > I agree we should address it here, but I don't think fiemap_extent is the right > place for it, it is linked to the UAPI, and changing it is usually not a good > idea. Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields into some sort of dev_t/per-device cookie should be fine. Userspace shouldn't be expecting any meaning in reserved areas. > I think I got your idea anyway, but, what if, instead returning the bdev in > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do > with such information? I don't like the idea of adding a FIEMAP_FLAG to distinguish callers. --D > > > > > + > > > + return error; > > > +} > > > + > > > /** > > > * bmap - find a block number in a file > > > * @inode: inode owning the block number being requested > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > > > */ > > > int bmap(struct inode *inode, sector_t *block) > > > { > > > - if (!inode->i_mapping->a_ops->bmap) > > > + if (inode->i_op->fiemap) > > > + return bmap_fiemap(inode, block); > > > + else if (inode->i_mapping->a_ops->bmap) > > > + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > > > + *block); > > > + else > > > return -EINVAL; > > > > Waitaminute. btrfs currently supports fiemap but not bmap, and now > > suddenly it will support this legacy interface they've never supported > > before. Are they on board with this? > > > > --D > > > > > > > > - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > > > return 0; > > > } > > > EXPORT_SYMBOL(bmap); > > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > > index 6086978fe01e..bfa59df332bf 100644 > > > --- a/fs/ioctl.c > > > +++ b/fs/ioctl.c > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > } > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > + u64 phys, u64 len, u32 flags) > > > +{ > > > + struct fiemap_extent *extent = fieinfo->fi_extents_start; > > > + > > > + /* only count the extents */ > > > + if (fieinfo->fi_extents_max == 0) { > > > + fieinfo->fi_extents_mapped++; > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > + } > > > + > > > + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > > > + return 1; > > > + > > > + if (flags & SET_UNKNOWN_FLAGS) > > > + flags |= FIEMAP_EXTENT_UNKNOWN; > > > + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > > > + flags |= FIEMAP_EXTENT_ENCODED; > > > + if (flags & SET_NOT_ALIGNED_FLAGS) > > > + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > > > + > > > + extent->fe_logical = logical; > > > + extent->fe_physical = phys; > > > + extent->fe_length = len; > > > + extent->fe_flags = flags; > > > + > > > + fieinfo->fi_extents_mapped++; > > > + > > > + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > > > + return 1; > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > +} > > > /** > > > * fiemap_fill_next_extent - Fiemap helper function > > > * @fieinfo: Fiemap context passed into ->fiemap > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > index 7a434979201c..28bb523d532a 100644 > > > --- a/include/linux/fs.h > > > +++ b/include/linux/fs.h > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > > > fiemap_fill_cb fi_cb; > > > }; > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > > > + u64 phys, u64 len, u32 flags); > > > int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > > > u64 phys, u64 len, u32 flags); > > > int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > > > -- > > > 2.17.2 > > > > > -- > Carlos
Hi Christoph. On Mon, Jan 14, 2019 at 05:56:17PM +0100, Christoph Hellwig wrote: > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > + u64 phys, u64 len, u32 flags) > > Any reason this function isn't in inode.c next to the caller and marked > static? > No reason other than to keep it close to its peer fiemap_fill_user_extent(), I honestly do prefer to keep both together than in separated files. But, I'm up to move it to fs/inode.c if required. > Otherwise looks fine except for the additional sanity checking pointed > out by Darrick. Working on that. > > > + /* only count the extents */ > > + if (fieinfo->fi_extents_max == 0) { > > + fieinfo->fi_extents_mapped++; > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > Maybe do a 'goto out' here? > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > And reuse this return. Bonus points for using a good old > if here: > > if (flags & FIEMAP_EXTENT_LAST) > return 1; > return 0; Ok, will be in the new version, thanks for the review :)
On Tue, Feb 05, 2019 at 10:56:01AM +0100, Carlos Maiolino wrote: > > Any reason this function isn't in inode.c next to the caller and marked > > static? > > > > No reason other than to keep it close to its peer fiemap_fill_user_extent(), I > honestly do prefer to keep both together than in separated files. But, I'm up > to move it to fs/inode.c if required. After your series fiemap_fill_user_extent should be static and close to it's caller, so with the kernel one in inode.c everything should be neat and symmetric.
On Tue, Feb 05, 2019 at 07:25:18PM +0100, Christoph Hellwig wrote: > On Tue, Feb 05, 2019 at 10:56:01AM +0100, Carlos Maiolino wrote: > > > Any reason this function isn't in inode.c next to the caller and marked > > > static? > > > > > > > No reason other than to keep it close to its peer fiemap_fill_user_extent(), I > > honestly do prefer to keep both together than in separated files. But, I'm up > > to move it to fs/inode.c if required. > > After your series fiemap_fill_user_extent should be static and close > to it's caller, so with the kernel one in inode.c everything should > be neat and symmetric. You are right, I didn't pay attention to that, thanks for the heads up, I'll fix it on the next version
> > > In any case, I think a better solution to the multi-device problem is to > > > start returning device information via struct fiemap_extent, at least > > > inside the kernel. Use one of the reserved fields to declare a new > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t > > > device number, and then you can check that against inode->i_sb->s_bdev > > > to avoid returning results for the non-primary device of a multi-device > > > filesystem. > > > > I agree we should address it here, but I don't think fiemap_extent is the right > > place for it, it is linked to the UAPI, and changing it is usually not a good > > idea. > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields > into some sort of dev_t/per-device cookie should be fine. Userspace > shouldn't be expecting any meaning in reserved areas. > > > I think I got your idea anyway, but, what if, instead returning the bdev in > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do > > with such information? > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers. Ok, may I ask why not? My apologies if I am wrong, but, per my understanding, there is nothing today, which tells userspace which device belongs the extent map reported by FIEMAP. If it belongs to the RT device in XFS, or whatever disk in a raid in BTRFS, we simply do not provide such information. So, the goal is to provide a way to tell the filesystem if a FIEMAP or a FIBMAP has been requested, so the current behavior of both ioctls won't change. Enabling filesystems to return device information into fiemap_extent requires modification of all filesystems to provide such information, which will not have any use other than matching the mounted device to the device where the extent is. A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive than the device id in fiemap_extent. I don't see much advantage in adding the device id instead of using the flag. A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via userspace, so, it would require a check to make sure it didn't come from userspace if ioctl_fiemap() was used. I think there are 2 other possibilities which can be used to fix this. - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, - If the device id is a must for you, maybe add the device id into fiemap_extent_info instead of fiemap_extent. So we don't mess with a UAPI exported data structure and still provides a way to the filesystems to provide which device the mapped extent is in. What you think? Cheers > > --D > > > > > > > > + > > > > + return error; > > > > +} > > > > + > > > > /** > > > > * bmap - find a block number in a file > > > > * @inode: inode owning the block number being requested > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > > > > */ > > > > int bmap(struct inode *inode, sector_t *block) > > > > { > > > > - if (!inode->i_mapping->a_ops->bmap) > > > > + if (inode->i_op->fiemap) > > > > + return bmap_fiemap(inode, block); > > > > + else if (inode->i_mapping->a_ops->bmap) > > > > + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > > > > + *block); > > > > + else > > > > return -EINVAL; > > > > > > Waitaminute. btrfs currently supports fiemap but not bmap, and now > > > suddenly it will support this legacy interface they've never supported > > > before. Are they on board with this? > > > > > > --D > > > > > > > > > > > - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > > > > return 0; > > > > } > > > > EXPORT_SYMBOL(bmap); > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > > > index 6086978fe01e..bfa59df332bf 100644 > > > > --- a/fs/ioctl.c > > > > +++ b/fs/ioctl.c > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > } > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > + u64 phys, u64 len, u32 flags) > > > > +{ > > > > + struct fiemap_extent *extent = fieinfo->fi_extents_start; > > > > + > > > > + /* only count the extents */ > > > > + if (fieinfo->fi_extents_max == 0) { > > > > + fieinfo->fi_extents_mapped++; > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > + } > > > > + > > > > + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > > > > + return 1; > > > > + > > > > + if (flags & SET_UNKNOWN_FLAGS) > > > > + flags |= FIEMAP_EXTENT_UNKNOWN; > > > > + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > > > > + flags |= FIEMAP_EXTENT_ENCODED; > > > > + if (flags & SET_NOT_ALIGNED_FLAGS) > > > > + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > > > > + > > > > + extent->fe_logical = logical; > > > > + extent->fe_physical = phys; > > > > + extent->fe_length = len; > > > > + extent->fe_flags = flags; > > > > + > > > > + fieinfo->fi_extents_mapped++; > > > > + > > > > + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > > > > + return 1; > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > +} > > > > /** > > > > * fiemap_fill_next_extent - Fiemap helper function > > > > * @fieinfo: Fiemap context passed into ->fiemap > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > > index 7a434979201c..28bb523d532a 100644 > > > > --- a/include/linux/fs.h > > > > +++ b/include/linux/fs.h > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > > > > fiemap_fill_cb fi_cb; > > > > }; > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > > > > + u64 phys, u64 len, u32 flags); > > > > int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > > > > u64 phys, u64 len, u32 flags); > > > > int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > > > > -- > > > > 2.17.2 > > > > > > > > -- > > Carlos
On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote: > > > > In any case, I think a better solution to the multi-device problem is to > > > > start returning device information via struct fiemap_extent, at least > > > > inside the kernel. Use one of the reserved fields to declare a new > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t > > > > device number, and then you can check that against inode->i_sb->s_bdev > > > > to avoid returning results for the non-primary device of a multi-device > > > > filesystem. > > > > > > I agree we should address it here, but I don't think fiemap_extent is the right > > > place for it, it is linked to the UAPI, and changing it is usually not a good > > > idea. > > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields > > into some sort of dev_t/per-device cookie should be fine. Userspace > > shouldn't be expecting any meaning in reserved areas. > > > > > I think I got your idea anyway, but, what if, instead returning the bdev in > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do > > > with such information? > > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers. > > Ok, may I ask why not? I think it's a bad idea to add a flag to FIEMAP to change its behavior to suit an older and even crappier legacy interface (i.e. FIBMAP). FIBMAP is architecturally broken in that we can't /ever/ provide the context of "which device does this map to?" FIEMAP is architecturally deficient as well, but its ioctl structure definition is flexible enough that we can report "which device does this map to". I want to enhance FIEMAP to deal with multi-device filesystems correctly, and as much as I want to kill FIBMAP, I can't because of zipl and *lilo. > My apologies if I am wrong, but, per my understanding, there is > nothing today, which tells userspace which device belongs the extent > map reported by FIEMAP. Right... > If it belongs to the RT device in XFS, or whatever disk in a raid in > BTRFS, we simply do not provide such information. Right... > So, the goal is to provide a way to tell the filesystem if a FIEMAP or > a FIBMAP has been requested, so the current behavior of both ioctls > won't change. ...but from my point of view, the FIEMAP behavior *ought* to change to be more expressive. Once that's done, we can use the more expressive FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems. The whole point of having fe_reserved* fields in struct fiemap_extent is so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can start returning data in a reserved field. New userspace programs that know about the flag can start reading information from the new field if they see the flag, and old userspace programs don't know about the flag and won't be any worse off. > Enabling filesystems to return device information into fiemap_extent > requires modification of all filesystems to provide such information, > which will not have any use other than matching the mounted device to > the device where the extent is. Perhaps it would help for me to present a more concrete proposal: --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800 +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800 @@ -22,7 +22,19 @@ struct fiemap_extent { __u64 fe_length; /* length in bytes for this extent */ __u64 fe_reserved64[2]; __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ - __u32 fe_reserved[3]; + + /* + * Underlying device that this extent is stored on. + * + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is + * set, this field is a 32-bit cookie that can be used to distinguish + * between backing devices but has no intrinsic meaning. If neither + * EXTENT_DEV flag is set, this field is meaningless. Only one of the + * EXTENT_DEV flags may be set at any time. + */ + __u32 fe_device; + __u32 fe_reserved[2]; }; struct fiemap { @@ -66,5 +78,14 @@ struct fiemap { * merged for efficiency. */ #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other * files. */ +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t + * structure containing the + * major and minor numbers + * of a block device. */ +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit + * cookie that can be used + * to distinguish physical + * devices but otherwise + * has no meaning. */ #endif /* _LINUX_FIEMAP_H */ Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start encoding fe_device = new_encode_dev(xfs_get_device_for_file()). Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE and encode the replica number in fe_device. Existing filesystems can be left unchanged, in which case neither EXTENT_DEV flag is set in fe_flags and the bits in fe_device are meaningless, the same as they are today. Reporting fe_device is entirely optional. Userspace programs will now be able to tell which device the file data lives on, which has been sort-of requested for years, if the filesystem chooses to start exporting that information. Your FIBMAP-via-FIEMAP backend can do something like: /* FIBMAP only returns results for the same block device backing the fs. */ if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device) return 0; /* Can't tell what is the backing device, bail out. */ if (fe->fe_flags & EXTENT_DEV_COOKIE) return 0; /* * Either fe_device matches the backing device or the implementation * doesn't tell us about the backing device, so assume it's ok. */ <return FIBMAP results> So that's how I'd solve a longstanding design problem of FIEMAP and then take advantage of that solution to remedy my objections to the proposed "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG behavior flag that userspace knows about but isn't allowed to pass in. > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive > than the device id in fiemap_extent. I don't see much advantage in > adding the device id instead of using the flag. > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via > userspace, so, it would require a check to make sure it didn't come from > userspace if ioctl_fiemap() was used. > > I think there are 2 other possibilities which can be used to fix this. > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, > - If the device id is a must for you, maybe add the device id into > fiemap_extent_info instead of fiemap_extent. That won't work with btrfs, which can store file extents on multiple different physical devices. > So we don't mess with a UAPI exported data structure and still > provides a way to the filesystems to provide which device the mapped > extent is in. > > What you think? > > Cheers > > > > > > --D > > > > > > > > > > > + > > > > > + return error; > > > > > +} > > > > > + > > > > > /** > > > > > * bmap - find a block number in a file > > > > > * @inode: inode owning the block number being requested > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > > > > > */ > > > > > int bmap(struct inode *inode, sector_t *block) > > > > > { > > > > > - if (!inode->i_mapping->a_ops->bmap) > > > > > + if (inode->i_op->fiemap) > > > > > + return bmap_fiemap(inode, block); > > > > > + else if (inode->i_mapping->a_ops->bmap) > > > > > + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > > > > > + *block); > > > > > + else > > > > > return -EINVAL; > > > > > > > > Waitaminute. btrfs currently supports fiemap but not bmap, and now > > > > suddenly it will support this legacy interface they've never supported > > > > before. Are they on board with this? > > > > > > > > --D > > > > > > > > > > > > > > - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > > > > > return 0; > > > > > } > > > > > EXPORT_SYMBOL(bmap); > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > > > > index 6086978fe01e..bfa59df332bf 100644 > > > > > --- a/fs/ioctl.c > > > > > +++ b/fs/ioctl.c > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > } > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > + u64 phys, u64 len, u32 flags) > > > > > +{ > > > > > + struct fiemap_extent *extent = fieinfo->fi_extents_start; > > > > > + > > > > > + /* only count the extents */ > > > > > + if (fieinfo->fi_extents_max == 0) { > > > > > + fieinfo->fi_extents_mapped++; > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > + } > > > > > + > > > > > + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > > > > > + return 1; > > > > > + > > > > > + if (flags & SET_UNKNOWN_FLAGS) > > > > > + flags |= FIEMAP_EXTENT_UNKNOWN; > > > > > + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > > > > > + flags |= FIEMAP_EXTENT_ENCODED; > > > > > + if (flags & SET_NOT_ALIGNED_FLAGS) > > > > > + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > > > > > + > > > > > + extent->fe_logical = logical; > > > > > + extent->fe_physical = phys; > > > > > + extent->fe_length = len; > > > > > + extent->fe_flags = flags; > > > > > + > > > > > + fieinfo->fi_extents_mapped++; > > > > > + > > > > > + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > > > > > + return 1; > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > +} > > > > > /** > > > > > * fiemap_fill_next_extent - Fiemap helper function > > > > > * @fieinfo: Fiemap context passed into ->fiemap > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > > > index 7a434979201c..28bb523d532a 100644 > > > > > --- a/include/linux/fs.h > > > > > +++ b/include/linux/fs.h > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > > > > > fiemap_fill_cb fi_cb; > > > > > }; > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > > > > > + u64 phys, u64 len, u32 flags); > > > > > int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > > > > > u64 phys, u64 len, u32 flags); > > > > > int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > > > > > -- > > > > > 2.17.2 > > > > > > > > > > > -- > > > Carlos > > -- > Carlos
On Feb 6, 2019, at 6:37 AM, Carlos Maiolino <cmaiolino@redhat.com> wrote: >>> On Wed, Dec 05, 2018 at 09:36:50AM -0800, Darrick J. Wong wrote: >>>> In any case, I think a better solution to the multi-device problem is to >>>> start returning device information via struct fiemap_extent, at least >>>> inside the kernel. Use one of the reserved fields to declare a new >>>> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t >>>> device number, and then you can check that against inode->i_sb->s_bdev >>>> to avoid returning results for the non-primary device of a multi-device >>>> filesystem. >>> >>> I agree we should address it here, but I don't think fiemap_extent is the right >>> place for it, it is linked to the UAPI, and changing it is usually not a good >>> idea. >> >> Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields >> into some sort of dev_t/per-device cookie should be fine. Userspace >> shouldn't be expecting any meaning in reserved areas. We are already using the __u32 fiemap_extent::fe_reserved[0] as fe_device for Lustre, to return the server index to userspace for filefrag with suitable patches. That is needed because a single file may be striped across multiple servers, and could instead return the dev_t for local multi-device filesystems. >>> I think I got your idea anyway, but, what if, instead returning the bdev in >>> fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to >>> idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do >>> with such information? >> >> I don't like the idea of adding a FIEMAP_FLAG to distinguish callers. > > Ok, may I ask why not? > > My apologies if I am wrong, but, per my understanding, there is nothing today, > which tells userspace which device belongs the extent map reported by FIEMAP. > If it belongs to the RT device in XFS, or whatever disk in a raid in BTRFS, we > simply do not provide such information. So, the goal is to provide a way to tell > the filesystem if a FIEMAP or a FIBMAP has been requested, so the current > behavior of both ioctls won't change. > > Enabling filesystems to return device information into fiemap_extent requires > modification of all filesystems to provide such information, which will not have > any use other than matching the mounted device to the device where the extent > is. Filling in the fe_device field is not harmful for existing filesystems, since it has virtually zero cost (not more than zeroing the field to avoid leaking kernel data) and older userspace tools would just ignore it. What would be better than just filling in the fe_device field would be to also add: #define FIEMAP_EXTENT_DEVICE 0x2000 to indicate that fe_device contains a valid value. That tells userspace that the filesystem is filling in the field, and allows compatibility with older kernels and allows incremental addition for filesystems that can handle this (XFS, BtrFS). We haven't added the FIEMAP_EXTENT_DEVICE flag for Lustre, but it would make sense to do so. > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive than the > device id in fiemap_extent. I don't see much advantage in adding the device id > instead of using the flag. We also have for Lustre: #define FIEMAP_FLAG_DEVICE_ORDER 0x40000000 which requests that the kernel FIEMAP return the extents for each block device first rather than in file logical block order. That avoids interleaving the extents across all of the devices in e.g. 1MB chunks (think RAID-0) which would force the maximum returned extent size to 1MB even though there are much larger contiguous extents allocated on each device. Instead, DEVICE_ORDER returns all of the extents for device 0 first, then device 1 next, etc. This shows if the on-disk allocation is good or bad, and also fills in the fe_device field. > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via > userspace, so, it would require a check to make sure it didn't come from > userspace if ioctl_fiemap() was used. Are you talking about a FIEMAP_FLAG_FIBMAP flag, or about returning the fe_device field? I think that passing a flag like FIEMAP_FLAG_DEVICE_ORDER from userspace is fine in this case, because it has a concrete meaning and is not just an internal flag. > I think there are 2 other possibilities which can be used to fix this. > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, > - If the device id is a must for you, maybe add the device id into > fiemap_extent_info instead of fiemap_extent. So we don't mess with a UAPI > exported data structure and still provides a way to the filesystems to provide > which device the mapped extent is in. No, that would mean all of the change, without making it more useful to userspace. Also, if with only a device per fiemap_extent_info then it won't handle filesystems that may allocate a single file on multiple devices, such as BtrFS and Lustre. Cheers, Andreas >>>> >>>>> + >>>>> + return error; >>>>> +} >>>>> + >>>>> /** >>>>> * bmap - find a block number in a file >>>>> * @inode: inode owning the block number being requested >>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); >>>>> */ >>>>> int bmap(struct inode *inode, sector_t *block) >>>>> { >>>>> - if (!inode->i_mapping->a_ops->bmap) >>>>> + if (inode->i_op->fiemap) >>>>> + return bmap_fiemap(inode, block); >>>>> + else if (inode->i_mapping->a_ops->bmap) >>>>> + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, >>>>> + *block); >>>>> + else >>>>> return -EINVAL; >>>> >>>> Waitaminute. btrfs currently supports fiemap but not bmap, and now >>>> suddenly it will support this legacy interface they've never supported >>>> before. Are they on board with this? >>>> >>>> --D >>>> >>>>> >>>>> - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); >>>>> return 0; >>>>> } >>>>> EXPORT_SYMBOL(bmap); >>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c >>>>> index 6086978fe01e..bfa59df332bf 100644 >>>>> --- a/fs/ioctl.c >>>>> +++ b/fs/ioctl.c >>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, >>>>> return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>> } >>>>> >>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, >>>>> + u64 phys, u64 len, u32 flags) >>>>> +{ >>>>> + struct fiemap_extent *extent = fieinfo->fi_extents_start; >>>>> + >>>>> + /* only count the extents */ >>>>> + if (fieinfo->fi_extents_max == 0) { >>>>> + fieinfo->fi_extents_mapped++; >>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>> + } >>>>> + >>>>> + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) >>>>> + return 1; >>>>> + >>>>> + if (flags & SET_UNKNOWN_FLAGS) >>>>> + flags |= FIEMAP_EXTENT_UNKNOWN; >>>>> + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) >>>>> + flags |= FIEMAP_EXTENT_ENCODED; >>>>> + if (flags & SET_NOT_ALIGNED_FLAGS) >>>>> + flags |= FIEMAP_EXTENT_NOT_ALIGNED; >>>>> + >>>>> + extent->fe_logical = logical; >>>>> + extent->fe_physical = phys; >>>>> + extent->fe_length = len; >>>>> + extent->fe_flags = flags; >>>>> + >>>>> + fieinfo->fi_extents_mapped++; >>>>> + >>>>> + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) >>>>> + return 1; >>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>> +} >>>>> /** >>>>> * fiemap_fill_next_extent - Fiemap helper function >>>>> * @fieinfo: Fiemap context passed into ->fiemap >>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h >>>>> index 7a434979201c..28bb523d532a 100644 >>>>> --- a/include/linux/fs.h >>>>> +++ b/include/linux/fs.h >>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { >>>>> fiemap_fill_cb fi_cb; >>>>> }; >>>>> >>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, >>>>> + u64 phys, u64 len, u32 flags); >>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, >>>>> u64 phys, u64 len, u32 flags); >>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); >>>>> -- >>>>> 2.17.2 >>>>> >>> >>> -- >>> Carlos > > -- > Carlos Cheers, Andreas
On Feb 6, 2019, at 1:44 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote: >>>>> In any case, I think a better solution to the multi-device problem is to >>>>> start returning device information via struct fiemap_extent, at least >>>>> inside the kernel. Use one of the reserved fields to declare a new >>>>> '__u32 fe_device' field in struct fiemap_extent which can be the dev_t >>>>> device number, and then you can check that against inode->i_sb->s_bdev >>>>> to avoid returning results for the non-primary device of a multi-device >>>>> filesystem. >>>> >>>> I agree we should address it here, but I don't think fiemap_extent is the right >>>> place for it, it is linked to the UAPI, and changing it is usually not a good >>>> idea. >>> >>> Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields >>> into some sort of dev_t/per-device cookie should be fine. Userspace >>> shouldn't be expecting any meaning in reserved areas. >>> >>>> I think I got your idea anyway, but, what if, instead returning the bdev in >>>> fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to >>>> idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do >>>> with such information? >>> >>> I don't like the idea of adding a FIEMAP_FLAG to distinguish callers. >> >> Ok, may I ask why not? > > I think it's a bad idea to add a flag to FIEMAP to change its behavior > to suit an older and even crappier legacy interface (i.e. FIBMAP). > > FIBMAP is architecturally broken in that we can't /ever/ provide the > context of "which device does this map to?" > > FIEMAP is architecturally deficient as well, but its ioctl structure > definition is flexible enough that we can report "which device does this > map to". > > I want to enhance FIEMAP to deal with multi-device filesystems > correctly, and as much as I want to kill FIBMAP, I can't because of zipl > and *lilo. > >> My apologies if I am wrong, but, per my understanding, there is >> nothing today, which tells userspace which device belongs the extent >> map reported by FIEMAP. > > Right... > >> If it belongs to the RT device in XFS, or whatever disk in a raid in >> BTRFS, we simply do not provide such information. > > Right... > >> So, the goal is to provide a way to tell the filesystem if a FIEMAP or >> a FIBMAP has been requested, so the current behavior of both ioctls >> won't change. > > ...but from my point of view, the FIEMAP behavior *ought* to change to > be more expressive. Once that's done, we can use the more expressive > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems. > > The whole point of having fe_reserved* fields in struct fiemap_extent is > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can > start returning data in a reserved field. New userspace programs that > know about the flag can start reading information from the new field if > they see the flag, and old userspace programs don't know about the flag > and won't be any worse off. Exactly correct. >> Enabling filesystems to return device information into fiemap_extent >> requires modification of all filesystems to provide such information, >> which will not have any use other than matching the mounted device to >> the device where the extent is. > > Perhaps it would help for me to present a more concrete proposal: > > --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800 > +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800 > @@ -22,7 +22,19 @@ struct fiemap_extent { > __u64 fe_length; /* length in bytes for this extent */ > __u64 fe_reserved64[2]; > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > - __u32 fe_reserved[3]; > + > + /* > + * Underlying device that this extent is stored on. > + * > + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the > + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is > + * set, this field is a 32-bit cookie that can be used to distinguish > + * between backing devices but has no intrinsic meaning. If neither > + * EXTENT_DEV flag is set, this field is meaningless. Only one of the > + * EXTENT_DEV flags may be set at any time. > + */ > + __u32 fe_device; > + __u32 fe_reserved[2]; > }; > > struct fiemap { > @@ -66,5 +78,14 @@ struct fiemap { > * merged for efficiency. */ > #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other > * files. */ > +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t > + * structure containing the > + * major and minor numbers > + * of a block device. */ > +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit > + * cookie that can be used > + * to distinguish physical > + * devices but otherwise > + * has no meaning. */ > > #endif /* _LINUX_FIEMAP_H */ > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start > encoding: > > fe_device = new_encode_dev(xfs_get_device_for_file()); > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE > and encode the replica number in fe_device. > > Existing filesystems can be left unchanged, in which case neither > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are > meaningless, the same as they are today. Reporting fe_device is entirely > optional. I like this better than my plain "FIEMAP_EXTENT_DEVICE" proposal, since it allows userspace to distinguish between an actual dev_t a unique-but- locally-meaninless identifier that is needed for network filesystems. Cheers, Andreas > Userspace programs will now be able to tell which device the file data > lives on, which has been sort-of requested for years, if the filesystem > chooses to start exporting that information. > > Your FIBMAP-via-FIEMAP backend can do something like: > > /* FIBMAP only returns results for the same block device backing the fs. */ > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device) > return 0; > > /* Can't tell what is the backing device, bail out. */ > if (fe->fe_flags & EXTENT_DEV_COOKIE) > return 0; > > /* > * Either fe_device matches the backing device or the implementation > * doesn't tell us about the backing device, so assume it's ok. > */ > <return FIBMAP results> > > So that's how I'd solve a longstanding design problem of FIEMAP and then > take advantage of that solution to remedy my objections to the proposed > "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG > behavior flag that userspace knows about but isn't allowed to pass in. > >> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive >> than the device id in fiemap_extent. I don't see much advantage in >> adding the device id instead of using the flag. >> >> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via >> userspace, so, it would require a check to make sure it didn't come from >> userspace if ioctl_fiemap() was used. >> >> I think there are 2 other possibilities which can be used to fix this. >> >> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, >> - If the device id is a must for you, maybe add the device id into >> fiemap_extent_info instead of fiemap_extent. > > That won't work with btrfs, which can store file extents on multiple > different physical devices. > >> So we don't mess with a UAPI exported data structure and still >> provides a way to the filesystems to provide which device the mapped >> extent is in. >> >> What you think? >> >> Cheers >> >> >>> >>> --D >>> >>>>> >>>>>> + >>>>>> + return error; >>>>>> +} >>>>>> + >>>>>> /** >>>>>> * bmap - find a block number in a file >>>>>> * @inode: inode owning the block number being requested >>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); >>>>>> */ >>>>>> int bmap(struct inode *inode, sector_t *block) >>>>>> { >>>>>> - if (!inode->i_mapping->a_ops->bmap) >>>>>> + if (inode->i_op->fiemap) >>>>>> + return bmap_fiemap(inode, block); >>>>>> + else if (inode->i_mapping->a_ops->bmap) >>>>>> + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, >>>>>> + *block); >>>>>> + else >>>>>> return -EINVAL; >>>>> >>>>> Waitaminute. btrfs currently supports fiemap but not bmap, and now >>>>> suddenly it will support this legacy interface they've never supported >>>>> before. Are they on board with this? >>>>> >>>>> --D >>>>> >>>>>> >>>>>> - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); >>>>>> return 0; >>>>>> } >>>>>> EXPORT_SYMBOL(bmap); >>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c >>>>>> index 6086978fe01e..bfa59df332bf 100644 >>>>>> --- a/fs/ioctl.c >>>>>> +++ b/fs/ioctl.c >>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, >>>>>> return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>> } >>>>>> >>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, >>>>>> + u64 phys, u64 len, u32 flags) >>>>>> +{ >>>>>> + struct fiemap_extent *extent = fieinfo->fi_extents_start; >>>>>> + >>>>>> + /* only count the extents */ >>>>>> + if (fieinfo->fi_extents_max == 0) { >>>>>> + fieinfo->fi_extents_mapped++; >>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>> + } >>>>>> + >>>>>> + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) >>>>>> + return 1; >>>>>> + >>>>>> + if (flags & SET_UNKNOWN_FLAGS) >>>>>> + flags |= FIEMAP_EXTENT_UNKNOWN; >>>>>> + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) >>>>>> + flags |= FIEMAP_EXTENT_ENCODED; >>>>>> + if (flags & SET_NOT_ALIGNED_FLAGS) >>>>>> + flags |= FIEMAP_EXTENT_NOT_ALIGNED; >>>>>> + >>>>>> + extent->fe_logical = logical; >>>>>> + extent->fe_physical = phys; >>>>>> + extent->fe_length = len; >>>>>> + extent->fe_flags = flags; >>>>>> + >>>>>> + fieinfo->fi_extents_mapped++; >>>>>> + >>>>>> + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) >>>>>> + return 1; >>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>> +} >>>>>> /** >>>>>> * fiemap_fill_next_extent - Fiemap helper function >>>>>> * @fieinfo: Fiemap context passed into ->fiemap >>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h >>>>>> index 7a434979201c..28bb523d532a 100644 >>>>>> --- a/include/linux/fs.h >>>>>> +++ b/include/linux/fs.h >>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { >>>>>> fiemap_fill_cb fi_cb; >>>>>> }; >>>>>> >>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, >>>>>> + u64 phys, u64 len, u32 flags); >>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, >>>>>> u64 phys, u64 len, u32 flags); >>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); >>>>>> -- >>>>>> 2.17.2 >>>>>> >>>> >>>> -- >>>> Carlos >> >> -- >> Carlos Cheers, Andreas
> >> If it belongs to the RT device in XFS, or whatever disk in a raid in > >> BTRFS, we simply do not provide such information. > > > > Right... > > > >> So, the goal is to provide a way to tell the filesystem if a FIEMAP or > >> a FIBMAP has been requested, so the current behavior of both ioctls > >> won't change. > > > > ...but from my point of view, the FIEMAP behavior *ought* to change to > > be more expressive. Once that's done, we can use the more expressive > > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems. > > > > The whole point of having fe_reserved* fields in struct fiemap_extent is > > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can > > start returning data in a reserved field. New userspace programs that > > know about the flag can start reading information from the new field if > > they see the flag, and old userspace programs don't know about the flag > > and won't be any worse off. > Btw, I am not saying I don't like the idea, I like it. What I was trying to do was to avoid touching UAPI in this patchset. But... I'll try to implement your idea here, send it to the list and raise my shields. Thanks for the help Andreas/Darrick. > Exactly correct. > > >> Enabling filesystems to return device information into fiemap_extent > >> requires modification of all filesystems to provide such information, > >> which will not have any use other than matching the mounted device to > >> the device where the extent is. > > > > Perhaps it would help for me to present a more concrete proposal: > > > > --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800 > > +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800 > > @@ -22,7 +22,19 @@ struct fiemap_extent { > > __u64 fe_length; /* length in bytes for this extent */ > > __u64 fe_reserved64[2]; > > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > > - __u32 fe_reserved[3]; > > + > > + /* > > + * Underlying device that this extent is stored on. > > + * > > + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the > > + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is > > + * set, this field is a 32-bit cookie that can be used to distinguish > > + * between backing devices but has no intrinsic meaning. If neither > > + * EXTENT_DEV flag is set, this field is meaningless. Only one of the > > + * EXTENT_DEV flags may be set at any time. > > + */ > > + __u32 fe_device; > > + __u32 fe_reserved[2]; > > }; > > > > struct fiemap { > > @@ -66,5 +78,14 @@ struct fiemap { > > * merged for efficiency. */ > > #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other > > * files. */ > > +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t > > + * structure containing the > > + * major and minor numbers > > + * of a block device. */ > > +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit > > + * cookie that can be used > > + * to distinguish physical > > + * devices but otherwise > > + * has no meaning. */ > > > > #endif /* _LINUX_FIEMAP_H */ > > > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start > > encoding: > > > > fe_device = new_encode_dev(xfs_get_device_for_file()); > > > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE > > and encode the replica number in fe_device. > > > > Existing filesystems can be left unchanged, in which case neither > > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are > > meaningless, the same as they are today. Reporting fe_device is entirely > > optional. > > I like this better than my plain "FIEMAP_EXTENT_DEVICE" proposal, since it > allows userspace to distinguish between an actual dev_t a unique-but- > locally-meaninless identifier that is needed for network filesystems. > > Cheers, Andreas > > > Userspace programs will now be able to tell which device the file data > > lives on, which has been sort-of requested for years, if the filesystem > > chooses to start exporting that information. > > > > Your FIBMAP-via-FIEMAP backend can do something like: > > > > /* FIBMAP only returns results for the same block device backing the fs. */ > > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device) > > return 0; > > > > /* Can't tell what is the backing device, bail out. */ > > if (fe->fe_flags & EXTENT_DEV_COOKIE) > > return 0; > > > > /* > > * Either fe_device matches the backing device or the implementation > > * doesn't tell us about the backing device, so assume it's ok. > > */ > > <return FIBMAP results> > > > > So that's how I'd solve a longstanding design problem of FIEMAP and then > > take advantage of that solution to remedy my objections to the proposed > > "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG > > behavior flag that userspace knows about but isn't allowed to pass in. > > > >> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive > >> than the device id in fiemap_extent. I don't see much advantage in > >> adding the device id instead of using the flag. > >> > >> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via > >> userspace, so, it would require a check to make sure it didn't come from > >> userspace if ioctl_fiemap() was used. > >> > >> I think there are 2 other possibilities which can be used to fix this. > >> > >> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, > >> - If the device id is a must for you, maybe add the device id into > >> fiemap_extent_info instead of fiemap_extent. > > > > That won't work with btrfs, which can store file extents on multiple > > different physical devices. > > > >> So we don't mess with a UAPI exported data structure and still > >> provides a way to the filesystems to provide which device the mapped > >> extent is in. > >> > >> What you think? > >> > >> Cheers > >> > >> > >>> > >>> --D > >>> > >>>>> > >>>>>> + > >>>>>> + return error; > >>>>>> +} > >>>>>> + > >>>>>> /** > >>>>>> * bmap - find a block number in a file > >>>>>> * @inode: inode owning the block number being requested > >>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > >>>>>> */ > >>>>>> int bmap(struct inode *inode, sector_t *block) > >>>>>> { > >>>>>> - if (!inode->i_mapping->a_ops->bmap) > >>>>>> + if (inode->i_op->fiemap) > >>>>>> + return bmap_fiemap(inode, block); > >>>>>> + else if (inode->i_mapping->a_ops->bmap) > >>>>>> + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > >>>>>> + *block); > >>>>>> + else > >>>>>> return -EINVAL; > >>>>> > >>>>> Waitaminute. btrfs currently supports fiemap but not bmap, and now > >>>>> suddenly it will support this legacy interface they've never supported > >>>>> before. Are they on board with this? > >>>>> > >>>>> --D > >>>>> > >>>>>> > >>>>>> - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > >>>>>> return 0; > >>>>>> } > >>>>>> EXPORT_SYMBOL(bmap); > >>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c > >>>>>> index 6086978fe01e..bfa59df332bf 100644 > >>>>>> --- a/fs/ioctl.c > >>>>>> +++ b/fs/ioctl.c > >>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > >>>>>> return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > >>>>>> } > >>>>>> > >>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > >>>>>> + u64 phys, u64 len, u32 flags) > >>>>>> +{ > >>>>>> + struct fiemap_extent *extent = fieinfo->fi_extents_start; > >>>>>> + > >>>>>> + /* only count the extents */ > >>>>>> + if (fieinfo->fi_extents_max == 0) { > >>>>>> + fieinfo->fi_extents_mapped++; > >>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > >>>>>> + } > >>>>>> + > >>>>>> + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > >>>>>> + return 1; > >>>>>> + > >>>>>> + if (flags & SET_UNKNOWN_FLAGS) > >>>>>> + flags |= FIEMAP_EXTENT_UNKNOWN; > >>>>>> + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > >>>>>> + flags |= FIEMAP_EXTENT_ENCODED; > >>>>>> + if (flags & SET_NOT_ALIGNED_FLAGS) > >>>>>> + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > >>>>>> + > >>>>>> + extent->fe_logical = logical; > >>>>>> + extent->fe_physical = phys; > >>>>>> + extent->fe_length = len; > >>>>>> + extent->fe_flags = flags; > >>>>>> + > >>>>>> + fieinfo->fi_extents_mapped++; > >>>>>> + > >>>>>> + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > >>>>>> + return 1; > >>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > >>>>>> +} > >>>>>> /** > >>>>>> * fiemap_fill_next_extent - Fiemap helper function > >>>>>> * @fieinfo: Fiemap context passed into ->fiemap > >>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h > >>>>>> index 7a434979201c..28bb523d532a 100644 > >>>>>> --- a/include/linux/fs.h > >>>>>> +++ b/include/linux/fs.h > >>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > >>>>>> fiemap_fill_cb fi_cb; > >>>>>> }; > >>>>>> > >>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > >>>>>> + u64 phys, u64 len, u32 flags); > >>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > >>>>>> u64 phys, u64 len, u32 flags); > >>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > >>>>>> -- > >>>>>> 2.17.2 > >>>>>> > >>>> > >>>> -- > >>>> Carlos > >> > >> -- > >> Carlos > > > Cheers, Andreas > > > > >
On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote: > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote: > > > > > In any case, I think a better solution to the multi-device problem is to > > > > > start returning device information via struct fiemap_extent, at least > > > > > inside the kernel. Use one of the reserved fields to declare a new > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t > > > > > device number, and then you can check that against inode->i_sb->s_bdev > > > > > to avoid returning results for the non-primary device of a multi-device > > > > > filesystem. > > > > > > > > I agree we should address it here, but I don't think fiemap_extent is the right > > > > place for it, it is linked to the UAPI, and changing it is usually not a good > > > > idea. > > > > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields > > > into some sort of dev_t/per-device cookie should be fine. Userspace > > > shouldn't be expecting any meaning in reserved areas. > > > > > > > I think I got your idea anyway, but, what if, instead returning the bdev in > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do > > > > with such information? > > > > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers. > > > > Ok, may I ask why not? > > I think it's a bad idea to add a flag to FIEMAP to change its behavior > to suit an older and even crappier legacy interface (i.e. FIBMAP). > > FIBMAP is architecturally broken in that we can't /ever/ provide the > context of "which device does this map to?" > > FIEMAP is architecturally deficient as well, but its ioctl structure > definition is flexible enough that we can report "which device does this > map to". > > I want to enhance FIEMAP to deal with multi-device filesystems > correctly, and as much as I want to kill FIBMAP, I can't because of zipl > and *lilo. > > > My apologies if I am wrong, but, per my understanding, there is > > nothing today, which tells userspace which device belongs the extent > > map reported by FIEMAP. > > Right... > > > If it belongs to the RT device in XFS, or whatever disk in a raid in > > BTRFS, we simply do not provide such information. > > Right... > > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or > > a FIBMAP has been requested, so the current behavior of both ioctls > > won't change. > > ...but from my point of view, the FIEMAP behavior *ought* to change to > be more expressive. Once that's done, we can use the more expressive > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems. > > The whole point of having fe_reserved* fields in struct fiemap_extent is > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can > start returning data in a reserved field. New userspace programs that > know about the flag can start reading information from the new field if > they see the flag, and old userspace programs don't know about the flag > and won't be any worse off. > > > Enabling filesystems to return device information into fiemap_extent > > requires modification of all filesystems to provide such information, > > which will not have any use other than matching the mounted device to > > the device where the extent is. > > Perhaps it would help for me to present a more concrete proposal: > > --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800 > +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800 > @@ -22,7 +22,19 @@ struct fiemap_extent { > __u64 fe_length; /* length in bytes for this extent */ > __u64 fe_reserved64[2]; > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > - __u32 fe_reserved[3]; > + > + /* > + * Underlying device that this extent is stored on. > + * > + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the > + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is > + * set, this field is a 32-bit cookie that can be used to distinguish > + * between backing devices but has no intrinsic meaning. If neither > + * EXTENT_DEV flag is set, this field is meaningless. Only one of the > + * EXTENT_DEV flags may be set at any time. > + */ > + __u32 fe_device; > + __u32 fe_reserved[2]; > }; > > struct fiemap { > @@ -66,5 +78,14 @@ struct fiemap { > * merged for efficiency. */ > #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other > * files. */ > +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t > + * structure containing the > + * major and minor numbers > + * of a block device. */ > +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit > + * cookie that can be used > + * to distinguish physical > + * devices but otherwise > + * has no meaning. */ > > #endif /* _LINUX_FIEMAP_H */ > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start > encoding fe_device = new_encode_dev(xfs_get_device_for_file()). > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE > and encode the replica number in fe_device. > All of this makes sense, but I'm struggling to understand what you mean by replica number here, and why it justify a second flag. > Existing filesystems can be left unchanged, in which case neither > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are > meaningless, the same as they are today. Reporting fe_device is entirely > optional. > > Userspace programs will now be able to tell which device the file data > lives on, which has been sort-of requested for years, if the filesystem > chooses to start exporting that information. > > Your FIBMAP-via-FIEMAP backend can do something like: > > /* FIBMAP only returns results for the same block device backing the fs. */ > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device) > return 0; > > /* Can't tell what is the backing device, bail out. */ > if (fe->fe_flags & EXTENT_DEV_COOKIE) > return 0; > Ok, the first conditional, is ok, the second one is not making sense to me. Looks like you are basically using it to flag the filesystem can't tell exactly which device the current extent is, let's say for example, distributed filesystems, where the physical extent can actually be on a different machine. But I can't say for sure, can you give me more details about what you are trying to achieve here? > /* > * Either fe_device matches the backing device or the implementation > * doesn't tell us about the backing device, so assume it's ok. > */ > <return FIBMAP results> > This actually looks to contradict what you have been complaining, about some filesystems which doesn't support FIBMAP currently, will now suddenly start to support. Assuming it's ok if the implementation doesn't tell us about the backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the backing device, assuming it's ok will just fall into your first complain. Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE you mentioned. > So that's how I'd solve a longstanding design problem of FIEMAP and then > take advantage of that solution to remedy my objections to the proposed > "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG > behavior flag that userspace knows about but isn't allowed to pass in. > > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive > > than the device id in fiemap_extent. I don't see much advantage in > > adding the device id instead of using the flag. > > > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via > > userspace, so, it would require a check to make sure it didn't come from > > userspace if ioctl_fiemap() was used. > > > > I think there are 2 other possibilities which can be used to fix this. > > > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, > > - If the device id is a must for you, maybe add the device id into > > fiemap_extent_info instead of fiemap_extent. > > That won't work with btrfs, which can store file extents on multiple > different physical devices. > > > So we don't mess with a UAPI exported data structure and still > > provides a way to the filesystems to provide which device the mapped > > extent is in. > > > > What you think? > > > > Cheers > > > > > > > > > > --D > > > > > > > > > > > > > > + > > > > > > + return error; > > > > > > +} > > > > > > + > > > > > > /** > > > > > > * bmap - find a block number in a file > > > > > > * @inode: inode owning the block number being requested > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > > > > > > */ > > > > > > int bmap(struct inode *inode, sector_t *block) > > > > > > { > > > > > > - if (!inode->i_mapping->a_ops->bmap) > > > > > > + if (inode->i_op->fiemap) > > > > > > + return bmap_fiemap(inode, block); > > > > > > + else if (inode->i_mapping->a_ops->bmap) > > > > > > + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > > > > > > + *block); > > > > > > + else > > > > > > return -EINVAL; > > > > > > > > > > Waitaminute. btrfs currently supports fiemap but not bmap, and now > > > > > suddenly it will support this legacy interface they've never supported > > > > > before. Are they on board with this? > > > > > > > > > > --D > > > > > > > > > > > > > > > > > - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > > > > > > return 0; > > > > > > } > > > > > > EXPORT_SYMBOL(bmap); > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > > > > > index 6086978fe01e..bfa59df332bf 100644 > > > > > > --- a/fs/ioctl.c > > > > > > +++ b/fs/ioctl.c > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > > return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > } > > > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > > + u64 phys, u64 len, u32 flags) > > > > > > +{ > > > > > > + struct fiemap_extent *extent = fieinfo->fi_extents_start; > > > > > > + > > > > > > + /* only count the extents */ > > > > > > + if (fieinfo->fi_extents_max == 0) { > > > > > > + fieinfo->fi_extents_mapped++; > > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > + } > > > > > > + > > > > > > + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > > > > > > + return 1; > > > > > > + > > > > > > + if (flags & SET_UNKNOWN_FLAGS) > > > > > > + flags |= FIEMAP_EXTENT_UNKNOWN; > > > > > > + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > > > > > > + flags |= FIEMAP_EXTENT_ENCODED; > > > > > > + if (flags & SET_NOT_ALIGNED_FLAGS) > > > > > > + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > > > > > > + > > > > > > + extent->fe_logical = logical; > > > > > > + extent->fe_physical = phys; > > > > > > + extent->fe_length = len; > > > > > > + extent->fe_flags = flags; > > > > > > + > > > > > > + fieinfo->fi_extents_mapped++; > > > > > > + > > > > > > + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > > > > > > + return 1; > > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > +} > > > > > > /** > > > > > > * fiemap_fill_next_extent - Fiemap helper function > > > > > > * @fieinfo: Fiemap context passed into ->fiemap > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > > > > index 7a434979201c..28bb523d532a 100644 > > > > > > --- a/include/linux/fs.h > > > > > > +++ b/include/linux/fs.h > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > > > > > > fiemap_fill_cb fi_cb; > > > > > > }; > > > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > > > > > > + u64 phys, u64 len, u32 flags); > > > > > > int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > > > > > > u64 phys, u64 len, u32 flags); > > > > > > int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > > > > > > -- > > > > > > 2.17.2 > > > > > > > > > > > > > > -- > > > > Carlos > > > > -- > > Carlos
Apologies, I forgot to mention another thing.. On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote: > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote: > > > > > In any case, I think a better solution to the multi-device problem is to > > > > > start returning device information via struct fiemap_extent, at least > > > > > inside the kernel. Use one of the reserved fields to declare a new > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t > > > > > device number, and then you can check that against inode->i_sb->s_bdev > > > > > to avoid returning results for the non-primary device of a multi-device > > > > > filesystem. > > > > > > > > I agree we should address it here, but I don't think fiemap_extent is the right > > > > place for it, it is linked to the UAPI, and changing it is usually not a good > > > > idea. > > > > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields > > > into some sort of dev_t/per-device cookie should be fine. Userspace > > > shouldn't be expecting any meaning in reserved areas. > > > > > > > I think I got your idea anyway, but, what if, instead returning the bdev in > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do > > > > with such information? > > > > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers. > > > > Ok, may I ask why not? > > I think it's a bad idea to add a flag to FIEMAP to change its behavior > to suit an older and even crappier legacy interface (i.e. FIBMAP). > > FIBMAP is architecturally broken in that we can't /ever/ provide the > context of "which device does this map to?" > > FIEMAP is architecturally deficient as well, but its ioctl structure > definition is flexible enough that we can report "which device does this > map to". > > I want to enhance FIEMAP to deal with multi-device filesystems > correctly, and as much as I want to kill FIBMAP, I can't because of zipl > and *lilo. > > > My apologies if I am wrong, but, per my understanding, there is > > nothing today, which tells userspace which device belongs the extent > > map reported by FIEMAP. > > Right... > > > If it belongs to the RT device in XFS, or whatever disk in a raid in > > BTRFS, we simply do not provide such information. > > Right... > > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or > > a FIBMAP has been requested, so the current behavior of both ioctls > > won't change. > > ...but from my point of view, the FIEMAP behavior *ought* to change to > be more expressive. Once that's done, we can use the more expressive > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems. > > The whole point of having fe_reserved* fields in struct fiemap_extent is > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can > start returning data in a reserved field. New userspace programs that > know about the flag can start reading information from the new field if > they see the flag, and old userspace programs don't know about the flag > and won't be any worse off. > > > Enabling filesystems to return device information into fiemap_extent > > requires modification of all filesystems to provide such information, > > which will not have any use other than matching the mounted device to > > the device where the extent is. > > Perhaps it would help for me to present a more concrete proposal: > > --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800 > +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800 > @@ -22,7 +22,19 @@ struct fiemap_extent { > __u64 fe_length; /* length in bytes for this extent */ > __u64 fe_reserved64[2]; > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > - __u32 fe_reserved[3]; > + > + /* > + * Underlying device that this extent is stored on. > + * > + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the > + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is > + * set, this field is a 32-bit cookie that can be used to distinguish > + * between backing devices but has no intrinsic meaning. If neither > + * EXTENT_DEV flag is set, this field is meaningless. Only one of the > + * EXTENT_DEV flags may be set at any time. > + */ > + __u32 fe_device; > + __u32 fe_reserved[2]; > }; > > struct fiemap { > @@ -66,5 +78,14 @@ struct fiemap { > * merged for efficiency. */ > #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other > * files. */ > +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t > + * structure containing the > + * major and minor numbers > + * of a block device. */ > +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit > + * cookie that can be used > + * to distinguish physical > + * devices but otherwise > + * has no meaning. */ > > #endif /* _LINUX_FIEMAP_H */ > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start > encoding fe_device = new_encode_dev(xfs_get_device_for_file()). Here, I believe you are forgetting that filesystems do not touch fiemap_extent directly. We call fiemap_fell_next_extent() helper to fill each extent found by fiemap. So, in either way, we'd need to modify fiemap_fill_next_extent() and the callbacks being used to accommodate this new field or create a new helper to modify the device which doesn't sound reasonable. So, either way, we will end up needing to modify all filesystems. So, although I really like the idea of improving the FIEMAP interface, I'm starting to consider another patchset for it. I think it requires an interface change big enough to fit in this patchset, which actually has a different purpose. Or, maybe, address this at the end of this patchset, leaving different interface changes in different patchsets, instead of making many changes all at once, mixed together. > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE > and encode the replica number in fe_device. > > Existing filesystems can be left unchanged, in which case neither > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are > meaningless, the same as they are today. Reporting fe_device is entirely > optional. > > Userspace programs will now be able to tell which device the file data > lives on, which has been sort-of requested for years, if the filesystem > chooses to start exporting that information. > > Your FIBMAP-via-FIEMAP backend can do something like: > > /* FIBMAP only returns results for the same block device backing the fs. */ > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device) > return 0; > > /* Can't tell what is the backing device, bail out. */ > if (fe->fe_flags & EXTENT_DEV_COOKIE) > return 0; > > /* > * Either fe_device matches the backing device or the implementation > * doesn't tell us about the backing device, so assume it's ok. > */ > <return FIBMAP results> > > So that's how I'd solve a longstanding design problem of FIEMAP and then > take advantage of that solution to remedy my objections to the proposed > "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG > behavior flag that userspace knows about but isn't allowed to pass in. > > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive > > than the device id in fiemap_extent. I don't see much advantage in > > adding the device id instead of using the flag. > > > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via > > userspace, so, it would require a check to make sure it didn't come from > > userspace if ioctl_fiemap() was used. > > > > I think there are 2 other possibilities which can be used to fix this. > > > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, > > - If the device id is a must for you, maybe add the device id into > > fiemap_extent_info instead of fiemap_extent. > > That won't work with btrfs, which can store file extents on multiple > different physical devices. > > > So we don't mess with a UAPI exported data structure and still > > provides a way to the filesystems to provide which device the mapped > > extent is in. > > > > What you think? > > > > Cheers > > > > > > > > > > --D > > > > > > > > > > > > > > + > > > > > > + return error; > > > > > > +} > > > > > > + > > > > > > /** > > > > > > * bmap - find a block number in a file > > > > > > * @inode: inode owning the block number being requested > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > > > > > > */ > > > > > > int bmap(struct inode *inode, sector_t *block) > > > > > > { > > > > > > - if (!inode->i_mapping->a_ops->bmap) > > > > > > + if (inode->i_op->fiemap) > > > > > > + return bmap_fiemap(inode, block); > > > > > > + else if (inode->i_mapping->a_ops->bmap) > > > > > > + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > > > > > > + *block); > > > > > > + else > > > > > > return -EINVAL; > > > > > > > > > > Waitaminute. btrfs currently supports fiemap but not bmap, and now > > > > > suddenly it will support this legacy interface they've never supported > > > > > before. Are they on board with this? > > > > > > > > > > --D > > > > > > > > > > > > > > > > > - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > > > > > > return 0; > > > > > > } > > > > > > EXPORT_SYMBOL(bmap); > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > > > > > index 6086978fe01e..bfa59df332bf 100644 > > > > > > --- a/fs/ioctl.c > > > > > > +++ b/fs/ioctl.c > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > > return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > } > > > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > > + u64 phys, u64 len, u32 flags) > > > > > > +{ > > > > > > + struct fiemap_extent *extent = fieinfo->fi_extents_start; > > > > > > + > > > > > > + /* only count the extents */ > > > > > > + if (fieinfo->fi_extents_max == 0) { > > > > > > + fieinfo->fi_extents_mapped++; > > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > + } > > > > > > + > > > > > > + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > > > > > > + return 1; > > > > > > + > > > > > > + if (flags & SET_UNKNOWN_FLAGS) > > > > > > + flags |= FIEMAP_EXTENT_UNKNOWN; > > > > > > + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > > > > > > + flags |= FIEMAP_EXTENT_ENCODED; > > > > > > + if (flags & SET_NOT_ALIGNED_FLAGS) > > > > > > + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > > > > > > + > > > > > > + extent->fe_logical = logical; > > > > > > + extent->fe_physical = phys; > > > > > > + extent->fe_length = len; > > > > > > + extent->fe_flags = flags; > > > > > > + > > > > > > + fieinfo->fi_extents_mapped++; > > > > > > + > > > > > > + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > > > > > > + return 1; > > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > +} > > > > > > /** > > > > > > * fiemap_fill_next_extent - Fiemap helper function > > > > > > * @fieinfo: Fiemap context passed into ->fiemap > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > > > > index 7a434979201c..28bb523d532a 100644 > > > > > > --- a/include/linux/fs.h > > > > > > +++ b/include/linux/fs.h > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > > > > > > fiemap_fill_cb fi_cb; > > > > > > }; > > > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > > > > > > + u64 phys, u64 len, u32 flags); > > > > > > int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > > > > > > u64 phys, u64 len, u32 flags); > > > > > > int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > > > > > > -- > > > > > > 2.17.2 > > > > > > > > > > > > > > -- > > > > Carlos > > > > -- > > Carlos
On Thu, Feb 07, 2019 at 12:59:54PM +0100, Carlos Maiolino wrote: > On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote: > > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote: > > > > > > In any case, I think a better solution to the multi-device problem is to > > > > > > start returning device information via struct fiemap_extent, at least > > > > > > inside the kernel. Use one of the reserved fields to declare a new > > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t > > > > > > device number, and then you can check that against inode->i_sb->s_bdev > > > > > > to avoid returning results for the non-primary device of a multi-device > > > > > > filesystem. > > > > > > > > > > I agree we should address it here, but I don't think fiemap_extent is the right > > > > > place for it, it is linked to the UAPI, and changing it is usually not a good > > > > > idea. > > > > > > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields > > > > into some sort of dev_t/per-device cookie should be fine. Userspace > > > > shouldn't be expecting any meaning in reserved areas. > > > > > > > > > I think I got your idea anyway, but, what if, instead returning the bdev in > > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to > > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do > > > > > with such information? > > > > > > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers. > > > > > > Ok, may I ask why not? > > > > I think it's a bad idea to add a flag to FIEMAP to change its behavior > > to suit an older and even crappier legacy interface (i.e. FIBMAP). > > > > FIBMAP is architecturally broken in that we can't /ever/ provide the > > context of "which device does this map to?" > > > > FIEMAP is architecturally deficient as well, but its ioctl structure > > definition is flexible enough that we can report "which device does this > > map to". > > > > I want to enhance FIEMAP to deal with multi-device filesystems > > correctly, and as much as I want to kill FIBMAP, I can't because of zipl > > and *lilo. > > > > > My apologies if I am wrong, but, per my understanding, there is > > > nothing today, which tells userspace which device belongs the extent > > > map reported by FIEMAP. > > > > Right... > > > > > If it belongs to the RT device in XFS, or whatever disk in a raid in > > > BTRFS, we simply do not provide such information. > > > > Right... > > > > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or > > > a FIBMAP has been requested, so the current behavior of both ioctls > > > won't change. > > > > ...but from my point of view, the FIEMAP behavior *ought* to change to > > be more expressive. Once that's done, we can use the more expressive > > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems. > > > > The whole point of having fe_reserved* fields in struct fiemap_extent is > > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can > > start returning data in a reserved field. New userspace programs that > > know about the flag can start reading information from the new field if > > they see the flag, and old userspace programs don't know about the flag > > and won't be any worse off. > > > > > Enabling filesystems to return device information into fiemap_extent > > > requires modification of all filesystems to provide such information, > > > which will not have any use other than matching the mounted device to > > > the device where the extent is. > > > > Perhaps it would help for me to present a more concrete proposal: > > > > --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800 > > +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800 > > @@ -22,7 +22,19 @@ struct fiemap_extent { > > __u64 fe_length; /* length in bytes for this extent */ > > __u64 fe_reserved64[2]; > > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > > - __u32 fe_reserved[3]; > > + > > + /* > > + * Underlying device that this extent is stored on. > > + * > > + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the > > + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is > > + * set, this field is a 32-bit cookie that can be used to distinguish > > + * between backing devices but has no intrinsic meaning. If neither > > + * EXTENT_DEV flag is set, this field is meaningless. Only one of the > > + * EXTENT_DEV flags may be set at any time. > > + */ > > + __u32 fe_device; > > + __u32 fe_reserved[2]; > > }; > > > > struct fiemap { > > @@ -66,5 +78,14 @@ struct fiemap { > > * merged for efficiency. */ > > #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other > > * files. */ > > +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t > > + * structure containing the > > + * major and minor numbers > > + * of a block device. */ > > +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit > > + * cookie that can be used > > + * to distinguish physical > > + * devices but otherwise > > + * has no meaning. */ > > > > #endif /* _LINUX_FIEMAP_H */ > > > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start > > encoding fe_device = new_encode_dev(xfs_get_device_for_file()). > > > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE > > and encode the replica number in fe_device. > > > > All of this makes sense, but I'm struggling to understand what you mean by > replica number here, and why it justify a second flag. I left in the "device cookie" thing in the proposal to accomodate a request from the Lustre folks to be able to report which replica is storing a particular extent map. Apparently the replica id is simply a 32-bit number that isn't inherently useful, hence the vagueness around what "cookie" really means... ...oh, right, lustre fell out of drivers/staging/. You could probably leave it out then. > > Existing filesystems can be left unchanged, in which case neither > > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are > > meaningless, the same as they are today. Reporting fe_device is entirely > > optional. > > > > Userspace programs will now be able to tell which device the file data > > lives on, which has been sort-of requested for years, if the filesystem > > chooses to start exporting that information. > > > > Your FIBMAP-via-FIEMAP backend can do something like: > > > > /* FIBMAP only returns results for the same block device backing the fs. */ > > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device) > > return 0; > > > > /* Can't tell what is the backing device, bail out. */ > > if (fe->fe_flags & EXTENT_DEV_COOKIE) > > return 0; > > > > Ok, the first conditional, is ok, the second one is not making sense to me. > Looks like you are basically using it to flag the filesystem can't tell > exactly which device the current extent is, let's say for example, distributed > filesystems, where the physical extent can actually be on a different machine. > But I can't say for sure, can you give me more details about what you are trying > to achieve here? You've understood me correctly. :) > > /* > > * Either fe_device matches the backing device or the implementation > > * doesn't tell us about the backing device, so assume it's ok. > > */ > > <return FIBMAP results> > > > > This actually looks to contradict what you have been complaining, about some > filesystems which doesn't support FIBMAP currently, will now suddenly start to > support. Assuming it's ok if the implementation doesn't tell us about the > backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the > backing device, assuming it's ok will just fall into your first complain. Sorry, this thread has been going on so long that I forgot your goal for this series. :/ Specifically, I had forgotten that you're removing the ->bmap pointer, which means that filesystems don't have any particular way to signal "Yes on FIEMAP, no on FIBMAP". Somehow I had thought that you were merely creating a generic_file_bmap() that would call FIEMAP and ripping out all the adhoc bmap implementations. Hmm, how many filesystems support FIEMAP and not FIBMAP? btrfs, nilfs2, and overlayfs. Also bad_inode.c...? Hmm, how many filesystems support FIBMAP and not FIEMAP? adfs, affs, befs, bfs, efs, exofs, fat, freevxfs, fuse(?), nfs, hfsplus, isofs, jfs, minixfs, ntfs, qnx[46], reiserfs, sysv, udf, and ufs. > Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE > you mentioned. I think you've understood it about as well as I can explain it. Maybe Andreas will have more to say about the lustre replica id, but OTOH it's gone and so there's no user of it, so we could just drop it until lustre comes back. > > So that's how I'd solve a longstanding design problem of FIEMAP and then > > take advantage of that solution to remedy my objections to the proposed > > "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG > > behavior flag that userspace knows about but isn't allowed to pass in. > > > > > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive > > > than the device id in fiemap_extent. I don't see much advantage in > > > adding the device id instead of using the flag. > > > > > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via > > > userspace, so, it would require a check to make sure it didn't come from > > > userspace if ioctl_fiemap() was used. > > > > > > I think there are 2 other possibilities which can be used to fix this. > > > > > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, > > > - If the device id is a must for you, maybe add the device id into > > > fiemap_extent_info instead of fiemap_extent. > > > > That won't work with btrfs, which can store file extents on multiple > > different physical devices. > > > > > So we don't mess with a UAPI exported data structure and still > > > provides a way to the filesystems to provide which device the mapped > > > extent is in. > > > > > > What you think? > > > > > > Cheers > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > + > > > > > > > + return error; > > > > > > > +} > > > > > > > + > > > > > > > /** > > > > > > > * bmap - find a block number in a file > > > > > > > * @inode: inode owning the block number being requested > > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > > > > > > > */ > > > > > > > int bmap(struct inode *inode, sector_t *block) > > > > > > > { > > > > > > > - if (!inode->i_mapping->a_ops->bmap) > > > > > > > + if (inode->i_op->fiemap) > > > > > > > + return bmap_fiemap(inode, block); > > > > > > > + else if (inode->i_mapping->a_ops->bmap) > > > > > > > + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > > > > > > > + *block); > > > > > > > + else > > > > > > > return -EINVAL; > > > > > > > > > > > > Waitaminute. btrfs currently supports fiemap but not bmap, and now > > > > > > suddenly it will support this legacy interface they've never supported > > > > > > before. Are they on board with this? > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > > > > > > > return 0; > > > > > > > } > > > > > > > EXPORT_SYMBOL(bmap); > > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > > > > > > index 6086978fe01e..bfa59df332bf 100644 > > > > > > > --- a/fs/ioctl.c > > > > > > > +++ b/fs/ioctl.c > > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > > > return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > } > > > > > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > > > + u64 phys, u64 len, u32 flags) > > > > > > > +{ > > > > > > > + struct fiemap_extent *extent = fieinfo->fi_extents_start; > > > > > > > + > > > > > > > + /* only count the extents */ > > > > > > > + if (fieinfo->fi_extents_max == 0) { > > > > > > > + fieinfo->fi_extents_mapped++; > > > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > + } > > > > > > > + > > > > > > > + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > > > > > > > + return 1; > > > > > > > + > > > > > > > + if (flags & SET_UNKNOWN_FLAGS) > > > > > > > + flags |= FIEMAP_EXTENT_UNKNOWN; > > > > > > > + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > > > > > > > + flags |= FIEMAP_EXTENT_ENCODED; > > > > > > > + if (flags & SET_NOT_ALIGNED_FLAGS) > > > > > > > + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > > > > > > > + > > > > > > > + extent->fe_logical = logical; > > > > > > > + extent->fe_physical = phys; > > > > > > > + extent->fe_length = len; > > > > > > > + extent->fe_flags = flags; > > > > > > > + > > > > > > > + fieinfo->fi_extents_mapped++; > > > > > > > + > > > > > > > + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > > > > > > > + return 1; > > > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > +} > > > > > > > /** > > > > > > > * fiemap_fill_next_extent - Fiemap helper function > > > > > > > * @fieinfo: Fiemap context passed into ->fiemap > > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > > > > > index 7a434979201c..28bb523d532a 100644 > > > > > > > --- a/include/linux/fs.h > > > > > > > +++ b/include/linux/fs.h > > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > > > > > > > fiemap_fill_cb fi_cb; > > > > > > > }; > > > > > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > > > > > > > + u64 phys, u64 len, u32 flags); > > > > > > > int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > > > > > > > u64 phys, u64 len, u32 flags); > > > > > > > int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > > > > > > > -- > > > > > > > 2.17.2 > > > > > > > > > > > > > > > > > -- > > > > > Carlos > > > > > > -- > > > Carlos > > -- > Carlos
On Thu, Feb 07, 2019 at 01:36:41PM +0100, Carlos Maiolino wrote: > Apologies, I forgot to mention another thing.. > > On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote: > > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote: > > > > > > In any case, I think a better solution to the multi-device problem is to > > > > > > start returning device information via struct fiemap_extent, at least > > > > > > inside the kernel. Use one of the reserved fields to declare a new > > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t > > > > > > device number, and then you can check that against inode->i_sb->s_bdev > > > > > > to avoid returning results for the non-primary device of a multi-device > > > > > > filesystem. > > > > > > > > > > I agree we should address it here, but I don't think fiemap_extent is the right > > > > > place for it, it is linked to the UAPI, and changing it is usually not a good > > > > > idea. > > > > > > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields > > > > into some sort of dev_t/per-device cookie should be fine. Userspace > > > > shouldn't be expecting any meaning in reserved areas. > > > > > > > > > I think I got your idea anyway, but, what if, instead returning the bdev in > > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to > > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do > > > > > with such information? > > > > > > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers. > > > > > > Ok, may I ask why not? > > > > I think it's a bad idea to add a flag to FIEMAP to change its behavior > > to suit an older and even crappier legacy interface (i.e. FIBMAP). > > > > FIBMAP is architecturally broken in that we can't /ever/ provide the > > context of "which device does this map to?" > > > > FIEMAP is architecturally deficient as well, but its ioctl structure > > definition is flexible enough that we can report "which device does this > > map to". > > > > I want to enhance FIEMAP to deal with multi-device filesystems > > correctly, and as much as I want to kill FIBMAP, I can't because of zipl > > and *lilo. > > > > > My apologies if I am wrong, but, per my understanding, there is > > > nothing today, which tells userspace which device belongs the extent > > > map reported by FIEMAP. > > > > Right... > > > > > If it belongs to the RT device in XFS, or whatever disk in a raid in > > > BTRFS, we simply do not provide such information. > > > > Right... > > > > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or > > > a FIBMAP has been requested, so the current behavior of both ioctls > > > won't change. > > > > ...but from my point of view, the FIEMAP behavior *ought* to change to > > be more expressive. Once that's done, we can use the more expressive > > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems. > > > > The whole point of having fe_reserved* fields in struct fiemap_extent is > > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can > > start returning data in a reserved field. New userspace programs that > > know about the flag can start reading information from the new field if > > they see the flag, and old userspace programs don't know about the flag > > and won't be any worse off. > > > > > Enabling filesystems to return device information into fiemap_extent > > > requires modification of all filesystems to provide such information, > > > which will not have any use other than matching the mounted device to > > > the device where the extent is. > > > > Perhaps it would help for me to present a more concrete proposal: > > > > --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800 > > +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800 > > @@ -22,7 +22,19 @@ struct fiemap_extent { > > __u64 fe_length; /* length in bytes for this extent */ > > __u64 fe_reserved64[2]; > > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > > - __u32 fe_reserved[3]; > > + > > + /* > > + * Underlying device that this extent is stored on. > > + * > > + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the > > + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is > > + * set, this field is a 32-bit cookie that can be used to distinguish > > + * between backing devices but has no intrinsic meaning. If neither > > + * EXTENT_DEV flag is set, this field is meaningless. Only one of the > > + * EXTENT_DEV flags may be set at any time. > > + */ > > + __u32 fe_device; > > + __u32 fe_reserved[2]; > > }; > > > > struct fiemap { > > @@ -66,5 +78,14 @@ struct fiemap { > > * merged for efficiency. */ > > #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other > > * files. */ > > +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t > > + * structure containing the > > + * major and minor numbers > > + * of a block device. */ > > +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit > > + * cookie that can be used > > + * to distinguish physical > > + * devices but otherwise > > + * has no meaning. */ > > > > #endif /* _LINUX_FIEMAP_H */ > > > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start > > encoding fe_device = new_encode_dev(xfs_get_device_for_file()). > > Here, I believe you are forgetting that filesystems do not touch fiemap_extent > directly. We call fiemap_fell_next_extent() helper to fill each extent found by > fiemap. So, in either way, we'd need to modify fiemap_fill_next_extent() and the > callbacks being used to accommodate this new field or create a new helper to > modify the device which doesn't sound reasonable. So, either way, we will end up > needing to modify all filesystems. Yep. Drat. I guess you could add a bdev parameter to fiemap_fill_next_extent, and we'd use that to encode fe_device. If the fs passes NULL then we just get it from the superblock or something. > So, although I really like the idea of improving the FIEMAP interface, I'm > starting to consider another patchset for it. I think it requires an interface > change big enough to fit in this patchset, which actually has a different > purpose. Or, maybe, address this at the end of this patchset, leaving different > interface changes in different patchsets, instead of making many changes all at > once, mixed together. <nod> I think you're right, fiemap upgrades as one series and then fibmap-via-fiemap as the second one. --D > > > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE > > and encode the replica number in fe_device. > > > > Existing filesystems can be left unchanged, in which case neither > > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are > > meaningless, the same as they are today. Reporting fe_device is entirely > > optional. > > > > Userspace programs will now be able to tell which device the file data > > lives on, which has been sort-of requested for years, if the filesystem > > chooses to start exporting that information. > > > > Your FIBMAP-via-FIEMAP backend can do something like: > > > > /* FIBMAP only returns results for the same block device backing the fs. */ > > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device) > > return 0; > > > > /* Can't tell what is the backing device, bail out. */ > > if (fe->fe_flags & EXTENT_DEV_COOKIE) > > return 0; > > > > /* > > * Either fe_device matches the backing device or the implementation > > * doesn't tell us about the backing device, so assume it's ok. > > */ > > <return FIBMAP results> > > > > So that's how I'd solve a longstanding design problem of FIEMAP and then > > take advantage of that solution to remedy my objections to the proposed > > "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG > > behavior flag that userspace knows about but isn't allowed to pass in. > > > > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive > > > than the device id in fiemap_extent. I don't see much advantage in > > > adding the device id instead of using the flag. > > > > > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via > > > userspace, so, it would require a check to make sure it didn't come from > > > userspace if ioctl_fiemap() was used. > > > > > > I think there are 2 other possibilities which can be used to fix this. > > > > > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, > > > - If the device id is a must for you, maybe add the device id into > > > fiemap_extent_info instead of fiemap_extent. > > > > That won't work with btrfs, which can store file extents on multiple > > different physical devices. > > > > > So we don't mess with a UAPI exported data structure and still > > > provides a way to the filesystems to provide which device the mapped > > > extent is in. > > > > > > What you think? > > > > > > Cheers > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > + > > > > > > > + return error; > > > > > > > +} > > > > > > > + > > > > > > > /** > > > > > > > * bmap - find a block number in a file > > > > > > > * @inode: inode owning the block number being requested > > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > > > > > > > */ > > > > > > > int bmap(struct inode *inode, sector_t *block) > > > > > > > { > > > > > > > - if (!inode->i_mapping->a_ops->bmap) > > > > > > > + if (inode->i_op->fiemap) > > > > > > > + return bmap_fiemap(inode, block); > > > > > > > + else if (inode->i_mapping->a_ops->bmap) > > > > > > > + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > > > > > > > + *block); > > > > > > > + else > > > > > > > return -EINVAL; > > > > > > > > > > > > Waitaminute. btrfs currently supports fiemap but not bmap, and now > > > > > > suddenly it will support this legacy interface they've never supported > > > > > > before. Are they on board with this? > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > > > > > > > return 0; > > > > > > > } > > > > > > > EXPORT_SYMBOL(bmap); > > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > > > > > > index 6086978fe01e..bfa59df332bf 100644 > > > > > > > --- a/fs/ioctl.c > > > > > > > +++ b/fs/ioctl.c > > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > > > return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > } > > > > > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > > > + u64 phys, u64 len, u32 flags) > > > > > > > +{ > > > > > > > + struct fiemap_extent *extent = fieinfo->fi_extents_start; > > > > > > > + > > > > > > > + /* only count the extents */ > > > > > > > + if (fieinfo->fi_extents_max == 0) { > > > > > > > + fieinfo->fi_extents_mapped++; > > > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > + } > > > > > > > + > > > > > > > + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > > > > > > > + return 1; > > > > > > > + > > > > > > > + if (flags & SET_UNKNOWN_FLAGS) > > > > > > > + flags |= FIEMAP_EXTENT_UNKNOWN; > > > > > > > + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > > > > > > > + flags |= FIEMAP_EXTENT_ENCODED; > > > > > > > + if (flags & SET_NOT_ALIGNED_FLAGS) > > > > > > > + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > > > > > > > + > > > > > > > + extent->fe_logical = logical; > > > > > > > + extent->fe_physical = phys; > > > > > > > + extent->fe_length = len; > > > > > > > + extent->fe_flags = flags; > > > > > > > + > > > > > > > + fieinfo->fi_extents_mapped++; > > > > > > > + > > > > > > > + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > > > > > > > + return 1; > > > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > +} > > > > > > > /** > > > > > > > * fiemap_fill_next_extent - Fiemap helper function > > > > > > > * @fieinfo: Fiemap context passed into ->fiemap > > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > > > > > index 7a434979201c..28bb523d532a 100644 > > > > > > > --- a/include/linux/fs.h > > > > > > > +++ b/include/linux/fs.h > > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > > > > > > > fiemap_fill_cb fi_cb; > > > > > > > }; > > > > > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > > > > > > > + u64 phys, u64 len, u32 flags); > > > > > > > int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > > > > > > > u64 phys, u64 len, u32 flags); > > > > > > > int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > > > > > > > -- > > > > > > > 2.17.2 > > > > > > > > > > > > > > > > > -- > > > > > Carlos > > > > > > -- > > > Carlos > > -- > Carlos
On Feb 7, 2019, at 10:02 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Thu, Feb 07, 2019 at 12:59:54PM +0100, Carlos Maiolino wrote: >> On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote: >>> >>> ...but from my point of view, the FIEMAP behavior *ought* to change to >>> be more expressive. Once that's done, we can use the more expressive >>> FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems. >>> >>> The whole point of having fe_reserved* fields in struct fiemap_extent is >>> so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can >>> start returning data in a reserved field. New userspace programs that >>> know about the flag can start reading information from the new field if >>> they see the flag, and old userspace programs don't know about the flag >>> and won't be any worse off. >>> >>> Perhaps it would help for me to present a more concrete proposal: >>> >>> --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800 >>> +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800 >>> @@ -22,7 +22,19 @@ struct fiemap_extent { >>> __u64 fe_length; /* length in bytes for this extent */ >>> __u64 fe_reserved64[2]; >>> __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ >>> - __u32 fe_reserved[3]; >>> + >>> + /* >>> + * Underlying device that this extent is stored on. >>> + * >>> + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the >>> + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is >>> + * set, this field is a 32-bit cookie that can be used to distinguish >>> + * between backing devices but has no intrinsic meaning. If neither >>> + * EXTENT_DEV flag is set, this field is meaningless. Only one of the >>> + * EXTENT_DEV flags may be set at any time. >>> + */ >>> + __u32 fe_device; >>> + __u32 fe_reserved[2]; >>> }; >>> >>> struct fiemap { >>> @@ -66,5 +78,14 @@ struct fiemap { >>> * merged for efficiency. */ >>> #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other >>> * files. */ >>> +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t >>> + * structure containing the >>> + * major and minor numbers >>> + * of a block device. */ >>> +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit >>> + * cookie that can be used >>> + * to distinguish physical >>> + * devices but otherwise >>> + * has no meaning. */ >>> >>> #endif /* _LINUX_FIEMAP_H */ >>> >>> Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start >>> encoding fe_device = new_encode_dev(xfs_get_device_for_file()). >>> >>> Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE >>> and encode the replica number in fe_device. >>> >> >> All of this makes sense, but I'm struggling to understand what you mean by >> replica number here, and why it justify a second flag. > > I left in the "device cookie" thing in the proposal to accomodate a > request from the Lustre folks to be able to report which replica is > storing a particular extent map. Apparently the replica id is simply a > 32-bit number that isn't inherently useful, hence the vagueness around > what "cookie" really means... > > ...oh, right, lustre fell out of drivers/staging/. You could probably > leave it out then. Do we really need to be this way, about reserving a single flag for Lustre, which will likely also be useful for other filesystems? It's not like Lustre is some closed-source binary module for which we need to make life difficult, it is used by many thousands of the largest computers at labs and universities and companies around the world. We are working to clean up the code outside the staging tree and resubmit it. Not reserving a flag just means we will continue to use random values in Lustre before it can be merged, which will make life harder when we try to merge again. In the case of Lustre, the proposed DEV_COOKIE would indicate fe_device is the integer index number of the server on which each extent of the file is located (Darrick's "replica number" term is not really correct). The server index is familiar to all Lustre users, so having filefrag print out the device number of "0000" or "0009" is totally clear to them. For pNFS or Ceph or other network filesystems (if they implement filefrag) it could be an index or some other number (e.g. the IP address of the server or low bits of the UUID or whatever). Reading back in the archives of the original FIEMAP discussion, it seems BtrFS would prefer to use DEV_COOKIE instead of DEV_T, because it uses internal RAID encoding and not plain block devices, but I'm not familiar with the details there. Alternately, or in addition to, a DEV_COOKIE flag which indicates that the same fe_device field is "not a device", it would be possible to add: #define FIEMAP_NO_DIRECT 0x40000000 and/or: #define FIEMAP_EXTENT_NET 0x80000000 /* Data stored remotely. * Sets NO_DIRECT flag */ returned by the filesystem that indicates the extent blocks are not local to the node, so FIBMAP should return an error (-EOPNOTSUP or -EREMOTE or whatever) because the file can't be booted from. In that case, we could return FIEMAP_EXTENT_DEVICE to indicate the fe_device field is valid, and return FIEMAP_EXTENT_NET to indicate the values in fe_device are not local block devices, just filesystem-specific values to distinguish devices. However, I'm open to both _DEV_COOKIE and _NET flag if that is preferred, since I think the two are somewhat complementary. >> This actually looks to contradict what you have been complaining, about some >> filesystems which doesn't support FIBMAP currently, will now suddenly start to >> support. Assuming it's ok if the implementation doesn't tell us about the >> backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the >> backing device, assuming it's ok will just fall into your first complain. > > Sorry, this thread has been going on so long that I forgot your goal for > this series. :/ > > Specifically, I had forgotten that you're removing the ->bmap pointer, > which means that filesystems don't have any particular way to signal > "Yes on FIEMAP, no on FIBMAP". Somehow I had thought that you were > merely creating a generic_file_bmap() that would call FIEMAP and ripping > out all the adhoc bmap implementations. Just a reminder here, you should set FIEMAP_FLAG_SYNC when mapping FIBMAP to FIEMAP so that the data on that file is flushed to disk before returning, since the block mapping may not be assigned yet or may be unstable, which could lead to an unbootable system if used for LILO. Cheers, Andreas >>> Existing filesystems can be left unchanged, in which case neither >>> EXTENT_DEV flag is set in fe_flags and the bits in fe_device are >>> meaningless, the same as they are today. Reporting fe_device is entirely >>> optional. >>> >>> Userspace programs will now be able to tell which device the file data >>> lives on, which has been sort-of requested for years, if the filesystem >>> chooses to start exporting that information. >>> >>> Your FIBMAP-via-FIEMAP backend can do something like: >>> >>> /* FIBMAP only returns results for the same block device backing the fs. */ >>> if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device) >>> return 0; >>> >>> /* Can't tell what is the backing device, bail out. */ >>> if (fe->fe_flags & EXTENT_DEV_COOKIE) >>> return 0; >>> >> >> Ok, the first conditional, is ok, the second one is not making sense to me. >> Looks like you are basically using it to flag the filesystem can't tell >> exactly which device the current extent is, let's say for example, distributed >> filesystems, where the physical extent can actually be on a different machine. >> But I can't say for sure, can you give me more details about what you are trying >> to achieve here? > > You've understood me correctly. :) > >>> /* >>> * Either fe_device matches the backing device or the implementation >>> * doesn't tell us about the backing device, so assume it's ok. >>> */ >>> <return FIBMAP results> >>> >> >> Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE >> you mentioned. > > I think you've understood it about as well as I can explain it. Maybe > Andreas will have more to say about the lustre replica id, but OTOH it's > gone and so there's no user of it, so we could just drop it until lustre > comes back. > >>> So that's how I'd solve a longstanding design problem of FIEMAP and then >>> take advantage of that solution to remedy my objections to the proposed >>> "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG >>> behavior flag that userspace knows about but isn't allowed to pass in. >>> >> >>>> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive >>>> than the device id in fiemap_extent. I don't see much advantage in >>>> adding the device id instead of using the flag. >>>> >>>> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via >>>> userspace, so, it would require a check to make sure it didn't come from >>>> userspace if ioctl_fiemap() was used. >>>> >>>> I think there are 2 other possibilities which can be used to fix this. >>>> >>>> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, >>>> - If the device id is a must for you, maybe add the device id into >>>> fiemap_extent_info instead of fiemap_extent. >>> >>> That won't work with btrfs, which can store file extents on multiple >>> different physical devices. >>> >>>> So we don't mess with a UAPI exported data structure and still >>>> provides a way to the filesystems to provide which device the mapped >>>> extent is in. >>>> >>>> What you think? >>>> >>>> Cheers >>>> >>>> >>>>> >>>>> --D >>>>> >>>>>>> >>>>>>>> + >>>>>>>> + return error; >>>>>>>> +} >>>>>>>> + >>>>>>>> /** >>>>>>>> * bmap - find a block number in a file >>>>>>>> * @inode: inode owning the block number being requested >>>>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); >>>>>>>> */ >>>>>>>> int bmap(struct inode *inode, sector_t *block) >>>>>>>> { >>>>>>>> - if (!inode->i_mapping->a_ops->bmap) >>>>>>>> + if (inode->i_op->fiemap) >>>>>>>> + return bmap_fiemap(inode, block); >>>>>>>> + else if (inode->i_mapping->a_ops->bmap) >>>>>>>> + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, >>>>>>>> + *block); >>>>>>>> + else >>>>>>>> return -EINVAL; >>>>>>> >>>>>>> Waitaminute. btrfs currently supports fiemap but not bmap, and now >>>>>>> suddenly it will support this legacy interface they've never supported >>>>>>> before. Are they on board with this? >>>>>>> >>>>>>> --D >>>>>>> >>>>>>>> >>>>>>>> - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); >>>>>>>> return 0; >>>>>>>> } >>>>>>>> EXPORT_SYMBOL(bmap); >>>>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c >>>>>>>> index 6086978fe01e..bfa59df332bf 100644 >>>>>>>> --- a/fs/ioctl.c >>>>>>>> +++ b/fs/ioctl.c >>>>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, >>>>>>>> return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>>>> } >>>>>>>> >>>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, >>>>>>>> + u64 phys, u64 len, u32 flags) >>>>>>>> +{ >>>>>>>> + struct fiemap_extent *extent = fieinfo->fi_extents_start; >>>>>>>> + >>>>>>>> + /* only count the extents */ >>>>>>>> + if (fieinfo->fi_extents_max == 0) { >>>>>>>> + fieinfo->fi_extents_mapped++; >>>>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>>>> + } >>>>>>>> + >>>>>>>> + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) >>>>>>>> + return 1; >>>>>>>> + >>>>>>>> + if (flags & SET_UNKNOWN_FLAGS) >>>>>>>> + flags |= FIEMAP_EXTENT_UNKNOWN; >>>>>>>> + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) >>>>>>>> + flags |= FIEMAP_EXTENT_ENCODED; >>>>>>>> + if (flags & SET_NOT_ALIGNED_FLAGS) >>>>>>>> + flags |= FIEMAP_EXTENT_NOT_ALIGNED; >>>>>>>> + >>>>>>>> + extent->fe_logical = logical; >>>>>>>> + extent->fe_physical = phys; >>>>>>>> + extent->fe_length = len; >>>>>>>> + extent->fe_flags = flags; >>>>>>>> + >>>>>>>> + fieinfo->fi_extents_mapped++; >>>>>>>> + >>>>>>>> + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) >>>>>>>> + return 1; >>>>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; >>>>>>>> +} >>>>>>>> /** >>>>>>>> * fiemap_fill_next_extent - Fiemap helper function >>>>>>>> * @fieinfo: Fiemap context passed into ->fiemap >>>>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h >>>>>>>> index 7a434979201c..28bb523d532a 100644 >>>>>>>> --- a/include/linux/fs.h >>>>>>>> +++ b/include/linux/fs.h >>>>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { >>>>>>>> fiemap_fill_cb fi_cb; >>>>>>>> }; >>>>>>>> >>>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, >>>>>>>> + u64 phys, u64 len, u32 flags); >>>>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, >>>>>>>> u64 phys, u64 len, u32 flags); >>>>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); >>>>>>>> -- >>>>>>>> 2.17.2 >>>>>>>> >>>>>> >>>>>> -- >>>>>> Carlos >>>> >>>> -- >>>> Carlos >> >> -- >> Carlos Cheers, Andreas
On Thu, Feb 07, 2019 at 10:52:33AM +0100, Carlos Maiolino wrote: > Btw, I am not saying I don't like the idea, I like it. What I was trying to do > was to avoid touching UAPI in this patchset. But... I'll try to implement your > idea here, send it to the list and raise my shields. Agreed. Please don't change the FIEMAP uapi. If we need to check for a request coming from bmap just defined an internal FIEMAP flag as the last available flag in the flags word, and reject it when it comes from userspace in fiemap.
On Thu, Feb 07, 2019 at 02:25:01PM -0700, Andreas Dilger wrote: > Do we really need to be this way, about reserving a single flag for Lustre, > which will likely also be useful for other filesystems? It's not like > Lustre is some closed-source binary module for which we need to make life > difficult, it is used by many thousands of the largest computers at labs > and universities and companies around the world. We are working to clean > up the code outside the staging tree and resubmit it. Not reserving a flag > just means we will continue to use random values in Lustre before it can > be merged, which will make life harder when we try to merge again. No, it is available in source, but otherwise just as bad. And we generally only define APIs for in-kernel usage. If we can come up with a good API for in-kernel filesystems we can do that, otherwise hell no. And staging for that matter qualifies as out of tree. That being said I'm really worried about these FIEMAP extensions as userspace has no business poking into details of the placement (vs just the layout). But all that belongs into a separate dicussion instead of dragging down this series where it does not belong at all.
On Thu, Feb 07, 2019 at 10:16:55AM -0800, Darrick J. Wong wrote: > On Thu, Feb 07, 2019 at 01:36:41PM +0100, Carlos Maiolino wrote: > > Apologies, I forgot to mention another thing.. > > > > On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote: > > > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote: > > > > > > > In any case, I think a better solution to the multi-device problem is to > > > > > > > start returning device information via struct fiemap_extent, at least > > > > > > > inside the kernel. Use one of the reserved fields to declare a new > > > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t > > > > > > > device number, and then you can check that against inode->i_sb->s_bdev > > > > > > > to avoid returning results for the non-primary device of a multi-device > > > > > > > filesystem. > > > > > > > > > > > > I agree we should address it here, but I don't think fiemap_extent is the right > > > > > > place for it, it is linked to the UAPI, and changing it is usually not a good > > > > > > idea. > > > > > > > > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields > > > > > into some sort of dev_t/per-device cookie should be fine. Userspace > > > > > shouldn't be expecting any meaning in reserved areas. > > > > > > > > > > > I think I got your idea anyway, but, what if, instead returning the bdev in > > > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to > > > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do > > > > > > with such information? > > > > > > > > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers. > > > > > > > > Ok, may I ask why not? > > > > > > I think it's a bad idea to add a flag to FIEMAP to change its behavior > > > to suit an older and even crappier legacy interface (i.e. FIBMAP). > > > > > > FIBMAP is architecturally broken in that we can't /ever/ provide the > > > context of "which device does this map to?" > > > > > > FIEMAP is architecturally deficient as well, but its ioctl structure > > > definition is flexible enough that we can report "which device does this > > > map to". > > > > > > I want to enhance FIEMAP to deal with multi-device filesystems > > > correctly, and as much as I want to kill FIBMAP, I can't because of zipl > > > and *lilo. > > > > > > > My apologies if I am wrong, but, per my understanding, there is > > > > nothing today, which tells userspace which device belongs the extent > > > > map reported by FIEMAP. > > > > > > Right... > > > > > > > If it belongs to the RT device in XFS, or whatever disk in a raid in > > > > BTRFS, we simply do not provide such information. > > > > > > Right... > > > > > > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or > > > > a FIBMAP has been requested, so the current behavior of both ioctls > > > > won't change. > > > > > > ...but from my point of view, the FIEMAP behavior *ought* to change to > > > be more expressive. Once that's done, we can use the more expressive > > > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems. > > > > > > The whole point of having fe_reserved* fields in struct fiemap_extent is > > > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can > > > start returning data in a reserved field. New userspace programs that > > > know about the flag can start reading information from the new field if > > > they see the flag, and old userspace programs don't know about the flag > > > and won't be any worse off. > > > > > > > Enabling filesystems to return device information into fiemap_extent > > > > requires modification of all filesystems to provide such information, > > > > which will not have any use other than matching the mounted device to > > > > the device where the extent is. > > > > > > Perhaps it would help for me to present a more concrete proposal: > > > > > > --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800 > > > +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800 > > > @@ -22,7 +22,19 @@ struct fiemap_extent { > > > __u64 fe_length; /* length in bytes for this extent */ > > > __u64 fe_reserved64[2]; > > > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > > > - __u32 fe_reserved[3]; > > > + > > > + /* > > > + * Underlying device that this extent is stored on. > > > + * > > > + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the > > > + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is > > > + * set, this field is a 32-bit cookie that can be used to distinguish > > > + * between backing devices but has no intrinsic meaning. If neither > > > + * EXTENT_DEV flag is set, this field is meaningless. Only one of the > > > + * EXTENT_DEV flags may be set at any time. > > > + */ > > > + __u32 fe_device; > > > + __u32 fe_reserved[2]; > > > }; > > > > > > struct fiemap { > > > @@ -66,5 +78,14 @@ struct fiemap { > > > * merged for efficiency. */ > > > #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other > > > * files. */ > > > +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t > > > + * structure containing the > > > + * major and minor numbers > > > + * of a block device. */ > > > +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit > > > + * cookie that can be used > > > + * to distinguish physical > > > + * devices but otherwise > > > + * has no meaning. */ > > > > > > #endif /* _LINUX_FIEMAP_H */ > > > > > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start > > > encoding fe_device = new_encode_dev(xfs_get_device_for_file()). > > > > Here, I believe you are forgetting that filesystems do not touch fiemap_extent > > directly. We call fiemap_fell_next_extent() helper to fill each extent found by > > fiemap. So, in either way, we'd need to modify fiemap_fill_next_extent() and the > > callbacks being used to accommodate this new field or create a new helper to > > modify the device which doesn't sound reasonable. So, either way, we will end up > > needing to modify all filesystems. > > Yep. Drat. I guess you could add a bdev parameter to > fiemap_fill_next_extent, and we'd use that to encode fe_device. If the > fs passes NULL then we just get it from the superblock or something. > > > So, although I really like the idea of improving the FIEMAP interface, I'm > > starting to consider another patchset for it. I think it requires an interface > > change big enough to fit in this patchset, which actually has a different > > purpose. Or, maybe, address this at the end of this patchset, leaving different > > interface changes in different patchsets, instead of making many changes all at > > once, mixed together. > > <nod> I think you're right, fiemap upgrades as one series and then > fibmap-via-fiemap as the second one. > Ok, fair enough, looks like we have an agreement :P I'll work on this direction now, and set aside this patchset while we can improve FIEMAP to return the device id, and then rebase this patchset on top of that. Thanks for the review > --D > > > > > > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE > > > and encode the replica number in fe_device. > > > > > > Existing filesystems can be left unchanged, in which case neither > > > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are > > > meaningless, the same as they are today. Reporting fe_device is entirely > > > optional. > > > > > > Userspace programs will now be able to tell which device the file data > > > lives on, which has been sort-of requested for years, if the filesystem > > > chooses to start exporting that information. > > > > > > Your FIBMAP-via-FIEMAP backend can do something like: > > > > > > /* FIBMAP only returns results for the same block device backing the fs. */ > > > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device) > > > return 0; > > > > > > /* Can't tell what is the backing device, bail out. */ > > > if (fe->fe_flags & EXTENT_DEV_COOKIE) > > > return 0; > > > > > > /* > > > * Either fe_device matches the backing device or the implementation > > > * doesn't tell us about the backing device, so assume it's ok. > > > */ > > > <return FIBMAP results> > > > > > > So that's how I'd solve a longstanding design problem of FIEMAP and then > > > take advantage of that solution to remedy my objections to the proposed > > > "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG > > > behavior flag that userspace knows about but isn't allowed to pass in. > > > > > > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive > > > > than the device id in fiemap_extent. I don't see much advantage in > > > > adding the device id instead of using the flag. > > > > > > > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via > > > > userspace, so, it would require a check to make sure it didn't come from > > > > userspace if ioctl_fiemap() was used. > > > > > > > > I think there are 2 other possibilities which can be used to fix this. > > > > > > > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, > > > > - If the device id is a must for you, maybe add the device id into > > > > fiemap_extent_info instead of fiemap_extent. > > > > > > That won't work with btrfs, which can store file extents on multiple > > > different physical devices. > > > > > > > So we don't mess with a UAPI exported data structure and still > > > > provides a way to the filesystems to provide which device the mapped > > > > extent is in. > > > > > > > > What you think? > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > + > > > > > > > > + return error; > > > > > > > > +} > > > > > > > > + > > > > > > > > /** > > > > > > > > * bmap - find a block number in a file > > > > > > > > * @inode: inode owning the block number being requested > > > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > > > > > > > > */ > > > > > > > > int bmap(struct inode *inode, sector_t *block) > > > > > > > > { > > > > > > > > - if (!inode->i_mapping->a_ops->bmap) > > > > > > > > + if (inode->i_op->fiemap) > > > > > > > > + return bmap_fiemap(inode, block); > > > > > > > > + else if (inode->i_mapping->a_ops->bmap) > > > > > > > > + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > > > > > > > > + *block); > > > > > > > > + else > > > > > > > > return -EINVAL; > > > > > > > > > > > > > > Waitaminute. btrfs currently supports fiemap but not bmap, and now > > > > > > > suddenly it will support this legacy interface they've never supported > > > > > > > before. Are they on board with this? > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > > > > - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > > > > > > > > return 0; > > > > > > > > } > > > > > > > > EXPORT_SYMBOL(bmap); > > > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > > > > > > > index 6086978fe01e..bfa59df332bf 100644 > > > > > > > > --- a/fs/ioctl.c > > > > > > > > +++ b/fs/ioctl.c > > > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > > > > return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > > } > > > > > > > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > > > > + u64 phys, u64 len, u32 flags) > > > > > > > > +{ > > > > > > > > + struct fiemap_extent *extent = fieinfo->fi_extents_start; > > > > > > > > + > > > > > > > > + /* only count the extents */ > > > > > > > > + if (fieinfo->fi_extents_max == 0) { > > > > > > > > + fieinfo->fi_extents_mapped++; > > > > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > > + } > > > > > > > > + > > > > > > > > + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > > > > > > > > + return 1; > > > > > > > > + > > > > > > > > + if (flags & SET_UNKNOWN_FLAGS) > > > > > > > > + flags |= FIEMAP_EXTENT_UNKNOWN; > > > > > > > > + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > > > > > > > > + flags |= FIEMAP_EXTENT_ENCODED; > > > > > > > > + if (flags & SET_NOT_ALIGNED_FLAGS) > > > > > > > > + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > > > > > > > > + > > > > > > > > + extent->fe_logical = logical; > > > > > > > > + extent->fe_physical = phys; > > > > > > > > + extent->fe_length = len; > > > > > > > > + extent->fe_flags = flags; > > > > > > > > + > > > > > > > > + fieinfo->fi_extents_mapped++; > > > > > > > > + > > > > > > > > + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > > > > > > > > + return 1; > > > > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > > +} > > > > > > > > /** > > > > > > > > * fiemap_fill_next_extent - Fiemap helper function > > > > > > > > * @fieinfo: Fiemap context passed into ->fiemap > > > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > > > > > > index 7a434979201c..28bb523d532a 100644 > > > > > > > > --- a/include/linux/fs.h > > > > > > > > +++ b/include/linux/fs.h > > > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > > > > > > > > fiemap_fill_cb fi_cb; > > > > > > > > }; > > > > > > > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > > > > > > > > + u64 phys, u64 len, u32 flags); > > > > > > > > int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > > > > > > > > u64 phys, u64 len, u32 flags); > > > > > > > > int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > > > > > > > > -- > > > > > > > > 2.17.2 > > > > > > > > > > > > > > > > > > > > -- > > > > > > Carlos > > > > > > > > -- > > > > Carlos > > > > -- > > Carlos
On Thu, Feb 07, 2019 at 09:02:10AM -0800, Darrick J. Wong wrote: > On Thu, Feb 07, 2019 at 12:59:54PM +0100, Carlos Maiolino wrote: > > On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote: > > > On Wed, Feb 06, 2019 at 02:37:53PM +0100, Carlos Maiolino wrote: > > > > > > > In any case, I think a better solution to the multi-device problem is to > > > > > > > start returning device information via struct fiemap_extent, at least > > > > > > > inside the kernel. Use one of the reserved fields to declare a new > > > > > > > '__u32 fe_device' field in struct fiemap_extent which can be the dev_t > > > > > > > device number, and then you can check that against inode->i_sb->s_bdev > > > > > > > to avoid returning results for the non-primary device of a multi-device > > > > > > > filesystem. > > > > > > > > > > > > I agree we should address it here, but I don't think fiemap_extent is the right > > > > > > place for it, it is linked to the UAPI, and changing it is usually not a good > > > > > > idea. > > > > > > > > > > Adding a FIEMAP_EXTENT flag or two to turn one of the fe_reserved fields > > > > > into some sort of dev_t/per-device cookie should be fine. Userspace > > > > > shouldn't be expecting any meaning in reserved areas. > > > > > > > > > > > I think I got your idea anyway, but, what if, instead returning the bdev in > > > > > > fiemap_extent, we instead, send a flag (via fi_flags) to the filesystem, to > > > > > > idenfify a FIBMAP or a FIEMAP call, and let the filesystem decide what to do > > > > > > with such information? > > > > > > > > > > I don't like the idea of adding a FIEMAP_FLAG to distinguish callers. > > > > > > > > Ok, may I ask why not? > > > > > > I think it's a bad idea to add a flag to FIEMAP to change its behavior > > > to suit an older and even crappier legacy interface (i.e. FIBMAP). > > > > > > FIBMAP is architecturally broken in that we can't /ever/ provide the > > > context of "which device does this map to?" > > > > > > FIEMAP is architecturally deficient as well, but its ioctl structure > > > definition is flexible enough that we can report "which device does this > > > map to". > > > > > > I want to enhance FIEMAP to deal with multi-device filesystems > > > correctly, and as much as I want to kill FIBMAP, I can't because of zipl > > > and *lilo. > > > > > > > My apologies if I am wrong, but, per my understanding, there is > > > > nothing today, which tells userspace which device belongs the extent > > > > map reported by FIEMAP. > > > > > > Right... > > > > > > > If it belongs to the RT device in XFS, or whatever disk in a raid in > > > > BTRFS, we simply do not provide such information. > > > > > > Right... > > > > > > > So, the goal is to provide a way to tell the filesystem if a FIEMAP or > > > > a FIBMAP has been requested, so the current behavior of both ioctls > > > > won't change. > > > > > > ...but from my point of view, the FIEMAP behavior *ought* to change to > > > be more expressive. Once that's done, we can use the more expressive > > > FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems. > > > > > > The whole point of having fe_reserved* fields in struct fiemap_extent is > > > so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can > > > start returning data in a reserved field. New userspace programs that > > > know about the flag can start reading information from the new field if > > > they see the flag, and old userspace programs don't know about the flag > > > and won't be any worse off. > > > > > > > Enabling filesystems to return device information into fiemap_extent > > > > requires modification of all filesystems to provide such information, > > > > which will not have any use other than matching the mounted device to > > > > the device where the extent is. > > > > > > Perhaps it would help for me to present a more concrete proposal: > > > > > > --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800 > > > +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800 > > > @@ -22,7 +22,19 @@ struct fiemap_extent { > > > __u64 fe_length; /* length in bytes for this extent */ > > > __u64 fe_reserved64[2]; > > > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > > > - __u32 fe_reserved[3]; > > > + > > > + /* > > > + * Underlying device that this extent is stored on. > > > + * > > > + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the > > > + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is > > > + * set, this field is a 32-bit cookie that can be used to distinguish > > > + * between backing devices but has no intrinsic meaning. If neither > > > + * EXTENT_DEV flag is set, this field is meaningless. Only one of the > > > + * EXTENT_DEV flags may be set at any time. > > > + */ > > > + __u32 fe_device; > > > + __u32 fe_reserved[2]; > > > }; > > > > > > struct fiemap { > > > @@ -66,5 +78,14 @@ struct fiemap { > > > * merged for efficiency. */ > > > #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other > > > * files. */ > > > +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t > > > + * structure containing the > > > + * major and minor numbers > > > + * of a block device. */ > > > +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit > > > + * cookie that can be used > > > + * to distinguish physical > > > + * devices but otherwise > > > + * has no meaning. */ > > > > > > #endif /* _LINUX_FIEMAP_H */ > > > > > > Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start > > > encoding fe_device = new_encode_dev(xfs_get_device_for_file()). > > > > > > Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE > > > and encode the replica number in fe_device. > > > > > > > All of this makes sense, but I'm struggling to understand what you mean by > > replica number here, and why it justify a second flag. > > I left in the "device cookie" thing in the proposal to accomodate a > request from the Lustre folks to be able to report which replica is > storing a particular extent map. Apparently the replica id is simply a > 32-bit number that isn't inherently useful, hence the vagueness around > what "cookie" really means... > > ...oh, right, lustre fell out of drivers/staging/. You could probably > leave it out then. > > > > Existing filesystems can be left unchanged, in which case neither > > > EXTENT_DEV flag is set in fe_flags and the bits in fe_device are > > > meaningless, the same as they are today. Reporting fe_device is entirely > > > optional. > > > > > > Userspace programs will now be able to tell which device the file data > > > lives on, which has been sort-of requested for years, if the filesystem > > > chooses to start exporting that information. > > > > > > Your FIBMAP-via-FIEMAP backend can do something like: > > > > > > /* FIBMAP only returns results for the same block device backing the fs. */ > > > if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device) > > > return 0; > > > > > > /* Can't tell what is the backing device, bail out. */ > > > if (fe->fe_flags & EXTENT_DEV_COOKIE) > > > return 0; > > > > > > > Ok, the first conditional, is ok, the second one is not making sense to me. > > Looks like you are basically using it to flag the filesystem can't tell > > exactly which device the current extent is, let's say for example, distributed > > filesystems, where the physical extent can actually be on a different machine. > > But I can't say for sure, can you give me more details about what you are trying > > to achieve here? > > You've understood me correctly. :) > > > > /* > > > * Either fe_device matches the backing device or the implementation > > > * doesn't tell us about the backing device, so assume it's ok. > > > */ > > > <return FIBMAP results> > > > > > > > This actually looks to contradict what you have been complaining, about some > > filesystems which doesn't support FIBMAP currently, will now suddenly start to > > support. Assuming it's ok if the implementation doesn't tell us about the > > backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the > > backing device, assuming it's ok will just fall into your first complain. > > Sorry, this thread has been going on so long that I forgot your goal for > this series. :/ > > Specifically, I had forgotten that you're removing the ->bmap pointer, > which means that filesystems don't have any particular way to signal > "Yes on FIEMAP, no on FIBMAP". Somehow I had thought that you were > merely creating a generic_file_bmap() that would call FIEMAP and ripping > out all the adhoc bmap implementations. > > Hmm, how many filesystems support FIEMAP and not FIBMAP? > > btrfs, nilfs2, and overlayfs. Also bad_inode.c...? > > Hmm, how many filesystems support FIBMAP and not FIEMAP? > > adfs, affs, befs, bfs, efs, exofs, fat, freevxfs, fuse(?), nfs, hfsplus, > isofs, jfs, minixfs, ntfs, qnx[46], reiserfs, sysv, udf, and ufs. Eh, that's why we should keep: if (inode->i_op->fiemap) return bmap_fiemap(inode, block); else if (..a_ops->bmap) ->a_ops->bmap(...) else return -EINVAL; > > > Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE > > you mentioned. > > I think you've understood it about as well as I can explain it. Maybe > Andreas will have more to say about the lustre replica id, but OTOH it's > gone and so there's no user of it, so we could just drop it until lustre > comes back. > Ok, thanks for confirming. > > > So that's how I'd solve a longstanding design problem of FIEMAP and then > > > take advantage of that solution to remedy my objections to the proposed > > > "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG > > > behavior flag that userspace knows about but isn't allowed to pass in. > > > > > > > > > A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive > > > > than the device id in fiemap_extent. I don't see much advantage in > > > > adding the device id instead of using the flag. > > > > > > > > A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via > > > > userspace, so, it would require a check to make sure it didn't come from > > > > userspace if ioctl_fiemap() was used. > > > > > > > > I think there are 2 other possibilities which can be used to fix this. > > > > > > > > - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, > > > > - If the device id is a must for you, maybe add the device id into > > > > fiemap_extent_info instead of fiemap_extent. > > > > > > That won't work with btrfs, which can store file extents on multiple > > > different physical devices. > > > > > > > So we don't mess with a UAPI exported data structure and still > > > > provides a way to the filesystems to provide which device the mapped > > > > extent is in. > > > > > > > > What you think? > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > + > > > > > > > > + return error; > > > > > > > > +} > > > > > > > > + > > > > > > > > /** > > > > > > > > * bmap - find a block number in a file > > > > > > > > * @inode: inode owning the block number being requested > > > > > > > > @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > > > > > > > > */ > > > > > > > > int bmap(struct inode *inode, sector_t *block) > > > > > > > > { > > > > > > > > - if (!inode->i_mapping->a_ops->bmap) > > > > > > > > + if (inode->i_op->fiemap) > > > > > > > > + return bmap_fiemap(inode, block); > > > > > > > > + else if (inode->i_mapping->a_ops->bmap) > > > > > > > > + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > > > > > > > > + *block); > > > > > > > > + else > > > > > > > > return -EINVAL; > > > > > > > > > > > > > > Waitaminute. btrfs currently supports fiemap but not bmap, and now > > > > > > > suddenly it will support this legacy interface they've never supported > > > > > > > before. Are they on board with this? > > > > > > > > > > > > > > --D > > > > > > > > > > > > > > > > > > > > > > > - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > > > > > > > > return 0; > > > > > > > > } > > > > > > > > EXPORT_SYMBOL(bmap); > > > > > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > > > > > > > index 6086978fe01e..bfa59df332bf 100644 > > > > > > > > --- a/fs/ioctl.c > > > > > > > > +++ b/fs/ioctl.c > > > > > > > > @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > > > > return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > > } > > > > > > > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > > > > > > > > + u64 phys, u64 len, u32 flags) > > > > > > > > +{ > > > > > > > > + struct fiemap_extent *extent = fieinfo->fi_extents_start; > > > > > > > > + > > > > > > > > + /* only count the extents */ > > > > > > > > + if (fieinfo->fi_extents_max == 0) { > > > > > > > > + fieinfo->fi_extents_mapped++; > > > > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > > + } > > > > > > > > + > > > > > > > > + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > > > > > > > > + return 1; > > > > > > > > + > > > > > > > > + if (flags & SET_UNKNOWN_FLAGS) > > > > > > > > + flags |= FIEMAP_EXTENT_UNKNOWN; > > > > > > > > + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > > > > > > > > + flags |= FIEMAP_EXTENT_ENCODED; > > > > > > > > + if (flags & SET_NOT_ALIGNED_FLAGS) > > > > > > > > + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > > > > > > > > + > > > > > > > > + extent->fe_logical = logical; > > > > > > > > + extent->fe_physical = phys; > > > > > > > > + extent->fe_length = len; > > > > > > > > + extent->fe_flags = flags; > > > > > > > > + > > > > > > > > + fieinfo->fi_extents_mapped++; > > > > > > > > + > > > > > > > > + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > > > > > > > > + return 1; > > > > > > > > + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > > > > > > > > +} > > > > > > > > /** > > > > > > > > * fiemap_fill_next_extent - Fiemap helper function > > > > > > > > * @fieinfo: Fiemap context passed into ->fiemap > > > > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > > > > > > index 7a434979201c..28bb523d532a 100644 > > > > > > > > --- a/include/linux/fs.h > > > > > > > > +++ b/include/linux/fs.h > > > > > > > > @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > > > > > > > > fiemap_fill_cb fi_cb; > > > > > > > > }; > > > > > > > > > > > > > > > > +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > > > > > > > > + u64 phys, u64 len, u32 flags); > > > > > > > > int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > > > > > > > > u64 phys, u64 len, u32 flags); > > > > > > > > int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > > > > > > > > -- > > > > > > > > 2.17.2 > > > > > > > > > > > > > > > > > > > > -- > > > > > > Carlos > > > > > > > > -- > > > > Carlos > > > > -- > > Carlos
On Thu, Feb 07, 2019 at 02:25:01PM -0700, Andreas Dilger wrote: > On Feb 7, 2019, at 10:02 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Thu, Feb 07, 2019 at 12:59:54PM +0100, Carlos Maiolino wrote: > >> On Wed, Feb 06, 2019 at 12:44:31PM -0800, Darrick J. Wong wrote: > >>> > >>> ...but from my point of view, the FIEMAP behavior *ought* to change to > >>> be more expressive. Once that's done, we can use the more expressive > >>> FIEMAP output to solve the problem of FIBMAP vs. multi-disk filesystems. > >>> > >>> The whole point of having fe_reserved* fields in struct fiemap_extent is > >>> so that we can add a new FIEMAP_EXTENT_ flag so that the filesystem can > >>> start returning data in a reserved field. New userspace programs that > >>> know about the flag can start reading information from the new field if > >>> they see the flag, and old userspace programs don't know about the flag > >>> and won't be any worse off. > >>> > >>> Perhaps it would help for me to present a more concrete proposal: > >>> > >>> --- a/include/uapi/linux/fiemap.h 2019-01-18 10:53:44.000000000 -0800 > >>> +++ b/include/uapi/linux/fiemap.h 2019-02-06 12:25:52.813935941 -0800 > >>> @@ -22,7 +22,19 @@ struct fiemap_extent { > >>> __u64 fe_length; /* length in bytes for this extent */ > >>> __u64 fe_reserved64[2]; > >>> __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > >>> - __u32 fe_reserved[3]; > >>> + > >>> + /* > >>> + * Underlying device that this extent is stored on. > >>> + * > >>> + * If FIEMAP_EXTENT_DEV_T is set, this field is a dev_t containing the > >>> + * major and minor numbers of a device. If FIEMAP_EXTENT_DEV_COOKIE is > >>> + * set, this field is a 32-bit cookie that can be used to distinguish > >>> + * between backing devices but has no intrinsic meaning. If neither > >>> + * EXTENT_DEV flag is set, this field is meaningless. Only one of the > >>> + * EXTENT_DEV flags may be set at any time. > >>> + */ > >>> + __u32 fe_device; > >>> + __u32 fe_reserved[2]; > >>> }; > >>> > >>> struct fiemap { > >>> @@ -66,5 +78,14 @@ struct fiemap { > >>> * merged for efficiency. */ > >>> #define FIEMAP_EXTENT_SHARED 0x00002000 /* Space shared with other > >>> * files. */ > >>> +#define FIEMAP_EXTENT_DEV_T 0x00004000 /* fe_device is a dev_t > >>> + * structure containing the > >>> + * major and minor numbers > >>> + * of a block device. */ > >>> +#define FIEMAP_EXTENT_DEV_COOKIE 0x00008000 /* fe_device is a 32-bit > >>> + * cookie that can be used > >>> + * to distinguish physical > >>> + * devices but otherwise > >>> + * has no meaning. */ > >>> > >>> #endif /* _LINUX_FIEMAP_H */ > >>> > >>> Under this scheme, XFS can set FIEMAP_EXTENT_DEV_T in fe_flags and start > >>> encoding fe_device = new_encode_dev(xfs_get_device_for_file()). > >>> > >>> Some clustered filesystem or whatever could set FIEMAP_EXTENT_DEV_COOKIE > >>> and encode the replica number in fe_device. > >>> > >> > >> All of this makes sense, but I'm struggling to understand what you mean by > >> replica number here, and why it justify a second flag. > > > > I left in the "device cookie" thing in the proposal to accomodate a > > request from the Lustre folks to be able to report which replica is > > storing a particular extent map. Apparently the replica id is simply a > > 32-bit number that isn't inherently useful, hence the vagueness around > > what "cookie" really means... > > > > ...oh, right, lustre fell out of drivers/staging/. You could probably > > leave it out then. > > Do we really need to be this way, about reserving a single flag for Lustre, > which will likely also be useful for other filesystems? It's not like > Lustre is some closed-source binary module for which we need to make life > difficult, it is used by many thousands of the largest computers at labs > and universities and companies around the world. We are working to clean > up the code outside the staging tree and resubmit it. Not reserving a flag > just means we will continue to use random values in Lustre before it can > be merged, which will make life harder when we try to merge again. > Agreed, it's a flag that may benefit different filesystems, not only lustre. > > In the case of Lustre, the proposed DEV_COOKIE would indicate fe_device is > the integer index number of the server on which each extent of the file is > located (Darrick's "replica number" term is not really correct). The server > index is familiar to all Lustre users, so having filefrag print out the > device number of "0000" or "0009" is totally clear to them. > Thanks for the info. /me doesn't know LustreFS. > For pNFS or Ceph or other network filesystems (if they implement filefrag) > it could be an index or some other number (e.g. the IP address of the server > or low bits of the UUID or whatever). Reading back in the archives of the > original FIEMAP discussion, it seems BtrFS would prefer to use DEV_COOKIE > instead of DEV_T, because it uses internal RAID encoding and not plain block > devices, but I'm not familiar with the details there. > > > Alternately, or in addition to, a DEV_COOKIE flag which indicates that the > same fe_device field is "not a device", it would be possible to add: > > #define FIEMAP_NO_DIRECT 0x40000000 > > and/or: > > #define FIEMAP_EXTENT_NET 0x80000000 /* Data stored remotely. > * Sets NO_DIRECT flag */ > > returned by the filesystem that indicates the extent blocks are not local > to the node, so FIBMAP should return an error (-EOPNOTSUP or -EREMOTE or > whatever) because the file can't be booted from. In that case, we could > return FIEMAP_EXTENT_DEVICE to indicate the fe_device field is valid, and > return FIEMAP_EXTENT_NET to indicate the values in fe_device are not local > block devices, just filesystem-specific values to distinguish devices. > > However, I'm open to both _DEV_COOKIE and _NET flag if that is preferred, > since I think the two are somewhat complementary. > I'd rather avoid going down this far in the rabbit hole. Once we have fe_device field and the basic flags, would be relatively easy to propose new flags, but now, new flags should be discussed on a patch proposal I believe. Discussing which flags should/shouldn't be added here, will be pointless. Let me work on the patchset to update FIEMAP, and then we can discuss such thing there. I'll Cc you on the patches if you want to. Cheers > >> This actually looks to contradict what you have been complaining, about some > >> filesystems which doesn't support FIBMAP currently, will now suddenly start to > >> support. Assuming it's ok if the implementation doesn't tell us about the > >> backing device, will simply make FIBMAP work. Let's say BTRFS doesn't report the > >> backing device, assuming it's ok will just fall into your first complain. > > > > Sorry, this thread has been going on so long that I forgot your goal for > > this series. :/ > > > > Specifically, I had forgotten that you're removing the ->bmap pointer, > > which means that filesystems don't have any particular way to signal > > "Yes on FIEMAP, no on FIBMAP". Somehow I had thought that you were > > merely creating a generic_file_bmap() that would call FIEMAP and ripping > > out all the adhoc bmap implementations. > > Just a reminder here, you should set FIEMAP_FLAG_SYNC when mapping FIBMAP > to FIEMAP so that the data on that file is flushed to disk before returning, > since the block mapping may not be assigned yet or may be unstable, which > could lead to an unbootable system if used for LILO. > > Cheers, Andreas > > >>> Existing filesystems can be left unchanged, in which case neither > >>> EXTENT_DEV flag is set in fe_flags and the bits in fe_device are > >>> meaningless, the same as they are today. Reporting fe_device is entirely > >>> optional. > >>> > >>> Userspace programs will now be able to tell which device the file data > >>> lives on, which has been sort-of requested for years, if the filesystem > >>> chooses to start exporting that information. > >>> > >>> Your FIBMAP-via-FIEMAP backend can do something like: > >>> > >>> /* FIBMAP only returns results for the same block device backing the fs. */ > >>> if ((fe->fe_flags & EXTENT_DEV_T) && fe->fe_device != inode->i_sb->sb_device) > >>> return 0; > >>> > >>> /* Can't tell what is the backing device, bail out. */ > >>> if (fe->fe_flags & EXTENT_DEV_COOKIE) > >>> return 0; > >>> > >> > >> Ok, the first conditional, is ok, the second one is not making sense to me. > >> Looks like you are basically using it to flag the filesystem can't tell > >> exactly which device the current extent is, let's say for example, distributed > >> filesystems, where the physical extent can actually be on a different machine. > >> But I can't say for sure, can you give me more details about what you are trying > >> to achieve here? > > > > You've understood me correctly. :) > > > >>> /* > >>> * Either fe_device matches the backing device or the implementation > >>> * doesn't tell us about the backing device, so assume it's ok. > >>> */ > >>> <return FIBMAP results> > >>> > >> > >> Anyway, I think I need to understand more your usage idea for EXTENT_DEV_COOKIE > >> you mentioned. > > > > I think you've understood it about as well as I can explain it. Maybe > > Andreas will have more to say about the lustre replica id, but OTOH it's > > gone and so there's no user of it, so we could just drop it until lustre > > comes back. > > > > >>> So that's how I'd solve a longstanding design problem of FIEMAP and then > >>> take advantage of that solution to remedy my objections to the proposed > >>> "Use FIEMAP for FIBMAP" series. It doesn't require a FIEMAP_FLAG > >>> behavior flag that userspace knows about but isn't allowed to pass in. > >>> > >> > >>>> A FIEMAP_FLAG will also require FS changes, but IMHO, less intrusive > >>>> than the device id in fiemap_extent. I don't see much advantage in > >>>> adding the device id instead of using the flag. > >>>> > >>>> A problem I see using a new FIEMAP_FLAG, is it 'could' be also passed via > >>>> userspace, so, it would require a check to make sure it didn't come from > >>>> userspace if ioctl_fiemap() was used. > >>>> > >>>> I think there are 2 other possibilities which can be used to fix this. > >>>> > >>>> - Use a boolean inside fiemap_extent_info to identify a fibmap call, or, > >>>> - If the device id is a must for you, maybe add the device id into > >>>> fiemap_extent_info instead of fiemap_extent. > >>> > >>> That won't work with btrfs, which can store file extents on multiple > >>> different physical devices. > >>> > >>>> So we don't mess with a UAPI exported data structure and still > >>>> provides a way to the filesystems to provide which device the mapped > >>>> extent is in. > >>>> > >>>> What you think? > >>>> > >>>> Cheers > >>>> > >>>> > >>>>> > >>>>> --D > >>>>> > >>>>>>> > >>>>>>>> + > >>>>>>>> + return error; > >>>>>>>> +} > >>>>>>>> + > >>>>>>>> /** > >>>>>>>> * bmap - find a block number in a file > >>>>>>>> * @inode: inode owning the block number being requested > >>>>>>>> @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); > >>>>>>>> */ > >>>>>>>> int bmap(struct inode *inode, sector_t *block) > >>>>>>>> { > >>>>>>>> - if (!inode->i_mapping->a_ops->bmap) > >>>>>>>> + if (inode->i_op->fiemap) > >>>>>>>> + return bmap_fiemap(inode, block); > >>>>>>>> + else if (inode->i_mapping->a_ops->bmap) > >>>>>>>> + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, > >>>>>>>> + *block); > >>>>>>>> + else > >>>>>>>> return -EINVAL; > >>>>>>> > >>>>>>> Waitaminute. btrfs currently supports fiemap but not bmap, and now > >>>>>>> suddenly it will support this legacy interface they've never supported > >>>>>>> before. Are they on board with this? > >>>>>>> > >>>>>>> --D > >>>>>>> > >>>>>>>> > >>>>>>>> - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); > >>>>>>>> return 0; > >>>>>>>> } > >>>>>>>> EXPORT_SYMBOL(bmap); > >>>>>>>> diff --git a/fs/ioctl.c b/fs/ioctl.c > >>>>>>>> index 6086978fe01e..bfa59df332bf 100644 > >>>>>>>> --- a/fs/ioctl.c > >>>>>>>> +++ b/fs/ioctl.c > >>>>>>>> @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, > >>>>>>>> return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > >>>>>>>> } > >>>>>>>> > >>>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, > >>>>>>>> + u64 phys, u64 len, u32 flags) > >>>>>>>> +{ > >>>>>>>> + struct fiemap_extent *extent = fieinfo->fi_extents_start; > >>>>>>>> + > >>>>>>>> + /* only count the extents */ > >>>>>>>> + if (fieinfo->fi_extents_max == 0) { > >>>>>>>> + fieinfo->fi_extents_mapped++; > >>>>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > >>>>>>>> + } > >>>>>>>> + > >>>>>>>> + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) > >>>>>>>> + return 1; > >>>>>>>> + > >>>>>>>> + if (flags & SET_UNKNOWN_FLAGS) > >>>>>>>> + flags |= FIEMAP_EXTENT_UNKNOWN; > >>>>>>>> + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) > >>>>>>>> + flags |= FIEMAP_EXTENT_ENCODED; > >>>>>>>> + if (flags & SET_NOT_ALIGNED_FLAGS) > >>>>>>>> + flags |= FIEMAP_EXTENT_NOT_ALIGNED; > >>>>>>>> + > >>>>>>>> + extent->fe_logical = logical; > >>>>>>>> + extent->fe_physical = phys; > >>>>>>>> + extent->fe_length = len; > >>>>>>>> + extent->fe_flags = flags; > >>>>>>>> + > >>>>>>>> + fieinfo->fi_extents_mapped++; > >>>>>>>> + > >>>>>>>> + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) > >>>>>>>> + return 1; > >>>>>>>> + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; > >>>>>>>> +} > >>>>>>>> /** > >>>>>>>> * fiemap_fill_next_extent - Fiemap helper function > >>>>>>>> * @fieinfo: Fiemap context passed into ->fiemap > >>>>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h > >>>>>>>> index 7a434979201c..28bb523d532a 100644 > >>>>>>>> --- a/include/linux/fs.h > >>>>>>>> +++ b/include/linux/fs.h > >>>>>>>> @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { > >>>>>>>> fiemap_fill_cb fi_cb; > >>>>>>>> }; > >>>>>>>> > >>>>>>>> +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, > >>>>>>>> + u64 phys, u64 len, u32 flags); > >>>>>>>> int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, > >>>>>>>> u64 phys, u64 len, u32 flags); > >>>>>>>> int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); > >>>>>>>> -- > >>>>>>>> 2.17.2 > >>>>>>>> > >>>>>> > >>>>>> -- > >>>>>> Carlos > >>>> > >>>> -- > >>>> Carlos > >> > >> -- > >> Carlos > > > Cheers, Andreas > > > > >
On Fri, Feb 08, 2019 at 09:46:12AM +0100, Christoph Hellwig wrote: > On Thu, Feb 07, 2019 at 02:25:01PM -0700, Andreas Dilger wrote: > > Do we really need to be this way, about reserving a single flag for Lustre, > > which will likely also be useful for other filesystems? It's not like > > Lustre is some closed-source binary module for which we need to make life > > difficult, it is used by many thousands of the largest computers at labs > > and universities and companies around the world. We are working to clean > > up the code outside the staging tree and resubmit it. Not reserving a flag > > just means we will continue to use random values in Lustre before it can > > be merged, which will make life harder when we try to merge again. > > No, it is available in source, but otherwise just as bad. And we generally > only define APIs for in-kernel usage. > > If we can come up with a good API for in-kernel filesystems we can do > that, otherwise hell no. And staging for that matter qualifies as out > of tree. > > That being said I'm really worried about these FIEMAP extensions as > userspace has no business poking into details of the placement (vs > just the layout). > I tend to say that identifying on which device an extent is is better than simply saying 'it maps to physical blocks X-Z, but it's your problem to identify which device X-Z belongs to'. > But all that belongs into a separate dicussion instead of dragging down > this series where it does not belong at all. Agreed, but now I'm on a kind of dead-end :P Darrick's concerns are valid, regarding letting currently unsupported filesystems to suddenly allow FIBMAP calls, but on the other hand, his proposed solution, which is also valid, requires a new discussion/patchset to discuss an improvement of the FIEMAP infra-structure, and 'fix' the problem mentioned. Using a flag to identify FIBMAP calls has been rejected. So, I'd accept suggestions on how to move this patch forward, without requiring the improvements suggested by Darrick, and, without using a flag to tag FIBMAP calls, as suggested by me, I'm kind of running out of ideas by now :( Cheers
On Feb 8, 2019, at 3:36 AM, Carlos Maiolino <cmaiolino@redhat.com> wrote: > > On Fri, Feb 08, 2019 at 09:46:12AM +0100, Christoph Hellwig wrote: >> On Thu, Feb 07, 2019 at 02:25:01PM -0700, Andreas Dilger wrote: >>> Do we really need to be this way, about reserving a single flag for Lustre, >>> which will likely also be useful for other filesystems? It's not like >>> Lustre is some closed-source binary module for which we need to make life >>> difficult, it is used by many thousands of the largest computers at labs >>> and universities and companies around the world. We are working to clean >>> up the code outside the staging tree and resubmit it. Not reserving a flag >>> just means we will continue to use random values in Lustre before it can >>> be merged, which will make life harder when we try to merge again. >> >> No, it is available in source, but otherwise just as bad. And we generally >> only define APIs for in-kernel usage. >> >> If we can come up with a good API for in-kernel filesystems we can do >> that, otherwise hell no. And staging for that matter qualifies as out >> of tree. >> >> That being said I'm really worried about these FIEMAP extensions as >> userspace has no business poking into details of the placement (vs >> just the layout). > > I tend to say that identifying on which device an extent is is better than > simply saying 'it maps to physical blocks X-Z, but it's your problem to identify > which device X-Z belongs to'. > >> But all that belongs into a separate dicussion instead of dragging down >> this series where it does not belong at all. > > Agreed, but now I'm on a kind of dead-end :P > > Darrick's concerns are valid, regarding letting currently unsupported > filesystems to suddenly allow FIBMAP calls, I don't think there is a huge danger of people suddenly moving to use LILO on f2fs or Btrfs with new kernels. In most cases, it _would_ just work, but the FIBMAP->FIEMAP layer needs to check for FIEMAP_FLAG_NOT_ALIGNED, FIEMAP_FLAG_ENCODED, and FIEMAP_FLAG_DEVICE flags that would make this unsuitable for booting. > but on the other hand, his proposed > solution, which is also valid, requires a new discussion/patchset to discuss an > improvement of the FIEMAP infra-structure, and 'fix' the problem mentioned. > Using a flag to identify FIBMAP calls has been rejected. So, I'd accept > suggestions on how to move this patch forward, without requiring the > improvements suggested by Darrick, and, without using a flag to tag FIBMAP > calls, as suggested by me, I'm kind of running out of ideas by now :( I think Darrick was against a flag like "FIEMAP_FLAG_FIBMAP" because it could be specified from userspace, and it is a bit ugly and has no other value than preventing FIBMAP from working on filesystems that don't support it today. As Christoph mentioned, such a flag could be OK as long as it is masked from userspace in the top-level ioctl_fiemap() handler (though to be honest, there is no benefit for a userspace app to set this flag, it would just increase the chance the ioctl(FIEMAP) call will fail). #define FIEMAP_FLAG_FIBMAP 0x80000000 Filesystems that don't want FIBMAP to work at all should return -ENOTTY from their ->fiemap() handler. That said, there is *still* a need for fe_device checking in ioctl_fibmap(), because for filesystems that allow both FIEMAP and FIBMAP (i.e. the most common ones like ext4 and XFS) there may still be reasons for FIBMAP to fail for some files if they are unsuitable (e.g. data stored on multiple devices). That isn't something that the filesystems should be checking themselves. Cheers, Andreas
On Fri, Feb 08, 2019 at 09:43:52AM +0100, Christoph Hellwig wrote: > Agreed. Please don't change the FIEMAP uapi. If we need to check > for a request coming from bmap just defined an internal FIEMAP flag > as the last available flag in the flags word, and reject it when > it comes from userspace in fiemap. Based on the new thread you started it seems like this go lost. I think we are 99% done with the bmap through fiemap series, and it is just missing this internal flag to make progress. Let's finish this off before starting another big project in the area.
On Mon, Feb 11, 2019 at 01:57:08PM +0100, Christoph Hellwig wrote: > On Fri, Feb 08, 2019 at 09:43:52AM +0100, Christoph Hellwig wrote: > > Agreed. Please don't change the FIEMAP uapi. If we need to check > > for a request coming from bmap just defined an internal FIEMAP flag > > as the last available flag in the flags word, and reject it when > > it comes from userspace in fiemap. > > Based on the new thread you started it seems like this go lost. I think > we are 99% done with the bmap through fiemap series, and it is just > missing this internal flag to make progress. Let's finish this off > before starting another big project in the area. I'm more than happy in see your reply, I'd love to finish this, but I got stuck in how to make some filesystems deny a FIBMAP call once they do support FIEMAP but not FIBMAP. I have this patch almost ready to go anyway. Do you agree in keep this flag in fi_flags field? Or maybe some other place, dunno, maybe a new fi_private field. >
On Mon, Feb 11, 2019 at 05:21:40PM +0100, Carlos Maiolino wrote: > I'm more than happy in see your reply, I'd love to finish this, but I got stuck > in how to make some filesystems deny a FIBMAP call once they do support FIEMAP > but not FIBMAP. > > I have this patch almost ready to go anyway. Do you agree in keep this flag in > fi_flags field? Or maybe some other place, dunno, maybe a new fi_private field. Have it in fi_flags as a purely in-kernel flag, and then deny the bmap calls through fiemap based on it.
diff --git a/fs/inode.c b/fs/inode.c index db681d310465..f07cc183ddbd 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1578,6 +1578,40 @@ void iput(struct inode *inode) } EXPORT_SYMBOL(iput); +static int bmap_fiemap(struct inode *inode, sector_t *block) +{ + struct fiemap_extent_info fieinfo = { 0, }; + struct fiemap_extent fextent; + u64 start = *block << inode->i_blkbits; + int error = -EINVAL; + + fextent.fe_logical = 0; + fextent.fe_physical = 0; + fieinfo.fi_extents_max = 1; + fieinfo.fi_extents_mapped = 0; + fieinfo.fi_extents_start = &fextent; + fieinfo.fi_start = start; + fieinfo.fi_len = 1 << inode->i_blkbits; + fieinfo.fi_flags = 0; + fieinfo.fi_cb = fiemap_fill_kernel_extent; + + error = inode->i_op->fiemap(inode, &fieinfo); + + if (error) + return error; + + if (fieinfo.fi_flags & (FIEMAP_EXTENT_UNKNOWN | + FIEMAP_EXTENT_ENCODED | + FIEMAP_EXTENT_DATA_INLINE | + FIEMAP_EXTENT_UNWRITTEN)) + return -EINVAL; + + *block = (fextent.fe_physical + + (start - fextent.fe_logical)) >> inode->i_blkbits; + + return error; +} + /** * bmap - find a block number in a file * @inode: inode owning the block number being requested @@ -1594,10 +1628,14 @@ EXPORT_SYMBOL(iput); */ int bmap(struct inode *inode, sector_t *block) { - if (!inode->i_mapping->a_ops->bmap) + if (inode->i_op->fiemap) + return bmap_fiemap(inode, block); + else if (inode->i_mapping->a_ops->bmap) + *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, + *block); + else return -EINVAL; - *block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block); return 0; } EXPORT_SYMBOL(bmap); diff --git a/fs/ioctl.c b/fs/ioctl.c index 6086978fe01e..bfa59df332bf 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -116,6 +116,38 @@ int fiemap_fill_user_extent(struct fiemap_extent_info *fieinfo, u64 logical, return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; } +int fiemap_fill_kernel_extent(struct fiemap_extent_info *fieinfo, u64 logical, + u64 phys, u64 len, u32 flags) +{ + struct fiemap_extent *extent = fieinfo->fi_extents_start; + + /* only count the extents */ + if (fieinfo->fi_extents_max == 0) { + fieinfo->fi_extents_mapped++; + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; + } + + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) + return 1; + + if (flags & SET_UNKNOWN_FLAGS) + flags |= FIEMAP_EXTENT_UNKNOWN; + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) + flags |= FIEMAP_EXTENT_ENCODED; + if (flags & SET_NOT_ALIGNED_FLAGS) + flags |= FIEMAP_EXTENT_NOT_ALIGNED; + + extent->fe_logical = logical; + extent->fe_physical = phys; + extent->fe_length = len; + extent->fe_flags = flags; + + fieinfo->fi_extents_mapped++; + + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) + return 1; + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; +} /** * fiemap_fill_next_extent - Fiemap helper function * @fieinfo: Fiemap context passed into ->fiemap diff --git a/include/linux/fs.h b/include/linux/fs.h index 7a434979201c..28bb523d532a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1711,6 +1711,8 @@ struct fiemap_extent_info { fiemap_fill_cb fi_cb; }; +int fiemap_fill_kernel_extent(struct fiemap_extent_info *info, u64 logical, + u64 phys, u64 len, u32 flags); int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, u64 phys, u64 len, u32 flags); int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
Enables the usage of FIEMAP ioctl infrastructure to handle FIBMAP calls. From now on, ->bmap() methods can start to be removed from filesystems which already provides ->fiemap(). This adds a new helper - bmap_fiemap() - which is used to fill in the fiemap request, call into the underlying filesystem and check the flags set in the extent requested. Add a new fiemap fill extent callback to handl the in-kernel only fiemap_extent structure used for FIBMAP. V2: - Now based on the updated fiemap_extent_info, - move the fiemap call itself to a new helper Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> --- fs/inode.c | 42 ++++++++++++++++++++++++++++++++++++++++-- fs/ioctl.c | 32 ++++++++++++++++++++++++++++++++ include/linux/fs.h | 2 ++ 3 files changed, 74 insertions(+), 2 deletions(-)