diff mbox series

[RFC-PATCH] xfs: do not update sunit/swidth in the superblock to match those provided during mount

Message ID 1574359699-10191-1-git-send-email-alex@zadara.com (mailing list archive)
State New, archived
Headers show
Series [RFC-PATCH] xfs: do not update sunit/swidth in the superblock to match those provided during mount | expand

Commit Message

Alex Lyakas Nov. 21, 2019, 6:08 p.m. UTC
We are hitting the following issue: if XFS is mounted with sunit/swidth different from those
specified during mkfs, then xfs_repair reports false corruption and eventually segfaults.

Example:

# mkfs
mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d sunit=64,swidth=64 -l sunit=32 /dev/vda

#mount with a different sunit/swidth:
mount -onoatime,sync,nouuid,sunit=32,swidth=32 /dev/vda /mnt/xfs

#umount
umount /mnt/xfs

#xfs_repair
xfs_repair -n /dev/vda
# reports false corruption and eventually segfaults[1]

The root cause seems to be that repair/xfs_repair.c::calc_mkfs() calculates the location of first inode chunk based on the current superblock sunit:
	/*
	 * ditto the location of the first inode chunks in the fs ('/')
	 */
	if (xfs_sb_version_hasdalign(&mp->m_sb) && do_inoalign)  {
		first_prealloc_ino = XFS_OFFBNO_TO_AGINO(mp, roundup(fino_bno,
					mp->m_sb.sb_unit), 0);
...

and then compares to the value in the superblock:

	/*
	 * now the first 3 inodes in the system
	 */
	if (mp->m_sb.sb_rootino != first_prealloc_ino)  {
		do_warn(
_("sb root inode value %" PRIu64 " %sinconsistent with calculated value %u\n"),
			mp->m_sb.sb_rootino,
			(mp->m_sb.sb_rootino == NULLFSINO ? "(NULLFSINO) ":""),
			first_prealloc_ino);

		if (!no_modify)
			do_warn(
		_("resetting superblock root inode pointer to %u\n"),
				first_prealloc_ino);
		else
			do_warn(
		_("would reset superblock root inode pointer to %u\n"),
				first_prealloc_ino);

		/*
		 * just set the value -- safe since the superblock
		 * doesn't get flushed out if no_modify is set
		 */
		mp->m_sb.sb_rootino = first_prealloc_ino;
	}

and sets the "correct" value into mp->m_sb.sb_rootino.

And from there xfs_repair uses the wrong value, leading to false corruption reports.

Looking at the kernel code of XFS, there seems to be no need to update the superblock sunit/swidth if the mount-provided sunit/swidth are different.
The superblock values are not used during runtime.

With the suggested patch, xfs repair is working properly also when mount-provided sunit/swidth are different.

However, I am not sure whether this is the proper approach. Otherwise, should we not allow specifying different sunit/swidth during mount?

[1]
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
sb root inode value 128 inconsistent with calculated value 96
would reset superblock root inode pointer to 96
sb realtime bitmap inode 129 inconsistent with calculated value 97
would reset superblock realtime bitmap ino pointer to 97
sb realtime summary inode 130 inconsistent with calculated value 98
would reset superblock realtime summary ino pointer to 98
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - 16:09:57: scanning filesystem freespace - 16 of 16 allocation groups done
root inode chunk not found
avl_insert: Warning! duplicate range [96,160]
add_inode - duplicate inode range
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - 16:09:57: scanning agi unlinked lists - 16 of 16 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 15
        - agno = 0
inode 129 not rt bitmap
bad .. entry in directory inode 128, points to self, would clear inode number
inode 129 not rt bitmap
would fix bad flags.
...
Segmentation fault (core dumped)

Signed-off-by: Alex Lyakas <alex@zadara.com>
---
 fs/xfs/xfs_mount.c | 18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

Comments

Brian Foster Nov. 22, 2019, 3:43 p.m. UTC | #1
On Thu, Nov 21, 2019 at 08:08:19PM +0200, Alex Lyakas wrote:
> We are hitting the following issue: if XFS is mounted with sunit/swidth different from those
> specified during mkfs, then xfs_repair reports false corruption and eventually segfaults.
> 
> Example:
> 
> # mkfs
> mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d sunit=64,swidth=64 -l sunit=32 /dev/vda
> 
> #mount with a different sunit/swidth:
> mount -onoatime,sync,nouuid,sunit=32,swidth=32 /dev/vda /mnt/xfs
> 

FYI, I couldn't reproduce this at first because sparse inodes is enabled
by default and that introduces more strict inode alignment requirements.
I'm assuming that sparse inodes is disabled in your example, but it
would be more helpful if you included the exact configuration and mkfs
output in such reports.

> #umount
> umount /mnt/xfs
> 
...
> 
> Looking at the kernel code of XFS, there seems to be no need to update the superblock sunit/swidth if the mount-provided sunit/swidth are different.
> The superblock values are not used during runtime.
> 

I'm not really sure what the right answer is here. On one hand, this
patch seems fundamentally reasonable to me. I find it kind of odd for
mount options to override and persist configuration set in the
superblock like this. OTOH, this changes a historical behavior which may
(or may not) cause disruption for users. I also think it's somewhat
unfortunate to change kernel mount option behavior to accommodate
repair, but I think the mount option behavior being odd argument stands
on its own regardless.

What is your actual use case for changing the stripe unit/width at mount
time like this?

> With the suggested patch, xfs repair is working properly also when mount-provided sunit/swidth are different.
> 
> However, I am not sure whether this is the proper approach. Otherwise, should we not allow specifying different sunit/swidth during mount?
> 
...
> 
> Signed-off-by: Alex Lyakas <alex@zadara.com>
> ---
>  fs/xfs/xfs_mount.c | 18 ++++++------------
>  1 file changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index ba5b6f3..e8263b4 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -399,19 +399,13 @@
>  		}
>  
>  		/*
> -		 * Update superblock with new values
> -		 * and log changes
> +		 * If sunit/swidth specified during mount do not match
> +		 * those in the superblock, use the mount-specified values,
> +		 * but do not update the superblock.
> +		 * Otherwise, xfs_repair reports false corruption.
> +		 * Here, only verify that superblock supports data alignment.
>  		 */
> -		if (xfs_sb_version_hasdalign(sbp)) {
> -			if (sbp->sb_unit != mp->m_dalign) {
> -				sbp->sb_unit = mp->m_dalign;
> -				mp->m_update_sb = true;
> -			}
> -			if (sbp->sb_width != mp->m_swidth) {
> -				sbp->sb_width = mp->m_swidth;
> -				mp->m_update_sb = true;
> -			}
> -		} else {
> +		if (!xfs_sb_version_hasdalign(sbp)) {

Would this change xfs_info behavior on a filesystem mounted with
different runtime fields from the superblock? I haven't tested it, but
it looks like we pull the fields from the superblock.

Brian

>  			xfs_warn(mp,
>  	"cannot change alignment: superblock does not support data alignment");
>  			return -EINVAL;
> -- 
> 1.9.1
>
Alex Lyakas Nov. 24, 2019, 9:13 a.m. UTC | #2
Hi Brian,

Thank you for your response.

On Fri, Nov 22, 2019 at 5:43 PM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Nov 21, 2019 at 08:08:19PM +0200, Alex Lyakas wrote:
> > We are hitting the following issue: if XFS is mounted with sunit/swidth different from those
> > specified during mkfs, then xfs_repair reports false corruption and eventually segfaults.
> >
> > Example:
> >
> > # mkfs
> > mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d sunit=64,swidth=64 -l sunit=32 /dev/vda
> >
> > #mount with a different sunit/swidth:
> > mount -onoatime,sync,nouuid,sunit=32,swidth=32 /dev/vda /mnt/xfs
> >
>
> FYI, I couldn't reproduce this at first because sparse inodes is enabled
> by default and that introduces more strict inode alignment requirements.
> I'm assuming that sparse inodes is disabled in your example, but it
> would be more helpful if you included the exact configuration and mkfs
> output in such reports.
Providing more details about configuration:

kernel: 4.14.99
xfsprogs: 4.9.0+nmu1ubuntu2
content of /etc/zadara/xfs.protofile is captured in [1]
mkfs output is captured in [2]

>
> > #umount
> > umount /mnt/xfs
> >
> ...
> >
> > Looking at the kernel code of XFS, there seems to be no need to update the superblock sunit/swidth if the mount-provided sunit/swidth are different.
> > The superblock values are not used during runtime.
> >
>
> I'm not really sure what the right answer is here. On one hand, this
> patch seems fundamentally reasonable to me. I find it kind of odd for
> mount options to override and persist configuration set in the
> superblock like this. OTOH, this changes a historical behavior which may
> (or may not) cause disruption for users. I also think it's somewhat
> unfortunate to change kernel mount option behavior to accommodate
> repair,
This is very critical for us to have a working repair in production. I
presume the same is true for most users.

> but I think the mount option behavior being odd argument stands
> on its own regardless.
>
> What is your actual use case for changing the stripe unit/width at mount
> time like this?
Our use case is like this: during mkfs the overall system does not
know yet the exact sunit/swidth to be used. Also, the underlying
storage can change its sunit/swidth alignment as part of some storage
migration scenarios. During mount we already know the proper
sunit/swidth. But the problem is that in order to specify sunit/swidth
during mount, XFS superblock must be marked as "supporting data
alignment", i.e., XFS_SB_VERSION_DALIGNBIT has to be set. Otherwise,
XFS refuses to mount and says:

            xfs_warn(mp,
    "cannot change alignment: superblock does not support data alignment");
            return -EINVAL;

In order for the superblock to be marked like this, *some*
sunit/swidth need to be specified during mkfs.

>
> > With the suggested patch, xfs repair is working properly also when mount-provided sunit/swidth are different.
> >
> > However, I am not sure whether this is the proper approach. Otherwise, should we not allow specifying different sunit/swidth during mount?
> >
> ...
> >
> > Signed-off-by: Alex Lyakas <alex@zadara.com>
> > ---
> >  fs/xfs/xfs_mount.c | 18 ++++++------------
> >  1 file changed, 6 insertions(+), 12 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index ba5b6f3..e8263b4 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -399,19 +399,13 @@
> >               }
> >
> >               /*
> > -              * Update superblock with new values
> > -              * and log changes
> > +              * If sunit/swidth specified during mount do not match
> > +              * those in the superblock, use the mount-specified values,
> > +              * but do not update the superblock.
> > +              * Otherwise, xfs_repair reports false corruption.
> > +              * Here, only verify that superblock supports data alignment.
> >                */
> > -             if (xfs_sb_version_hasdalign(sbp)) {
> > -                     if (sbp->sb_unit != mp->m_dalign) {
> > -                             sbp->sb_unit = mp->m_dalign;
> > -                             mp->m_update_sb = true;
> > -                     }
> > -                     if (sbp->sb_width != mp->m_swidth) {
> > -                             sbp->sb_width = mp->m_swidth;
> > -                             mp->m_update_sb = true;
> > -                     }
> > -             } else {
> > +             if (!xfs_sb_version_hasdalign(sbp)) {
>
> Would this change xfs_info behavior on a filesystem mounted with
> different runtime fields from the superblock? I haven't tested it, but
> it looks like we pull the fields from the superblock.
xfs_info uses XFS_IOC_FSGEOMETRY to get the values, and this pulls the
values from the run-time copy of the superblock:

int
xfs_fs_geometry(
    xfs_mount_t        *mp,
    xfs_fsop_geom_t        *geo,
    int            new_version)
...
    if (new_version >= 2) {
        geo->sunit = mp->m_sb.sb_unit;
        geo->swidth = mp->m_sb.sb_width;
    }

So if during mount we have updated the superblock, we will pull the
updated values. If we do not update (as the proposed patch), we will
report the values stored in the superblock. Perhaps we need to update
the geomtery ioctl to report runtime values?

Thanks,
Alex.

>
> Brian
>
> >                       xfs_warn(mp,
> >       "cannot change alignment: superblock does not support data alignment");
> >                       return -EINVAL;
> > --
> > 1.9.1
> >
>

[1]
root@vc-20-01-48-dev:~# cat /etc/zadara/xfs.protofile
dummy                   : bootfilename, not used, backward compatibility
0 0                             : numbers of blocks and inodes, not
used, backward compatibility
d--777 0 0              : set 777 perms for the root dir
$
$

[2]
root@vc-20-01-48-dev:~# mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d
sunit=64,swidth=64 -l sunit=32 /dev/vda
meta-data=/dev/vda               isize=512    agcount=16, agsize=163832 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0,
rmapbt=0, reflink=0
data     =                       bsize=4096   blocks=2621312, imaxpct=25
         =                       sunit=8      swidth=8 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=4 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
Darrick J. Wong Nov. 24, 2019, 4:40 p.m. UTC | #3
On Sun, Nov 24, 2019 at 11:13:09AM +0200, Alex Lyakas wrote:
> Hi Brian,
> 
> Thank you for your response.
> 
> On Fri, Nov 22, 2019 at 5:43 PM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Thu, Nov 21, 2019 at 08:08:19PM +0200, Alex Lyakas wrote:
> > > We are hitting the following issue: if XFS is mounted with sunit/swidth different from those
> > > specified during mkfs, then xfs_repair reports false corruption and eventually segfaults.
> > >
> > > Example:
> > >
> > > # mkfs
> > > mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d sunit=64,swidth=64 -l sunit=32 /dev/vda
> > >
> > > #mount with a different sunit/swidth:
> > > mount -onoatime,sync,nouuid,sunit=32,swidth=32 /dev/vda /mnt/xfs
> > >
> >
> > FYI, I couldn't reproduce this at first because sparse inodes is enabled
> > by default and that introduces more strict inode alignment requirements.
> > I'm assuming that sparse inodes is disabled in your example, but it
> > would be more helpful if you included the exact configuration and mkfs
> > output in such reports.
> Providing more details about configuration:
> 
> kernel: 4.14.99
> xfsprogs: 4.9.0+nmu1ubuntu2
> content of /etc/zadara/xfs.protofile is captured in [1]
> mkfs output is captured in [2]
> 
> >
> > > #umount
> > > umount /mnt/xfs
> > >
> > ...
> > >
> > > Looking at the kernel code of XFS, there seems to be no need to update the superblock sunit/swidth if the mount-provided sunit/swidth are different.
> > > The superblock values are not used during runtime.
> > >
> >
> > I'm not really sure what the right answer is here. On one hand, this
> > patch seems fundamentally reasonable to me. I find it kind of odd for
> > mount options to override and persist configuration set in the
> > superblock like this. OTOH, this changes a historical behavior which may
> > (or may not) cause disruption for users. I also think it's somewhat
> > unfortunate to change kernel mount option behavior to accommodate
> > repair,
> This is very critical for us to have a working repair in production. I
> presume the same is true for most users.
> 
> > but I think the mount option behavior being odd argument stands
> > on its own regardless.
> >
> > What is your actual use case for changing the stripe unit/width at mount
> > time like this?
> Our use case is like this: during mkfs the overall system does not
> know yet the exact sunit/swidth to be used. Also, the underlying
> storage can change its sunit/swidth alignment as part of some storage
> migration scenarios. During mount we already know the proper
> sunit/swidth. But the problem is that in order to specify sunit/swidth
> during mount, XFS superblock must be marked as "supporting data
> alignment", i.e., XFS_SB_VERSION_DALIGNBIT has to be set. Otherwise,
> XFS refuses to mount and says:
> 
>             xfs_warn(mp,
>     "cannot change alignment: superblock does not support data alignment");
>             return -EINVAL;
> 
> In order for the superblock to be marked like this, *some*
> sunit/swidth need to be specified during mkfs.
> 
> >
> > > With the suggested patch, xfs repair is working properly also when mount-provided sunit/swidth are different.
> > >
> > > However, I am not sure whether this is the proper approach.
> > > Otherwise, should we not allow specifying different sunit/swidth
> > > during mount?

I propose a (somewhat) different solution to this problem:

Port to libxfs the code that determines where mkfs/repair expect the
root inode.  Whenever we want to update the geometry information in the
superblock from mount options, we can test the new ones to see if that
would cause sb_rootino to change.  If there's no change, we update
everything like we do now.  If it would change, either we run with those
parameters incore only (which I think is possible for su/sw?) or refuse
them (because corruption is bad).

This way we don't lose the su/sw updating behavior we have now, and we
also gain the ability to shut down an entire class of accidental sb
geometry corruptions.

--D

> > >
> > ...
> > >
> > > Signed-off-by: Alex Lyakas <alex@zadara.com>
> > > ---
> > >  fs/xfs/xfs_mount.c | 18 ++++++------------
> > >  1 file changed, 6 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > index ba5b6f3..e8263b4 100644
> > > --- a/fs/xfs/xfs_mount.c
> > > +++ b/fs/xfs/xfs_mount.c
> > > @@ -399,19 +399,13 @@
> > >               }
> > >
> > >               /*
> > > -              * Update superblock with new values
> > > -              * and log changes
> > > +              * If sunit/swidth specified during mount do not match
> > > +              * those in the superblock, use the mount-specified values,
> > > +              * but do not update the superblock.
> > > +              * Otherwise, xfs_repair reports false corruption.
> > > +              * Here, only verify that superblock supports data alignment.
> > >                */
> > > -             if (xfs_sb_version_hasdalign(sbp)) {
> > > -                     if (sbp->sb_unit != mp->m_dalign) {
> > > -                             sbp->sb_unit = mp->m_dalign;
> > > -                             mp->m_update_sb = true;
> > > -                     }
> > > -                     if (sbp->sb_width != mp->m_swidth) {
> > > -                             sbp->sb_width = mp->m_swidth;
> > > -                             mp->m_update_sb = true;
> > > -                     }
> > > -             } else {
> > > +             if (!xfs_sb_version_hasdalign(sbp)) {
> >
> > Would this change xfs_info behavior on a filesystem mounted with
> > different runtime fields from the superblock? I haven't tested it, but
> > it looks like we pull the fields from the superblock.
> xfs_info uses XFS_IOC_FSGEOMETRY to get the values, and this pulls the
> values from the run-time copy of the superblock:
> 
> int
> xfs_fs_geometry(
>     xfs_mount_t        *mp,
>     xfs_fsop_geom_t        *geo,
>     int            new_version)
> ...
>     if (new_version >= 2) {
>         geo->sunit = mp->m_sb.sb_unit;
>         geo->swidth = mp->m_sb.sb_width;
>     }
> 
> So if during mount we have updated the superblock, we will pull the
> updated values. If we do not update (as the proposed patch), we will
> report the values stored in the superblock. Perhaps we need to update
> the geomtery ioctl to report runtime values?
> 
> Thanks,
> Alex.
> 
> >
> > Brian
> >
> > >                       xfs_warn(mp,
> > >       "cannot change alignment: superblock does not support data alignment");
> > >                       return -EINVAL;
> > > --
> > > 1.9.1
> > >
> >
> 
> [1]
> root@vc-20-01-48-dev:~# cat /etc/zadara/xfs.protofile
> dummy                   : bootfilename, not used, backward compatibility
> 0 0                             : numbers of blocks and inodes, not
> used, backward compatibility
> d--777 0 0              : set 777 perms for the root dir
> $
> $
> 
> [2]
> root@vc-20-01-48-dev:~# mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d
> sunit=64,swidth=64 -l sunit=32 /dev/vda
> meta-data=/dev/vda               isize=512    agcount=16, agsize=163832 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=0,
> rmapbt=0, reflink=0
> data     =                       bsize=4096   blocks=2621312, imaxpct=25
>          =                       sunit=8      swidth=8 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal log           bsize=4096   blocks=2560, version=2
>          =                       sectsz=512   sunit=4 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
Eric Sandeen Nov. 24, 2019, 5:38 p.m. UTC | #4
On 11/24/19 10:40 AM, Darrick J. Wong wrote:
> On Sun, Nov 24, 2019 at 11:13:09AM +0200, Alex Lyakas wrote:

...

>>>> With the suggested patch, xfs repair is working properly also when mount-provided sunit/swidth are different.
>>>>
>>>> However, I am not sure whether this is the proper approach.
>>>> Otherwise, should we not allow specifying different sunit/swidth
>>>> during mount?
> 
> I propose a (somewhat) different solution to this problem:
> 
> Port to libxfs the code that determines where mkfs/repair expect the
> root inode.  Whenever we want to update the geometry information in the
> superblock from mount options, we can test the new ones to see if that
> would cause sb_rootino to change.  If there's no change, we update
> everything like we do now.  If it would change, either we run with those
> parameters incore only (which I think is possible for su/sw?) or refuse
> them (because corruption is bad).
> 
> This way we don't lose the su/sw updating behavior we have now, and we
> also gain the ability to shut down an entire class of accidental sb
> geometry corruptions.

I also wonder if we should be putting so much weight on the root inode
location in repair, or if we could get away with other consistency checks
to be sure it's legit, since we've always been able to move the
"expected" Location.
Brian Foster Nov. 25, 2019, 1:07 p.m. UTC | #5
On Sun, Nov 24, 2019 at 11:13:09AM +0200, Alex Lyakas wrote:
> Hi Brian,
> 
> Thank you for your response.
> 
> On Fri, Nov 22, 2019 at 5:43 PM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Thu, Nov 21, 2019 at 08:08:19PM +0200, Alex Lyakas wrote:
> > > We are hitting the following issue: if XFS is mounted with sunit/swidth different from those
> > > specified during mkfs, then xfs_repair reports false corruption and eventually segfaults.
> > >
> > > Example:
> > >
> > > # mkfs
> > > mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d sunit=64,swidth=64 -l sunit=32 /dev/vda
> > >
> > > #mount with a different sunit/swidth:
> > > mount -onoatime,sync,nouuid,sunit=32,swidth=32 /dev/vda /mnt/xfs
> > >
> >
> > FYI, I couldn't reproduce this at first because sparse inodes is enabled
> > by default and that introduces more strict inode alignment requirements.
> > I'm assuming that sparse inodes is disabled in your example, but it
> > would be more helpful if you included the exact configuration and mkfs
> > output in such reports.
> Providing more details about configuration:
> 
> kernel: 4.14.99
> xfsprogs: 4.9.0+nmu1ubuntu2
> content of /etc/zadara/xfs.protofile is captured in [1]
> mkfs output is captured in [2]
> 
> >
> > > #umount
> > > umount /mnt/xfs
> > >
> > ...
> > >
> > > Looking at the kernel code of XFS, there seems to be no need to update the superblock sunit/swidth if the mount-provided sunit/swidth are different.
> > > The superblock values are not used during runtime.
> > >
> >
> > I'm not really sure what the right answer is here. On one hand, this
> > patch seems fundamentally reasonable to me. I find it kind of odd for
> > mount options to override and persist configuration set in the
> > superblock like this. OTOH, this changes a historical behavior which may
> > (or may not) cause disruption for users. I also think it's somewhat
> > unfortunate to change kernel mount option behavior to accommodate
> > repair,
> This is very critical for us to have a working repair in production. I
> presume the same is true for most users.
> 

Of course.

> > but I think the mount option behavior being odd argument stands
> > on its own regardless.
> >
> > What is your actual use case for changing the stripe unit/width at mount
> > time like this?
> Our use case is like this: during mkfs the overall system does not
> know yet the exact sunit/swidth to be used. Also, the underlying
> storage can change its sunit/swidth alignment as part of some storage
> migration scenarios. During mount we already know the proper
> sunit/swidth. But the problem is that in order to specify sunit/swidth
> during mount, XFS superblock must be marked as "supporting data
> alignment", i.e., XFS_SB_VERSION_DALIGNBIT has to be set. Otherwise,
> XFS refuses to mount and says:
> 
>             xfs_warn(mp,
>     "cannot change alignment: superblock does not support data alignment");
>             return -EINVAL;
> 
> In order for the superblock to be marked like this, *some*
> sunit/swidth need to be specified during mkfs.
> 

What do you mean by "the overall system doesn't know the stripe unit
values" at mkfs time? Shouldn't these values originate from the
underlying storage? Is there some additional configuration at play in
this particular use case?

The migration use case certainly sounds valid, though my understanding
is that if an existing stripe configuration is in place, it's usually
best for the new storage configuration to take that into consideration
(and use some multiple of stripe unit or something, for example).

> >
> > > With the suggested patch, xfs repair is working properly also when mount-provided sunit/swidth are different.
> > >
> > > However, I am not sure whether this is the proper approach. Otherwise, should we not allow specifying different sunit/swidth during mount?
> > >
> > ...
> > >
> > > Signed-off-by: Alex Lyakas <alex@zadara.com>
> > > ---
> > >  fs/xfs/xfs_mount.c | 18 ++++++------------
> > >  1 file changed, 6 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > index ba5b6f3..e8263b4 100644
> > > --- a/fs/xfs/xfs_mount.c
> > > +++ b/fs/xfs/xfs_mount.c
> > > @@ -399,19 +399,13 @@
> > >               }
> > >
> > >               /*
> > > -              * Update superblock with new values
> > > -              * and log changes
> > > +              * If sunit/swidth specified during mount do not match
> > > +              * those in the superblock, use the mount-specified values,
> > > +              * but do not update the superblock.
> > > +              * Otherwise, xfs_repair reports false corruption.
> > > +              * Here, only verify that superblock supports data alignment.
> > >                */
> > > -             if (xfs_sb_version_hasdalign(sbp)) {
> > > -                     if (sbp->sb_unit != mp->m_dalign) {
> > > -                             sbp->sb_unit = mp->m_dalign;
> > > -                             mp->m_update_sb = true;
> > > -                     }
> > > -                     if (sbp->sb_width != mp->m_swidth) {
> > > -                             sbp->sb_width = mp->m_swidth;
> > > -                             mp->m_update_sb = true;
> > > -                     }
> > > -             } else {
> > > +             if (!xfs_sb_version_hasdalign(sbp)) {
> >
> > Would this change xfs_info behavior on a filesystem mounted with
> > different runtime fields from the superblock? I haven't tested it, but
> > it looks like we pull the fields from the superblock.
> xfs_info uses XFS_IOC_FSGEOMETRY to get the values, and this pulls the
> values from the run-time copy of the superblock:
> 
> int
> xfs_fs_geometry(
>     xfs_mount_t        *mp,
>     xfs_fsop_geom_t        *geo,
>     int            new_version)
> ...
>     if (new_version >= 2) {
>         geo->sunit = mp->m_sb.sb_unit;
>         geo->swidth = mp->m_sb.sb_width;
>     }
> 
> So if during mount we have updated the superblock, we will pull the
> updated values. If we do not update (as the proposed patch), we will
> report the values stored in the superblock. Perhaps we need to update
> the geomtery ioctl to report runtime values?
> 

Ok, thanks. I'd argue that we should return the runtime results if we
took the proposed approach, but we should probably settle on a solution
first..

Brian

> Thanks,
> Alex.
> 
> >
> > Brian
> >
> > >                       xfs_warn(mp,
> > >       "cannot change alignment: superblock does not support data alignment");
> > >                       return -EINVAL;
> > > --
> > > 1.9.1
> > >
> >
> 
> [1]
> root@vc-20-01-48-dev:~# cat /etc/zadara/xfs.protofile
> dummy                   : bootfilename, not used, backward compatibility
> 0 0                             : numbers of blocks and inodes, not
> used, backward compatibility
> d--777 0 0              : set 777 perms for the root dir
> $
> $
> 
> [2]
> root@vc-20-01-48-dev:~# mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d
> sunit=64,swidth=64 -l sunit=32 /dev/vda
> meta-data=/dev/vda               isize=512    agcount=16, agsize=163832 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=0,
> rmapbt=0, reflink=0
> data     =                       bsize=4096   blocks=2621312, imaxpct=25
>          =                       sunit=8      swidth=8 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal log           bsize=4096   blocks=2560, version=2
>          =                       sectsz=512   sunit=4 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
>
Brian Foster Nov. 25, 2019, 1:07 p.m. UTC | #6
On Sun, Nov 24, 2019 at 11:38:53AM -0600, Eric Sandeen wrote:
> On 11/24/19 10:40 AM, Darrick J. Wong wrote:
> > On Sun, Nov 24, 2019 at 11:13:09AM +0200, Alex Lyakas wrote:
> 
> ...
> 
> >>>> With the suggested patch, xfs repair is working properly also when mount-provided sunit/swidth are different.
> >>>>
> >>>> However, I am not sure whether this is the proper approach.
> >>>> Otherwise, should we not allow specifying different sunit/swidth
> >>>> during mount?
> > 
> > I propose a (somewhat) different solution to this problem:
> > 
> > Port to libxfs the code that determines where mkfs/repair expect the
> > root inode.  Whenever we want to update the geometry information in the
> > superblock from mount options, we can test the new ones to see if that
> > would cause sb_rootino to change.  If there's no change, we update
> > everything like we do now.  If it would change, either we run with those
> > parameters incore only (which I think is possible for su/sw?) or refuse
> > them (because corruption is bad).
> > 
> > This way we don't lose the su/sw updating behavior we have now, and we
> > also gain the ability to shut down an entire class of accidental sb
> > geometry corruptions.
> 

Indeed, I was thinking about something similar with regard to
validation. ISTM that we either need some form of runtime validation...

> I also wonder if we should be putting so much weight on the root inode
> location in repair, or if we could get away with other consistency checks
> to be sure it's legit, since we've always been able to move the
> "expected" Location.
> 

... or to fix xfs_repair. ;) Fixing the latter seems ideal to me, but
I'm not sure how involved that is compared to a runtime fix. Clearly the
existing repair check is not a sufficient corruption check on its own.
Perhaps we could validate the inode pointed to by the superblock in
general and if that survives, verify it looks like a root directory..?
The unexpected location thing could still be a (i.e. bad alignment)
warning, but that's probably a separate topic.

I'm not opposed to changing runtime behavior even with a repair fix,
fwiw. I wonder if conditionally updating the superblock is the right
behavior as it might be either too subtle for users or too disruptive if
some appliance out there happens to use a mount cycle to update su/sw.
Failing the mount seems preferable, but raises similar questions wrt to
changing behavior. Yes, it is corruption otherwise, but unless I'm
missing something it seems like a pretty rare corner case (e.g. how many
people change alignment like this? of those that do, how many ever run
xfs_repair?). To me, the ideal behavior is for mount options to always
dictate runtime behavior and for a separate admin tool or script to make
persistent changes (with associated validation) to the superblock.

Brian
Alex Lyakas Nov. 26, 2019, 8:49 a.m. UTC | #7
Hi Brian,

On Mon, Nov 25, 2019 at 3:07 PM Brian Foster <bfoster@redhat.com> wrote:
>
> On Sun, Nov 24, 2019 at 11:13:09AM +0200, Alex Lyakas wrote:
> > Hi Brian,
> >
> > Thank you for your response.
> >
> > On Fri, Nov 22, 2019 at 5:43 PM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > On Thu, Nov 21, 2019 at 08:08:19PM +0200, Alex Lyakas wrote:
> > > > We are hitting the following issue: if XFS is mounted with sunit/swidth different from those
> > > > specified during mkfs, then xfs_repair reports false corruption and eventually segfaults.
> > > >
> > > > Example:
> > > >
> > > > # mkfs
> > > > mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d sunit=64,swidth=64 -l sunit=32 /dev/vda
> > > >
> > > > #mount with a different sunit/swidth:
> > > > mount -onoatime,sync,nouuid,sunit=32,swidth=32 /dev/vda /mnt/xfs
> > > >
> > >
> > > FYI, I couldn't reproduce this at first because sparse inodes is enabled
> > > by default and that introduces more strict inode alignment requirements.
> > > I'm assuming that sparse inodes is disabled in your example, but it
> > > would be more helpful if you included the exact configuration and mkfs
> > > output in such reports.
> > Providing more details about configuration:
> >
> > kernel: 4.14.99
> > xfsprogs: 4.9.0+nmu1ubuntu2
> > content of /etc/zadara/xfs.protofile is captured in [1]
> > mkfs output is captured in [2]
> >
> > >
> > > > #umount
> > > > umount /mnt/xfs
> > > >
> > > ...
> > > >
> > > > Looking at the kernel code of XFS, there seems to be no need to update the superblock sunit/swidth if the mount-provided sunit/swidth are different.
> > > > The superblock values are not used during runtime.
> > > >
> > >
> > > I'm not really sure what the right answer is here. On one hand, this
> > > patch seems fundamentally reasonable to me. I find it kind of odd for
> > > mount options to override and persist configuration set in the
> > > superblock like this. OTOH, this changes a historical behavior which may
> > > (or may not) cause disruption for users. I also think it's somewhat
> > > unfortunate to change kernel mount option behavior to accommodate
> > > repair,
> > This is very critical for us to have a working repair in production. I
> > presume the same is true for most users.
> >
>
> Of course.
>
> > > but I think the mount option behavior being odd argument stands
> > > on its own regardless.
> > >
> > > What is your actual use case for changing the stripe unit/width at mount
> > > time like this?
> > Our use case is like this: during mkfs the overall system does not
> > know yet the exact sunit/swidth to be used. Also, the underlying
> > storage can change its sunit/swidth alignment as part of some storage
> > migration scenarios. During mount we already know the proper
> > sunit/swidth. But the problem is that in order to specify sunit/swidth
> > during mount, XFS superblock must be marked as "supporting data
> > alignment", i.e., XFS_SB_VERSION_DALIGNBIT has to be set. Otherwise,
> > XFS refuses to mount and says:
> >
> >             xfs_warn(mp,
> >     "cannot change alignment: superblock does not support data alignment");
> >             return -EINVAL;
> >
> > In order for the superblock to be marked like this, *some*
> > sunit/swidth need to be specified during mkfs.
> >
>
> What do you mean by "the overall system doesn't know the stripe unit
> values" at mkfs time? Shouldn't these values originate from the
> underlying storage? Is there some additional configuration at play in
> this particular use case?
I checked the code, and it turns out that the only reason for
specifying sunit/swidth during mkfs is to mark the superblock as
"supporting data alignment". The real sunit/swidth is always
determined during mount, based on the underlying storage. The
significant property of the storage is the deduplication granularity;
we saw that we get best deduplication results when XFS sunit/swidth
are aligned to deduplication granularity. During storage migrations,
an XFS share can be migrated to a storage with different deduplication
parameters; hence we need to be able to update the runtime
sunit/swidth.

Thanks,
Alex.


>
> The migration use case certainly sounds valid, though my understanding
> is that if an existing stripe configuration is in place, it's usually
> best for the new storage configuration to take that into consideration
> (and use some multiple of stripe unit or something, for example).
>
> > >
> > > > With the suggested patch, xfs repair is working properly also when mount-provided sunit/swidth are different.
> > > >
> > > > However, I am not sure whether this is the proper approach. Otherwise, should we not allow specifying different sunit/swidth during mount?
> > > >
> > > ...
> > > >
> > > > Signed-off-by: Alex Lyakas <alex@zadara.com>
> > > > ---
> > > >  fs/xfs/xfs_mount.c | 18 ++++++------------
> > > >  1 file changed, 6 insertions(+), 12 deletions(-)
> > > >
> > > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > > index ba5b6f3..e8263b4 100644
> > > > --- a/fs/xfs/xfs_mount.c
> > > > +++ b/fs/xfs/xfs_mount.c
> > > > @@ -399,19 +399,13 @@
> > > >               }
> > > >
> > > >               /*
> > > > -              * Update superblock with new values
> > > > -              * and log changes
> > > > +              * If sunit/swidth specified during mount do not match
> > > > +              * those in the superblock, use the mount-specified values,
> > > > +              * but do not update the superblock.
> > > > +              * Otherwise, xfs_repair reports false corruption.
> > > > +              * Here, only verify that superblock supports data alignment.
> > > >                */
> > > > -             if (xfs_sb_version_hasdalign(sbp)) {
> > > > -                     if (sbp->sb_unit != mp->m_dalign) {
> > > > -                             sbp->sb_unit = mp->m_dalign;
> > > > -                             mp->m_update_sb = true;
> > > > -                     }
> > > > -                     if (sbp->sb_width != mp->m_swidth) {
> > > > -                             sbp->sb_width = mp->m_swidth;
> > > > -                             mp->m_update_sb = true;
> > > > -                     }
> > > > -             } else {
> > > > +             if (!xfs_sb_version_hasdalign(sbp)) {
> > >
> > > Would this change xfs_info behavior on a filesystem mounted with
> > > different runtime fields from the superblock? I haven't tested it, but
> > > it looks like we pull the fields from the superblock.
> > xfs_info uses XFS_IOC_FSGEOMETRY to get the values, and this pulls the
> > values from the run-time copy of the superblock:
> >
> > int
> > xfs_fs_geometry(
> >     xfs_mount_t        *mp,
> >     xfs_fsop_geom_t        *geo,
> >     int            new_version)
> > ...
> >     if (new_version >= 2) {
> >         geo->sunit = mp->m_sb.sb_unit;
> >         geo->swidth = mp->m_sb.sb_width;
> >     }
> >
> > So if during mount we have updated the superblock, we will pull the
> > updated values. If we do not update (as the proposed patch), we will
> > report the values stored in the superblock. Perhaps we need to update
> > the geomtery ioctl to report runtime values?
> >
>
> Ok, thanks. I'd argue that we should return the runtime results if we
> took the proposed approach, but we should probably settle on a solution
> first..
>
> Brian
>
> > Thanks,
> > Alex.
> >
> > >
> > > Brian
> > >
> > > >                       xfs_warn(mp,
> > > >       "cannot change alignment: superblock does not support data alignment");
> > > >                       return -EINVAL;
> > > > --
> > > > 1.9.1
> > > >
> > >
> >
> > [1]
> > root@vc-20-01-48-dev:~# cat /etc/zadara/xfs.protofile
> > dummy                   : bootfilename, not used, backward compatibility
> > 0 0                             : numbers of blocks and inodes, not
> > used, backward compatibility
> > d--777 0 0              : set 777 perms for the root dir
> > $
> > $
> >
> > [2]
> > root@vc-20-01-48-dev:~# mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d
> > sunit=64,swidth=64 -l sunit=32 /dev/vda
> > meta-data=/dev/vda               isize=512    agcount=16, agsize=163832 blks
> >          =                       sectsz=512   attr=2, projid32bit=1
> >          =                       crc=1        finobt=1, sparse=0,
> > rmapbt=0, reflink=0
> > data     =                       bsize=4096   blocks=2621312, imaxpct=25
> >          =                       sunit=8      swidth=8 blks
> > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > log      =internal log           bsize=4096   blocks=2560, version=2
> >          =                       sectsz=512   sunit=4 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> >
>
Alex Lyakas Nov. 26, 2019, 8:50 a.m. UTC | #8
On Mon, Nov 25, 2019 at 3:07 PM Brian Foster <bfoster@redhat.com> wrote:
>
> On Sun, Nov 24, 2019 at 11:38:53AM -0600, Eric Sandeen wrote:
> > On 11/24/19 10:40 AM, Darrick J. Wong wrote:
> > > On Sun, Nov 24, 2019 at 11:13:09AM +0200, Alex Lyakas wrote:
> >
> > ...
> >
> > >>>> With the suggested patch, xfs repair is working properly also when mount-provided sunit/swidth are different.
> > >>>>
> > >>>> However, I am not sure whether this is the proper approach.
> > >>>> Otherwise, should we not allow specifying different sunit/swidth
> > >>>> during mount?
> > >
> > > I propose a (somewhat) different solution to this problem:
> > >
> > > Port to libxfs the code that determines where mkfs/repair expect the
> > > root inode.  Whenever we want to update the geometry information in the
> > > superblock from mount options, we can test the new ones to see if that
> > > would cause sb_rootino to change.  If there's no change, we update
> > > everything like we do now.  If it would change, either we run with those
> > > parameters incore only (which I think is possible for su/sw?) or refuse
> > > them (because corruption is bad).
> > >
> > > This way we don't lose the su/sw updating behavior we have now, and we
> > > also gain the ability to shut down an entire class of accidental sb
> > > geometry corruptions.
> >
>
> Indeed, I was thinking about something similar with regard to
> validation. ISTM that we either need some form of runtime validation...
>
> > I also wonder if we should be putting so much weight on the root inode
> > location in repair, or if we could get away with other consistency checks
> > to be sure it's legit, since we've always been able to move the
> > "expected" Location.
> >
>
> ... or to fix xfs_repair. ;) Fixing the latter seems ideal to me, but
> I'm not sure how involved that is compared to a runtime fix. Clearly the
> existing repair check is not a sufficient corruption check on its own.
> Perhaps we could validate the inode pointed to by the superblock in
> general and if that survives, verify it looks like a root directory..?
> The unexpected location thing could still be a (i.e. bad alignment)
> warning, but that's probably a separate topic.
>
> I'm not opposed to changing runtime behavior even with a repair fix,
> fwiw. I wonder if conditionally updating the superblock is the right
> behavior as it might be either too subtle for users or too disruptive if
> some appliance out there happens to use a mount cycle to update su/sw.
> Failing the mount seems preferable, but raises similar questions wrt to
> changing behavior. Yes, it is corruption otherwise, but unless I'm
> missing something it seems like a pretty rare corner case (e.g. how many
> people change alignment like this? of those that do, how many ever run
> xfs_repair?).

>To me, the ideal behavior is for mount options to always
> dictate runtime behavior and for a separate admin tool or script to make
> persistent changes (with associated validation) to the superblock.
This sounds inline with the proposed patch.

>
> Brian
>
Brian Foster Nov. 26, 2019, 11:54 a.m. UTC | #9
On Tue, Nov 26, 2019 at 10:49:22AM +0200, Alex Lyakas wrote:
> Hi Brian,
> 
> On Mon, Nov 25, 2019 at 3:07 PM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Sun, Nov 24, 2019 at 11:13:09AM +0200, Alex Lyakas wrote:
> > > Hi Brian,
> > >
> > > Thank you for your response.
> > >
> > > On Fri, Nov 22, 2019 at 5:43 PM Brian Foster <bfoster@redhat.com> wrote:
> > > >
> > > > On Thu, Nov 21, 2019 at 08:08:19PM +0200, Alex Lyakas wrote:
> > > > > We are hitting the following issue: if XFS is mounted with sunit/swidth different from those
> > > > > specified during mkfs, then xfs_repair reports false corruption and eventually segfaults.
> > > > >
> > > > > Example:
> > > > >
> > > > > # mkfs
> > > > > mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d sunit=64,swidth=64 -l sunit=32 /dev/vda
> > > > >
> > > > > #mount with a different sunit/swidth:
> > > > > mount -onoatime,sync,nouuid,sunit=32,swidth=32 /dev/vda /mnt/xfs
> > > > >
> > > >
> > > > FYI, I couldn't reproduce this at first because sparse inodes is enabled
> > > > by default and that introduces more strict inode alignment requirements.
> > > > I'm assuming that sparse inodes is disabled in your example, but it
> > > > would be more helpful if you included the exact configuration and mkfs
> > > > output in such reports.
> > > Providing more details about configuration:
> > >
> > > kernel: 4.14.99
> > > xfsprogs: 4.9.0+nmu1ubuntu2
> > > content of /etc/zadara/xfs.protofile is captured in [1]
> > > mkfs output is captured in [2]
> > >
> > > >
> > > > > #umount
> > > > > umount /mnt/xfs
> > > > >
> > > > ...
> > > > >
> > > > > Looking at the kernel code of XFS, there seems to be no need to update the superblock sunit/swidth if the mount-provided sunit/swidth are different.
> > > > > The superblock values are not used during runtime.
> > > > >
> > > >
> > > > I'm not really sure what the right answer is here. On one hand, this
> > > > patch seems fundamentally reasonable to me. I find it kind of odd for
> > > > mount options to override and persist configuration set in the
> > > > superblock like this. OTOH, this changes a historical behavior which may
> > > > (or may not) cause disruption for users. I also think it's somewhat
> > > > unfortunate to change kernel mount option behavior to accommodate
> > > > repair,
> > > This is very critical for us to have a working repair in production. I
> > > presume the same is true for most users.
> > >
> >
> > Of course.
> >
> > > > but I think the mount option behavior being odd argument stands
> > > > on its own regardless.
> > > >
> > > > What is your actual use case for changing the stripe unit/width at mount
> > > > time like this?
> > > Our use case is like this: during mkfs the overall system does not
> > > know yet the exact sunit/swidth to be used. Also, the underlying
> > > storage can change its sunit/swidth alignment as part of some storage
> > > migration scenarios. During mount we already know the proper
> > > sunit/swidth. But the problem is that in order to specify sunit/swidth
> > > during mount, XFS superblock must be marked as "supporting data
> > > alignment", i.e., XFS_SB_VERSION_DALIGNBIT has to be set. Otherwise,
> > > XFS refuses to mount and says:
> > >
> > >             xfs_warn(mp,
> > >     "cannot change alignment: superblock does not support data alignment");
> > >             return -EINVAL;
> > >
> > > In order for the superblock to be marked like this, *some*
> > > sunit/swidth need to be specified during mkfs.
> > >
> >
> > What do you mean by "the overall system doesn't know the stripe unit
> > values" at mkfs time? Shouldn't these values originate from the
> > underlying storage? Is there some additional configuration at play in
> > this particular use case?
> I checked the code, and it turns out that the only reason for
> specifying sunit/swidth during mkfs is to mark the superblock as
> "supporting data alignment". The real sunit/swidth is always
> determined during mount, based on the underlying storage. The
> significant property of the storage is the deduplication granularity;
> we saw that we get best deduplication results when XFS sunit/swidth
> are aligned to deduplication granularity. During storage migrations,
> an XFS share can be migrated to a storage with different deduplication
> parameters; hence we need to be able to update the runtime
> sunit/swidth.
> 

Ok, but that doesn't really answer my question of why can't you set it
at mkfs time? Presumably you have the same access to the storage at mkfs
time as at mount time? I can't really speak to how your dedup stuff
accounts for dynamic changes in relation to existing data that was
placed based an old stripe unit, etc., so I'll just assume that works
for your purposes in the migration case.

Note that Dave chimed in with some historical context on IRC:

...
14:31 <dchinner_> keep in mind that the sunit/swidth mount options were only added to work around a CXFS client mount bug that caused the superblock values on disk to get zeroed
14:31 <sandeen> hah, that rings a bell
14:32 <sandeen> still, it's been there for /years/ now
14:32 <dchinner_> it was never intended as a mechanism to actually change the layout
14:32 <djwong> aha, there's the context i was missing :)
14:32 <dchinner_> it turned out to be kinda useful for the crazy md raid restripers
14:32 <dchinner_> but, really, the use it's being put to here is just .... not supportable
14:34 <sandeen> usecase aside, it seems that if the kernel code allows you to rewrite it, xfs_repair shouldn't choke on the results  
14:34 <dchinner_> yes, you can do it, but you get to keep all the broken bits when your crazy storage changes alignments and layouts dynamically.
14:34 <dchinner_> the kernel doesn't validate it properly
14:34 <sandeen> ok so something  else to fix
14:35 <dchinner_> so if we are going to say "you can change it" then the kernel must validate that it is compatible with the existing on-disk layout
14:35 <dchinner_> i.e. root inode location, ag header location, inode cluster size and alignment, etc
...

My takeaway from that is the behavior of updating the superblock from
the mount options could probably be nuked off (as your patch does), but
I'd suggest to get feedback from Eric and Darrick to see whether they
agree or would prefer to maintain existing behavior with proper
validation...

Brian

> Thanks,
> Alex.
> 
> 
> >
> > The migration use case certainly sounds valid, though my understanding
> > is that if an existing stripe configuration is in place, it's usually
> > best for the new storage configuration to take that into consideration
> > (and use some multiple of stripe unit or something, for example).
> >
> > > >
> > > > > With the suggested patch, xfs repair is working properly also when mount-provided sunit/swidth are different.
> > > > >
> > > > > However, I am not sure whether this is the proper approach. Otherwise, should we not allow specifying different sunit/swidth during mount?
> > > > >
> > > > ...
> > > > >
> > > > > Signed-off-by: Alex Lyakas <alex@zadara.com>
> > > > > ---
> > > > >  fs/xfs/xfs_mount.c | 18 ++++++------------
> > > > >  1 file changed, 6 insertions(+), 12 deletions(-)
> > > > >
> > > > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > > > index ba5b6f3..e8263b4 100644
> > > > > --- a/fs/xfs/xfs_mount.c
> > > > > +++ b/fs/xfs/xfs_mount.c
> > > > > @@ -399,19 +399,13 @@
> > > > >               }
> > > > >
> > > > >               /*
> > > > > -              * Update superblock with new values
> > > > > -              * and log changes
> > > > > +              * If sunit/swidth specified during mount do not match
> > > > > +              * those in the superblock, use the mount-specified values,
> > > > > +              * but do not update the superblock.
> > > > > +              * Otherwise, xfs_repair reports false corruption.
> > > > > +              * Here, only verify that superblock supports data alignment.
> > > > >                */
> > > > > -             if (xfs_sb_version_hasdalign(sbp)) {
> > > > > -                     if (sbp->sb_unit != mp->m_dalign) {
> > > > > -                             sbp->sb_unit = mp->m_dalign;
> > > > > -                             mp->m_update_sb = true;
> > > > > -                     }
> > > > > -                     if (sbp->sb_width != mp->m_swidth) {
> > > > > -                             sbp->sb_width = mp->m_swidth;
> > > > > -                             mp->m_update_sb = true;
> > > > > -                     }
> > > > > -             } else {
> > > > > +             if (!xfs_sb_version_hasdalign(sbp)) {
> > > >
> > > > Would this change xfs_info behavior on a filesystem mounted with
> > > > different runtime fields from the superblock? I haven't tested it, but
> > > > it looks like we pull the fields from the superblock.
> > > xfs_info uses XFS_IOC_FSGEOMETRY to get the values, and this pulls the
> > > values from the run-time copy of the superblock:
> > >
> > > int
> > > xfs_fs_geometry(
> > >     xfs_mount_t        *mp,
> > >     xfs_fsop_geom_t        *geo,
> > >     int            new_version)
> > > ...
> > >     if (new_version >= 2) {
> > >         geo->sunit = mp->m_sb.sb_unit;
> > >         geo->swidth = mp->m_sb.sb_width;
> > >     }
> > >
> > > So if during mount we have updated the superblock, we will pull the
> > > updated values. If we do not update (as the proposed patch), we will
> > > report the values stored in the superblock. Perhaps we need to update
> > > the geomtery ioctl to report runtime values?
> > >
> >
> > Ok, thanks. I'd argue that we should return the runtime results if we
> > took the proposed approach, but we should probably settle on a solution
> > first..
> >
> > Brian
> >
> > > Thanks,
> > > Alex.
> > >
> > > >
> > > > Brian
> > > >
> > > > >                       xfs_warn(mp,
> > > > >       "cannot change alignment: superblock does not support data alignment");
> > > > >                       return -EINVAL;
> > > > > --
> > > > > 1.9.1
> > > > >
> > > >
> > >
> > > [1]
> > > root@vc-20-01-48-dev:~# cat /etc/zadara/xfs.protofile
> > > dummy                   : bootfilename, not used, backward compatibility
> > > 0 0                             : numbers of blocks and inodes, not
> > > used, backward compatibility
> > > d--777 0 0              : set 777 perms for the root dir
> > > $
> > > $
> > >
> > > [2]
> > > root@vc-20-01-48-dev:~# mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d
> > > sunit=64,swidth=64 -l sunit=32 /dev/vda
> > > meta-data=/dev/vda               isize=512    agcount=16, agsize=163832 blks
> > >          =                       sectsz=512   attr=2, projid32bit=1
> > >          =                       crc=1        finobt=1, sparse=0,
> > > rmapbt=0, reflink=0
> > > data     =                       bsize=4096   blocks=2621312, imaxpct=25
> > >          =                       sunit=8      swidth=8 blks
> > > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > > log      =internal log           bsize=4096   blocks=2560, version=2
> > >          =                       sectsz=512   sunit=4 blks, lazy-count=1
> > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > >
> >
>
Alex Lyakas Nov. 26, 2019, 1:37 p.m. UTC | #10
Hi Brian,

Thank you for your response.

On Tue, Nov 26, 2019 at 1:54 PM Brian Foster <bfoster@redhat.com> wrote:
>
> On Tue, Nov 26, 2019 at 10:49:22AM +0200, Alex Lyakas wrote:
> > Hi Brian,
> >
> > On Mon, Nov 25, 2019 at 3:07 PM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > On Sun, Nov 24, 2019 at 11:13:09AM +0200, Alex Lyakas wrote:
> > > > Hi Brian,
> > > >
> > > > Thank you for your response.
> > > >
> > > > On Fri, Nov 22, 2019 at 5:43 PM Brian Foster <bfoster@redhat.com> wrote:
> > > > >
> > > > > On Thu, Nov 21, 2019 at 08:08:19PM +0200, Alex Lyakas wrote:
> > > > > > We are hitting the following issue: if XFS is mounted with sunit/swidth different from those
> > > > > > specified during mkfs, then xfs_repair reports false corruption and eventually segfaults.
> > > > > >
> > > > > > Example:
> > > > > >
> > > > > > # mkfs
> > > > > > mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d sunit=64,swidth=64 -l sunit=32 /dev/vda
> > > > > >
> > > > > > #mount with a different sunit/swidth:
> > > > > > mount -onoatime,sync,nouuid,sunit=32,swidth=32 /dev/vda /mnt/xfs
> > > > > >
> > > > >
> > > > > FYI, I couldn't reproduce this at first because sparse inodes is enabled
> > > > > by default and that introduces more strict inode alignment requirements.
> > > > > I'm assuming that sparse inodes is disabled in your example, but it
> > > > > would be more helpful if you included the exact configuration and mkfs
> > > > > output in such reports.
> > > > Providing more details about configuration:
> > > >
> > > > kernel: 4.14.99
> > > > xfsprogs: 4.9.0+nmu1ubuntu2
> > > > content of /etc/zadara/xfs.protofile is captured in [1]
> > > > mkfs output is captured in [2]
> > > >
> > > > >
> > > > > > #umount
> > > > > > umount /mnt/xfs
> > > > > >
> > > > > ...
> > > > > >
> > > > > > Looking at the kernel code of XFS, there seems to be no need to update the superblock sunit/swidth if the mount-provided sunit/swidth are different.
> > > > > > The superblock values are not used during runtime.
> > > > > >
> > > > >
> > > > > I'm not really sure what the right answer is here. On one hand, this
> > > > > patch seems fundamentally reasonable to me. I find it kind of odd for
> > > > > mount options to override and persist configuration set in the
> > > > > superblock like this. OTOH, this changes a historical behavior which may
> > > > > (or may not) cause disruption for users. I also think it's somewhat
> > > > > unfortunate to change kernel mount option behavior to accommodate
> > > > > repair,
> > > > This is very critical for us to have a working repair in production. I
> > > > presume the same is true for most users.
> > > >
> > >
> > > Of course.
> > >
> > > > > but I think the mount option behavior being odd argument stands
> > > > > on its own regardless.
> > > > >
> > > > > What is your actual use case for changing the stripe unit/width at mount
> > > > > time like this?
> > > > Our use case is like this: during mkfs the overall system does not
> > > > know yet the exact sunit/swidth to be used. Also, the underlying
> > > > storage can change its sunit/swidth alignment as part of some storage
> > > > migration scenarios. During mount we already know the proper
> > > > sunit/swidth. But the problem is that in order to specify sunit/swidth
> > > > during mount, XFS superblock must be marked as "supporting data
> > > > alignment", i.e., XFS_SB_VERSION_DALIGNBIT has to be set. Otherwise,
> > > > XFS refuses to mount and says:
> > > >
> > > >             xfs_warn(mp,
> > > >     "cannot change alignment: superblock does not support data alignment");
> > > >             return -EINVAL;
> > > >
> > > > In order for the superblock to be marked like this, *some*
> > > > sunit/swidth need to be specified during mkfs.
> > > >
> > >
> > > What do you mean by "the overall system doesn't know the stripe unit
> > > values" at mkfs time? Shouldn't these values originate from the
> > > underlying storage? Is there some additional configuration at play in
> > > this particular use case?
> > I checked the code, and it turns out that the only reason for
> > specifying sunit/swidth during mkfs is to mark the superblock as
> > "supporting data alignment". The real sunit/swidth is always
> > determined during mount, based on the underlying storage. The
> > significant property of the storage is the deduplication granularity;
> > we saw that we get best deduplication results when XFS sunit/swidth
> > are aligned to deduplication granularity. During storage migrations,
> > an XFS share can be migrated to a storage with different deduplication
> > parameters; hence we need to be able to update the runtime
> > sunit/swidth.
> >
>
> Ok, but that doesn't really answer my question of why can't you set it
> at mkfs time? Presumably you have the same access to the storage at mkfs
> time as at mount time?
Yes, we have the access and we can set the proper sunit/swidth during
mkfs. But our thinking was that during mkfs we want only to mark the
superblock as "supporting data alignment", which is what specifying
sunit/swidth during mkfs does. During mount, we look at the underlying
storage config, and pass the proper sunit/swidth. The underlying
storage could be different from one that existed at mkfs time. I hope
this explains our situation better.

>I can't really speak to how your dedup stuff
> accounts for dynamic changes in relation to existing data that was
> placed based an old stripe unit, etc., so I'll just assume that works
> for your purposes in the migration case.
>
> Note that Dave chimed in with some historical context on IRC:
>
> ...
> 14:31 <dchinner_> keep in mind that the sunit/swidth mount options were only added to work around a CXFS client mount bug that caused the superblock values on disk to get zeroed
> 14:31 <sandeen> hah, that rings a bell
> 14:32 <sandeen> still, it's been there for /years/ now
> 14:32 <dchinner_> it was never intended as a mechanism to actually change the layout
> 14:32 <djwong> aha, there's the context i was missing :)
> 14:32 <dchinner_> it turned out to be kinda useful for the crazy md raid restripers
> 14:32 <dchinner_> but, really, the use it's being put to here is just .... not supportable
> 14:34 <sandeen> usecase aside, it seems that if the kernel code allows you to rewrite it, xfs_repair shouldn't choke on the results
> 14:34 <dchinner_> yes, you can do it, but you get to keep all the broken bits when your crazy storage changes alignments and layouts dynamically.
> 14:34 <dchinner_> the kernel doesn't validate it properly
> 14:34 <sandeen> ok so something  else to fix
> 14:35 <dchinner_> so if we are going to say "you can change it" then the kernel must validate that it is compatible with the existing on-disk layout
> 14:35 <dchinner_> i.e. root inode location, ag header location, inode cluster size and alignment, etc
> ...
>
Fascinating!

From our perspective, when the underlying storage alignment changes,
we are ok for all already-allocated extents to keep their existing
alignment. We only want the newly-allocated extents to use the
newly-specified sunit/swidth. My understanding is that such filesystem
is not "broken bits", as everything is consistent, except different
extents may have different alignments.

> My takeaway from that is the behavior of updating the superblock from
> the mount options could probably be nuked off (as your patch does), but
> I'd suggest to get feedback from Eric and Darrick to see whether they
> agree or would prefer to maintain existing behavior with proper
> validation...

Let's wait for their additional feedback, then.

Thanks,
Alex.


>
> Brian
>
> > Thanks,
> > Alex.
> >
> >
> > >
> > > The migration use case certainly sounds valid, though my understanding
> > > is that if an existing stripe configuration is in place, it's usually
> > > best for the new storage configuration to take that into consideration
> > > (and use some multiple of stripe unit or something, for example).
> > >
> > > > >
> > > > > > With the suggested patch, xfs repair is working properly also when mount-provided sunit/swidth are different.
> > > > > >
> > > > > > However, I am not sure whether this is the proper approach. Otherwise, should we not allow specifying different sunit/swidth during mount?
> > > > > >
> > > > > ...
> > > > > >
> > > > > > Signed-off-by: Alex Lyakas <alex@zadara.com>
> > > > > > ---
> > > > > >  fs/xfs/xfs_mount.c | 18 ++++++------------
> > > > > >  1 file changed, 6 insertions(+), 12 deletions(-)
> > > > > >
> > > > > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > > > > index ba5b6f3..e8263b4 100644
> > > > > > --- a/fs/xfs/xfs_mount.c
> > > > > > +++ b/fs/xfs/xfs_mount.c
> > > > > > @@ -399,19 +399,13 @@
> > > > > >               }
> > > > > >
> > > > > >               /*
> > > > > > -              * Update superblock with new values
> > > > > > -              * and log changes
> > > > > > +              * If sunit/swidth specified during mount do not match
> > > > > > +              * those in the superblock, use the mount-specified values,
> > > > > > +              * but do not update the superblock.
> > > > > > +              * Otherwise, xfs_repair reports false corruption.
> > > > > > +              * Here, only verify that superblock supports data alignment.
> > > > > >                */
> > > > > > -             if (xfs_sb_version_hasdalign(sbp)) {
> > > > > > -                     if (sbp->sb_unit != mp->m_dalign) {
> > > > > > -                             sbp->sb_unit = mp->m_dalign;
> > > > > > -                             mp->m_update_sb = true;
> > > > > > -                     }
> > > > > > -                     if (sbp->sb_width != mp->m_swidth) {
> > > > > > -                             sbp->sb_width = mp->m_swidth;
> > > > > > -                             mp->m_update_sb = true;
> > > > > > -                     }
> > > > > > -             } else {
> > > > > > +             if (!xfs_sb_version_hasdalign(sbp)) {
> > > > >
> > > > > Would this change xfs_info behavior on a filesystem mounted with
> > > > > different runtime fields from the superblock? I haven't tested it, but
> > > > > it looks like we pull the fields from the superblock.
> > > > xfs_info uses XFS_IOC_FSGEOMETRY to get the values, and this pulls the
> > > > values from the run-time copy of the superblock:
> > > >
> > > > int
> > > > xfs_fs_geometry(
> > > >     xfs_mount_t        *mp,
> > > >     xfs_fsop_geom_t        *geo,
> > > >     int            new_version)
> > > > ...
> > > >     if (new_version >= 2) {
> > > >         geo->sunit = mp->m_sb.sb_unit;
> > > >         geo->swidth = mp->m_sb.sb_width;
> > > >     }
> > > >
> > > > So if during mount we have updated the superblock, we will pull the
> > > > updated values. If we do not update (as the proposed patch), we will
> > > > report the values stored in the superblock. Perhaps we need to update
> > > > the geomtery ioctl to report runtime values?
> > > >
> > >
> > > Ok, thanks. I'd argue that we should return the runtime results if we
> > > took the proposed approach, but we should probably settle on a solution
> > > first..
> > >
> > > Brian
> > >
> > > > Thanks,
> > > > Alex.
> > > >
> > > > >
> > > > > Brian
> > > > >
> > > > > >                       xfs_warn(mp,
> > > > > >       "cannot change alignment: superblock does not support data alignment");
> > > > > >                       return -EINVAL;
> > > > > > --
> > > > > > 1.9.1
> > > > > >
> > > > >
> > > >
> > > > [1]
> > > > root@vc-20-01-48-dev:~# cat /etc/zadara/xfs.protofile
> > > > dummy                   : bootfilename, not used, backward compatibility
> > > > 0 0                             : numbers of blocks and inodes, not
> > > > used, backward compatibility
> > > > d--777 0 0              : set 777 perms for the root dir
> > > > $
> > > > $
> > > >
> > > > [2]
> > > > root@vc-20-01-48-dev:~# mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d
> > > > sunit=64,swidth=64 -l sunit=32 /dev/vda
> > > > meta-data=/dev/vda               isize=512    agcount=16, agsize=163832 blks
> > > >          =                       sectsz=512   attr=2, projid32bit=1
> > > >          =                       crc=1        finobt=1, sparse=0,
> > > > rmapbt=0, reflink=0
> > > > data     =                       bsize=4096   blocks=2621312, imaxpct=25
> > > >          =                       sunit=8      swidth=8 blks
> > > > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > > > log      =internal log           bsize=4096   blocks=2560, version=2
> > > >          =                       sectsz=512   sunit=4 blks, lazy-count=1
> > > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > > >
> > >
> >
>
Eric Sandeen Nov. 26, 2019, 4:53 p.m. UTC | #11
On 11/26/19 7:37 AM, Alex Lyakas wrote:
> Hi Brian,
> 
> Thank you for your response.
> 
> On Tue, Nov 26, 2019 at 1:54 PM Brian Foster <bfoster@redhat.com> wrote:

...

> Fascinating!
> 
> From our perspective, when the underlying storage alignment changes,
> we are ok for all already-allocated extents to keep their existing
> alignment. We only want the newly-allocated extents to use the
> newly-specified sunit/swidth. My understanding is that such filesystem
> is not "broken bits", as everything is consistent, except different
> extents may have different alignments.
> 
>> My takeaway from that is the behavior of updating the superblock from
>> the mount options could probably be nuked off (as your patch does), but
>> I'd suggest to get feedback from Eric and Darrick to see whether they
>> agree or would prefer to maintain existing behavior with proper
>> validation...
> 
> Let's wait for their additional feedback, then.

What I'm trying to figure out is whether stripe geometry changes which alter
the expected root inode location are inherently problematic, and should be
rejected during mount processing, or if the only issue is that xfs_repair
assumes the geometry used in calculating root inode location will never change.

If the former, then we'd reject the mounts you're trying to do which have
"broken" xfs_repair's operation.

If the latter, then we could try to teach xfs_repair a different algorithm
for validating that the root inode in the superblock is correctly set.

-Eric

> Thanks,
> Alex.
Christoph Hellwig Nov. 27, 2019, 2:19 p.m. UTC | #12
Can we all take a little step back and think about the implications
of the original patch from Alex?  Because I think there is very little.
And updated sunit/swidth is just a little performance optimization,
and anyone who really cares about changing that after the fact can
trivially add those to fstab.

So I think something like his original patch plus a message during
mount that the new values are not persisted should be perfectly fine.

We can still make xfs_repair smarter about guessing the root inode
of course.
Brian Foster Nov. 27, 2019, 3:19 p.m. UTC | #13
On Wed, Nov 27, 2019 at 06:19:29AM -0800, Christoph Hellwig wrote:
> Can we all take a little step back and think about the implications
> of the original patch from Alex?  Because I think there is very little.
> And updated sunit/swidth is just a little performance optimization,
> and anyone who really cares about changing that after the fact can
> trivially add those to fstab.
> 
> So I think something like his original patch plus a message during
> mount that the new values are not persisted should be perfectly fine.
> 

I agree, FWIW. I've no issues with the original patch provided we fix up
the xfs_info behavior. I think the "historical behavior" argument is
reasonable, but at the same time how many people rely on the historical
behavior of updating the superblock? It's not like this changes the
mount api or anything. A user would just have to continue using the same
mount options to persist behavior.

Eric pointed out offline that we do refer to using the mount options to
"reset" su/sw in at least one place (repair?), but I don't see why we
couldn't rephrase that and/or provide a repair/admin script that updates
on-disk values. Still just my .02. :)

Brian

> We can still make xfs_repair smarter about guessing the root inode
> of course.
>
Dave Chinner Nov. 30, 2019, 8:28 p.m. UTC | #14
On Wed, Nov 27, 2019 at 06:19:29AM -0800, Christoph Hellwig wrote:
> Can we all take a little step back and think about the implications
> of the original patch from Alex?  Because I think there is very little.
> And updated sunit/swidth is just a little performance optimization,
> and anyone who really cares about changing that after the fact can
> trivially add those to fstab.
> 
> So I think something like his original patch plus a message during
> mount that the new values are not persisted should be perfectly fine.

Well, the original purpose of the mount options was to persist a new
sunit/swidth to the superblock...

Let's ignore the fact that it was a result of a CXFS client mount
bug trashing the existing sunit/swidth values, and instead focus on
the fact we've been telling people for years that you "only need to
set these once after a RAID reshape" and so we have a lot of users
out there expecting it to persist the new values...

I don't think we can just redefine the documented and expected
behaviour of a mount option like this.

With that in mind, the xfs(5) man page explicitly states this:

	The sunit and swidth parameters specified must be compatible
	with the existing filesystem alignment characteristics.  In
	general,  that  means  the  only  valid changes to sunit are
	increasing it by a power-of-2 multiple. Valid swidth values
	are any integer multiple of a valid sunit value.

Note the comment about changes to sunit? What is being done here -
halving the sunit from 64 to 32 blocks is invalid, documented as
invalid, but the kernel does not enforce this. We should fix the
kernel code to enforce the alignment rules that the mount option
is documented to require.

If we want to change the alignment characteristics after mkfs, then
use su=1,sw=1 as the initial values, then the first mount can use
the options to change it to whatever is present after mkfs has run.

Filesystems on storage that has dynamically changeable geometry
probably shouldn't be using fixed physical alignment in the first
place, though...

Cheers,

Dave.
Alex Lyakas Dec. 1, 2019, 9 a.m. UTC | #15
Hi Dave,

Thank you for your response.

On Sat, Nov 30, 2019 at 10:28 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Nov 27, 2019 at 06:19:29AM -0800, Christoph Hellwig wrote:
> > Can we all take a little step back and think about the implications
> > of the original patch from Alex?  Because I think there is very little.
> > And updated sunit/swidth is just a little performance optimization,
> > and anyone who really cares about changing that after the fact can
> > trivially add those to fstab.
> >
> > So I think something like his original patch plus a message during
> > mount that the new values are not persisted should be perfectly fine.
>
> Well, the original purpose of the mount options was to persist a new
> sunit/swidth to the superblock...
>
> Let's ignore the fact that it was a result of a CXFS client mount
> bug trashing the existing sunit/swidth values, and instead focus on
> the fact we've been telling people for years that you "only need to
> set these once after a RAID reshape" and so we have a lot of users
> out there expecting it to persist the new values...
>
> I don't think we can just redefine the documented and expected
> behaviour of a mount option like this.
>
> With that in mind, the xfs(5) man page explicitly states this:
>
>         The sunit and swidth parameters specified must be compatible
>         with the existing filesystem alignment characteristics.  In
>         general,  that  means  the  only  valid changes to sunit are
>         increasing it by a power-of-2 multiple. Valid swidth values
>         are any integer multiple of a valid sunit value.
>
> Note the comment about changes to sunit? What is being done here -
> halving the sunit from 64 to 32 blocks is invalid, documented as
> invalid, but the kernel does not enforce this. We should fix the
> kernel code to enforce the alignment rules that the mount option
> is documented to require.
>
> If we want to change the alignment characteristics after mkfs, then
> use su=1,sw=1 as the initial values, then the first mount can use
> the options to change it to whatever is present after mkfs has run.

If I understand your response correctly:
- some sunit/swidth changes during mount are legal and some aren't
- the legal changes should be persisted in the superblock

What about the repair? Even if user performs a legal change, it still
breaks the repairability of the file system.

For now, we made a local change to not persist sunit/swidth updates in
the superblock. Because we must have a working repair, and our kernel
(4.14 stable) allows any sunit/swidth changes.

We can definitely adhere to the recommended behavior of setting
sunit/swidth=1 during mkfs, provided the repair still works after
mounting with different sunit/swidth.

Thanks,
Alex.




>
> Filesystems on storage that has dynamically changeable geometry
> probably shouldn't be using fixed physical alignment in the first
> place, though...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
Dave Chinner Dec. 1, 2019, 9:57 p.m. UTC | #16
On Sun, Dec 01, 2019 at 11:00:32AM +0200, Alex Lyakas wrote:
> Hi Dave,
> 
> Thank you for your response.
> 
> On Sat, Nov 30, 2019 at 10:28 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Wed, Nov 27, 2019 at 06:19:29AM -0800, Christoph Hellwig wrote:
> > > Can we all take a little step back and think about the implications
> > > of the original patch from Alex?  Because I think there is very little.
> > > And updated sunit/swidth is just a little performance optimization,
> > > and anyone who really cares about changing that after the fact can
> > > trivially add those to fstab.
> > >
> > > So I think something like his original patch plus a message during
> > > mount that the new values are not persisted should be perfectly fine.
> >
> > Well, the original purpose of the mount options was to persist a new
> > sunit/swidth to the superblock...
> >
> > Let's ignore the fact that it was a result of a CXFS client mount
> > bug trashing the existing sunit/swidth values, and instead focus on
> > the fact we've been telling people for years that you "only need to
> > set these once after a RAID reshape" and so we have a lot of users
> > out there expecting it to persist the new values...
> >
> > I don't think we can just redefine the documented and expected
> > behaviour of a mount option like this.
> >
> > With that in mind, the xfs(5) man page explicitly states this:
> >
> >         The sunit and swidth parameters specified must be compatible
> >         with the existing filesystem alignment characteristics.  In
> >         general,  that  means  the  only  valid changes to sunit are
> >         increasing it by a power-of-2 multiple. Valid swidth values
> >         are any integer multiple of a valid sunit value.
> >
> > Note the comment about changes to sunit? What is being done here -
> > halving the sunit from 64 to 32 blocks is invalid, documented as
> > invalid, but the kernel does not enforce this. We should fix the
> > kernel code to enforce the alignment rules that the mount option
> > is documented to require.
> >
> > If we want to change the alignment characteristics after mkfs, then
> > use su=1,sw=1 as the initial values, then the first mount can use
> > the options to change it to whatever is present after mkfs has run.
> 
> If I understand your response correctly:
> - some sunit/swidth changes during mount are legal and some aren't
> - the legal changes should be persisted in the superblock

Yup.

> What about the repair? Even if user performs a legal change, it still
> breaks the repairability of the file system.

It is not a legal ichange if it moves the root inode to a new
location. IOWs, if the alignment mods will result in the root inode
changing location, then it should be rejected by the kernel.

Anyway, we need more details about your test environment, because
the example you gave:

| # mkfs
| mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d sunit=64,swidth=64 -l sunit=32 /dev/vda
| 
| #mount with a different sunit/swidth:
| mount -onoatime,sync,nouuid,sunit=32,swidth=32 /dev/vda /mnt/xfs
| 
| #umount
| umount /mnt/xfs
| 
| #xfs_repair
| xfs_repair -n /dev/vda
| # reports false corruption and eventually segfaults[1]

Does not reproduce the reported failure on my workstation running
v5.3.0 kernel and a v5.0.0 xfsprogs:

$ sudo mkfs.xfs -f -d file,name=1t.img,size=1t,sunit=64,swidth=64 -l sunit=32
meta-data=1t.img                 isize=512    agcount=32, agsize=8388608 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=268435456, imaxpct=5
         =                       sunit=8      swidth=8 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=131072, version=2
         =                       sectsz=512   sunit=4 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount -o loop -o sunit=32,swidth=32 1t.img /mnt/1t
$ sudo xfs_info /mnt/1t
....
data     =                       bsize=4096   blocks=268435456, imaxpct=5
         =                       sunit=4      swidth=4 blks
....
$ sudo umount /mnt/1t
$ sudo xfs_repair -f 1t.img 
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - 08:42:14: scanning filesystem freespace - 32 of 32 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
....
Phase 7 - verify and correct link counts...
        - 08:42:18: verify and correct link counts - 32 of 32 allocation groups done
done
$ echo $?
0
$ sudo mount -o loop 1t.img /mnt/1t
$ sudo xfs_info /mnt/1t
....
data     =                       bsize=4096   blocks=268435456, imaxpct=5
         =                       sunit=4      swidth=4 blks
....
$

So reducing the sunit doesn't necessarily change the root inode
location, and so in some cases reducing the sunit doesn't change
the root inode location, either.

> For now, we made a local change to not persist sunit/swidth updates in
> the superblock. Because we must have a working repair, and our kernel
> (4.14 stable) allows any sunit/swidth changes.

From the above, it's not clear where the problem lies - it may be
that there's a bug in repair we've fixed since whatever version you
are using....

> We can definitely adhere to the recommended behavior of setting
> sunit/swidth=1 during mkfs, provided the repair still works after
> mounting with different sunit/swidth.

... hence I'd suggest that more investigation needs to be done
before you do anything permanent...

Cheers,

Dave.
Dave Chinner Dec. 1, 2019, 11:29 p.m. UTC | #17
On Sun, Dec 01, 2019 at 11:00:32AM +0200, Alex Lyakas wrote:
> We can definitely adhere to the recommended behavior of setting
> sunit/swidth=1 during mkfs, provided the repair still works after
> mounting with different sunit/swidth.

FWIW, sunit/sw=1fsb doesn't work reliably either, if you grow the
sunit by a large enough value that it changes the alignment of the
root inode....

Essentially, there is no magic bullet here. We need to ensure that
the kernel doesn't do something that breaks assumptions about root
inode location, and we need to fix xfs_repair not to trash the root
inode without first checking that the current root inode pointer is
invalid...

Cheers,

Dave.
Alex Lyakas Dec. 2, 2019, 8:07 a.m. UTC | #18
Hi Dave,

On Sun, Dec 1, 2019 at 11:57 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sun, Dec 01, 2019 at 11:00:32AM +0200, Alex Lyakas wrote:
> > Hi Dave,
> >
> > Thank you for your response.
> >
> > On Sat, Nov 30, 2019 at 10:28 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Wed, Nov 27, 2019 at 06:19:29AM -0800, Christoph Hellwig wrote:
> > > > Can we all take a little step back and think about the implications
> > > > of the original patch from Alex?  Because I think there is very little.
> > > > And updated sunit/swidth is just a little performance optimization,
> > > > and anyone who really cares about changing that after the fact can
> > > > trivially add those to fstab.
> > > >
> > > > So I think something like his original patch plus a message during
> > > > mount that the new values are not persisted should be perfectly fine.
> > >
> > > Well, the original purpose of the mount options was to persist a new
> > > sunit/swidth to the superblock...
> > >
> > > Let's ignore the fact that it was a result of a CXFS client mount
> > > bug trashing the existing sunit/swidth values, and instead focus on
> > > the fact we've been telling people for years that you "only need to
> > > set these once after a RAID reshape" and so we have a lot of users
> > > out there expecting it to persist the new values...
> > >
> > > I don't think we can just redefine the documented and expected
> > > behaviour of a mount option like this.
> > >
> > > With that in mind, the xfs(5) man page explicitly states this:
> > >
> > >         The sunit and swidth parameters specified must be compatible
> > >         with the existing filesystem alignment characteristics.  In
> > >         general,  that  means  the  only  valid changes to sunit are
> > >         increasing it by a power-of-2 multiple. Valid swidth values
> > >         are any integer multiple of a valid sunit value.
> > >
> > > Note the comment about changes to sunit? What is being done here -
> > > halving the sunit from 64 to 32 blocks is invalid, documented as
> > > invalid, but the kernel does not enforce this. We should fix the
> > > kernel code to enforce the alignment rules that the mount option
> > > is documented to require.
> > >
> > > If we want to change the alignment characteristics after mkfs, then
> > > use su=1,sw=1 as the initial values, then the first mount can use
> > > the options to change it to whatever is present after mkfs has run.
> >
> > If I understand your response correctly:
> > - some sunit/swidth changes during mount are legal and some aren't
> > - the legal changes should be persisted in the superblock
>
> Yup.
>
> > What about the repair? Even if user performs a legal change, it still
> > breaks the repairability of the file system.
>
> It is not a legal ichange if it moves the root inode to a new
> location. IOWs, if the alignment mods will result in the root inode
> changing location, then it should be rejected by the kernel.
>
> Anyway, we need more details about your test environment, because
> the example you gave:
>
> | # mkfs
> | mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d sunit=64,swidth=64 -l sunit=32 /dev/vda
> |
> | #mount with a different sunit/swidth:
> | mount -onoatime,sync,nouuid,sunit=32,swidth=32 /dev/vda /mnt/xfs
> |
> | #umount
> | umount /mnt/xfs
> |
> | #xfs_repair
> | xfs_repair -n /dev/vda
> | # reports false corruption and eventually segfaults[1]
>
> Does not reproduce the reported failure on my workstation running
> v5.3.0 kernel and a v5.0.0 xfsprogs:
>
> $ sudo mkfs.xfs -f -d file,name=1t.img,size=1t,sunit=64,swidth=64 -l sunit=32
> meta-data=1t.img                 isize=512    agcount=32, agsize=8388608 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=268435456, imaxpct=5
>          =                       sunit=8      swidth=8 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=131072, version=2
>          =                       sectsz=512   sunit=4 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

I believe in one of earlier emails in this thread, Brian mentioned
that this is related to "sparse inodes" being enabled:

"I couldn't reproduce this at first because sparse inodes is enabled
by default and that introduces more strict inode alignment requirements."

Can you please try mkfs'ing with spare inodes disabled?

Thanks,
Alex.


> $ sudo mount -o loop -o sunit=32,swidth=32 1t.img /mnt/1t
> $ sudo xfs_info /mnt/1t
> ....
> data     =                       bsize=4096   blocks=268435456, imaxpct=5
>          =                       sunit=4      swidth=4 blks
> ....
> $ sudo umount /mnt/1t
> $ sudo xfs_repair -f 1t.img
> Phase 1 - find and verify superblock...
>         - reporting progress in intervals of 15 minutes
> Phase 2 - using internal log
>         - zero log...
>         - scan filesystem freespace and inode maps...
>         - 08:42:14: scanning filesystem freespace - 32 of 32 allocation groups done
>         - found root inode chunk
> Phase 3 - for each AG...
>         - scan and clear agi unlinked lists...
> ....
> Phase 7 - verify and correct link counts...
>         - 08:42:18: verify and correct link counts - 32 of 32 allocation groups done
> done
> $ echo $?
> 0
> $ sudo mount -o loop 1t.img /mnt/1t
> $ sudo xfs_info /mnt/1t
> ....
> data     =                       bsize=4096   blocks=268435456, imaxpct=5
>          =                       sunit=4      swidth=4 blks
> ....
> $
>
> So reducing the sunit doesn't necessarily change the root inode
> location, and so in some cases reducing the sunit doesn't change
> the root inode location, either.
>
> > For now, we made a local change to not persist sunit/swidth updates in
> > the superblock. Because we must have a working repair, and our kernel
> > (4.14 stable) allows any sunit/swidth changes.
>
> From the above, it's not clear where the problem lies - it may be
> that there's a bug in repair we've fixed since whatever version you
> are using....
>
> > We can definitely adhere to the recommended behavior of setting
> > sunit/swidth=1 during mkfs, provided the repair still works after
> > mounting with different sunit/swidth.
>
> ... hence I'd suggest that more investigation needs to be done
> before you do anything permanent...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index ba5b6f3..e8263b4 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -399,19 +399,13 @@ 
 		}
 
 		/*
-		 * Update superblock with new values
-		 * and log changes
+		 * If sunit/swidth specified during mount do not match
+		 * those in the superblock, use the mount-specified values,
+		 * but do not update the superblock.
+		 * Otherwise, xfs_repair reports false corruption.
+		 * Here, only verify that superblock supports data alignment.
 		 */
-		if (xfs_sb_version_hasdalign(sbp)) {
-			if (sbp->sb_unit != mp->m_dalign) {
-				sbp->sb_unit = mp->m_dalign;
-				mp->m_update_sb = true;
-			}
-			if (sbp->sb_width != mp->m_swidth) {
-				sbp->sb_width = mp->m_swidth;
-				mp->m_update_sb = true;
-			}
-		} else {
+		if (!xfs_sb_version_hasdalign(sbp)) {
 			xfs_warn(mp,
 	"cannot change alignment: superblock does not support data alignment");
 			return -EINVAL;