UFS s_maxbytes bogosity

Message ID	20170604220602.GN6365@ZenIV.linux.org.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Sun, 4 Jun 2017 23:06:02 +0100 From: Al Viro <viro@ZenIV.linux.org.uk> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: Richard Narron <comet.berkeley@gmail.com>, Will B <will.brokenbourgh2877@gmail.com>, linux-fsdevel <linux-fsdevel@vger.kernel.org> Subject: Re: UFS s_maxbytes bogosity Message-ID: <20170604220602.GN6365@ZenIV.linux.org.uk> References: <CA+55aFwMTzija75vrOs4Gs55QSfLp5SAXPKaDXp__t1OW53XRQ@mail.gmail.com> <20170604213736.GM6365@ZenIV.linux.org.uk> <20170604215838.GA24416@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170604215838.GA24416@ZenIV.linux.org.uk> User-Agent: Mutt/1.8.0 (2017-02-23) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Al Viro June 4, 2017, 10:06 p.m. UTC

On Sun, Jun 04, 2017 at 10:58:38PM +0100, Al Viro wrote:

> u64 ufs_max_bytes(struct super_block *sb)
> {
> 	struct ufs_sb_private_info *uspi = UFS_SB(sb)->s_uspi;
> 	int bits = uspi->s_apbshift;
> 	u64 res;
> 
> 	if (bits > 21)
> 		return MAX_LFS_FILESIZE;
> 
> 	res = UFS_NDADDR + (1LL << bits) + (1LL << (2*bits)) +
> 		(1LL << (3*bits));
> 
> 	if (res >= (MAX_LFS_FILESIZE >> uspi->s_bshift))
> 		return MAX_LFS_FILESIZE;
> 
> 	return res << uspi->s_bshift;
> }
> 
> ought to calculate the right thing for modern UFS variants; I would
> leave the anything other than UFS_MOUNT_UFSTYPE_44BSD and
> UFS_MOUNT_UFSTYPE_UFS2 alone.

Now that I'm hopefully sufficiently awake...  Folks, could you try this:

Linus Torvalds June 4, 2017, 11:26 p.m. UTC | #1

On Sun, Jun 4, 2017 at 3:06 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Now that I'm hopefully sufficiently awake...  Folks, could you try this:

Ok, that looks believable, but complex. So it does I wonder if it's
worth it, particularly considering that we don't really have a
maintainer, and it took people this long to even notice that huge
glaring 2GB limitat.

In fact, once we raise it past the 2GB limit, most of the s_maxbytes
reasons go away - we will already be passing around values that have
the high bit set in "int", and one of the main reasons for x_maxbyte
was to limit overflow damage in filesystem code that passed "int"
around where they shouldn't.

So assuming we trust UFS doesn't do that (and considering that it uses
the default VFS helpers for reading etc, it's presumably all good), we
might as well just use the MAX_LFS_FILESIZE define.

It's not as if we need to get s_maxbytes exactly right. All we
*really* care about is to get the LFS case ok for code that is limited
to 31 bits, and to not overflow the page index when we use the page
cache (which MAX_LFS_FILESIZE does already).

Past that, any extra precision can avoid a few unnecessary calls down
to the filesystem (ie not bother to do extra readpage calls for cases
we know aren't relevant), but it shouldn't be a big deal.

             Linus

Al Viro June 5, 2017, 12:11 a.m. UTC | #2

On Sun, Jun 04, 2017 at 04:26:51PM -0700, Linus Torvalds wrote:

> So assuming we trust UFS doesn't do that (and considering that it uses
> the default VFS helpers for reading etc, it's presumably all good), we
> might as well just use the MAX_LFS_FILESIZE define.

Umm...  Might start hitting this (in ufs_block_to_path()):
                ufs_warning(inode->i_sb, "ufs_block_to_path", "block > big");

> It's not as if we need to get s_maxbytes exactly right. All we
> *really* care about is to get the LFS case ok for code that is limited
> to 31 bits, and to not overflow the page index when we use the page
> cache (which MAX_LFS_FILESIZE does already).
> 
> Past that, any extra precision can avoid a few unnecessary calls down
> to the filesystem (ie not bother to do extra readpage calls for cases
> we know aren't relevant), but it shouldn't be a big deal.

For UFS2 - yes, for UFS1 we have another hard limit I'd missed.  i_blocks
is in half-kilobyte units there and it's 32bit on-disk.  So for UFS1 I'd
cap it with 1Tb (the real limit is ~ 2Tb - 2Mb, but accurate calculation
is a bit of a bother).  Come to think of that, the minimal block size for
UFS1 is 4K with pointers-per-block >= 1024.  So tree-imposed limit is
no lower than 1024^3*4096, i.e. greater than that and we could make
->s_maxbytes unconditional 1Tb for UFS1.

Linus Torvalds June 5, 2017, 3 a.m. UTC | #3

On Sun, Jun 4, 2017 at 5:11 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> For UFS2 - yes, for UFS1 we have another hard limit I'd missed.  i_blocks
> is in half-kilobyte units there and it's 32bit on-disk.  So for UFS1 I'd
> cap it with 1Tb (the real limit is ~ 2Tb - 2Mb, but accurate calculation
> is a bit of a bother).  Come to think of that, the minimal block size for
> UFS1 is 4K with pointers-per-block >= 1024.  So tree-imposed limit is
> no lower than 1024^3*4096, i.e. greater than that and we could make
> ->s_maxbytes unconditional 1Tb for UFS1.

The nblocks limit (and the 32-bit block numbers) might not limit a
sparse file, so I think the tree-imposed limit might be the final true
limit even on UFS1, no?

                      Linus

Al Viro June 5, 2017, 3:49 a.m. UTC | #4

On Sun, Jun 04, 2017 at 08:00:26PM -0700, Linus Torvalds wrote:
> On Sun, Jun 4, 2017 at 5:11 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > For UFS2 - yes, for UFS1 we have another hard limit I'd missed.  i_blocks
> > is in half-kilobyte units there and it's 32bit on-disk.  So for UFS1 I'd
> > cap it with 1Tb (the real limit is ~ 2Tb - 2Mb, but accurate calculation
> > is a bit of a bother).  Come to think of that, the minimal block size for
> > UFS1 is 4K with pointers-per-block >= 1024.  So tree-imposed limit is
> > no lower than 1024^3*4096, i.e. greater than that and we could make
> > ->s_maxbytes unconditional 1Tb for UFS1.
> 
> The nblocks limit (and the 32-bit block numbers) might not limit a
> sparse file, so I think the tree-imposed limit might be the final true
> limit even on UFS1, no?

Yes, but... where would we notice that overflow on allocation?  Matter of fact...
<looks>
<starts swearing>
commit 8f45c33decf62e1aaaa9411aae8fef6a38f95845
Author: Jan Kara <jack@suse.cz>
Date:   Thu May 20 16:00:36 2010 +0200

    ufs: Remove dead quota code
    
    UFS quota is non-functional at least since 2.6.12 because dq_op was set
    to NULL. Since the filesystem exists mainly to allow cooperation with Solaris
    and quota format isn't standard, just remove the dead code.
    
    CC: Evgeniy Dushistov <dushistov@mail.ru>
    Signed-off-by: Jan Kara <jack@suse.cz>
<swears some more>

Jan?  You do realize that "non-functional" or not, dquot_alloc_block() updates
->i_bytes/->i_blocks, right?  And removing that should've had the updates put
back explicitly...

Grrr...  OK, I'll put together a patch fixing that idiocy.  As it is, rw mounts
of ufs on-disk ->i_blocks not updated and, AFAICS, can lead to disk space leaks
on inode deletion.

Al Viro June 7, 2017, 11:48 p.m. UTC | #5

On Mon, Jun 05, 2017 at 04:49:36AM +0100, Al Viro wrote:

> Grrr...  OK, I'll put together a patch fixing that idiocy.  As it is, rw mounts
> of ufs on-disk ->i_blocks not updated and, AFAICS, can lead to disk space leaks
> on inode deletion.

Turns out it's even worse than that, and some of the breakage in there is my fault.
I'll fix what I can, but for now I would suggest
	* making UFS_FS_WRITE depend on BROKEN; it had always been experimental,
but since 2010 it had been actively buggering filesystems (aforementioned Jan's
bug + a bug of mine in 2015 series).  I'll fix that stuff, but I'd expect
it to take a month or so, including serious testing.  I have set the things up
for such testing (which has promptly caught all kinds of fun), but we are at -rc4
now, so it'll probably be the next cycle fodder, with interesting backporting
afterwards.
	* for ->s_maxbytes, let's go with tree-imposed limit.  Proper way of
dealing with ->i_blocks overflows is -ENOSPC when attempt to grab a block would've
caused such an overflow.
	* If Evgeniy can resurface and take part in that fun, I'd be happy to help.
If not, well... guess I'll take the thing over until somebody willing to adopt it
comes along.

Richard Narron June 8, 2017, 12:35 a.m. UTC | #6

On Thu, 8 Jun 2017, Al Viro wrote:

> On Mon, Jun 05, 2017 at 04:49:36AM +0100, Al Viro wrote:
>
>> Grrr...  OK, I'll put together a patch fixing that idiocy.  As it is, rw mounts
>> of ufs on-disk ->i_blocks not updated and, AFAICS, can lead to disk space leaks
>> on inode deletion.
>
> Turns out it's even worse than that, and some of the breakage in there is my fault.
> I'll fix what I can, but for now I would suggest
> 	* making UFS_FS_WRITE depend on BROKEN; it had always been experimental,
> but since 2010 it had been actively buggering filesystems (aforementioned Jan's
> bug + a bug of mine in 2015 series).  I'll fix that stuff, but I'd expect
> it to take a month or so, including serious testing.  I have set the things up
> for such testing (which has promptly caught all kinds of fun), but we are at -rc4
> now, so it'll probably be the next cycle fodder, with interesting backporting
> afterwards.
> 	* for ->s_maxbytes, let's go with tree-imposed limit.  Proper way of
> dealing with ->i_blocks overflows is -ENOSPC when attempt to grab a block would've
> caused such an overflow.
> 	* If Evgeniy can resurface and take part in that fun, I'd be happy to help.
> If not, well... guess I'll take the thing over until somebody willing to adopt it
> comes along.

I am willing to test.  I just turned on UFS_FS_WRITE for the very first 
time running 4.12-rc4 and was able to copy a file of more than 2GB from 
one r/o FreeBSD subpartition to another r/w FreeBSD subpartition.

So it is already looking pretty good.

I did notice an ->i_blocks bug where it was set to zero on the output file 
instead of the actual block count.  This showed up when I poped over to FreeBSD 11.0 and 
did an fsck on the r/w subpartition.

Al Viro June 8, 2017, 2:20 a.m. UTC | #7

On Wed, Jun 07, 2017 at 05:35:31PM -0700, Richard Narron wrote:

> I am willing to test.  I just turned on UFS_FS_WRITE for the very first time
> running 4.12-rc4 and was able to copy a file of more than 2GB from one r/o
> FreeBSD subpartition to another r/w FreeBSD subpartition.
> 
> So it is already looking pretty good.

The nasty cases are around short files, especially short files with holes.
Linear writes as done by cp(1) will do nothing worse than bogus i_blocks
(and possibly mangled counters in cylinder groups).  Random write access
to short files, OTOH, steps into a lot more codepaths...

As for ->i_blocks, it triggers this:

root@kvm1:/mnt# df .; mkdir a; rmdir a; df .
Filesystem     1K-blocks  Used Available Use% Mounted on
/dev/loop0        507420  4504    462340   1% /mnt
Filesystem     1K-blocks  Used Available Use% Mounted on
/dev/loop0        507420  4536    462308   1% /mnt

Note the 32Kb (== one block on that ufs2) leaked here.
Every iteration will leak another one.  Similar for long
symlinks...

Al Viro June 8, 2017, 10:15 p.m. UTC | #8

On Thu, Jun 08, 2017 at 03:20:57AM +0100, Al Viro wrote:
> On Wed, Jun 07, 2017 at 05:35:31PM -0700, Richard Narron wrote:
> 
> > I am willing to test.  I just turned on UFS_FS_WRITE for the very first time
> > running 4.12-rc4 and was able to copy a file of more than 2GB from one r/o
> > FreeBSD subpartition to another r/w FreeBSD subpartition.
> > 
> > So it is already looking pretty good.
> 
> The nasty cases are around short files, especially short files with holes.
> Linear writes as done by cp(1) will do nothing worse than bogus i_blocks
> (and possibly mangled counters in cylinder groups).  Random write access
> to short files, OTOH, steps into a lot more codepaths...
> 
> As for ->i_blocks, it triggers this:
> 
> root@kvm1:/mnt# df .; mkdir a; rmdir a; df .
> Filesystem     1K-blocks  Used Available Use% Mounted on
> /dev/loop0        507420  4504    462340   1% /mnt
> Filesystem     1K-blocks  Used Available Use% Mounted on
> /dev/loop0        507420  4536    462308   1% /mnt
> 
> Note the 32Kb (== one block on that ufs2) leaked here.
> Every iteration will leak another one.  Similar for long
> symlinks...

Spot the bogosity:

static inline int _ubh_isblockset_(struct ufs_sb_private_info * uspi,
        struct ufs_buffer_head * ubh, unsigned begin, unsigned block)
{
        switch (uspi->s_fpb) {
        case 8:
                return (*ubh_get_addr (ubh, begin + block) == 0xff);
        case 4:
                return (*ubh_get_addr (ubh, begin + (block >> 1)) == (0x0f << ((block & 0x01) << 2)));
        case 2:
                return (*ubh_get_addr (ubh, begin + (block >> 2)) == (0x03 << ((block & 0x03) << 1)));
        case 1:
                return (*ubh_get_addr (ubh, begin + (block >> 3)) == (0x01 << (block & 0x07)));
        }
        return 0;
}

with

static inline void _ubh_setblock_(struct ufs_sb_private_info * uspi,
        struct ufs_buffer_head * ubh, unsigned begin, unsigned block)
{
        switch (uspi->s_fpb) {
        case 8:
                *ubh_get_addr(ubh, begin + block) = 0xff;
                return;
        case 4:
                *ubh_get_addr(ubh, begin + (block >> 1)) |= (0x0f << ((block & 0x01) << 2));
                return;
        case 2:
                *ubh_get_addr(ubh, begin + (block >> 2)) |= (0x03 << ((block & 0x03) << 1));
                return;
        case 1:
                *ubh_get_addr(ubh, begin + (block >> 3)) |= (0x01 << ((block & 0x07)));
                return;
        }
}

The only saving grace is that UFS defaults to 8 fragments per block...

Linus Torvalds June 8, 2017, 10:36 p.m. UTC | #9

On Thu, Jun 8, 2017 at 3:15 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Spot the bogosity:

Heh.. No masking on the testing side.

> The only saving grace is that UFS defaults to 8 fragments per block...

Yeah, that case just works, because it's the whole byte.

                Linus

Richard Narron June 9, 2017, 12:11 a.m. UTC | #10

On Thu, 8 Jun 2017, Al Viro wrote:

> The nasty cases are around short files, especially short files with holes.
> Linear writes as done by cp(1) will do nothing worse than bogus i_blocks
> (and possibly mangled counters in cylinder groups).  Random write access
> to short files, OTOH, steps into a lot more codepaths...
>
> As for ->i_blocks, it triggers this:
>
> root@kvm1:/mnt# df .; mkdir a; rmdir a; df .
> Filesystem     1K-blocks  Used Available Use% Mounted on
> /dev/loop0        507420  4504    462340   1% /mnt
> Filesystem     1K-blocks  Used Available Use% Mounted on
> /dev/loop0        507420  4536    462308   1% /mnt
>
> Note the 32Kb (== one block on that ufs2) leaked here.
> Every iteration will leak another one.  Similar for long
> symlinks...
>
Test results don't look pretty on FreeBSD.  (I will also test OpenBSD and 
NetBSD.)

Freebsd:
   #newfs ada0s2d
Linux:
   #mount -t ufs -o rw,ufstype=ufs2 /dev/sda18 /fbsdd
   cd /fbsdd
   #df .
   Filesystem     1K-blocks  Used Available Use% Mounted on
   /dev/sda18      20307184     8  18682632   1% /fbsdd
   #mkdir a
   #rmdir a
   #df .
   Filesystem     1K-blocks  Used Available Use% Mounted on
   /dev/sda18      20307184    40  18682600   1% /fbsdd
   #umount /dev/sda18
FreeBSD:
   # fsck -n /dev/ada0s2d
   ** /dev/ada0s2d (NO WRITE)
   ** Last Mounted on /diskd
   ** Phase 1 - Check Blocks and Sizes
   ** Phase 2 - Check Pathnames
   ** Phase 3 - Check Connectivity
   ** Phase 4 - Check Reference Counts
   ** Phase 5 - Check Cyl groups

   FREE BLK COUNT(S) WRONG IN SUPERBLK
   SALVAGE? no

   SUMMARY INFORMATION BAD
   SALVAGE? no

   BLK(S) MISSING IN BIT MAPS
   SALVAGE? no

   2 files, 2 used, 5076786 free (26 frags, 634595 blocks, 0.0%
   fragmentation)

UFS s_maxbytes bogosity

Commit Message

Comments

Patch