Message ID | 20170604220602.GN6365@ZenIV.linux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, Jun 4, 2017 at 3:06 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Now that I'm hopefully sufficiently awake... Folks, could you try this: Ok, that looks believable, but complex. So it does I wonder if it's worth it, particularly considering that we don't really have a maintainer, and it took people this long to even notice that huge glaring 2GB limitat. In fact, once we raise it past the 2GB limit, most of the s_maxbytes reasons go away - we will already be passing around values that have the high bit set in "int", and one of the main reasons for x_maxbyte was to limit overflow damage in filesystem code that passed "int" around where they shouldn't. So assuming we trust UFS doesn't do that (and considering that it uses the default VFS helpers for reading etc, it's presumably all good), we might as well just use the MAX_LFS_FILESIZE define. It's not as if we need to get s_maxbytes exactly right. All we *really* care about is to get the LFS case ok for code that is limited to 31 bits, and to not overflow the page index when we use the page cache (which MAX_LFS_FILESIZE does already). Past that, any extra precision can avoid a few unnecessary calls down to the filesystem (ie not bother to do extra readpage calls for cases we know aren't relevant), but it shouldn't be a big deal. Linus
On Sun, Jun 04, 2017 at 04:26:51PM -0700, Linus Torvalds wrote: > So assuming we trust UFS doesn't do that (and considering that it uses > the default VFS helpers for reading etc, it's presumably all good), we > might as well just use the MAX_LFS_FILESIZE define. Umm... Might start hitting this (in ufs_block_to_path()): ufs_warning(inode->i_sb, "ufs_block_to_path", "block > big"); > It's not as if we need to get s_maxbytes exactly right. All we > *really* care about is to get the LFS case ok for code that is limited > to 31 bits, and to not overflow the page index when we use the page > cache (which MAX_LFS_FILESIZE does already). > > Past that, any extra precision can avoid a few unnecessary calls down > to the filesystem (ie not bother to do extra readpage calls for cases > we know aren't relevant), but it shouldn't be a big deal. For UFS2 - yes, for UFS1 we have another hard limit I'd missed. i_blocks is in half-kilobyte units there and it's 32bit on-disk. So for UFS1 I'd cap it with 1Tb (the real limit is ~ 2Tb - 2Mb, but accurate calculation is a bit of a bother). Come to think of that, the minimal block size for UFS1 is 4K with pointers-per-block >= 1024. So tree-imposed limit is no lower than 1024^3*4096, i.e. greater than that and we could make ->s_maxbytes unconditional 1Tb for UFS1.
On Sun, Jun 4, 2017 at 5:11 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > For UFS2 - yes, for UFS1 we have another hard limit I'd missed. i_blocks > is in half-kilobyte units there and it's 32bit on-disk. So for UFS1 I'd > cap it with 1Tb (the real limit is ~ 2Tb - 2Mb, but accurate calculation > is a bit of a bother). Come to think of that, the minimal block size for > UFS1 is 4K with pointers-per-block >= 1024. So tree-imposed limit is > no lower than 1024^3*4096, i.e. greater than that and we could make > ->s_maxbytes unconditional 1Tb for UFS1. The nblocks limit (and the 32-bit block numbers) might not limit a sparse file, so I think the tree-imposed limit might be the final true limit even on UFS1, no? Linus
On Sun, Jun 04, 2017 at 08:00:26PM -0700, Linus Torvalds wrote: > On Sun, Jun 4, 2017 at 5:11 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > For UFS2 - yes, for UFS1 we have another hard limit I'd missed. i_blocks > > is in half-kilobyte units there and it's 32bit on-disk. So for UFS1 I'd > > cap it with 1Tb (the real limit is ~ 2Tb - 2Mb, but accurate calculation > > is a bit of a bother). Come to think of that, the minimal block size for > > UFS1 is 4K with pointers-per-block >= 1024. So tree-imposed limit is > > no lower than 1024^3*4096, i.e. greater than that and we could make > > ->s_maxbytes unconditional 1Tb for UFS1. > > The nblocks limit (and the 32-bit block numbers) might not limit a > sparse file, so I think the tree-imposed limit might be the final true > limit even on UFS1, no? Yes, but... where would we notice that overflow on allocation? Matter of fact... <looks> <starts swearing> commit 8f45c33decf62e1aaaa9411aae8fef6a38f95845 Author: Jan Kara <jack@suse.cz> Date: Thu May 20 16:00:36 2010 +0200 ufs: Remove dead quota code UFS quota is non-functional at least since 2.6.12 because dq_op was set to NULL. Since the filesystem exists mainly to allow cooperation with Solaris and quota format isn't standard, just remove the dead code. CC: Evgeniy Dushistov <dushistov@mail.ru> Signed-off-by: Jan Kara <jack@suse.cz> <swears some more> Jan? You do realize that "non-functional" or not, dquot_alloc_block() updates ->i_bytes/->i_blocks, right? And removing that should've had the updates put back explicitly... Grrr... OK, I'll put together a patch fixing that idiocy. As it is, rw mounts of ufs on-disk ->i_blocks not updated and, AFAICS, can lead to disk space leaks on inode deletion.
On Mon, Jun 05, 2017 at 04:49:36AM +0100, Al Viro wrote: > Grrr... OK, I'll put together a patch fixing that idiocy. As it is, rw mounts > of ufs on-disk ->i_blocks not updated and, AFAICS, can lead to disk space leaks > on inode deletion. Turns out it's even worse than that, and some of the breakage in there is my fault. I'll fix what I can, but for now I would suggest * making UFS_FS_WRITE depend on BROKEN; it had always been experimental, but since 2010 it had been actively buggering filesystems (aforementioned Jan's bug + a bug of mine in 2015 series). I'll fix that stuff, but I'd expect it to take a month or so, including serious testing. I have set the things up for such testing (which has promptly caught all kinds of fun), but we are at -rc4 now, so it'll probably be the next cycle fodder, with interesting backporting afterwards. * for ->s_maxbytes, let's go with tree-imposed limit. Proper way of dealing with ->i_blocks overflows is -ENOSPC when attempt to grab a block would've caused such an overflow. * If Evgeniy can resurface and take part in that fun, I'd be happy to help. If not, well... guess I'll take the thing over until somebody willing to adopt it comes along.
On Thu, 8 Jun 2017, Al Viro wrote: > On Mon, Jun 05, 2017 at 04:49:36AM +0100, Al Viro wrote: > >> Grrr... OK, I'll put together a patch fixing that idiocy. As it is, rw mounts >> of ufs on-disk ->i_blocks not updated and, AFAICS, can lead to disk space leaks >> on inode deletion. > > Turns out it's even worse than that, and some of the breakage in there is my fault. > I'll fix what I can, but for now I would suggest > * making UFS_FS_WRITE depend on BROKEN; it had always been experimental, > but since 2010 it had been actively buggering filesystems (aforementioned Jan's > bug + a bug of mine in 2015 series). I'll fix that stuff, but I'd expect > it to take a month or so, including serious testing. I have set the things up > for such testing (which has promptly caught all kinds of fun), but we are at -rc4 > now, so it'll probably be the next cycle fodder, with interesting backporting > afterwards. > * for ->s_maxbytes, let's go with tree-imposed limit. Proper way of > dealing with ->i_blocks overflows is -ENOSPC when attempt to grab a block would've > caused such an overflow. > * If Evgeniy can resurface and take part in that fun, I'd be happy to help. > If not, well... guess I'll take the thing over until somebody willing to adopt it > comes along. I am willing to test. I just turned on UFS_FS_WRITE for the very first time running 4.12-rc4 and was able to copy a file of more than 2GB from one r/o FreeBSD subpartition to another r/w FreeBSD subpartition. So it is already looking pretty good. I did notice an ->i_blocks bug where it was set to zero on the output file instead of the actual block count. This showed up when I poped over to FreeBSD 11.0 and did an fsck on the r/w subpartition.
On Wed, Jun 07, 2017 at 05:35:31PM -0700, Richard Narron wrote: > I am willing to test. I just turned on UFS_FS_WRITE for the very first time > running 4.12-rc4 and was able to copy a file of more than 2GB from one r/o > FreeBSD subpartition to another r/w FreeBSD subpartition. > > So it is already looking pretty good. The nasty cases are around short files, especially short files with holes. Linear writes as done by cp(1) will do nothing worse than bogus i_blocks (and possibly mangled counters in cylinder groups). Random write access to short files, OTOH, steps into a lot more codepaths... As for ->i_blocks, it triggers this: root@kvm1:/mnt# df .; mkdir a; rmdir a; df . Filesystem 1K-blocks Used Available Use% Mounted on /dev/loop0 507420 4504 462340 1% /mnt Filesystem 1K-blocks Used Available Use% Mounted on /dev/loop0 507420 4536 462308 1% /mnt Note the 32Kb (== one block on that ufs2) leaked here. Every iteration will leak another one. Similar for long symlinks...
On Thu, Jun 08, 2017 at 03:20:57AM +0100, Al Viro wrote: > On Wed, Jun 07, 2017 at 05:35:31PM -0700, Richard Narron wrote: > > > I am willing to test. I just turned on UFS_FS_WRITE for the very first time > > running 4.12-rc4 and was able to copy a file of more than 2GB from one r/o > > FreeBSD subpartition to another r/w FreeBSD subpartition. > > > > So it is already looking pretty good. > > The nasty cases are around short files, especially short files with holes. > Linear writes as done by cp(1) will do nothing worse than bogus i_blocks > (and possibly mangled counters in cylinder groups). Random write access > to short files, OTOH, steps into a lot more codepaths... > > As for ->i_blocks, it triggers this: > > root@kvm1:/mnt# df .; mkdir a; rmdir a; df . > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/loop0 507420 4504 462340 1% /mnt > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/loop0 507420 4536 462308 1% /mnt > > Note the 32Kb (== one block on that ufs2) leaked here. > Every iteration will leak another one. Similar for long > symlinks... Spot the bogosity: static inline int _ubh_isblockset_(struct ufs_sb_private_info * uspi, struct ufs_buffer_head * ubh, unsigned begin, unsigned block) { switch (uspi->s_fpb) { case 8: return (*ubh_get_addr (ubh, begin + block) == 0xff); case 4: return (*ubh_get_addr (ubh, begin + (block >> 1)) == (0x0f << ((block & 0x01) << 2))); case 2: return (*ubh_get_addr (ubh, begin + (block >> 2)) == (0x03 << ((block & 0x03) << 1))); case 1: return (*ubh_get_addr (ubh, begin + (block >> 3)) == (0x01 << (block & 0x07))); } return 0; } with static inline void _ubh_setblock_(struct ufs_sb_private_info * uspi, struct ufs_buffer_head * ubh, unsigned begin, unsigned block) { switch (uspi->s_fpb) { case 8: *ubh_get_addr(ubh, begin + block) = 0xff; return; case 4: *ubh_get_addr(ubh, begin + (block >> 1)) |= (0x0f << ((block & 0x01) << 2)); return; case 2: *ubh_get_addr(ubh, begin + (block >> 2)) |= (0x03 << ((block & 0x03) << 1)); return; case 1: *ubh_get_addr(ubh, begin + (block >> 3)) |= (0x01 << ((block & 0x07))); return; } } The only saving grace is that UFS defaults to 8 fragments per block...
On Thu, Jun 8, 2017 at 3:15 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Spot the bogosity: Heh.. No masking on the testing side. > The only saving grace is that UFS defaults to 8 fragments per block... Yeah, that case just works, because it's the whole byte. Linus
On Thu, 8 Jun 2017, Al Viro wrote: > The nasty cases are around short files, especially short files with holes. > Linear writes as done by cp(1) will do nothing worse than bogus i_blocks > (and possibly mangled counters in cylinder groups). Random write access > to short files, OTOH, steps into a lot more codepaths... > > As for ->i_blocks, it triggers this: > > root@kvm1:/mnt# df .; mkdir a; rmdir a; df . > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/loop0 507420 4504 462340 1% /mnt > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/loop0 507420 4536 462308 1% /mnt > > Note the 32Kb (== one block on that ufs2) leaked here. > Every iteration will leak another one. Similar for long > symlinks... > Test results don't look pretty on FreeBSD. (I will also test OpenBSD and NetBSD.) Freebsd: #newfs ada0s2d Linux: #mount -t ufs -o rw,ufstype=ufs2 /dev/sda18 /fbsdd cd /fbsdd #df . Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda18 20307184 8 18682632 1% /fbsdd #mkdir a #rmdir a #df . Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda18 20307184 40 18682600 1% /fbsdd #umount /dev/sda18 FreeBSD: # fsck -n /dev/ada0s2d ** /dev/ada0s2d (NO WRITE) ** Last Mounted on /diskd ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FREE BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? no SUMMARY INFORMATION BAD SALVAGE? no BLK(S) MISSING IN BIT MAPS SALVAGE? no 2 files, 2 used, 5076786 free (26 frags, 634595 blocks, 0.0% fragmentation)
diff --git a/fs/ufs/super.c b/fs/ufs/super.c index 131b2b77c818..2bab1491a5d4 100644 --- a/fs/ufs/super.c +++ b/fs/ufs/super.c @@ -746,6 +746,24 @@ static void ufs_put_super(struct super_block *sb) return; } +static u64 ufs_max_bytes(struct super_block *sb) +{ + struct ufs_sb_private_info *uspi = UFS_SB(sb)->s_uspi; + int bits = uspi->s_apbshift; + u64 res; + + if (bits > 21) + return MAX_LFS_FILESIZE; + + res = UFS_NDADDR + (1LL << bits) + (1LL << (2*bits)) + + (1LL << (3*bits)); + + if (res >= (MAX_LFS_FILESIZE >> uspi->s_bshift)) + return MAX_LFS_FILESIZE; + + return res << uspi->s_bshift; +} + static int ufs_fill_super(struct super_block *sb, void *data, int silent) { struct ufs_sb_info * sbi; @@ -823,6 +841,7 @@ static int ufs_fill_super(struct super_block *sb, void *data, int silent) uspi->s_fshift = 9; uspi->s_sbsize = super_block_size = 1536; uspi->s_sbbase = 0; + sb->s_maxbytes = ufs_max_bytes(sb); flags |= UFS_DE_44BSD | UFS_UID_44BSD | UFS_ST_44BSD | UFS_CG_44BSD; break; case UFS_MOUNT_UFSTYPE_UFS2: @@ -833,6 +852,7 @@ static int ufs_fill_super(struct super_block *sb, void *data, int silent) uspi->s_fshift = 9; uspi->s_sbsize = super_block_size = 1536; uspi->s_sbbase = 0; + sb->s_maxbytes = ufs_max_bytes(sb); flags |= UFS_TYPE_UFS2 | UFS_DE_44BSD | UFS_UID_44BSD | UFS_ST_44BSD | UFS_CG_44BSD; break;