Kernels v4.9+ cause short reads of block devices

Message ID	CA+55aFzukw-6m0vtm6jDn4kXyfcUQbDFSwt1OOoa9nwG6toycQ@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> MIME-Version: 1.0 In-Reply-To: <1ae53e17-e455-4f17-0280-b0dae183a449@nazar.ca> References: <1ae53e17-e455-4f17-0280-b0dae183a449@nazar.ca> From: Linus Torvalds <torvalds@linux-foundation.org> Date: Wed, 23 Aug 2017 12:37:13 -0700 Message-ID: <CA+55aFzukw-6m0vtm6jDn4kXyfcUQbDFSwt1OOoa9nwG6toycQ@mail.gmail.com> Subject: Re: Kernels v4.9+ cause short reads of block devices To: Doug Nazar <nazard@nazar.ca>, Al Viro <viro@zeniv.linux.org.uk> Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Wei Fang <fangwei1@huawei.com>, linux-fsdevel <linux-fsdevel@vger.kernel.org> Content-Type: multipart/mixed; boundary="001a113d0a8ab4f6f0055770d661" Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Linus Torvalds Aug. 23, 2017, 7:37 p.m. UTC

On Wed, Aug 23, 2017 at 12:15 PM, Doug Nazar <nazard@nazar.ca> wrote:
> The following commits cause short reads of block devices, however writes are
> still allowed.
>
> c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
> d05c5f7ba164 ("vfs,mm: fix return value of read() at s_maxbytes")
>
> When e2fsck sees this, it thinks it's a bad sector and tries to write a
> block of nulls which overwrites the valid data.

Hmm. Block devices shouldn't have issues with s_maxbytes, and I'm
surprised that nobody has seen that before.

> Device is LVM over 2 x RAID-5 on an old 32bit desktop.
>
> RO    RA   SSZ   BSZ   StartSec            Size   Device
> rw  4096   512  4096          0   9748044840960 /dev/Storage/Main

.. and the problem may be as simple as just a missing initialization
of s_maxbytes for blockdev_superblock.

Does the attcahed trivial one-liner fix things for you?

Al, if it really is this simple, how come nobody even noticed?

Also, I do wonder if that check in do_generic_file_read() should just
unconditionally use MAX_LFS_FILESIZE, since the whole point there is
really about the index wrap-around, not about any underlying
filesystem limits per se.

And that's exactly what MAX_LFS_FILESIZE is - the maximum size that
fits in the page index.

              Linus
fs/block_dev.c | 1 +
 1 file changed, 1 insertion(+)

Doug Nazar Aug. 23, 2017, 7:53 p.m. UTC | #1

On 8/23/17 3:37 PM, Linus Torvalds wrote:
> On Wed, Aug 23, 2017 at 12:15 PM, Doug Nazar <nazard@nazar.ca> wrote:
>> The following commits cause short reads of block devices, however writes are
>> still allowed.
>>
>> c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
>> d05c5f7ba164 ("vfs,mm: fix return value of read() at s_maxbytes")
>>
>> When e2fsck sees this, it thinks it's a bad sector and tries to write a
>> block of nulls which overwrites the valid data.
> Hmm. Block devices shouldn't have issues with s_maxbytes, and I'm
> surprised that nobody has seen that before.
>
>> Device is LVM over 2 x RAID-5 on an old 32bit desktop.
>>
>> RO    RA   SSZ   BSZ   StartSec            Size   Device
>> rw  4096   512  4096          0   9748044840960 /dev/Storage/Main
> .. and the problem may be as simple as just a missing initialization
> of s_maxbytes for blockdev_superblock.
>
> Does the attcahed trivial one-liner fix things for you?
>
> Al, if it really is this simple, how come nobody even noticed?
>
> Also, I do wonder if that check in do_generic_file_read() should just
> unconditionally use MAX_LFS_FILESIZE, since the whole point there is
> really about the index wrap-around, not about any underlying
> filesystem limits per se.
>
> And that's exactly what MAX_LFS_FILESIZE is - the maximum size that
> fits in the page index.

It's compiling now, but I think it's already set to MAX_LFS_FILESIZE.

[  169.095127] ppos=80180006000, s_maxbytes=7ffffffffff, 
magic=0x62646576, type=bdev

Doug

Linus Torvalds Aug. 23, 2017, 8:13 p.m. UTC | #2

On Wed, Aug 23, 2017 at 12:53 PM, Doug Nazar <nazard@nazar.ca> wrote:
>
> It's compiling now, but I think it's already set to MAX_LFS_FILESIZE.
>
> [  169.095127] ppos=80180006000, s_maxbytes=7ffffffffff, magic=0x62646576,
> type=bdev

Oh, right you are - I'm much too used to 64-bit, where
MAX_LFS_FILESIZE is basically infinite, and was jusr assuming that it
was something like the UFS bug we had not that long ago that was due
to the 32-bit limit.

But yes, on 32-bit, we are limited by the 32-bit index into the page
cache, and we limit the index to 31 bits too, so we have (PAGE_SIZE <<
31) -1, which is that 7ffffffffff.

And that also explains why people haven't seen it. You do need

 (a) 32-bit environment

 (b) a disk larger than that 8TB in size

The *hard* limit for the page cache on a 32-bit environment should
actually be (PAGE_SIZE << 32)-PAGE_SIZE (that final PAGE_SIZE
subtraction is to make sure we don't generate that page cache with
index -1), so having a disk that is 16TB or larger is not going to
work, but your disk is right in that 8TB-16TB hole that used to work
and was broken by that check.

Anyway, that makes me feel better. I should have looked at your disk
size more, now I at least understand why nobody noticed before.

So just throw away my patch. That's wrong, and garbage.

The *right* patch is likely to just this instead:

  -#define MAX_LFS_FILESIZE       (((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
  +#define MAX_LFS_FILESIZE       (((loff_t)PAGE_SIZE <<
BITS_PER_LONG)-PAGE_SIZE)

which should make MAX_LFS_FILESIZE be 0xffffffff000 and you disk size
should be ok.

                      Linus

Andreas Dilger Aug. 23, 2017, 9:01 p.m. UTC | #3

On Aug 23, 2017, at 2:13 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Wed, Aug 23, 2017 at 12:53 PM, Doug Nazar <nazard@nazar.ca> wrote:
>> 
>> It's compiling now, but I think it's already set to MAX_LFS_FILESIZE.
>> 
>> [  169.095127] ppos=80180006000, s_maxbytes=7ffffffffff, magic=0x62646576,
>> type=bdev
> 
> Oh, right you are - I'm much too used to 64-bit, where
> MAX_LFS_FILESIZE is basically infinite, and was jusr assuming that it
> was something like the UFS bug we had not that long ago that was due
> to the 32-bit limit.
> 
> But yes, on 32-bit, we are limited by the 32-bit index into the page
> cache, and we limit the index to 31 bits too, so we have (PAGE_SIZE <<
> 31) -1, which is that 7ffffffffff.
> 
> And that also explains why people haven't seen it. You do need
> 
> (a) 32-bit environment
> 
> (b) a disk larger than that 8TB in size
> 
> The *hard* limit for the page cache on a 32-bit environment should
> actually be (PAGE_SIZE << 32)-PAGE_SIZE (that final PAGE_SIZE
> subtraction is to make sure we don't generate that page cache with
> index -1), so having a disk that is 16TB or larger is not going to
> work, but your disk is right in that 8TB-16TB hole that used to work
> and was broken by that check.
> 
> Anyway, that makes me feel better. I should have looked at your disk
> size more, now I at least understand why nobody noticed before.
> 
> So just throw away my patch. That's wrong, and garbage.
> 
> The *right* patch is likely to just this instead:
> 
>  -#define MAX_LFS_FILESIZE       (((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
>  +#define MAX_LFS_FILESIZE       (((loff_t)PAGE_SIZE <<
> BITS_PER_LONG)-PAGE_SIZE)
> 
> which should make MAX_LFS_FILESIZE be 0xffffffff000 and you disk size
> should be ok.

Doug,
I noticed while checking for other implications of changing MAX_LFS_FILESIZE
that fs/jfs/super.c is also working around this limit.  If you are going
to submit a patch for this, it also makes sense to fix jfs_fill_super() to
use MAX_LFS_FILESIZE instead of JFS rolling its own, something like:

	/* logical blocks are represented by 40 bits in pxd_t, etc.
	 * and page cache is indexed by long. */
	sb->s_maxbytes = min((u64)sb->s_blocksize) << 40,
                             MAX_LFS_FILESIZE);

It also looks like ocfs2_max_file_offset() is trying to avoid overflowing
the old 31-bit limit, and isn't using MAX_LFS_FILESIZE directly, so it will
now be wrong.  It looks like it could use "bitshift = 32; trim = bytes;",
but Joel or Mark should confirm.

Finally, there is a check in fs/super.c::mount_fs() that is verifying
s_maxbytes is not set too large, but this has been present since 2.6.32
and should probably be removed at this point, or changed to a BUG_ON()
(see commit 42cb56ae2ab for details).

Cheers, Andreas

Doug Nazar Aug. 24, 2017, 10:03 a.m. UTC | #4

On 8/23/17 4:13 PM, Linus Torvalds wrote:

> Oh, right you are - I'm much too used to 64-bit, where
> MAX_LFS_FILESIZE is basically infinite, and was jusr assuming that it
> was something like the UFS bug we had not that long ago that was due
> to the 32-bit limit.
>
> But yes, on 32-bit, we are limited by the 32-bit index into the page
> cache, and we limit the index to 31 bits too, so we have (PAGE_SIZE <<
> 31) -1, which is that 7ffffffffff.

Yeah, it's an old organically grown storage server (install images, old 
vms, etc.) that
I can't convince myself it's worth upgrading.

> The *right* patch is likely to just this instead:
>
>    -#define MAX_LFS_FILESIZE       (((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
>    +#define MAX_LFS_FILESIZE       (((loff_t)PAGE_SIZE <<
> BITS_PER_LONG)-PAGE_SIZE)
>
> which should make MAX_LFS_FILESIZE be 0xffffffff000 and you disk size
> should be ok.

That solves my issue. I'm curious if that check should also be in the 
write path. If the
check had been there too I wouldn't have ended up with any corruption.

I'm not sure if anything over 16TB will create/assemble (or if anybody 
else is crazy like me).

Doug

Doug Nazar Aug. 24, 2017, 10:20 a.m. UTC | #5

On 8/23/17 5:01 PM, Andreas Dilger wrote:
> Doug,
> I noticed while checking for other implications of changing MAX_LFS_FILESIZE
> that fs/jfs/super.c is also working around this limit.  If you are going
> to submit a patch for this, it also makes sense to fix jfs_fill_super() to
> use MAX_LFS_FILESIZE instead of JFS rolling its own, something like:
>
> 	/* logical blocks are represented by 40 bits in pxd_t, etc.
> 	 * and page cache is indexed by long. */
> 	sb->s_maxbytes = min((u64)sb->s_blocksize) << 40,
>                               MAX_LFS_FILESIZE);
>
> It also looks like ocfs2_max_file_offset() is trying to avoid overflowing
> the old 31-bit limit, and isn't using MAX_LFS_FILESIZE directly, so it will
> now be wrong.  It looks like it could use "bitshift = 32; trim = bytes;",
> but Joel or Mark should confirm.
>
> Finally, there is a check in fs/super.c::mount_fs() that is verifying
> s_maxbytes is not set too large, but this has been present since 2.6.32
> and should probably be removed at this point, or changed to a BUG_ON()
> (see commit 42cb56ae2ab for details).

I don't have any issue trying to write patches for those, but I have no 
domain knowledge
in the area or any way to test them.

 From a quick glance, jfs is locked to PSIZE (4096) so should be ok.
OCFS looks a little complex, and since it's a shared fs, little hesitant.

The check in fs/super.c, maybe that should be:

sb->s_maxbytes > MAX_LFS_FILESIZE

Actually, little confused, the comment says unsigned, but loff_t looks 
like its long long.
Maybe cast to u64 and check greater than?

Doug

Dave Kleikamp Aug. 24, 2017, 3:22 p.m. UTC | #6

On 08/24/2017 05:20 AM, Doug Nazar wrote:
> On 8/23/17 5:01 PM, Andreas Dilger wrote:
>> Doug,
>> I noticed while checking for other implications of changing
>> MAX_LFS_FILESIZE
>> that fs/jfs/super.c is also working around this limit.  If you are going
>> to submit a patch for this, it also makes sense to fix
>> jfs_fill_super() to
>> use MAX_LFS_FILESIZE instead of JFS rolling its own, something like:
>>
>>     /* logical blocks are represented by 40 bits in pxd_t, etc.
>>      * and page cache is indexed by long. */
>>     sb->s_maxbytes = min((u64)sb->s_blocksize) << 40,
>>                               MAX_LFS_FILESIZE);
>>
>> It also looks like ocfs2_max_file_offset() is trying to avoid overflowing
>> the old 31-bit limit, and isn't using MAX_LFS_FILESIZE directly, so it
>> will
>> now be wrong.  It looks like it could use "bitshift = 32; trim = bytes;",
>> but Joel or Mark should confirm.
>>
>> Finally, there is a check in fs/super.c::mount_fs() that is verifying
>> s_maxbytes is not set too large, but this has been present since 2.6.32
>> and should probably be removed at this point, or changed to a BUG_ON()
>> (see commit 42cb56ae2ab for details).
> 
> I don't have any issue trying to write patches for those, but I have no
> domain knowledge
> in the area or any way to test them.

If you want to wrap the jfs change into this, I will be happy to test it
for you, or I could take care of jfs with a separate patch if you'd prefer.

> 
> From a quick glance, jfs is locked to PSIZE (4096) so should be ok.
> OCFS looks a little complex, and since it's a shared fs, little hesitant.
> 
> The check in fs/super.c, maybe that should be:
> 
> sb->s_maxbytes > MAX_LFS_FILESIZE
> 
> Actually, little confused, the comment says unsigned, but loff_t looks
> like its long long.
> Maybe cast to u64 and check greater than?
> 
> Doug
>

Linus Torvalds Aug. 27, 2017, 7:47 p.m. UTC | #7

On Wed, Aug 23, 2017 at 2:01 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>
> Doug,
> I noticed while checking for other implications of changing MAX_LFS_FILESIZE
> that fs/jfs/super.c is also working around this limit.

Note to people: I just committed the patch to update MAX_LFS_FILESIZE.

I made it use the simpler (and clearer) calculation of

    ((loff_t)ULONG_MAX << PAGE_SHIFT)

for the 32-bit case, and I did *not* change any other users.

The jfs comment was a bit confusing, and talks about "wraps around" at
8TB, when that actually happens at 16TB. Yes, if you use a signed
number for the index, it does wrap at 8TB, but you really shouldn't
(and the code the jfs comment points to doesn't).

So I didn't touch that.  Nor did I touch:

> it also makes sense to fix jfs_fill_super() to
> use MAX_LFS_FILESIZE instead of JFS rolling its own, something like:
>
>         /* logical blocks are represented by 40 bits in pxd_t, etc.
>          * and page cache is indexed by long. */
>         sb->s_maxbytes = min((u64)sb->s_blocksize) << 40,
>                              MAX_LFS_FILESIZE);

which I agree should be modified. The new MAX_LFS_FILESIZE should be
the right size, but the difference now is only one page less one byte.

                Linus

Dave Kleikamp Aug. 27, 2017, 7:54 p.m. UTC | #8

On 08/27/2017 02:47 PM, Linus Torvalds wrote:
> On Wed, Aug 23, 2017 at 2:01 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>>
>> Doug,
>> I noticed while checking for other implications of changing MAX_LFS_FILESIZE
>> that fs/jfs/super.c is also working around this limit.
> 
> Note to people: I just committed the patch to update MAX_LFS_FILESIZE.
> 
> I made it use the simpler (and clearer) calculation of
> 
>     ((loff_t)ULONG_MAX << PAGE_SHIFT)
> 
> for the 32-bit case, and I did *not* change any other users.
> 
> The jfs comment was a bit confusing, and talks about "wraps around" at
> 8TB, when that actually happens at 16TB. Yes, if you use a signed
> number for the index, it does wrap at 8TB, but you really shouldn't
> (and the code the jfs comment points to doesn't).
> 
> So I didn't touch that.  Nor did I touch:
> 
>> it also makes sense to fix jfs_fill_super() to
>> use MAX_LFS_FILESIZE instead of JFS rolling its own, something like:
>>
>>         /* logical blocks are represented by 40 bits in pxd_t, etc.
>>          * and page cache is indexed by long. */
>>         sb->s_maxbytes = min((u64)sb->s_blocksize) << 40,
>>                              MAX_LFS_FILESIZE);
> 
> which I agree should be modified. The new MAX_LFS_FILESIZE should be
> the right size, but the difference now is only one page less one byte.

I'll submit a patch to clean up jfs.

Thanks,
Shaggy

> 
>                 Linus
>

Doug Nazar Aug. 29, 2017, 10:40 a.m. UTC | #9

On 8/27/17 3:47 PM, Linus Torvalds wrote:
> Note to people: I just committed the patch to update MAX_LFS_FILESIZE.
> I made it use the simpler (and clearer) calculation of
>
>      ((loff_t)ULONG_MAX << PAGE_SHIFT)
>
> for the 32-bit case, and I did *not* change any other users.

Not a problem... I got caught trying to find out why 4.13 hangs on a
warm reset. It has really slowed down my testing having towalk couple
of floors for every reboot.

Doug

Kernels v4.9+ cause short reads of block devices

Commit Message

Comments

Patch