diff mbox

[RFC,-] btrfs: do not write corrupted metadata blocks to disk

Message ID 286C424321A6496789A866288933406A@alyakaslap (mailing list archive)
State Superseded
Headers show

Commit Message

Alex Lyakas Feb. 21, 2016, 3:36 p.m. UTC
csum_dirty_buffer was issuing a warning in case the extent buffer
did not look alright, but was still returning success.
Let's return error in this case, and also add two additional sanity
checks on the extent buffer header.

We had btrfs metadata corruption, and after looking at the logs we saw
that WARN_ON(found_start != start) has been triggered. We are still 
investigating
which component trashed the cache page which belonged to btrfs. But btrfs
only issued a warning, and as a result, the corrupted metadata block went to 
disk.

I think we should return an error in such case that the extent buffer 
doesn't look alright.
The caller up the chain may BUG_ON on this, for example flush_epd_write_bio 
will,
but it is better than to have a silent metadata corruption on disk.

Note: this patch has been properly tested on 3.18 kernel only.

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
---
fs/btrfs/disk-io.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

Comments

Filipe Manana Feb. 22, 2016, 1:05 a.m. UTC | #1
On Sun, Feb 21, 2016 at 3:36 PM, Alex Lyakas <alex@zadarastorage.com> wrote:
> csum_dirty_buffer was issuing a warning in case the extent buffer
> did not look alright, but was still returning success.
> Let's return error in this case, and also add two additional sanity
> checks on the extent buffer header.
>
> We had btrfs metadata corruption, and after looking at the logs we saw
> that WARN_ON(found_start != start) has been triggered. We are still
> investigating

There's a warning for WARN_ON(found_start != start || !PageUptodate(page))

Are you sure it triggered only because of found_start != start and not
because of !PageUptodate(page) (or both)?

> which component trashed the cache page which belonged to btrfs. But btrfs
> only issued a warning, and as a result, the corrupted metadata block went to
> disk.
>
> I think we should return an error in such case that the extent buffer
> doesn't look alright.

I think so too.

> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio
> will,
> but it is better than to have a silent metadata corruption on disk.
>
> Note: this patch has been properly tested on 3.18 kernel only.
>
> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
> ---
> fs/btrfs/disk-io.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 4545e2e..701e706 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info
> *fs_info, struct page *page)
> {
>     u64 start = page_offset(page);
>     u64 found_start;
>     struct extent_buffer *eb;
>
>     eb = (struct extent_buffer *)page->private;
>     if (page != eb->pages[0])
>         return 0;
>     found_start = btrfs_header_bytenr(eb);
>     if (WARN_ON(found_start != start || !PageUptodate(page)))
> -        return 0;
> -    csum_tree_block(fs_info, eb, 0);
> +        return -EUCLEAN;
> +#ifdef CONFIG_BTRFS_ASSERT

A bit odd to surround these with CONFIG_BTRFS_ASSERT if we don't do assertions.
I would remove this #ifdef ... #endif or do the memcmp calls inside ASSERT().

> +    if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
> +                    (unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE)))
> +        return -EUCLEAN;
> +    if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
> +                    (unsigned long)btrfs_header_chunk_tree_uuid(eb),
> +                    BTRFS_FSID_SIZE)))

This second comparison doesn't seem correct. Second argument to
memcmp_extent_buffer should be fs_info->chunk_tree_uuid, which
shouldn't be the same as the fsid (take a look at utils.c:make_btrfs()
in the tools, both uuids are generated by different calls to
uuid_generate()) - did you make your tests only before adding this
comparison?. Also you should use BTRFS_UUID_SIZE instead of
BTRFS_FSID_SIZE (even if both have the same value).

> +        return -EUCLEAN;
> +#endif
> +    if (csum_tree_block(fs_info, eb, 0))
> +        return -EUCLEAN;

I would just return the real error from csum_tree_block() - currently
it returns 1 for all possible failures instead of its real possible
failures: -ENOMEM or -EINVAL.

Thanks.

>     return 0;
> }
>
> static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
>                  struct extent_buffer *eb)
> {
>     struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>     u8 fsid[BTRFS_UUID_SIZE];
>     int ret = 1;
>
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Lyakas Feb. 22, 2016, 9:46 a.m. UTC | #2
Thank you, Filipe, for your review.

On Mon, Feb 22, 2016 at 3:05 AM, Filipe Manana <fdmanana@kernel.org> wrote:
> On Sun, Feb 21, 2016 at 3:36 PM, Alex Lyakas <alex@zadarastorage.com> wrote:
>> csum_dirty_buffer was issuing a warning in case the extent buffer
>> did not look alright, but was still returning success.
>> Let's return error in this case, and also add two additional sanity
>> checks on the extent buffer header.
>>
>> We had btrfs metadata corruption, and after looking at the logs we saw
>> that WARN_ON(found_start != start) has been triggered. We are still
>> investigating
>
> There's a warning for WARN_ON(found_start != start || !PageUptodate(page))
>
> Are you sure it triggered only because of found_start != start and not
> because of !PageUptodate(page) (or both)?
The problem initially happened on kernel 3.8.13.  In this kernel, the
code looks like this:
         found_start = btrfs_header_bytenr(eb);
         if (found_start != start) {
                 WARN_ON(1);
                 return 0;
         }
         if (!PageUptodate(page)) {
                 WARN_ON(1);
                 return 0;
         }
(You can see it on
http://lxr.free-electrons.com/source/fs/btrfs/disk-io.c?v=3.8#L420)
The WARN_ON that we hit was on the found_start comparison.

>
>> which component trashed the cache page which belonged to btrfs. But btrfs
>> only issued a warning, and as a result, the corrupted metadata block went to
>> disk.
>>
>> I think we should return an error in such case that the extent buffer
>> doesn't look alright.
>
> I think so too.
>
>> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio
>> will,
>> but it is better than to have a silent metadata corruption on disk.
>>
>> Note: this patch has been properly tested on 3.18 kernel only.
>>
>> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
>> ---
>> fs/btrfs/disk-io.c | 14 ++++++++++++--
>> 1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 4545e2e..701e706 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info
>> *fs_info, struct page *page)
>> {
>>     u64 start = page_offset(page);
>>     u64 found_start;
>>     struct extent_buffer *eb;
>>
>>     eb = (struct extent_buffer *)page->private;
>>     if (page != eb->pages[0])
>>         return 0;
>>     found_start = btrfs_header_bytenr(eb);
>>     if (WARN_ON(found_start != start || !PageUptodate(page)))
>> -        return 0;
>> -    csum_tree_block(fs_info, eb, 0);
>> +        return -EUCLEAN;
>> +#ifdef CONFIG_BTRFS_ASSERT
>
> A bit odd to surround these with CONFIG_BTRFS_ASSERT if we don't do assertions.
> I would remove this #ifdef ... #endif or do the memcmp calls inside ASSERT().
Agreed.

>
>> +    if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
>> +                    (unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE)))
>> +        return -EUCLEAN;
>> +    if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
>> +                    (unsigned long)btrfs_header_chunk_tree_uuid(eb),
>> +                    BTRFS_FSID_SIZE)))
>
> This second comparison doesn't seem correct. Second argument to
> memcmp_extent_buffer should be fs_info->chunk_tree_uuid, which
> shouldn't be the same as the fsid (take a look at utils.c:make_btrfs()
> in the tools, both uuids are generated by different calls to
> uuid_generate()) - did you make your tests only before adding this
> comparison?. Also you should use BTRFS_UUID_SIZE instead of
> BTRFS_FSID_SIZE (even if both have the same value).
Obviously, you are right. In the 3.18-based code that I fixed locally
here, the fix looks like this:

    if (found_start != start) {
        ZBTRFS_WARN(1, "FS[%s]: header_bytenr(eb)(%llu) !=
page->index<<PAGE_CACHE_SHIFT(%llu)", root->fs_info->sb->s_id,
found_start, start);
        return -EUCLEAN;
    }
    if (!PageUptodate(page)) {
        ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu page->index(%llu)
!PageUptodate", root->fs_info->sb->s_id, start, (u64)page->index);
        return -EUCLEAN;
    }
    if (memcmp_extent_buffer(eb, root->fs_info->fsid, (unsigned
long)btrfs_header_fsid(), BTRFS_FSID_SIZE)) {
        u8 hdr_fsid[BTRFS_FSID_SIZE] = {0};

        read_extent_buffer(eb, hdr_fsid, (unsigned
long)btrfs_header_fsid(), BTRFS_FSID_SIZE);
        ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu header->fsid["PRIX128"]
!= fs_info->fsid["PRIX128"]", root->fs_info->sb->s_id, start,
                    PRI_UUID(hdr_fsid), PRI_UUID(root->fs_info->fsid));
        return -EUCLEAN;
    }
    if (memcmp_extent_buffer(eb, root->fs_info->chunk_tree_uuid,
(unsigned long)btrfs_header_chunk_tree_uuid(eb), BTRFS_UUID_SIZE)) {
        u8 hdr_ch_uuid[BTRFS_UUID_SIZE] = {0};

        read_extent_buffer(eb, hdr_ch_uuid, (unsigned
long)btrfs_header_chunk_tree_uuid(eb), BTRFS_UUID_SIZE);
        ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu
header->chunk_tree_uuid["PRIX128"] !=
fs_info->chunk_tree_uuid["PRIX128"]", root->fs_info->sb->s_id, start,
                    PRI_UUID(hdr_ch_uuid),
PRI_UUID(root->fs_info->chunk_tree_uuid));
        return -EUCLEAN;
    }
    if (csum_tree_block(root, eb, 0) != 0) {
        ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu csum_tree_block()
failure", root->fs_info->sb->s_id, start);
        return -EUCLEAN;
    }

ZBTRFS_WARN, PRIX128/PRI_UUID are some custom macros that I added to
properly test the fix:
#define ZBTRFS_WARN(condition, format, ...) WARN(condition, format,
##__VA_ARGS__)
#define PRIX128
"%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
#define PRI_UUID(uuid)    ((u8*)(uuid))[0],  ((u8*)(uuid))[1],
((u8*)(uuid))[2],  ((u8*)(uuid))[3],    \
                        ((u8*)(uuid))[4],  ((u8*)(uuid))[5],
((u8*)(uuid))[6],  ((u8*)(uuid))[7],    \
                        ((u8*)(uuid))[8],  ((u8*)(uuid))[9],
((u8*)(uuid))[10], ((u8*)(uuid))[11],    \
                        ((u8*)(uuid))[12], ((u8*)(uuid))[13],
((u8*)(uuid))[14], ((u8*)(uuid))[15]


Then I just pulled the latest Linux tree and fixed it there, and
obviously did not check it properly.

>
>> +        return -EUCLEAN;
>> +#endif
>> +    if (csum_tree_block(fs_info, eb, 0))
>> +        return -EUCLEAN;
>
> I would just return the real error from csum_tree_block() - currently
> it returns 1 for all possible failures instead of its real possible
> failures: -ENOMEM or -EINVAL.
Returning a positive errno is a bit risky, no? Somebody might only
check ret<0, for example, and will miss the error (like
flush_epd_write_bio).
That's why I preferred to return a negative error, because I saw that
csum_tree_block returns 1.

Thanks,
Alex.


>
> Thanks.
>
>>     return 0;
>> }
>>
>> static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
>>                  struct extent_buffer *eb)
>> {
>>     struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>>     u8 fsid[BTRFS_UUID_SIZE];
>>     int ret = 1;
>>
>> --
>> 1.9.1
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana Feb. 22, 2016, 10:28 a.m. UTC | #3
On Mon, Feb 22, 2016 at 9:46 AM, Alex Lyakas <alex@zadarastorage.com> wrote:
> Thank you, Filipe, for your review.
>
> On Mon, Feb 22, 2016 at 3:05 AM, Filipe Manana <fdmanana@kernel.org> wrote:
>> On Sun, Feb 21, 2016 at 3:36 PM, Alex Lyakas <alex@zadarastorage.com> wrote:
>>> csum_dirty_buffer was issuing a warning in case the extent buffer
>>> did not look alright, but was still returning success.
>>> Let's return error in this case, and also add two additional sanity
>>> checks on the extent buffer header.
>>>
>>> We had btrfs metadata corruption, and after looking at the logs we saw
>>> that WARN_ON(found_start != start) has been triggered. We are still
>>> investigating
>>
>> There's a warning for WARN_ON(found_start != start || !PageUptodate(page))
>>
>> Are you sure it triggered only because of found_start != start and not
>> because of !PageUptodate(page) (or both)?
> The problem initially happened on kernel 3.8.13.  In this kernel, the
> code looks like this:
>          found_start = btrfs_header_bytenr(eb);
>          if (found_start != start) {
>                  WARN_ON(1);
>                  return 0;
>          }
>          if (!PageUptodate(page)) {
>                  WARN_ON(1);
>                  return 0;
>          }
> (You can see it on
> http://lxr.free-electrons.com/source/fs/btrfs/disk-io.c?v=3.8#L420)
> The WARN_ON that we hit was on the found_start comparison.

Ok, I see now that one of those useless cleanup patches merged both
conditions into a single if some time ago.

>
>>
>>> which component trashed the cache page which belonged to btrfs. But btrfs
>>> only issued a warning, and as a result, the corrupted metadata block went to
>>> disk.
>>>
>>> I think we should return an error in such case that the extent buffer
>>> doesn't look alright.
>>
>> I think so too.
>>
>>> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio
>>> will,
>>> but it is better than to have a silent metadata corruption on disk.
>>>
>>> Note: this patch has been properly tested on 3.18 kernel only.
>>>
>>> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
>>> ---
>>> fs/btrfs/disk-io.c | 14 ++++++++++++--
>>> 1 file changed, 12 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 4545e2e..701e706 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info
>>> *fs_info, struct page *page)
>>> {
>>>     u64 start = page_offset(page);
>>>     u64 found_start;
>>>     struct extent_buffer *eb;
>>>
>>>     eb = (struct extent_buffer *)page->private;
>>>     if (page != eb->pages[0])
>>>         return 0;
>>>     found_start = btrfs_header_bytenr(eb);
>>>     if (WARN_ON(found_start != start || !PageUptodate(page)))
>>> -        return 0;
>>> -    csum_tree_block(fs_info, eb, 0);
>>> +        return -EUCLEAN;
>>> +#ifdef CONFIG_BTRFS_ASSERT
>>
>> A bit odd to surround these with CONFIG_BTRFS_ASSERT if we don't do assertions.
>> I would remove this #ifdef ... #endif or do the memcmp calls inside ASSERT().
> Agreed.
>
>>
>>> +    if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
>>> +                    (unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE)))
>>> +        return -EUCLEAN;
>>> +    if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
>>> +                    (unsigned long)btrfs_header_chunk_tree_uuid(eb),
>>> +                    BTRFS_FSID_SIZE)))
>>
>> This second comparison doesn't seem correct. Second argument to
>> memcmp_extent_buffer should be fs_info->chunk_tree_uuid, which
>> shouldn't be the same as the fsid (take a look at utils.c:make_btrfs()
>> in the tools, both uuids are generated by different calls to
>> uuid_generate()) - did you make your tests only before adding this
>> comparison?. Also you should use BTRFS_UUID_SIZE instead of
>> BTRFS_FSID_SIZE (even if both have the same value).
> Obviously, you are right. In the 3.18-based code that I fixed locally
> here, the fix looks like this:
>
>     if (found_start != start) {
>         ZBTRFS_WARN(1, "FS[%s]: header_bytenr(eb)(%llu) !=
> page->index<<PAGE_CACHE_SHIFT(%llu)", root->fs_info->sb->s_id,
> found_start, start);
>         return -EUCLEAN;
>     }
>     if (!PageUptodate(page)) {
>         ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu page->index(%llu)
> !PageUptodate", root->fs_info->sb->s_id, start, (u64)page->index);
>         return -EUCLEAN;
>     }
>     if (memcmp_extent_buffer(eb, root->fs_info->fsid, (unsigned
> long)btrfs_header_fsid(), BTRFS_FSID_SIZE)) {
>         u8 hdr_fsid[BTRFS_FSID_SIZE] = {0};
>
>         read_extent_buffer(eb, hdr_fsid, (unsigned
> long)btrfs_header_fsid(), BTRFS_FSID_SIZE);
>         ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu header->fsid["PRIX128"]
> != fs_info->fsid["PRIX128"]", root->fs_info->sb->s_id, start,
>                     PRI_UUID(hdr_fsid), PRI_UUID(root->fs_info->fsid));
>         return -EUCLEAN;
>     }
>     if (memcmp_extent_buffer(eb, root->fs_info->chunk_tree_uuid,
> (unsigned long)btrfs_header_chunk_tree_uuid(eb), BTRFS_UUID_SIZE)) {
>         u8 hdr_ch_uuid[BTRFS_UUID_SIZE] = {0};
>
>         read_extent_buffer(eb, hdr_ch_uuid, (unsigned
> long)btrfs_header_chunk_tree_uuid(eb), BTRFS_UUID_SIZE);
>         ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu
> header->chunk_tree_uuid["PRIX128"] !=
> fs_info->chunk_tree_uuid["PRIX128"]", root->fs_info->sb->s_id, start,
>                     PRI_UUID(hdr_ch_uuid),
> PRI_UUID(root->fs_info->chunk_tree_uuid));
>         return -EUCLEAN;
>     }
>     if (csum_tree_block(root, eb, 0) != 0) {
>         ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu csum_tree_block()
> failure", root->fs_info->sb->s_id, start);
>         return -EUCLEAN;
>     }
>
> ZBTRFS_WARN, PRIX128/PRI_UUID are some custom macros that I added to
> properly test the fix:
> #define ZBTRFS_WARN(condition, format, ...) WARN(condition, format,
> ##__VA_ARGS__)
> #define PRIX128
> "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
> #define PRI_UUID(uuid)    ((u8*)(uuid))[0],  ((u8*)(uuid))[1],
> ((u8*)(uuid))[2],  ((u8*)(uuid))[3],    \
>                         ((u8*)(uuid))[4],  ((u8*)(uuid))[5],
> ((u8*)(uuid))[6],  ((u8*)(uuid))[7],    \
>                         ((u8*)(uuid))[8],  ((u8*)(uuid))[9],
> ((u8*)(uuid))[10], ((u8*)(uuid))[11],    \
>                         ((u8*)(uuid))[12], ((u8*)(uuid))[13],
> ((u8*)(uuid))[14], ((u8*)(uuid))[15]
>
>
> Then I just pulled the latest Linux tree and fixed it there, and
> obviously did not check it properly.
>
>>
>>> +        return -EUCLEAN;
>>> +#endif
>>> +    if (csum_tree_block(fs_info, eb, 0))
>>> +        return -EUCLEAN;
>>
>> I would just return the real error from csum_tree_block() - currently
>> it returns 1 for all possible failures instead of its real possible
>> failures: -ENOMEM or -EINVAL.
> Returning a positive errno is a bit risky, no? Somebody might only
> check ret<0, for example, and will miss the error (like
> flush_epd_write_bio).
> That's why I preferred to return a negative error, because I saw that
> csum_tree_block returns 1.

Right, what I meant before but didn't phrase it properly, was to make
csum_tree_block() returns a negative errno instead of 1 on failure.
And then just make csum_dirty_buffer() return whatever
csum_tree_block() returns as an error.

thanks

>
> Thanks,
> Alex.
>
>
>>
>> Thanks.
>>
>>>     return 0;
>>> }
>>>
>>> static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
>>>                  struct extent_buffer *eb)
>>> {
>>>     struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>>>     u8 fsid[BTRFS_UUID_SIZE];
>>>     int ret = 1;
>>>
>>> --
>>> 1.9.1
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Lyakas March 10, 2016, 11:14 a.m. UTC | #4
Hello Filipe,

I have sent two patches addressing this issue.

When testing, I discovered that log tree blocks can sometimes carry
chunk tree UUID which is all zeros! Does this make sense? You can take
a look at a small debug-tree output demonstrating such phenomenon at
https://drive.google.com/file/d/0B9rmyUifdvMLbHBuSWU5dlVKNWc. Due to
this I did not include the chunk tree UUID check. Hoping very much
that fs UUID should always be valid for all tree blocks))

Thanks,
Alex.



On Mon, Feb 22, 2016 at 12:28 PM, Filipe Manana <fdmanana@kernel.org> wrote:
> On Mon, Feb 22, 2016 at 9:46 AM, Alex Lyakas <alex@zadarastorage.com> wrote:
>> Thank you, Filipe, for your review.
>>
>> On Mon, Feb 22, 2016 at 3:05 AM, Filipe Manana <fdmanana@kernel.org> wrote:
>>> On Sun, Feb 21, 2016 at 3:36 PM, Alex Lyakas <alex@zadarastorage.com> wrote:
>>>> csum_dirty_buffer was issuing a warning in case the extent buffer
>>>> did not look alright, but was still returning success.
>>>> Let's return error in this case, and also add two additional sanity
>>>> checks on the extent buffer header.
>>>>
>>>> We had btrfs metadata corruption, and after looking at the logs we saw
>>>> that WARN_ON(found_start != start) has been triggered. We are still
>>>> investigating
>>>
>>> There's a warning for WARN_ON(found_start != start || !PageUptodate(page))
>>>
>>> Are you sure it triggered only because of found_start != start and not
>>> because of !PageUptodate(page) (or both)?
>> The problem initially happened on kernel 3.8.13.  In this kernel, the
>> code looks like this:
>>          found_start = btrfs_header_bytenr(eb);
>>          if (found_start != start) {
>>                  WARN_ON(1);
>>                  return 0;
>>          }
>>          if (!PageUptodate(page)) {
>>                  WARN_ON(1);
>>                  return 0;
>>          }
>> (You can see it on
>> http://lxr.free-electrons.com/source/fs/btrfs/disk-io.c?v=3.8#L420)
>> The WARN_ON that we hit was on the found_start comparison.
>
> Ok, I see now that one of those useless cleanup patches merged both
> conditions into a single if some time ago.
>
>>
>>>
>>>> which component trashed the cache page which belonged to btrfs. But btrfs
>>>> only issued a warning, and as a result, the corrupted metadata block went to
>>>> disk.
>>>>
>>>> I think we should return an error in such case that the extent buffer
>>>> doesn't look alright.
>>>
>>> I think so too.
>>>
>>>> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio
>>>> will,
>>>> but it is better than to have a silent metadata corruption on disk.
>>>>
>>>> Note: this patch has been properly tested on 3.18 kernel only.
>>>>
>>>> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
>>>> ---
>>>> fs/btrfs/disk-io.c | 14 ++++++++++++--
>>>> 1 file changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>> index 4545e2e..701e706 100644
>>>> --- a/fs/btrfs/disk-io.c
>>>> +++ b/fs/btrfs/disk-io.c
>>>> @@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info
>>>> *fs_info, struct page *page)
>>>> {
>>>>     u64 start = page_offset(page);
>>>>     u64 found_start;
>>>>     struct extent_buffer *eb;
>>>>
>>>>     eb = (struct extent_buffer *)page->private;
>>>>     if (page != eb->pages[0])
>>>>         return 0;
>>>>     found_start = btrfs_header_bytenr(eb);
>>>>     if (WARN_ON(found_start != start || !PageUptodate(page)))
>>>> -        return 0;
>>>> -    csum_tree_block(fs_info, eb, 0);
>>>> +        return -EUCLEAN;
>>>> +#ifdef CONFIG_BTRFS_ASSERT
>>>
>>> A bit odd to surround these with CONFIG_BTRFS_ASSERT if we don't do assertions.
>>> I would remove this #ifdef ... #endif or do the memcmp calls inside ASSERT().
>> Agreed.
>>
>>>
>>>> +    if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
>>>> +                    (unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE)))
>>>> +        return -EUCLEAN;
>>>> +    if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
>>>> +                    (unsigned long)btrfs_header_chunk_tree_uuid(eb),
>>>> +                    BTRFS_FSID_SIZE)))
>>>
>>> This second comparison doesn't seem correct. Second argument to
>>> memcmp_extent_buffer should be fs_info->chunk_tree_uuid, which
>>> shouldn't be the same as the fsid (take a look at utils.c:make_btrfs()
>>> in the tools, both uuids are generated by different calls to
>>> uuid_generate()) - did you make your tests only before adding this
>>> comparison?. Also you should use BTRFS_UUID_SIZE instead of
>>> BTRFS_FSID_SIZE (even if both have the same value).
>> Obviously, you are right. In the 3.18-based code that I fixed locally
>> here, the fix looks like this:
>>
>>     if (found_start != start) {
>>         ZBTRFS_WARN(1, "FS[%s]: header_bytenr(eb)(%llu) !=
>> page->index<<PAGE_CACHE_SHIFT(%llu)", root->fs_info->sb->s_id,
>> found_start, start);
>>         return -EUCLEAN;
>>     }
>>     if (!PageUptodate(page)) {
>>         ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu page->index(%llu)
>> !PageUptodate", root->fs_info->sb->s_id, start, (u64)page->index);
>>         return -EUCLEAN;
>>     }
>>     if (memcmp_extent_buffer(eb, root->fs_info->fsid, (unsigned
>> long)btrfs_header_fsid(), BTRFS_FSID_SIZE)) {
>>         u8 hdr_fsid[BTRFS_FSID_SIZE] = {0};
>>
>>         read_extent_buffer(eb, hdr_fsid, (unsigned
>> long)btrfs_header_fsid(), BTRFS_FSID_SIZE);
>>         ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu header->fsid["PRIX128"]
>> != fs_info->fsid["PRIX128"]", root->fs_info->sb->s_id, start,
>>                     PRI_UUID(hdr_fsid), PRI_UUID(root->fs_info->fsid));
>>         return -EUCLEAN;
>>     }
>>     if (memcmp_extent_buffer(eb, root->fs_info->chunk_tree_uuid,
>> (unsigned long)btrfs_header_chunk_tree_uuid(eb), BTRFS_UUID_SIZE)) {
>>         u8 hdr_ch_uuid[BTRFS_UUID_SIZE] = {0};
>>
>>         read_extent_buffer(eb, hdr_ch_uuid, (unsigned
>> long)btrfs_header_chunk_tree_uuid(eb), BTRFS_UUID_SIZE);
>>         ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu
>> header->chunk_tree_uuid["PRIX128"] !=
>> fs_info->chunk_tree_uuid["PRIX128"]", root->fs_info->sb->s_id, start,
>>                     PRI_UUID(hdr_ch_uuid),
>> PRI_UUID(root->fs_info->chunk_tree_uuid));
>>         return -EUCLEAN;
>>     }
>>     if (csum_tree_block(root, eb, 0) != 0) {
>>         ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu csum_tree_block()
>> failure", root->fs_info->sb->s_id, start);
>>         return -EUCLEAN;
>>     }
>>
>> ZBTRFS_WARN, PRIX128/PRI_UUID are some custom macros that I added to
>> properly test the fix:
>> #define ZBTRFS_WARN(condition, format, ...) WARN(condition, format,
>> ##__VA_ARGS__)
>> #define PRIX128
>> "%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X%02X"
>> #define PRI_UUID(uuid)    ((u8*)(uuid))[0],  ((u8*)(uuid))[1],
>> ((u8*)(uuid))[2],  ((u8*)(uuid))[3],    \
>>                         ((u8*)(uuid))[4],  ((u8*)(uuid))[5],
>> ((u8*)(uuid))[6],  ((u8*)(uuid))[7],    \
>>                         ((u8*)(uuid))[8],  ((u8*)(uuid))[9],
>> ((u8*)(uuid))[10], ((u8*)(uuid))[11],    \
>>                         ((u8*)(uuid))[12], ((u8*)(uuid))[13],
>> ((u8*)(uuid))[14], ((u8*)(uuid))[15]
>>
>>
>> Then I just pulled the latest Linux tree and fixed it there, and
>> obviously did not check it properly.
>>
>>>
>>>> +        return -EUCLEAN;
>>>> +#endif
>>>> +    if (csum_tree_block(fs_info, eb, 0))
>>>> +        return -EUCLEAN;
>>>
>>> I would just return the real error from csum_tree_block() - currently
>>> it returns 1 for all possible failures instead of its real possible
>>> failures: -ENOMEM or -EINVAL.
>> Returning a positive errno is a bit risky, no? Somebody might only
>> check ret<0, for example, and will miss the error (like
>> flush_epd_write_bio).
>> That's why I preferred to return a negative error, because I saw that
>> csum_tree_block returns 1.
>
> Right, what I meant before but didn't phrase it properly, was to make
> csum_tree_block() returns a negative errno instead of 1 on failure.
> And then just make csum_dirty_buffer() return whatever
> csum_tree_block() returns as an error.
>
> thanks
>
>>
>> Thanks,
>> Alex.
>>
>>
>>>
>>> Thanks.
>>>
>>>>     return 0;
>>>> }
>>>>
>>>> static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
>>>>                  struct extent_buffer *eb)
>>>> {
>>>>     struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>>>>     u8 fsid[BTRFS_UUID_SIZE];
>>>>     int ret = 1;
>>>>
>>>> --
>>>> 1.9.1
>>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4545e2e..701e706 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -508,22 +508,32 @@  static int csum_dirty_buffer(struct btrfs_fs_info 
*fs_info, struct page *page)
{
     u64 start = page_offset(page);
     u64 found_start;
     struct extent_buffer *eb;

     eb = (struct extent_buffer *)page->private;
     if (page != eb->pages[0])
         return 0;
     found_start = btrfs_header_bytenr(eb);
     if (WARN_ON(found_start != start || !PageUptodate(page)))
-        return 0;
-    csum_tree_block(fs_info, eb, 0);
+        return -EUCLEAN;
+#ifdef CONFIG_BTRFS_ASSERT
+    if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
+                    (unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE)))
+        return -EUCLEAN;
+    if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
+                    (unsigned long)btrfs_header_chunk_tree_uuid(eb),
+                    BTRFS_FSID_SIZE)))
+        return -EUCLEAN;
+#endif
+    if (csum_tree_block(fs_info, eb, 0))
+        return -EUCLEAN;
     return 0;
}

static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
                  struct extent_buffer *eb)
{
     struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
     u8 fsid[BTRFS_UUID_SIZE];
     int ret = 1;