Message ID | 514B346D.1010403@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> What we really need is the right bits in the right places > to let the administrator know if a device looks like it might > be corrupt & in need of fixing, vs. ignoring it altogether. -- > 2. the current git btrfs-show and btrfs fi show both output > *different* devices for device with UUID > b5dc52bd-21bf-4173-8049-d54d88c82240, and they're both wrong. -- here per Jon experience showing stale btrfs on /dev/sdb (note that good btrfs is on /dev/sdb3) is wrong. More over when we show something like /dev/sdb and /dev/sdb3 both have two different btrfs, its a bit scary. I would vote for ignoring it altogether since we have btrfs-find-root and btrfs-select-super to help find the backup SB for recovery if needed. Thanks. -Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 21, 2013 at 11:25 AM, Eric Sandeen <sandeen@redhat.com> wrote: > On 3/21/13 10:29 AM, Jon Nelson wrote: >> On Thu, Mar 21, 2013 at 10:11 AM, Eric Sandeen <sandeen@redhat.com> wrote: >>> On 3/21/13 10:04 AM, Jon Nelson wrote: >> ... >>>> 2. the current git btrfs-show and btrfs fi show both output >>>> *different* devices for device with UUID >>>> b5dc52bd-21bf-4173-8049-d54d88c82240, and they're both wrong. >>> >>> does blkid output find that uuid anywhere? >>> >>> Since you're working in git, can you maybe do a little bisecting >>> to find out when it changed? Should be a fairly quick test? >> >> blkid does /not/ report that uuid anywhere. >> >> git bisect, if I did it correctly, says: >> >> >> 6eba9002956ac40db87d42fb653a0524dc568810 is the first bad commit >> commit 6eba9002956ac40db87d42fb653a0524dc568810 >> Author: Goffredo Baroncelli <kreijack@inwind.it> >> Date: Tue Sep 4 19:59:26 2012 +0200 >> >> Correct un-initialized fsid variable >> >> :100644 100644 b21a87f827a6250da45f2fb6a1c3a6b651062243 >> 03952051b5e25e0b67f0f910c84d93eb90de8480 M disk-io.c > > Ok, I think this is another case of greedily scanning stale > backup superblocks (did you ever have btrfs on the whole sda > or sdb?) > > btrfs_read_dev_super() currently tries to scan all 3 superblocks > (primary & 2 backups). I'm guessing that you have some stale > backup superblocks on sda and/or sdb. > > Before the above commit, if the first sb didn't look valid, > it'd skip to the 2nd. If the 2nd (stale) one looked OK, > it'd compare its fsid to an uniniitialized variable (fsid) > which would fail (since the "fsid" contents were random.) > Same for the 3rd backup if found, and eventually it'd return > -1 as failure and not report the device. > > After the commit, it'd skip the first invalid sb as well. > But this time, it takes the fsid from the 2nd superblock as > "good" and makes it through the loop thinking that it's found > something valid. Hence the report of a device which you didn't > expect even though the first superblock is indeed wiped out. > > There are some patches floating around to stop this > backup superblock scanning altogether. > > This might fix it for you; it basically returns failure > if any superblock on the device is found to be bad. > > What we really need is the right bits in the right places > to let the administrator know if a device looks like it might > be corrupt & in need of fixing, vs. ignoring it altogether. > > Not sure if this is something we want upstream but you could > test if if you like. I did test and it appears to resolve the issue for me. Thank you!
On 3/22/13 8:59 AM, Jon Nelson wrote: > On Thu, Mar 21, 2013 at 11:25 AM, Eric Sandeen <sandeen@redhat.com> wrote: >> On 3/21/13 10:29 AM, Jon Nelson wrote: >>> On Thu, Mar 21, 2013 at 10:11 AM, Eric Sandeen <sandeen@redhat.com> wrote: >>>> On 3/21/13 10:04 AM, Jon Nelson wrote: >>> ... >>>>> 2. the current git btrfs-show and btrfs fi show both output >>>>> *different* devices for device with UUID >>>>> b5dc52bd-21bf-4173-8049-d54d88c82240, and they're both wrong. >>>> >>>> does blkid output find that uuid anywhere? >>>> >>>> Since you're working in git, can you maybe do a little bisecting >>>> to find out when it changed? Should be a fairly quick test? >>> >>> blkid does /not/ report that uuid anywhere. >>> >>> git bisect, if I did it correctly, says: >>> >>> >>> 6eba9002956ac40db87d42fb653a0524dc568810 is the first bad commit >>> commit 6eba9002956ac40db87d42fb653a0524dc568810 >>> Author: Goffredo Baroncelli <kreijack@inwind.it> >>> Date: Tue Sep 4 19:59:26 2012 +0200 >>> >>> Correct un-initialized fsid variable >>> >>> :100644 100644 b21a87f827a6250da45f2fb6a1c3a6b651062243 >>> 03952051b5e25e0b67f0f910c84d93eb90de8480 M disk-io.c >> >> Ok, I think this is another case of greedily scanning stale >> backup superblocks (did you ever have btrfs on the whole sda >> or sdb?) >> >> btrfs_read_dev_super() currently tries to scan all 3 superblocks >> (primary & 2 backups). I'm guessing that you have some stale >> backup superblocks on sda and/or sdb. >> >> Before the above commit, if the first sb didn't look valid, >> it'd skip to the 2nd. If the 2nd (stale) one looked OK, >> it'd compare its fsid to an uniniitialized variable (fsid) >> which would fail (since the "fsid" contents were random.) >> Same for the 3rd backup if found, and eventually it'd return >> -1 as failure and not report the device. >> >> After the commit, it'd skip the first invalid sb as well. >> But this time, it takes the fsid from the 2nd superblock as >> "good" and makes it through the loop thinking that it's found >> something valid. Hence the report of a device which you didn't >> expect even though the first superblock is indeed wiped out. >> >> There are some patches floating around to stop this >> backup superblock scanning altogether. >> >> This might fix it for you; it basically returns failure >> if any superblock on the device is found to be bad. >> >> What we really need is the right bits in the right places >> to let the administrator know if a device looks like it might >> be corrupt & in need of fixing, vs. ignoring it altogether. >> >> Not sure if this is something we want upstream but you could >> test if if you like. > > I did test and it appears to resolve the issue for me. > Thank you! Thanks. I need to get back to finding the right overall solution here, but have been busy elsewhere. It's on the list ;) Anand is looking at it too and has some patches on the list. -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/disk-io.c b/disk-io.c index a9fd374..3aca16f 100644 --- a/disk-io.c +++ b/disk-io.c @@ -1104,7 +1104,6 @@ struct btrfs_root *open_ctree_fd(int fp, const char *path, u64 sb_bytenr, int btrfs_read_dev_super(int fd, struct btrfs_super_block *sb, u64 sb_bytenr) { u8 fsid[BTRFS_FSID_SIZE]; - int fsid_is_initialized = 0; struct btrfs_super_block buf; int i; int ret; @@ -1130,24 +1129,22 @@ int btrfs_read_dev_super(int fd, struct btrfs_super_block *sb, u64 sb_bytenr) if (ret < sizeof(buf)) break; - if (btrfs_super_bytenr(&buf) != bytenr ) - continue; - /* if magic is NULL, the device was removed */ - if (buf.magic == 0 && i == 0) + /* Bail out on any invalid superblock */ + if (btrfs_super_bytenr(&buf) != bytenr || + buf.magic != cpu_to_le64(BTRFS_MAGIC)) return -1; - if (buf.magic != cpu_to_le64(BTRFS_MAGIC)) - continue; - if (!fsid_is_initialized) { + /* if sb 0 looks valid use its fsid as trusted */ + if (i == 0) memcpy(fsid, buf.fsid, sizeof(fsid)); - fsid_is_initialized = 1; - } else if (memcmp(fsid, buf.fsid, sizeof(fsid))) { + else if (memcmp(fsid, buf.fsid, sizeof(fsid))) { /* * the superblocks (the original one and * its backups) contain data of different - * filesystems -> the super cannot be trusted + * filesystems -> the super cannot be trusted, + * hence the fs cannot be trusted. */ - continue; + return -1; } if (btrfs_super_generation(&buf) > transid) {