diff mbox

test osd on zfs

Message ID 516F10BE.6020103@llnl.gov (mailing list archive)
State New, archived
Headers show

Commit Message

Brian Behlendorf April 17, 2013, 9:14 p.m. UTC
On 04/17/2013 01:16 PM, Mark Nelson wrote:
> I'll let Brian talk about the virtues of ZFS,

I think the virtues of ZFS have been discussed at length in various 
other forums.  But in short it brings some nice functionality to the 
table which may be useful to ceph and that's worth exploring.

>>>>
>>>> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
>>>> index c03764f..9f4d63c 100644
>>>> --- a/module/zfs/zpl_xattr.c
>>>> +++ b/module/zfs/zpl_xattr.c
>>>> @@ -225,6 +225,11 @@ zpl_xattr_get_dir(struct inode *ip, const char
>>>> *name,
>>>> void *value,
>>>>                  goto out;
>>>>          }
>>>>
>>>> +       if (size < i_size_read(xip)) {
>>>> +               error = -ERANGE;
>>>> +               goto out;
>>>> +       }
>>>> +
>>>>          error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE,
>>>> 0, cr);
>>>>   out:
>>>>          if (xip)
>>>> @@ -263,7 +268,10 @@ zpl_xattr_get_sa(struct inode *ip, const char
>>>> *name,
>>>> void *value, size_t size)
>>>>          if (!size)
>>>>                  return (nv_size);
>>>>
>>>> -       memcpy(value, nv_value, MIN(size, nv_size));
>>>>
>>>> +       if (size < nv_size)
>>>> +               return (-ERANGE);
>>>
>>> Note, that zpl_xattr_get_sa() is called by __zpl_xattr_get() which can
>>> also be called by zpl_xattr_get() to test for xattr existence. So it
>>> needs to make sure that zpl_xattr_set() doesn't fail if getting
>>> -ERANGE.

This shouldn't be a problem.  The zpl_xattr_get() call from 
zpl_xattr_set() passes a NULL value and zero size which will prevent it 
from hitting the ERANGE error.  It will return instead the xattr size as 
expected.

>>>
>>>> +
>>>> +       memcpy(value, nv_value, size);
>>>>
>>>>          return (MIN(size, nv_size));
>>>
>>> No need for MIN() here.

Thanks for catching that.

I've opened a pull request at github with the updated fix and kicked it 
off for automated testing.  It would be nice to verify this resolves the 
crash.

https://github.com/zfsonlinux/zfs/pull/1409

         if (xip)
@@ -263,9 +268,12 @@ zpl_xattr_get_sa(struct inode *ip, const char 
*name, void *
         if (!size)
                 return (nv_size);

-       memcpy(value, nv_value, MIN(size, nv_size));
+       if (size < nv_size)
+               return (-ERANGE);
+
+       memcpy(value, nv_value, size);

-       return (MIN(size, nv_size));
+       return (size);
  }

  static int
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Henry Chang April 18, 2013, 2:20 a.m. UTC | #1
Sorry, off the topic. I am wondering if we use zfs as the underlying
filesystem for ceph osd and let osd filestore do sync writes, do we
still need the osd journal?

2013/4/18 Brian Behlendorf <behlendorf1@llnl.gov>:
> On 04/17/2013 01:16 PM, Mark Nelson wrote:
>>
>> I'll let Brian talk about the virtues of ZFS,
>
>
> I think the virtues of ZFS have been discussed at length in various other
> forums.  But in short it brings some nice functionality to the table which
> may be useful to ceph and that's worth exploring.
>
>
>>>>>
>>>>> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
>>>>> index c03764f..9f4d63c 100644
>>>>> --- a/module/zfs/zpl_xattr.c
>>>>> +++ b/module/zfs/zpl_xattr.c
>>>>> @@ -225,6 +225,11 @@ zpl_xattr_get_dir(struct inode *ip, const char
>>>>> *name,
>>>>> void *value,
>>>>>                  goto out;
>>>>>          }
>>>>>
>>>>> +       if (size < i_size_read(xip)) {
>>>>> +               error = -ERANGE;
>>>>> +               goto out;
>>>>> +       }
>>>>> +
>>>>>          error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE,
>>>>> 0, cr);
>>>>>   out:
>>>>>          if (xip)
>>>>> @@ -263,7 +268,10 @@ zpl_xattr_get_sa(struct inode *ip, const char
>>>>> *name,
>>>>> void *value, size_t size)
>>>>>          if (!size)
>>>>>                  return (nv_size);
>>>>>
>>>>> -       memcpy(value, nv_value, MIN(size, nv_size));
>>>>>
>>>>> +       if (size < nv_size)
>>>>> +               return (-ERANGE);
>>>>
>>>>
>>>> Note, that zpl_xattr_get_sa() is called by __zpl_xattr_get() which can
>>>> also be called by zpl_xattr_get() to test for xattr existence. So it
>>>> needs to make sure that zpl_xattr_set() doesn't fail if getting
>>>> -ERANGE.
>
>
> This shouldn't be a problem.  The zpl_xattr_get() call from zpl_xattr_set()
> passes a NULL value and zero size which will prevent it from hitting the
> ERANGE error.  It will return instead the xattr size as expected.
>
>
>>>>
>>>>> +
>>>>> +       memcpy(value, nv_value, size);
>>>>>
>>>>>          return (MIN(size, nv_size));
>>>>
>>>>
>>>> No need for MIN() here.
>
>
> Thanks for catching that.
>
> I've opened a pull request at github with the updated fix and kicked it off
> for automated testing.  It would be nice to verify this resolves the crash.
>
> https://github.com/zfsonlinux/zfs/pull/1409
>
> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
> index c03764f..42a06ad 100644
>
> --- a/module/zfs/zpl_xattr.c
> +++ b/module/zfs/zpl_xattr.c
> @@ -225,6 +225,11 @@ zpl_xattr_get_dir(struct inode *ip, const char *name,
> void
>                 goto out;
>         }
>
> +       if (size < i_size_read(xip)) {
> +               error = -ERANGE;
> +               goto out;
> +       }
> +
>         error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE, 0, cr);
>  out:
>         if (xip)
> @@ -263,9 +268,12 @@ zpl_xattr_get_sa(struct inode *ip, const char *name,
> void *
>
>         if (!size)
>                 return (nv_size);
>
> -       memcpy(value, nv_value, MIN(size, nv_size));
> +       if (size < nv_size)
> +               return (-ERANGE);
> +
> +       memcpy(value, nv_value, size);
>
> -       return (MIN(size, nv_size));
> +       return (size);
>  }
>
>  static int
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Priebe - Profihost AG April 18, 2013, 5:56 a.m. UTC | #2
Am 17.04.2013 um 23:14 schrieb Brian Behlendorf <behlendorf1@llnl.gov>:

> On 04/17/2013 01:16 PM, Mark Nelson wrote:
>> I'll let Brian talk about the virtues of ZFS,
> 
> I think the virtues of ZFS have been discussed at length in various other forums.  But in short it brings some nice functionality to the table which may be useful to ceph and that's worth exploring.
Sure I know about the advantages of zfs.

I just thought about how ceph can benefit. Right now I've no idea. The osds should be single disks so zpool, zraid does not matter. Ceph does it own scrubbing and check summing and instead of btrfs ceph does not know how to use snapshots with zfs. That's why I'm asking.

Greets,
Stefan--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil April 18, 2013, 2:50 p.m. UTC | #3
On Thu, 18 Apr 2013, Stefan Priebe - Profihost AG wrote:
> Am 17.04.2013 um 23:14 schrieb Brian Behlendorf <behlendorf1@llnl.gov>:
> 
> > On 04/17/2013 01:16 PM, Mark Nelson wrote:
> >> I'll let Brian talk about the virtues of ZFS,
> > 
> > I think the virtues of ZFS have been discussed at length in various other forums.  But in short it brings some nice functionality to the table which may be useful to ceph and that's worth exploring.
> Sure I know about the advantages of zfs.
> 
> I just thought about how ceph can benefit. Right now I've no idea. The 
> osds should be single disks so zpool, zraid does not matter. Ceph does 
> it own scrubbing and check summing and instead of btrfs ceph does not 
> know how to use snapshots with zfs. That's why I'm asking.

The main things that come to mind:

- zfs checksumming
- ceph can eventually use zfs snapshots similarly to how it uses btrfs 
  snapshots to create stable checkpoints as journal reference points, 
  allowing parallel (instead of writeahead) journaling
- can use raidz beneath a single ceph-osd for better reliability (e.g., 2x 
  * raidz instead of 3x replication)

ZFS doesn't have a clone function that we can use to enable efficient 
cephfs/rbd/rados snaps, but maybe this will motivate someone to implement 
one. :)

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Elsayed April 18, 2013, 8:07 p.m. UTC | #4
Sage Weil wrote:

<snip>
> The main things that come to mind:
> 
> - zfs checksumming
> - ceph can eventually use zfs snapshots similarly to how it uses btrfs
>   snapshots to create stable checkpoints as journal reference points,
>   allowing parallel (instead of writeahead) journaling
> - can use raidz beneath a single ceph-osd for better reliability (e.g., 2x
>   * raidz instead of 3x replication)
> 
> ZFS doesn't have a clone function that we can use to enable efficient
> cephfs/rbd/rados snaps, but maybe this will motivate someone to implement
> one. :)

Since Btrfs has implemented raid5/6 support (meaning raidz is only a feature 
gain if you want 3x parity, which is unlikely to be useful for an OSD[1]), 
the checksumming may be the only real benefit since it supports sha256 (in 
addition to the non-cryptographic fletcher2/fletcher4), whereas btrfs only 
has crc32c at this time.

[1] A raidz3 with 4 disks is basically raid1, at which point you may as well 
use Ceph-level replication. And a 5-or-more-disk OSD strikes me as a 
questionable way to set it up, considering Ceph's strengths.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Mitchell April 19, 2013, 10:47 a.m. UTC | #5
Alex Elsayed wrote:
> Since Btrfs has implemented raid5/6 support (meaning raidz is only a feature
> gain if you want 3x parity, which is unlikely to be useful for an OSD[1]),
> the checksumming may be the only real benefit since it supports sha256 (in
> addition to the non-cryptographic fletcher2/fletcher4), whereas btrfs only
> has crc32c at this time.

Plus (in my real-world experience) *far* better robustness. If Ceph 
could use either and both had feature parity, I'd choose ZFS in a 
heartbeat. I've had too many simple Btrfs filesystems go corrupt, not 
even using any fancy RAID features.

I wasn't aware that Ceph was using btrfs' file-scope clone command. ZFS 
doesn't have that, although in theory with the new capabilities system 
it could be supported in one implementation without requiring an on-disk 
format change.

--Jeff
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
index c03764f..42a06ad 100644
--- a/module/zfs/zpl_xattr.c
+++ b/module/zfs/zpl_xattr.c
@@ -225,6 +225,11 @@  zpl_xattr_get_dir(struct inode *ip, const char 
*name, void
                 goto out;
         }

+       if (size < i_size_read(xip)) {
+               error = -ERANGE;
+               goto out;
+       }
+
         error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE, 0, cr);
  out: