diff mbox

test osd on zfs

Message ID 516EF07E.4000909@llnl.gov (mailing list archive)
State New, archived
Headers show

Commit Message

Brian Behlendorf April 17, 2013, 6:57 p.m. UTC
Here's a patch for the ERANGE error (lightly tested).  Sage's patch 
looks good but only covers one of two code paths for xattrs.  With zfs 
they may either be stored as a system attribute which is usually close 
to the dnode on disk (zfs set xattr=sa pool/dataset).  Or they may be 
stored in their own object which is how it's implemented on Solaris (zfs 
set xattr=on pool/dataset).  The second method is still the default for 
compatibility reasons even though it's slower.  Sage's patch only 
covered the SA case.

 > Well, looking at the code again it's not going to work, as setxattr is
 > going to fail with ERANGE.

Why?  We support an arbitrary number of maximum sized xattrs (65536). 
What am I missing here?

Incidentally, does anybody know of an good xattr test suite we could add 
to our regression tests?

Thanks,
Brian

         if (xip)
@@ -263,7 +268,10 @@ zpl_xattr_get_sa(struct inode *ip, const char 
*name, void *value, size_t size)
         if (!size)
                 return (nv_size);

-       memcpy(value, nv_value, MIN(size, nv_size));
+       if (size < nv_size)
+               return (-ERANGE);
+
+       memcpy(value, nv_value, size);

         return (MIN(size, nv_size));
  }

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Yehuda Sadeh April 17, 2013, 7:07 p.m. UTC | #1
On Wed, Apr 17, 2013 at 11:57 AM, Brian Behlendorf <behlendorf1@llnl.gov> wrote:
>
> Here's a patch for the ERANGE error (lightly tested).  Sage's patch looks
> good but only covers one of two code paths for xattrs.  With zfs they may
> either be stored as a system attribute which is usually close to the dnode
> on disk (zfs set xattr=sa pool/dataset).  Or they may be stored in their own
> object which is how it's implemented on Solaris (zfs set xattr=on
> pool/dataset).  The second method is still the default for compatibility
> reasons even though it's slower.  Sage's patch only covered the SA case.
>
>
>> Well, looking at the code again it's not going to work, as setxattr is
>> going to fail with ERANGE.
>
> Why?  We support an arbitrary number of maximum sized xattrs (65536). What
> am I missing here?
>
> Incidentally, does anybody know of an good xattr test suite we could add to
> our regression tests?
>
> Thanks,
> Brian
>
> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
> index c03764f..9f4d63c 100644
> --- a/module/zfs/zpl_xattr.c
> +++ b/module/zfs/zpl_xattr.c
> @@ -225,6 +225,11 @@ zpl_xattr_get_dir(struct inode *ip, const char *name,
> void *value,
>                 goto out;
>         }
>
> +       if (size < i_size_read(xip)) {
> +               error = -ERANGE;
> +               goto out;
> +       }
> +
>         error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE, 0, cr);
>  out:
>         if (xip)
> @@ -263,7 +268,10 @@ zpl_xattr_get_sa(struct inode *ip, const char *name,
> void *value, size_t size)
>         if (!size)
>                 return (nv_size);
>
> -       memcpy(value, nv_value, MIN(size, nv_size));
>
> +       if (size < nv_size)
> +               return (-ERANGE);

Note, that zpl_xattr_get_sa() is called by __zpl_xattr_get() which can
also be called by zpl_xattr_get() to test for xattr existence. So it
needs to make sure that zpl_xattr_set() doesn't fail if getting
-ERANGE.

> +
> +       memcpy(value, nv_value, size);
>
>         return (MIN(size, nv_size));

No need for MIN() here.


Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Priebe - Profihost AG April 17, 2013, 7:09 p.m. UTC | #2
Sorry to disturb, but what is the raeson / advantage of using zfs for ceph?

Greets,
Stefan
Am 17.04.2013 21:07, schrieb Yehuda Sadeh:
> On Wed, Apr 17, 2013 at 11:57 AM, Brian Behlendorf <behlendorf1@llnl.gov> wrote:
>>
>> Here's a patch for the ERANGE error (lightly tested).  Sage's patch looks
>> good but only covers one of two code paths for xattrs.  With zfs they may
>> either be stored as a system attribute which is usually close to the dnode
>> on disk (zfs set xattr=sa pool/dataset).  Or they may be stored in their own
>> object which is how it's implemented on Solaris (zfs set xattr=on
>> pool/dataset).  The second method is still the default for compatibility
>> reasons even though it's slower.  Sage's patch only covered the SA case.
>>
>>
>>> Well, looking at the code again it's not going to work, as setxattr is
>>> going to fail with ERANGE.
>>
>> Why?  We support an arbitrary number of maximum sized xattrs (65536). What
>> am I missing here?
>>
>> Incidentally, does anybody know of an good xattr test suite we could add to
>> our regression tests?
>>
>> Thanks,
>> Brian
>>
>> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
>> index c03764f..9f4d63c 100644
>> --- a/module/zfs/zpl_xattr.c
>> +++ b/module/zfs/zpl_xattr.c
>> @@ -225,6 +225,11 @@ zpl_xattr_get_dir(struct inode *ip, const char *name,
>> void *value,
>>                  goto out;
>>          }
>>
>> +       if (size < i_size_read(xip)) {
>> +               error = -ERANGE;
>> +               goto out;
>> +       }
>> +
>>          error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE, 0, cr);
>>   out:
>>          if (xip)
>> @@ -263,7 +268,10 @@ zpl_xattr_get_sa(struct inode *ip, const char *name,
>> void *value, size_t size)
>>          if (!size)
>>                  return (nv_size);
>>
>> -       memcpy(value, nv_value, MIN(size, nv_size));
>>
>> +       if (size < nv_size)
>> +               return (-ERANGE);
>
> Note, that zpl_xattr_get_sa() is called by __zpl_xattr_get() which can
> also be called by zpl_xattr_get() to test for xattr existence. So it
> needs to make sure that zpl_xattr_set() doesn't fail if getting
> -ERANGE.
>
>> +
>> +       memcpy(value, nv_value, size);
>>
>>          return (MIN(size, nv_size));
>
> No need for MIN() here.
>
>
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Nelson April 17, 2013, 8:16 p.m. UTC | #3
I'll let Brian talk about the virtues of ZFS, but from my perspective 
it's an interesting option as there are a lot of folks banging on it for 
NFS servers and it has some interesting capabilities.  I have no idea 
how well it will work in practice, but if we can show that Ceph can run 
on it at least people can try it out and give us feedback.

Mark

On 04/17/2013 02:09 PM, Stefan Priebe wrote:
> Sorry to disturb, but what is the raeson / advantage of using zfs for ceph?
>
> Greets,
> Stefan
> Am 17.04.2013 21:07, schrieb Yehuda Sadeh:
>> On Wed, Apr 17, 2013 at 11:57 AM, Brian Behlendorf
>> <behlendorf1@llnl.gov> wrote:
>>>
>>> Here's a patch for the ERANGE error (lightly tested).  Sage's patch
>>> looks
>>> good but only covers one of two code paths for xattrs.  With zfs they
>>> may
>>> either be stored as a system attribute which is usually close to the
>>> dnode
>>> on disk (zfs set xattr=sa pool/dataset).  Or they may be stored in
>>> their own
>>> object which is how it's implemented on Solaris (zfs set xattr=on
>>> pool/dataset).  The second method is still the default for compatibility
>>> reasons even though it's slower.  Sage's patch only covered the SA case.
>>>
>>>
>>>> Well, looking at the code again it's not going to work, as setxattr is
>>>> going to fail with ERANGE.
>>>
>>> Why?  We support an arbitrary number of maximum sized xattrs (65536).
>>> What
>>> am I missing here?
>>>
>>> Incidentally, does anybody know of an good xattr test suite we could
>>> add to
>>> our regression tests?
>>>
>>> Thanks,
>>> Brian
>>>
>>> diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
>>> index c03764f..9f4d63c 100644
>>> --- a/module/zfs/zpl_xattr.c
>>> +++ b/module/zfs/zpl_xattr.c
>>> @@ -225,6 +225,11 @@ zpl_xattr_get_dir(struct inode *ip, const char
>>> *name,
>>> void *value,
>>>                  goto out;
>>>          }
>>>
>>> +       if (size < i_size_read(xip)) {
>>> +               error = -ERANGE;
>>> +               goto out;
>>> +       }
>>> +
>>>          error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE,
>>> 0, cr);
>>>   out:
>>>          if (xip)
>>> @@ -263,7 +268,10 @@ zpl_xattr_get_sa(struct inode *ip, const char
>>> *name,
>>> void *value, size_t size)
>>>          if (!size)
>>>                  return (nv_size);
>>>
>>> -       memcpy(value, nv_value, MIN(size, nv_size));
>>>
>>> +       if (size < nv_size)
>>> +               return (-ERANGE);
>>
>> Note, that zpl_xattr_get_sa() is called by __zpl_xattr_get() which can
>> also be called by zpl_xattr_get() to test for xattr existence. So it
>> needs to make sure that zpl_xattr_set() doesn't fail if getting
>> -ERANGE.
>>
>>> +
>>> +       memcpy(value, nv_value, size);
>>>
>>>          return (MIN(size, nv_size));
>>
>> No need for MIN() here.
>>
>>
>> Yehuda
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Mitchell April 17, 2013, 8:49 p.m. UTC | #4
> On 04/17/2013 02:09 PM, Stefan Priebe wrote:
>>
>> Sorry to disturb, but what is the raeson / advantage of using zfs for
>> ceph?

A few things off the top of my head:

1) Very mature filesystem with full xattr support (this bug
notwithstanding) and copy-on-write snapshots. While the port to Linux
sometimes has some rough edges (but in my experience over the past few
years is generally very good), the main code from Solaris (and now the
Illumos project) is well-tested and very well regarded. Btrfs has many
of the same features, but in my real-world experience I've had
multiple btrfs filesystems go corrupt with very innocuous usage
patterns and across a variety of kernel versions. The zfsonlinux bugs
don't tend to be data-destructive, once data is written to it.
2) Very intelligent caching; also supports external devices (like
SSDs) for a level 2 cache. This speeds up reads dramatically.
3) Very robust error-checking. There are lots of stories of ZFS
finding bad memory, bad controllers, and bad hard drives because of
its checksumming (which you can optionally turn off for speed). If you
set up the OSDs such that each OSD is based off of a ZFS mirror, you
get these benefits locally. For some people, especially when heavy on
reads (due to the intelligent caching), a solution that knocks the
remote replication level down by one but uses local mirrors for OSDs
may provide good functionality and safety compromises.

--Jeff
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/module/zfs/zpl_xattr.c b/module/zfs/zpl_xattr.c
index c03764f..9f4d63c 100644
--- a/module/zfs/zpl_xattr.c
+++ b/module/zfs/zpl_xattr.c
@@ -225,6 +225,11 @@  zpl_xattr_get_dir(struct inode *ip, const char 
*name, void *value,
                 goto out;
         }

+       if (size < i_size_read(xip)) {
+               error = -ERANGE;
+               goto out;
+       }
+
         error = zpl_read_common(xip, value, size, 0, UIO_SYSSPACE, 0, cr);
  out: