diff mbox

Hadoop and Ceph client/mds view of modification time

Message ID 50ABE602.9090600@inktank.com (mailing list archive)
State New, archived
Headers show

Commit Message

Sam Lang Nov. 20, 2012, 8:20 p.m. UTC
On 11/20/2012 01:44 PM, Noah Watkins wrote:
> This is a description of the clock synchronization issue we are facing
> in Hadoop:
>
> Components of Hadoop use mtime as a versioning mechanism. Here is an
> example where Client B tests the expected 'version' of a file created
> by Client A:
>
>    Client A: create file, write data into file.
>    Client A: expected_mtime <-- lstat(file)
>    Client A: broadcast expected_mtime to client B
>    ...
>    Client B: mtime <-- lstat(file)
>    Client B: test expected_mtime == mtime

Here's a patch that might work to push the setattr out to the mds every 
time (the same as Sage's patch for getattr).  This isn't quite 
writeback, as it waits for the setattr at the server to complete before 
returning, but I think that's actually what you want in this case.  It 
needs to be enabled by setting client setattr writethru = true in the 
config.  Also, I haven't tested that it sends the setattr, just a basic 
test of functionality.

BTW, if its always client B's first stat of the file, you won't need 
Sage's patch.

-sam

attributes to the mds server
  // note: the max amount of "in flight" dirty data is roughly (max - 
target)
  OPTION(fuse_use_invalidate_cb, OPT_BOOL, false) // use fuse 2.8+ 
invalidate callback to keep page cache consistent
  OPTION(fuse_big_writes, OPT_BOOL, true)


>
> Since mtime may be set in Ceph by both client and MDS, inconsistent
> mtime view is possible when clocks are not adequately synchronized.
>
> Here is a test that reproduces the problem. In the following output,
> issdm-18 has the MDS, and issdm-22 is a non-Ceph node with its time
> set to an hour earlier than the MDS node.
>
> nwatkins@issdm-22:~$ ssh issdm-18 date && ./test
> Tue Nov 20 11:40:28 PST 2012           // MDS TIME
> local time: Tue Nov 20 10:42:47 2012  // Client TIME
> fstat time: Tue Nov 20 11:40:28 2012  // mtime seen after file
> creation (MDS time)
> lstat time: Tue Nov 20 10:42:47 2012  // mtime seen after file write
> (client time)
>
> Here is the code used to produce that output.
>
> #include <errno.h>
> #include <sys/fcntl.h>
> #include <sys/time.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <dirent.h>
> #include <sys/xattr.h>
> #include <stdio.h>
> #include <string.h>
> #include <assert.h>
> #include <cephfs/libcephfs.h>
> #include <time.h>
>
> int main(int argc, char **argv)
> {
>          struct stat st;
>          struct ceph_mount_info *cmount;
>          struct timeval tv;
>
>          /* setup */
>          ceph_create(&cmount, "admin");
>          ceph_conf_read_file(cmount, "/users/nwatkins/Projects/ceph.conf");
>          ceph_mount(cmount, "/");
>
>          /* print local time for reference */
>          gettimeofday(&tv, NULL);
>          printf("local time: %s", ctime(&tv.tv_sec));
>
>          /* create a file */
>          char buf[256];
>          sprintf(buf, "/somefile.%d", getpid());
>          int fd = ceph_open(cmount, buf, O_WRONLY|O_CREAT, 0);
>          assert(fd > 0);
>
>          /* get mtime for this new file */
>          memset(&st, 0, sizeof(st));
>          int ret = ceph_fstat(cmount, fd, &st);
>          assert(ret == 0);
>          printf("fstat time: %s", ctime(&st.st_mtime));
>
>          /* write some data into the file */
>          ret = ceph_write(cmount, fd, buf, sizeof(buf), -1);
>          assert(ret == sizeof(buf));
>          ceph_close(cmount, fd);
>
>          memset(&st, 0, sizeof(st));
>          ret = ceph_lstat(cmount, buf, &st);
>          assert(ret == 0);
>          printf("lstat time: %s", ctime(&st.st_mtime));
>
>          ceph_shutdown(cmount);
>          return 0;
> }
>
> Note that this output is currently using the short patch from
> http://marc.info/?l=ceph-devel&m=133178637520337&w=2 which forces
> getattr to always go to the MDS.
>
> diff --git a/src/client/Client.cc b/src/client/Client.cc
> index 4a9ae3c..2bb24b7 100644
> --- a/src/client/Client.cc
> +++ b/src/client/Client.cc
> @@ -3858,7 +3858,7 @@ int Client::readlink(const char *relpath, char
> *buf, loff_t \
> size)
>   int Client::_getattr(Inode *in, int mask, int uid, int gid)
>   {
> -  bool yes = in->caps_issued_mask(mask);
> +  bool yes = false; //in->caps_issued_mask(mask);
>
>     ldout(cct, 10) << "_getattr mask " << ccap_string(mask) << "
> issued=" << yes << \
> dendl;  if (yes)
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Noah Watkins Nov. 21, 2012, 4:43 p.m. UTC | #1
(Sorry for the dupe message. vger rejected due to HTML).

Thanks, I'll try this patch this morning.

Client B should perform a single stat after a notification from Client
A. But, won't Sage's patch still be required, since Client A needs the
MDS time to pass to Client B?

On Tue, Nov 20, 2012 at 12:20 PM, Sam Lang <sam.lang@inktank.com> wrote:
> On 11/20/2012 01:44 PM, Noah Watkins wrote:
>>
>> This is a description of the clock synchronization issue we are facing
>> in Hadoop:
>>
>> Components of Hadoop use mtime as a versioning mechanism. Here is an
>> example where Client B tests the expected 'version' of a file created
>> by Client A:
>>
>>    Client A: create file, write data into file.
>>    Client A: expected_mtime <-- lstat(file)
>>    Client A: broadcast expected_mtime to client B
>>    ...
>>    Client B: mtime <-- lstat(file)
>>    Client B: test expected_mtime == mtime
>
>
> Here's a patch that might work to push the setattr out to the mds every time
> (the same as Sage's patch for getattr).  This isn't quite writeback, as it
> waits for the setattr at the server to complete before returning, but I
> think that's actually what you want in this case.  It needs to be enabled by
> setting client setattr writethru = true in the config.  Also, I haven't
> tested that it sends the setattr, just a basic test of functionality.
>
> BTW, if its always client B's first stat of the file, you won't need Sage's
> patch.
>
> -sam
>
> diff --git a/src/client/Client.cc b/src/client/Client.cc
> index 8d4a5ac..a7dd8f7 100644
> --- a/src/client/Client.cc
> +++ b/src/client/Client.cc
> @@ -4165,6 +4165,7 @@ int Client::_getattr(Inode *in, int mask, int uid, int
> gid)
>
>  int Client::_setattr(Inode *in, struct stat *attr, int mask, int uid, int
> gid)
>  {
> +  int orig_mask = mask;
>    int issued = in->caps_issued();
>
>    ldout(cct, 10) << "_setattr mask " << mask << " issued " <<
> ccap_string(issued) << dendl;
> @@ -4219,7 +4220,7 @@ int Client::_setattr(Inode *in, struct stat *attr, int
> mask, int uid, int gid)
>        mask &= ~(CEPH_SETATTR_MTIME|CEPH_SETATTR_ATIME);
>      }
>    }
> -  if (!mask)
> +  if (!cct->_conf->client_setattr_writethru && !mask)
>      return 0;
>
>    MetaRequest *req = new MetaRequest(CEPH_MDS_OP_SETATTR);
> @@ -4229,6 +4230,10 @@ int Client::_setattr(Inode *in, struct stat *attr,
> int mask, int uid, int gid)
>    req->set_filepath(path);
>    req->inode = in;
>
> +  // reset mask back to original if we're meant to do writethru
> +  if (cct->_conf->client_setattr_writethru)
> +    mask = orig_mask;
> +
>    if (mask & CEPH_SETATTR_MODE) {
>      req->head.args.setattr.mode = attr->st_mode;
>      req->inode_drop |= CEPH_CAP_AUTH_SHARED;
> diff --git a/src/common/config_opts.h b/src/common/config_opts.h
> index cc05095..51a2769 100644
> --- a/src/common/config_opts.h
> +++ b/src/common/config_opts.h
> @@ -178,6 +178,7 @@ OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 100)
> // MB * n  (dirty OR tx.
>  OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty (keep
> this smallish)
>  OPTION(client_oc_max_dirty_age, OPT_DOUBLE, 5.0)      // max age in cache
> before writeback
>  OPTION(client_oc_max_objects, OPT_INT, 1000)      // max objects in cache
> +OPTION(client_setattr_writethru, OPT_BOOL, false)  // send the attributes
> to the mds server
>  // note: the max amount of "in flight" dirty data is roughly (max - target)
>  OPTION(fuse_use_invalidate_cb, OPT_BOOL, false) // use fuse 2.8+ invalidate
> callback to keep page cache consistent
>  OPTION(fuse_big_writes, OPT_BOOL, true)
>
>
>>
>> Since mtime may be set in Ceph by both client and MDS, inconsistent
>> mtime view is possible when clocks are not adequately synchronized.
>>
>> Here is a test that reproduces the problem. In the following output,
>> issdm-18 has the MDS, and issdm-22 is a non-Ceph node with its time
>> set to an hour earlier than the MDS node.
>>
>> nwatkins@issdm-22:~$ ssh issdm-18 date && ./test
>> Tue Nov 20 11:40:28 PST 2012           // MDS TIME
>> local time: Tue Nov 20 10:42:47 2012  // Client TIME
>> fstat time: Tue Nov 20 11:40:28 2012  // mtime seen after file
>> creation (MDS time)
>> lstat time: Tue Nov 20 10:42:47 2012  // mtime seen after file write
>> (client time)
>>
>> Here is the code used to produce that output.
>>
>> #include <errno.h>
>> #include <sys/fcntl.h>
>> #include <sys/time.h>
>> #include <unistd.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <dirent.h>
>> #include <sys/xattr.h>
>> #include <stdio.h>
>> #include <string.h>
>> #include <assert.h>
>> #include <cephfs/libcephfs.h>
>> #include <time.h>
>>
>> int main(int argc, char **argv)
>> {
>>          struct stat st;
>>          struct ceph_mount_info *cmount;
>>          struct timeval tv;
>>
>>          /* setup */
>>          ceph_create(&cmount, "admin");
>>          ceph_conf_read_file(cmount,
>> "/users/nwatkins/Projects/ceph.conf");
>>          ceph_mount(cmount, "/");
>>
>>          /* print local time for reference */
>>          gettimeofday(&tv, NULL);
>>          printf("local time: %s", ctime(&tv.tv_sec));
>>
>>          /* create a file */
>>          char buf[256];
>>          sprintf(buf, "/somefile.%d", getpid());
>>          int fd = ceph_open(cmount, buf, O_WRONLY|O_CREAT, 0);
>>          assert(fd > 0);
>>
>>          /* get mtime for this new file */
>>          memset(&st, 0, sizeof(st));
>>          int ret = ceph_fstat(cmount, fd, &st);
>>          assert(ret == 0);
>>          printf("fstat time: %s", ctime(&st.st_mtime));
>>
>>          /* write some data into the file */
>>          ret = ceph_write(cmount, fd, buf, sizeof(buf), -1);
>>          assert(ret == sizeof(buf));
>>          ceph_close(cmount, fd);
>>
>>          memset(&st, 0, sizeof(st));
>>          ret = ceph_lstat(cmount, buf, &st);
>>          assert(ret == 0);
>>          printf("lstat time: %s", ctime(&st.st_mtime));
>>
>>          ceph_shutdown(cmount);
>>          return 0;
>> }
>>
>> Note that this output is currently using the short patch from
>> http://marc.info/?l=ceph-devel&m=133178637520337&w=2 which forces
>> getattr to always go to the MDS.
>>
>> diff --git a/src/client/Client.cc b/src/client/Client.cc
>> index 4a9ae3c..2bb24b7 100644
>> --- a/src/client/Client.cc
>> +++ b/src/client/Client.cc
>> @@ -3858,7 +3858,7 @@ int Client::readlink(const char *relpath, char
>> *buf, loff_t \
>> size)
>>   int Client::_getattr(Inode *in, int mask, int uid, int gid)
>>   {
>> -  bool yes = in->caps_issued_mask(mask);
>> +  bool yes = false; //in->caps_issued_mask(mask);
>>
>>     ldout(cct, 10) << "_getattr mask " << ccap_string(mask) << "
>> issued=" << yes << \
>> dendl;  if (yes)
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/src/client/Client.cc b/src/client/Client.cc
index 8d4a5ac..a7dd8f7 100644
--- a/src/client/Client.cc
+++ b/src/client/Client.cc
@@ -4165,6 +4165,7 @@  int Client::_getattr(Inode *in, int mask, int uid, 
int gid)

  int Client::_setattr(Inode *in, struct stat *attr, int mask, int uid, 
int gid)
  {
+  int orig_mask = mask;
    int issued = in->caps_issued();

    ldout(cct, 10) << "_setattr mask " << mask << " issued " << 
ccap_string(issued) << dendl;
@@ -4219,7 +4220,7 @@  int Client::_setattr(Inode *in, struct stat *attr, 
int mask, int uid, int gid)
        mask &= ~(CEPH_SETATTR_MTIME|CEPH_SETATTR_ATIME);
      }
    }
-  if (!mask)
+  if (!cct->_conf->client_setattr_writethru && !mask)
      return 0;

    MetaRequest *req = new MetaRequest(CEPH_MDS_OP_SETATTR);
@@ -4229,6 +4230,10 @@  int Client::_setattr(Inode *in, struct stat 
*attr, int mask, int uid, int gid)
    req->set_filepath(path);
    req->inode = in;

+  // reset mask back to original if we're meant to do writethru
+  if (cct->_conf->client_setattr_writethru)
+    mask = orig_mask;
+
    if (mask & CEPH_SETATTR_MODE) {
      req->head.args.setattr.mode = attr->st_mode;
      req->inode_drop |= CEPH_CAP_AUTH_SHARED;
diff --git a/src/common/config_opts.h b/src/common/config_opts.h
index cc05095..51a2769 100644
--- a/src/common/config_opts.h
+++ b/src/common/config_opts.h
@@ -178,6 +178,7 @@  OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 100) 
    // MB * n  (dirty OR tx.
  OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty 
(keep this smallish)
  OPTION(client_oc_max_dirty_age, OPT_DOUBLE, 5.0)      // max age in 
cache before writeback
  OPTION(client_oc_max_objects, OPT_INT, 1000)      // max objects in cache
+OPTION(client_setattr_writethru, OPT_BOOL, false)  // send the