diff mbox

fio rbd completions (Was: fio rbd hang for block sizes > 1M)

Message ID 544E6A8D.1040608@kernel.dk (mailing list archive)
State New, archived
Headers show

Commit Message

Jens Axboe Oct. 27, 2014, 3:53 p.m. UTC
On 10/27/2014 09:45 AM, Ketor D wrote:
> The return code is 0 if success.I mod the code a bit and then run fio very well.
> I think if you fix this bug, the path will be nearly pefect!!
> 
> ret = rbd_aio_get_return_value(fri->completion);
> //printf("ret=%ld\n", ret);
> //if (ret != (int) io_u->xfer_buflen) {
> if (ret != 0) {
> if (ret >= 0) {
> io_u->resid = io_u->xfer_buflen - ret;
> io_u->error = 0;
> } else
> io_u->error = ret;
> }

Weird, so it does not do partial completions I assume. Modified -v8 to
take that into account, hopefully this just works out-of-the-box.

What does the performance numbers look like for your sync test with this?

Comments

Ketor D Oct. 27, 2014, 4:20 p.m. UTC | #1
V8 patch runs good.

The iops is 33032. If I just comment the usleep(100) in the master, I
can get iops 35245.
The CPU usage about the two test is same 120%.
So maybe this patch could be better!

Belong to the master, this patch is perfect enough!!


2014-10-27 23:53 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
> On 10/27/2014 09:45 AM, Ketor D wrote:
>> The return code is 0 if success.I mod the code a bit and then run fio very well.
>> I think if you fix this bug, the path will be nearly pefect!!
>>
>> ret = rbd_aio_get_return_value(fri->completion);
>> //printf("ret=%ld\n", ret);
>> //if (ret != (int) io_u->xfer_buflen) {
>> if (ret != 0) {
>> if (ret >= 0) {
>> io_u->resid = io_u->xfer_buflen - ret;
>> io_u->error = 0;
>> } else
>> io_u->error = ret;
>> }
>
> Weird, so it does not do partial completions I assume. Modified -v8 to
> take that into account, hopefully this just works out-of-the-box.
>
> What does the performance numbers look like for your sync test with this?
>
> --
> Jens Axboe
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Oct. 27, 2014, 4:55 p.m. UTC | #2
On 10/27/2014 10:20 AM, Ketor D wrote:
> V8 patch runs good.
> 
> The iops is 33032. If I just comment the usleep(100) in the master, I
> can get iops 35245.
> The CPU usage about the two test is same 120%.
> So maybe this patch could be better!
> 
> Belong to the master, this patch is perfect enough!!

Agree, committed. I'll setup a local test here and see if we can't
recoup those last percentages. CPU usage may have been the same for your
test, but it will be more for others. A busy loop in there is not a good
idea.
Mark Kirkwood Oct. 27, 2014, 9:59 p.m. UTC | #3
On 28/10/14 05:20, Ketor D wrote:
> V8 patch runs good.
>
> The iops is 33032. If I just comment the usleep(100) in the master, I
> can get iops 35245.
> The CPU usage about the two test is same 120%.
> So maybe this patch could be better!
>

Yeah, v8 is working for me.

I'm seeing it a bit slower for some blocksizes, but faster (or perhaps 
about the same within repeat measurement error) for others:

blocksize k |  patched iops | orig iops
------------+---------------+-----------
4           | 12265         | 11930
128         |  5800         |  7100
1024        |  1193         |  1196

Regards

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Oct. 27, 2014, 10:32 p.m. UTC | #4
On 10/27/2014 03:59 PM, Mark Kirkwood wrote:
> On 28/10/14 05:20, Ketor D wrote:
>> V8 patch runs good.
>>
>> The iops is 33032. If I just comment the usleep(100) in the master, I
>> can get iops 35245.
>> The CPU usage about the two test is same 120%.
>> So maybe this patch could be better!
>>
> 
> Yeah, v8 is working for me.
> 
> I'm seeing it a bit slower for some blocksizes, but faster (or perhaps
> about the same within repeat measurement error) for others:
> 
> blocksize k |  patched iops | orig iops
> ------------+---------------+-----------
> 4           | 12265         | 11930
> 128         |  5800         |  7100
> 1024        |  1193         |  1196

As for most things, the difference should be in IOPS, not bandwidth. So
I would assume that the above are within normal variance, since 4k
should show the biggest difference, then drop off after that and match
at 128/1024k.
Mark Kirkwood Oct. 27, 2014, 11:21 p.m. UTC | #5
On 28/10/14 11:32, Jens Axboe wrote:
> On 10/27/2014 03:59 PM, Mark Kirkwood wrote:
>> On 28/10/14 05:20, Ketor D wrote:
>>> V8 patch runs good.
>>>
>>> The iops is 33032. If I just comment the usleep(100) in the master, I
>>> can get iops 35245.
>>> The CPU usage about the two test is same 120%.
>>> So maybe this patch could be better!
>>>
>>
>> Yeah, v8 is working for me.
>>
>> I'm seeing it a bit slower for some blocksizes, but faster (or perhaps
>> about the same within repeat measurement error) for others:
>>
>> blocksize k |  patched iops | orig iops
>> ------------+---------------+-----------
>> 4           | 12265         | 11930
>> 128         |  5800         |  7100
>> 1024        |  1193         |  1196
>
> As for most things, the difference should be in IOPS, not bandwidth. So
> I would assume that the above are within normal variance, since 4k
> should show the biggest difference, then drop off after that and match
> at 128/1024k.
>


Yeah, I suspect the 4K numbers are the same as we are bottlenecked by 
Ceph's small blocksize performance, not fio itself. If Ketor has a setup 
that can get higher IOPS @4K it would be interesting to see his numbers 
for patched vs orig!

Cheers

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ketor D Oct. 28, 2014, 3:23 a.m. UTC | #6
Hi Mark,
      Wish you could test my patch.I get the best performance using this patch.


2014-10-28 7:21 GMT+08:00 Mark Kirkwood <mark.kirkwood@catalyst.net.nz>:
> ng to see his numbers for patched vs orig!
Mark Kirkwood Oct. 28, 2014, 4:01 a.m. UTC | #7
On 28/10/14 16:23, Ketor D wrote:
> Hi Mark,
>        Wish you could test my patch.I get the best performance using this patch.
>
>

It is not clear cut for me (tested reads only):

blocksize k | v8 patched iops | Ketor patch iops | orig iops
------------+-----------------+------------------+-----------
4           | 12265           | 11930            |  11516
128         |  5800           |  7100            |   6550
1024        |  1193           |  1196            |   1248

Cheers

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Oct. 28, 2014, 4:05 a.m. UTC | #8
> On Oct 27, 2014, at 9:23 PM, Ketor D <d.ketor@gmail.com> wrote:
> 
> Hi Mark,
>      Wish you could test my patch.I get the best performance using this patch.

There's no way we're doing a busy loop, sorry. As mentioned in a previous email, it'd be great if you would work off current git and potentially improve that. 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ketor D Oct. 28, 2014, 4:49 a.m. UTC | #9
Agree. Busy loop is only for test.
I will try the current git.

Thanks!

2014-10-28 12:05 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
>
>> On Oct 27, 2014, at 9:23 PM, Ketor D <d.ketor@gmail.com> wrote:
>>
>> Hi Mark,
>>      Wish you could test my patch.I get the best performance using this patch.
>
> There's no way we're doing a busy loop, sorry. As mentioned in a previous email, it'd be great if you would work off current git and potentially improve that.
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Oct. 28, 2014, 3:14 p.m. UTC | #10
On 2014-10-27 22:49, Ketor D wrote:
> Agree. Busy loop is only for test.
> I will try the current git.

Committed two more rbd changes:

- Add support for rbd_invalidate_cache() (if it exists)
- Use rbd_aio_is_complete() instead of using fri->io_complete. The 
latter should have some locking to ensure it's always seen, so it's 
better to use the API provided function to determine whether this IO is 
done or not.

Unless we often hit the complete race, I would not expect this to make 
much of a difference. But it's worth testing in any case, especially 
since my two attempts at setting up ceph + rbd have failed miserably. So 
I still can't test myself.
Ketor D Oct. 28, 2014, 3:49 p.m. UTC | #11
Cannot get the new commited code from github now.
When I get the newest code, I will test.

2014-10-28 23:14 GMT+08:00 Jens Axboe <axboe@kernel.dk>:
> On 2014-10-27 22:49, Ketor D wrote:
>>
>> Agree. Busy loop is only for test.
>> I will try the current git.
>
>
> Committed two more rbd changes:
>
> - Add support for rbd_invalidate_cache() (if it exists)
> - Use rbd_aio_is_complete() instead of using fri->io_complete. The latter
> should have some locking to ensure it's always seen, so it's better to use
> the API provided function to determine whether this IO is done or not.
>
> Unless we often hit the complete race, I would not expect this to make much
> of a difference. But it's worth testing in any case, especially since my two
> attempts at setting up ceph + rbd have failed miserably. So I still can't
> test myself.
>
> --
> Jens Axboe
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Oct. 28, 2014, 3:53 p.m. UTC | #12
On 2014-10-28 09:49, Ketor D wrote:
> Cannot get the new commited code from github now.
> When I get the newest code, I will test.

github is just a mirror, I push to:

git://git.kernel.dk/fio

Github is pushed automatically every hour, if there are changes. So it 
may lag an hour. Should be there now, though.
diff mbox

Patch

diff --git a/engines/rbd.c b/engines/rbd.c
index 6fe87b8d010c..5160c32aedb0 100644
--- a/engines/rbd.c
+++ b/engines/rbd.c
@@ -11,7 +11,9 @@ 
 
 struct fio_rbd_iou {
 	struct io_u *io_u;
+	rbd_completion_t completion;
 	int io_complete;
+	int io_seen;
 };
 
 struct rbd_data {
@@ -30,35 +32,35 @@  struct rbd_options {
 
 static struct fio_option options[] = {
 	{
-	 .name     = "rbdname",
-	 .lname    = "rbd engine rbdname",
-	 .type     = FIO_OPT_STR_STORE,
-	 .help     = "RBD name for RBD engine",
-	 .off1     = offsetof(struct rbd_options, rbd_name),
-	 .category = FIO_OPT_C_ENGINE,
-	 .group    = FIO_OPT_G_RBD,
-	 },
+		.name		= "rbdname",
+		.lname		= "rbd engine rbdname",
+		.type		= FIO_OPT_STR_STORE,
+		.help		= "RBD name for RBD engine",
+		.off1		= offsetof(struct rbd_options, rbd_name),
+		.category	= FIO_OPT_C_ENGINE,
+		.group		= FIO_OPT_G_RBD,
+	},
 	{
-	 .name     = "pool",
-	 .lname    = "rbd engine pool",
-	 .type     = FIO_OPT_STR_STORE,
-	 .help     = "Name of the pool hosting the RBD for the RBD engine",
-	 .off1     = offsetof(struct rbd_options, pool_name),
-	 .category = FIO_OPT_C_ENGINE,
-	 .group    = FIO_OPT_G_RBD,
-	 },
+		.name     = "pool",
+		.lname    = "rbd engine pool",
+		.type     = FIO_OPT_STR_STORE,
+		.help     = "Name of the pool hosting the RBD for the RBD engine",
+		.off1     = offsetof(struct rbd_options, pool_name),
+		.category = FIO_OPT_C_ENGINE,
+		.group    = FIO_OPT_G_RBD,
+	},
 	{
-	 .name     = "clientname",
-	 .lname    = "rbd engine clientname",
-	 .type     = FIO_OPT_STR_STORE,
-	 .help     = "Name of the ceph client to access the RBD for the RBD engine",
-	 .off1     = offsetof(struct rbd_options, client_name),
-	 .category = FIO_OPT_C_ENGINE,
-	 .group    = FIO_OPT_G_RBD,
-	 },
+		.name     = "clientname",
+		.lname    = "rbd engine clientname",
+		.type     = FIO_OPT_STR_STORE,
+		.help     = "Name of the ceph client to access the RBD for the RBD engine",
+		.off1     = offsetof(struct rbd_options, client_name),
+		.category = FIO_OPT_C_ENGINE,
+		.group    = FIO_OPT_G_RBD,
+	},
 	{
-	 .name = NULL,
-	 },
+		.name = NULL,
+	},
 };
 
 static int _fio_setup_rbd_data(struct thread_data *td,
@@ -163,92 +165,99 @@  static void _fio_rbd_disconnect(struct rbd_data *rbd_data)
 	}
 }
 
-static void _fio_rbd_finish_write_aiocb(rbd_completion_t comp, void *data)
+static void _fio_rbd_finish_aiocb(rbd_completion_t comp, void *data)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
+	struct io_u *io_u = data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
+	ssize_t ret;
 
-	fio_rbd_iou->io_complete = 1;
+	fri->io_complete = 1;
 
-	/* if write needs to be verified - we should not release comp here
-	   without fetching the result */
+	/*
+	 * Looks like return value is 0 for success, or < 0 for
+	 * a specific error. So we have to assume that it can't do
+	 * partial completions.
+	 */
+	ret = rbd_aio_get_return_value(fri->completion);
+	if (ret < 0) {
+		io_u->error = ret;
+		io_u->resid = io_u->xfer_buflen;
+	} else
+		io_u->error = 0;
+}
 
-	rbd_aio_release(comp);
-	/* TODO handle error */
+static struct io_u *fio_rbd_event(struct thread_data *td, int event)
+{
+	struct rbd_data *rbd_data = td->io_ops->data;
 
-	return;
+	return rbd_data->aio_events[event];
 }
 
-static void _fio_rbd_finish_read_aiocb(rbd_completion_t comp, void *data)
+static inline int fri_check_complete(struct rbd_data *rbd_data,
+				     struct io_u *io_u,
+				     unsigned int *events)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
-	fio_rbd_iou->io_complete = 1;
+	if (fri->io_complete) {
+		fri->io_complete = 0;
+		fri->io_seen = 1;
+		rbd_data->aio_events[*events] = io_u;
+		(*events)++;
 
-	/* if read needs to be verified - we should not release comp here
-	   without fetching the result */
-	rbd_aio_release(comp);
-
-	/* TODO handle error */
+		rbd_aio_release(fri->completion);
+		return 1;
+	}
 
-	return;
+	return 0;
 }
 
-static void _fio_rbd_finish_sync_aiocb(rbd_completion_t comp, void *data)
+static int rbd_iter_events(struct thread_data *td, unsigned int *events,
+			   unsigned int min_evts, int wait)
 {
-	struct io_u *io_u = (struct io_u *)data;
-	struct fio_rbd_iou *fio_rbd_iou =
-	    (struct fio_rbd_iou *)io_u->engine_data;
-
-	fio_rbd_iou->io_complete = 1;
+	struct rbd_data *rbd_data = td->io_ops->data;
+	unsigned int this_events = 0;
+	struct io_u *io_u;
+	int i;
 
-	/* if sync needs to be verified - we should not release comp here
-	   without fetching the result */
-	rbd_aio_release(comp);
+	io_u_qiter(&td->io_u_all, io_u, i) {
+		struct fio_rbd_iou *fri = io_u->engine_data;
 
-	/* TODO handle error */
+		if (!(io_u->flags & IO_U_F_FLIGHT))
+			continue;
+		if (fri->io_seen)
+			continue;
 
-	return;
-}
+		if (fri_check_complete(rbd_data, io_u, events))
+			this_events++;
+		else if (wait) {
+			rbd_aio_wait_for_complete(fri->completion);
 
-static struct io_u *fio_rbd_event(struct thread_data *td, int event)
-{
-	struct rbd_data *rbd_data = td->io_ops->data;
+			if (fri_check_complete(rbd_data, io_u, events))
+				this_events++;
+		}
+		if (*events >= min_evts)
+			break;
+	}
 
-	return rbd_data->aio_events[event];
+	return this_events;
 }
 
 static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
 			     unsigned int max, const struct timespec *t)
 {
-	struct rbd_data *rbd_data = td->io_ops->data;
-	unsigned int events = 0;
-	struct io_u *io_u;
-	int i;
-	struct fio_rbd_iou *fov;
+	unsigned int this_events, events = 0;
+	int wait = 0;
 
 	do {
-		io_u_qiter(&td->io_u_all, io_u, i) {
-			if (!(io_u->flags & IO_U_F_FLIGHT))
-				continue;
+		this_events = rbd_iter_events(td, &events, min, wait);
 
-			fov = (struct fio_rbd_iou *)io_u->engine_data;
-
-			if (fov->io_complete) {
-				fov->io_complete = 0;
-				rbd_data->aio_events[events] = io_u;
-				events++;
-			}
-
-		}
-		if (events < min)
-			usleep(100);
-		else
+		if (events >= min)
 			break;
+		if (this_events)
+			continue;
 
+		wait = 1;
 	} while (1);
 
 	return events;
@@ -256,17 +265,18 @@  static int fio_rbd_getevents(struct thread_data *td, unsigned int min,
 
 static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 {
-	int r = -1;
 	struct rbd_data *rbd_data = td->io_ops->data;
-	rbd_completion_t comp;
+	struct fio_rbd_iou *fri = io_u->engine_data;
+	int r = -1;
 
 	fio_ro_check(td, io_u);
 
+	fri->io_complete = 0;
+	fri->io_seen = 0;
+
 	if (io_u->ddir == DDIR_WRITE) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_write_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_WRITE failed.\n");
@@ -274,17 +284,17 @@  static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_write(rbd_data->image, io_u->offset,
-				  io_u->xfer_buflen, io_u->xfer_buf, comp);
+				  io_u->xfer_buflen, io_u->xfer_buf,
+				  fri->completion);
 		if (r < 0) {
 			log_err("rbd_aio_write failed.\n");
+			rbd_aio_release(fri->completion);
 			goto failed;
 		}
 
 	} else if (io_u->ddir == DDIR_READ) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_read_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_READ failed.\n");
@@ -292,27 +302,28 @@  static int fio_rbd_queue(struct thread_data *td, struct io_u *io_u)
 		}
 
 		r = rbd_aio_read(rbd_data->image, io_u->offset,
-				 io_u->xfer_buflen, io_u->xfer_buf, comp);
+				 io_u->xfer_buflen, io_u->xfer_buf,
+				 fri->completion);
 
 		if (r < 0) {
 			log_err("rbd_aio_read failed.\n");
+			rbd_aio_release(fri->completion);
 			goto failed;
 		}
 
 	} else if (io_u->ddir == DDIR_SYNC) {
-		r = rbd_aio_create_completion(io_u,
-					      (rbd_callback_t)
-					      _fio_rbd_finish_sync_aiocb,
-					      &comp);
+		r = rbd_aio_create_completion(io_u, _fio_rbd_finish_aiocb,
+						&fri->completion);
 		if (r < 0) {
 			log_err
 			    ("rbd_aio_create_completion for DDIR_SYNC failed.\n");
 			goto failed;
 		}
 
-		r = rbd_aio_flush(rbd_data->image, comp);
+		r = rbd_aio_flush(rbd_data->image, fri->completion);
 		if (r < 0) {
 			log_err("rbd_flush failed.\n");
+			rbd_aio_release(fri->completion);
 			goto failed;
 		}
 
@@ -344,7 +355,6 @@  static int fio_rbd_init(struct thread_data *td)
 
 failed:
 	return 1;
-
 }
 
 static void fio_rbd_cleanup(struct thread_data *td)
@@ -379,8 +389,9 @@  static int fio_rbd_setup(struct thread_data *td)
 	}
 	td->io_ops->data = rbd_data;
 
-	/* librbd does not allow us to run first in the main thread and later in a
-	 * fork child. It needs to be the same process context all the time. 
+	/* librbd does not allow us to run first in the main thread and later
+	 * in a fork child. It needs to be the same process context all the
+	 * time. 
 	 */
 	td->o.use_thread = 1;
 
@@ -439,22 +450,21 @@  static int fio_rbd_invalidate(struct thread_data *td, struct fio_file *f)
 
 static void fio_rbd_io_u_free(struct thread_data *td, struct io_u *io_u)
 {
-	struct fio_rbd_iou *o = io_u->engine_data;
+	struct fio_rbd_iou *fri = io_u->engine_data;
 
-	if (o) {
+	if (fri) {
 		io_u->engine_data = NULL;
-		free(o);
+		free(fri);
 	}
 }
 
 static int fio_rbd_io_u_init(struct thread_data *td, struct io_u *io_u)
 {
-	struct fio_rbd_iou *o;
+	struct fio_rbd_iou *fri;
 
-	o = malloc(sizeof(*o));
-	o->io_complete = 0;
-	o->io_u = io_u;
-	io_u->engine_data = o;
+	fri = calloc(1, sizeof(*fri));
+	fri->io_u = io_u;
+	io_u->engine_data = fri;
 	return 0;
 }