mbox series

[00/29] RFC: iov_iter: Switch to using an ops table

Message ID 160596800145.154728.7192318545120181269.stgit@warthog.procyon.org.uk (mailing list archive)
Headers show
Series RFC: iov_iter: Switch to using an ops table | expand

Message

David Howells Nov. 21, 2020, 2:13 p.m. UTC
Hi Pavel, Willy, Jens, Al,

I had a go switching the iov_iter stuff away from using a type bitmask to
using an ops table to get rid of the if-if-if-if chains that are all over
the place.  After I pushed it, someone pointed me at Pavel's two patches.

I have another iterator class that I want to add - which would lengthen the
if-if-if-if chains.  A lot of the time, there's a conditional clause at the
beginning of a function that just jumps off to a type-specific handler or
to reject the operation for that type.  An ops table can just point to that
instead.

As far as I can tell, there's no difference in performance in most cases,
though doing AFS-based kernel compiles appears to take less time (down from
3m20 to 2m50), which might make sense as that uses iterators a lot - but
there are too many variables in that for that to be a good benchmark (I'm
dealing with a remote server, for a start).

Can someone recommend a good way to benchmark this properly?  The problem
is that the difference this makes relative to the amount of time taken to
actually do I/O is tiny.

I've tried TCP transfers using the following sink program:

	#include <stdio.h>
	#include <stdlib.h>
	#include <string.h>
	#include <fcntl.h>
	#include <unistd.h>
	#include <netinet/in.h>
	#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
	static unsigned char buffer[512 * 1024] __attribute__((aligned(4096)));
	int main(int argc, char *argv[])
	{
		struct sockaddr_in sin = { .sin_family = AF_INET, .sin_port = htons(5555) };
		int sfd, afd;
		sfd = socket(AF_INET, SOCK_STREAM, 0);
		OSERROR(sfd, "socket");
		OSERROR(bind(sfd, (struct sockaddr *)&sin, sizeof(sin)), "bind");
		OSERROR(listen(sfd, 1), "listen");
		for (;;) {
			afd = accept(sfd, NULL, NULL);
			if (afd != -1) {
				while (read(afd, buffer, sizeof(buffer)) > 0) {}
				close(afd);
			}
		}
	}

and send program:

	#include <stdio.h>
	#include <stdlib.h>
	#include <string.h>
	#include <fcntl.h>
	#include <unistd.h>
	#include <netdb.h>
	#include <netinet/in.h>
	#include <sys/stat.h>
	#include <sys/sendfile.h>
	#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
	static unsigned char buffer[512*1024] __attribute__((aligned(4096)));
	int main(int argc, char *argv[])
	{
		struct sockaddr_in sin = { .sin_family = AF_INET, .sin_port = htons(5555) };
		struct hostent *h;
		ssize_t size, r, o;
		int cfd;
		if (argc != 3) {
			fprintf(stderr, "tcp-gen <server> <size>\n");
			exit(2);
		}
		size = strtoul(argv[2], NULL, 0);
		if (size <= 0) {
			fprintf(stderr, "Bad size\n");
			exit(2);
		}
		h = gethostbyname(argv[1]);
		if (!h) {
			fprintf(stderr, "%s: %s\n", argv[1], hstrerror(h_errno));
			exit(3);
		}
		if (!h->h_addr_list[0]) {
			fprintf(stderr, "%s: No addresses\n", argv[1]);
			exit(3);
		}
		memcpy(&sin.sin_addr, h->h_addr_list[0], h->h_length);
		cfd = socket(AF_INET, SOCK_STREAM, 0);
		OSERROR(cfd, "socket");
		OSERROR(connect(cfd, (struct sockaddr *)&sin, sizeof(sin)), "connect");
		do {
			r = size > sizeof(buffer) ? sizeof(buffer) : size;
			size -= r;
			o = 0;
			do {
				ssize_t w = write(cfd, buffer + o, r - o);
				OSERROR(w, "write");
				o += w;
			} while (o < r);
		} while (size > 0);
		OSERROR(close(cfd), "close/c");
		return 0;
	}

since the socket interface uses iterators.  It seems to show no difference.
One side note, though: I've been doing 10GiB same-machine transfers, and it
takes either ~2.5s or ~0.87s and rarely in between, with or without these
patches, alternating apparently randomly between the two times.

The patches can be found here:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-ops

David
---
David Howells (29):
      iov_iter: Switch to using a table of operations
      iov_iter: Split copy_page_to_iter()
      iov_iter: Split iov_iter_fault_in_readable
      iov_iter: Split the iterate_and_advance() macro
      iov_iter: Split copy_to_iter()
      iov_iter: Split copy_mc_to_iter()
      iov_iter: Split copy_from_iter()
      iov_iter: Split the iterate_all_kinds() macro
      iov_iter: Split copy_from_iter_full()
      iov_iter: Split copy_from_iter_nocache()
      iov_iter: Split copy_from_iter_flushcache()
      iov_iter: Split copy_from_iter_full_nocache()
      iov_iter: Split copy_page_from_iter()
      iov_iter: Split iov_iter_zero()
      iov_iter: Split copy_from_user_atomic()
      iov_iter: Split iov_iter_advance()
      iov_iter: Split iov_iter_revert()
      iov_iter: Split iov_iter_single_seg_count()
      iov_iter: Split iov_iter_alignment()
      iov_iter: Split iov_iter_gap_alignment()
      iov_iter: Split iov_iter_get_pages()
      iov_iter: Split iov_iter_get_pages_alloc()
      iov_iter: Split csum_and_copy_from_iter()
      iov_iter: Split csum_and_copy_from_iter_full()
      iov_iter: Split csum_and_copy_to_iter()
      iov_iter: Split iov_iter_npages()
      iov_iter: Split dup_iter()
      iov_iter: Split iov_iter_for_each_range()
      iov_iter: Remove iterate_all_kinds() and iterate_and_advance()


 lib/iov_iter.c | 1440 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 934 insertions(+), 506 deletions(-)

Comments

Pavel Begunkov Nov. 21, 2020, 2:34 p.m. UTC | #1
On 21/11/2020 14:13, David Howells wrote:
> 
> Hi Pavel, Willy, Jens, Al,
> 
> I had a go switching the iov_iter stuff away from using a type bitmask to
> using an ops table to get rid of the if-if-if-if chains that are all over
> the place.  After I pushed it, someone pointed me at Pavel's two patches.
> 
> I have another iterator class that I want to add - which would lengthen the
> if-if-if-if chains.  A lot of the time, there's a conditional clause at the
> beginning of a function that just jumps off to a type-specific handler or
> to reject the operation for that type.  An ops table can just point to that
> instead.
> 
> As far as I can tell, there's no difference in performance in most cases,
> though doing AFS-based kernel compiles appears to take less time (down from
> 3m20 to 2m50), which might make sense as that uses iterators a lot - but
> there are too many variables in that for that to be a good benchmark (I'm
> dealing with a remote server, for a start).
> 
> Can someone recommend a good way to benchmark this properly?  The problem
> is that the difference this makes relative to the amount of time taken to
> actually do I/O is tiny.

I find enough of iov overhead running fio/t/io_uring.c with nullblk.
Not sure whether it'll help you but worth a try.

> 
> I've tried TCP transfers using the following sink program:
> 
> 	#include <stdio.h>
> 	#include <stdlib.h>
> 	#include <string.h>
> 	#include <fcntl.h>
> 	#include <unistd.h>
> 	#include <netinet/in.h>
> 	#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
> 	static unsigned char buffer[512 * 1024] __attribute__((aligned(4096)));
> 	int main(int argc, char *argv[])
> 	{
> 		struct sockaddr_in sin = { .sin_family = AF_INET, .sin_port = htons(5555) };
> 		int sfd, afd;
> 		sfd = socket(AF_INET, SOCK_STREAM, 0);
> 		OSERROR(sfd, "socket");
> 		OSERROR(bind(sfd, (struct sockaddr *)&sin, sizeof(sin)), "bind");
> 		OSERROR(listen(sfd, 1), "listen");
> 		for (;;) {
> 			afd = accept(sfd, NULL, NULL);
> 			if (afd != -1) {
> 				while (read(afd, buffer, sizeof(buffer)) > 0) {}
> 				close(afd);
> 			}
> 		}
> 	}
> 
> and send program:
> 
> 	#include <stdio.h>
> 	#include <stdlib.h>
> 	#include <string.h>
> 	#include <fcntl.h>
> 	#include <unistd.h>
> 	#include <netdb.h>
> 	#include <netinet/in.h>
> 	#include <sys/stat.h>
> 	#include <sys/sendfile.h>
> 	#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
> 	static unsigned char buffer[512*1024] __attribute__((aligned(4096)));
> 	int main(int argc, char *argv[])
> 	{
> 		struct sockaddr_in sin = { .sin_family = AF_INET, .sin_port = htons(5555) };
> 		struct hostent *h;
> 		ssize_t size, r, o;
> 		int cfd;
> 		if (argc != 3) {
> 			fprintf(stderr, "tcp-gen <server> <size>\n");
> 			exit(2);
> 		}
> 		size = strtoul(argv[2], NULL, 0);
> 		if (size <= 0) {
> 			fprintf(stderr, "Bad size\n");
> 			exit(2);
> 		}
> 		h = gethostbyname(argv[1]);
> 		if (!h) {
> 			fprintf(stderr, "%s: %s\n", argv[1], hstrerror(h_errno));
> 			exit(3);
> 		}
> 		if (!h->h_addr_list[0]) {
> 			fprintf(stderr, "%s: No addresses\n", argv[1]);
> 			exit(3);
> 		}
> 		memcpy(&sin.sin_addr, h->h_addr_list[0], h->h_length);
> 		cfd = socket(AF_INET, SOCK_STREAM, 0);
> 		OSERROR(cfd, "socket");
> 		OSERROR(connect(cfd, (struct sockaddr *)&sin, sizeof(sin)), "connect");
> 		do {
> 			r = size > sizeof(buffer) ? sizeof(buffer) : size;
> 			size -= r;
> 			o = 0;
> 			do {
> 				ssize_t w = write(cfd, buffer + o, r - o);
> 				OSERROR(w, "write");
> 				o += w;
> 			} while (o < r);
> 		} while (size > 0);
> 		OSERROR(close(cfd), "close/c");
> 		return 0;
> 	}
> 
> since the socket interface uses iterators.  It seems to show no difference.
> One side note, though: I've been doing 10GiB same-machine transfers, and it
> takes either ~2.5s or ~0.87s and rarely in between, with or without these
> patches, alternating apparently randomly between the two times.
> 
> The patches can be found here:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-ops
> 
> David
> ---
> David Howells (29):
>       iov_iter: Switch to using a table of operations
>       iov_iter: Split copy_page_to_iter()
>       iov_iter: Split iov_iter_fault_in_readable
>       iov_iter: Split the iterate_and_advance() macro
>       iov_iter: Split copy_to_iter()
>       iov_iter: Split copy_mc_to_iter()
>       iov_iter: Split copy_from_iter()
>       iov_iter: Split the iterate_all_kinds() macro
>       iov_iter: Split copy_from_iter_full()
>       iov_iter: Split copy_from_iter_nocache()
>       iov_iter: Split copy_from_iter_flushcache()
>       iov_iter: Split copy_from_iter_full_nocache()
>       iov_iter: Split copy_page_from_iter()
>       iov_iter: Split iov_iter_zero()
>       iov_iter: Split copy_from_user_atomic()
>       iov_iter: Split iov_iter_advance()
>       iov_iter: Split iov_iter_revert()
>       iov_iter: Split iov_iter_single_seg_count()
>       iov_iter: Split iov_iter_alignment()
>       iov_iter: Split iov_iter_gap_alignment()
>       iov_iter: Split iov_iter_get_pages()
>       iov_iter: Split iov_iter_get_pages_alloc()
>       iov_iter: Split csum_and_copy_from_iter()
>       iov_iter: Split csum_and_copy_from_iter_full()
>       iov_iter: Split csum_and_copy_to_iter()
>       iov_iter: Split iov_iter_npages()
>       iov_iter: Split dup_iter()
>       iov_iter: Split iov_iter_for_each_range()
>       iov_iter: Remove iterate_all_kinds() and iterate_and_advance()
> 
> 
>  lib/iov_iter.c | 1440 +++++++++++++++++++++++++++++++-----------------
>  1 file changed, 934 insertions(+), 506 deletions(-)
> 
>
Linus Torvalds Nov. 21, 2020, 6:23 p.m. UTC | #2
On Sat, Nov 21, 2020 at 6:13 AM David Howells <dhowells@redhat.com> wrote:
>
> Can someone recommend a good way to benchmark this properly?  The problem
> is that the difference this makes relative to the amount of time taken to
> actually do I/O is tiny.

Maybe try /dev/zero -> /dev/null to try a load where the IO itself is
cheap. Or vmsplice to /dev/null?

         Linus
Matthew Wilcox Dec. 11, 2020, 3:24 a.m. UTC | #3
On Sat, Nov 21, 2020 at 02:13:21PM +0000, David Howells wrote:
> I had a go switching the iov_iter stuff away from using a type bitmask to
> using an ops table to get rid of the if-if-if-if chains that are all over
> the place.  After I pushed it, someone pointed me at Pavel's two patches.
> 
> I have another iterator class that I want to add - which would lengthen the
> if-if-if-if chains.  A lot of the time, there's a conditional clause at the
> beginning of a function that just jumps off to a type-specific handler or
> to reject the operation for that type.  An ops table can just point to that
> instead.

So, given the performance problem, how about turning this inside out?

struct iov_step {
	union {
		void *kaddr;
		void __user *uaddr;
	};
	unsigned int len;
	bool user_addr;
	bool kmap;
	struct page *page;
};

bool iov_iterate(struct iov_step *step, struct iov_iter *i, size_t max)
{
	if (step->page)
		kunmap(page)
	else if (step->kmap)
		kunmap_atomic(step->kaddr);

	if (max == 0)
		return false;

	if (i->type & ITER_IOVEC) {
		step->user_addr = true;
		step->uaddr = i->iov.iov_base + i->iov_offset;
		return true;
	}
	if (i->type & ITER_BVEC) {
		... get the page ...
	} else if (i->type & ITER_KVEC) {
		... get the page ...
	} else ...

	kmap or kmap_atomic as appropriate ...
	...set kaddr & len ...

	return true;
}

size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
{
	struct iov_step step = {};

	while (iov_iterate(&step, i, bytes)) {
		if (user_addr)
			copy_from_user(addr, step.uaddr, step.len);
		else
			memcpy(addr, step.kaddr, step.len);
		bytes -= step.len;
	}
}