Recvfile patch used for Samba.

On Tue, Jul 23, 2013 at 02:58:58PM -0700, Jeremy Allison wrote:
> On Tue, Jul 23, 2013 at 05:10:27PM +1000, Dave Chinner wrote:
> > So, we are nesting up to 32 page locks here. That's bad. And we are
> > nesting kmap() calls for all the pages individually - is that even
> > safe to do?
> > 
> > So, what happens when we've got 16 pages in, and the filesystem has
> > allocated space for those 16 blocks, and we get ENOSPC on the 17th?
> > Sure, you undo the state here, but what about the 16 blocks that the
> > filesystem has allocated to this file? There's no notification to
> > the filesystem that they need to be truncated away because the write
> > failed....
> > 
> > > +
> > > +	/* IOV is ready, receive the date from socket now */
> > > +	msg.msg_name = NULL;
> > > +	msg.msg_namelen = 0;
> > > +	msg.msg_iov = (struct iovec *)&iov[0];
> > > +	msg.msg_iovlen = cPagesAllocated ;
> > > +	msg.msg_control = NULL;
> > > +	msg.msg_controllen = 0;
> > > +	msg.msg_flags = MSG_KERNSPACE;
> > > +	rcvtimeo = sock->sk->sk_rcvtimeo;    
> > > +	sock->sk->sk_rcvtimeo = 8 * HZ;
> > 
> > We can hold the inode and the pages locked for 8 seconds?
> > 
> > I'll stop there. This is fundamentally broken. It's an attempt to do
> > a multi-page write operation without any of the supporting
> > structures needed to handle the failure cases properly.  The nested
> > page locking has "deadlock" written all over it, and the lack of
> > partial failure handling shouts "data corruption" and "stale data
> > exposure" to me. The fact it can block for up to 8 seconds waiting
> > for network shenanigans to be completed while holding lots of locks
> > is going to cause all sorts of problems under memory pressure.
> > 
> > Not to mention it means that all memory allocations in the msgrcv
> > path need to be done with GFP_NOFS, because GFP_KERNEL allocations
> > are almost guaranteed to deadlock on the locked pages this path
> > already holds....
> > 
> > Need I say more?
> 
> No, that's great ! :-).
> 
> Thanks for the analysis. I'd heard it wasn't
> near production quality, but not being a kernel
> engineer myself I wasn't able to make that assessment.
> 
> Having said that the OEMs that are using it does
> find it improves write speeds by a large amount (10%
> or more), so it's showing there is room for improvement
> here if the correct code can be created for recvfile.

10% is not very large gain given the complexity it adds, and I
question that the gain actually comes from moving the memcpy() into
the kernel.  If this recvfile code enabled zero-copy behaviour into
the page cache, then it would be worth pursuing. But it doesn't, and
so IMO the complexity is not worth the gain right now.

Indeed, I suspect the 10% gain will be from the multi-page write
behaviour that was hacked into the code. I wrote a multi-page
write prototype ~3 years ago that showed write(2) performance gains
of roughly 10% on low CPU power machines running XFS.

$ git branch |grep multi
  multipage-write
$ git checkout multipage-write 
Checking out files: 100% (45114/45114), done.
Switched to branch 'multipage-write'
$ head -4 Makefile 
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 37
EXTRAVERSION = -rc6
$

I should probably pick this up again and push it forwards. FWIW,
I've attached the first multipage-write infrastructure patch from
the above branch to show how this sort of operation needs to be done
from a filesystem and page-cache perspective to avoid locking
problems have sane error handling.

I beleive the version that Christoph implemented for a couple of
OEMs around that time de-multiplexed the ->iomap method....

Cheers,

Dave.

Recvfile patch used for Samba.

Commit Message

Comments

Patch