diff mbox

[2/8] swap: lock i_mutex for swap_writepage direct_IO

Message ID a59510f4552a5d3557958cdb0ce1b23b3abfc75b.1418618044.git.osandov@osandov.com (mailing list archive)
State New, archived
Headers show

Commit Message

Omar Sandoval Dec. 15, 2014, 5:26 a.m. UTC
The generic write code locks i_mutex for a direct_IO. Swap-over-NFS
doesn't grab the mutex because nfs_direct_IO doesn't expect i_mutex to
be held, but most direct_IO implementations do.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
---
 mm/page_io.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Jan Kara Dec. 15, 2014, 4:27 p.m. UTC | #1
On Sun 14-12-14 21:26:56, Omar Sandoval wrote:
> The generic write code locks i_mutex for a direct_IO. Swap-over-NFS
> doesn't grab the mutex because nfs_direct_IO doesn't expect i_mutex to
> be held, but most direct_IO implementations do.
  I think you are speaking about direct IO writes only, aren't you? For DIO
reads we don't hold i_mutex AFAICS. And also for DIO writes we don't
necessarily hold i_mutex - see for example XFS which doesn't take i_mutex
for direct IO writes. It uses it's internal rwlock for this (see
xfs_file_dio_aio_write()). So I think this is just wrong.

								Honza

> Signed-off-by: Omar Sandoval <osandov@osandov.com>
> ---
>  mm/page_io.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 955db8b..1630ac0 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -263,6 +263,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
>  	if (sis->flags & SWP_FILE) {
>  		struct kiocb kiocb;
>  		struct file *swap_file = sis->swap_file;
> +		struct inode *inode = file_inode(swap_file);
>  		struct address_space *mapping = swap_file->f_mapping;
>  		struct bio_vec bv = {
>  			.bv_page = page,
> @@ -283,9 +284,11 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
>  
>  		set_page_writeback(page);
>  		unlock_page(page);
> +		mutex_lock(&inode->i_mutex);
>  		ret = mapping->a_ops->direct_IO(ITER_BVEC | WRITE,
>  						&kiocb, &from,
>  						kiocb.ki_pos);
> +		mutex_unlock(&inode->i_mutex);
>  		if (ret == PAGE_SIZE) {
>  			count_vm_event(PSWPOUT);
>  			ret = 0;
> -- 
> 2.1.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Dec. 15, 2014, 4:56 p.m. UTC | #2
On Mon, Dec 15, 2014 at 05:27:05PM +0100, Jan Kara wrote:
> On Sun 14-12-14 21:26:56, Omar Sandoval wrote:
> > The generic write code locks i_mutex for a direct_IO. Swap-over-NFS
> > doesn't grab the mutex because nfs_direct_IO doesn't expect i_mutex to
> > be held, but most direct_IO implementations do.
>   I think you are speaking about direct IO writes only, aren't you? For DIO
> reads we don't hold i_mutex AFAICS. And also for DIO writes we don't
> necessarily hold i_mutex - see for example XFS which doesn't take i_mutex
> for direct IO writes. It uses it's internal rwlock for this (see
> xfs_file_dio_aio_write()). So I think this is just wrong.

The problem is that the use of ->direct_IO by the swap code is a gross
layering violation.  ->direct_IO is a callback for the filesystem, and
the swap code need to call ->read_iter instead of ->readpage and
->write_tier instead of ->direct_IO, and leave the locking to the
filesystem.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Omar Sandoval Dec. 15, 2014, 10:11 p.m. UTC | #3
On Mon, Dec 15, 2014 at 08:56:15AM -0800, Christoph Hellwig wrote:
> On Mon, Dec 15, 2014 at 05:27:05PM +0100, Jan Kara wrote:
> > On Sun 14-12-14 21:26:56, Omar Sandoval wrote:
> > > The generic write code locks i_mutex for a direct_IO. Swap-over-NFS
> > > doesn't grab the mutex because nfs_direct_IO doesn't expect i_mutex to
> > > be held, but most direct_IO implementations do.
> >   I think you are speaking about direct IO writes only, aren't you? For DIO
> > reads we don't hold i_mutex AFAICS. And also for DIO writes we don't
> > necessarily hold i_mutex - see for example XFS which doesn't take i_mutex
> > for direct IO writes. It uses it's internal rwlock for this (see
> > xfs_file_dio_aio_write()). So I think this is just wrong.
> 
> The problem is that the use of ->direct_IO by the swap code is a gross
> layering violation.  ->direct_IO is a callback for the filesystem, and
> the swap code need to call ->read_iter instead of ->readpage and
> ->write_tier instead of ->direct_IO, and leave the locking to the
> filesystem.
>
Ok, I got the swap code working with ->read_iter/->write_iter without
too much trouble. I wanted to double check before I submit if there's
any gotchas involved with adding the O_DIRECT flag to a file pointer
after it has been opened -- swapon opens the swapfile before we know if
we're using the SWP_FILE infrastructure, and we need to add O_DIRECT so
->{read,write}_iter use direct I/O, but we can't add O_DIRECT to the
original open without excluding filesystems that support the old bmap
path but not direct I/O.
Christoph Hellwig Dec. 16, 2014, 8:35 a.m. UTC | #4
On Mon, Dec 15, 2014 at 02:11:00PM -0800, Omar Sandoval wrote:
> Ok, I got the swap code working with ->read_iter/->write_iter without
> too much trouble. I wanted to double check before I submit if there's
> any gotchas involved with adding the O_DIRECT flag to a file pointer
> after it has been opened -- swapon opens the swapfile before we know if
> we're using the SWP_FILE infrastructure, and we need to add O_DIRECT so
> ->{read,write}_iter use direct I/O, but we can't add O_DIRECT to the
> original open without excluding filesystems that support the old bmap
> path but not direct I/O.

In general just adding O_DIRECT is a problem.  However given that the
swap file is locked against any other access while in use it seems ok
in this particular case.  Just make sure to clear it on swapoff, and
write a detailed comment explaining the situation.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Dec. 20, 2014, 6:51 a.m. UTC | #5
On Mon, Dec 15, 2014 at 08:56:15AM -0800, Christoph Hellwig wrote:
> On Mon, Dec 15, 2014 at 05:27:05PM +0100, Jan Kara wrote:
> > On Sun 14-12-14 21:26:56, Omar Sandoval wrote:
> > > The generic write code locks i_mutex for a direct_IO. Swap-over-NFS
> > > doesn't grab the mutex because nfs_direct_IO doesn't expect i_mutex to
> > > be held, but most direct_IO implementations do.
> >   I think you are speaking about direct IO writes only, aren't you? For DIO
> > reads we don't hold i_mutex AFAICS. And also for DIO writes we don't
> > necessarily hold i_mutex - see for example XFS which doesn't take i_mutex
> > for direct IO writes. It uses it's internal rwlock for this (see
> > xfs_file_dio_aio_write()). So I think this is just wrong.
> 
> The problem is that the use of ->direct_IO by the swap code is a gross
> layering violation.  ->direct_IO is a callback for the filesystem, and
> the swap code need to call ->read_iter instead of ->readpage and
> ->write_tier instead of ->direct_IO, and leave the locking to the
> filesystem.

The thing is, ->read_iter() and ->write_iter() might decide to fall back to 
buffered IO path.  XFS is unusual in that respect - there O_DIRECT ends up
with short write in such case.  Other filesystems, OTOH...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Omar Sandoval Dec. 22, 2014, 7:26 a.m. UTC | #6
On Sat, Dec 20, 2014 at 06:51:33AM +0000, Al Viro wrote:
> On Mon, Dec 15, 2014 at 08:56:15AM -0800, Christoph Hellwig wrote:
> > On Mon, Dec 15, 2014 at 05:27:05PM +0100, Jan Kara wrote:
> > > On Sun 14-12-14 21:26:56, Omar Sandoval wrote:
> > > > The generic write code locks i_mutex for a direct_IO. Swap-over-NFS
> > > > doesn't grab the mutex because nfs_direct_IO doesn't expect i_mutex to
> > > > be held, but most direct_IO implementations do.
> > >   I think you are speaking about direct IO writes only, aren't you? For DIO
> > > reads we don't hold i_mutex AFAICS. And also for DIO writes we don't
> > > necessarily hold i_mutex - see for example XFS which doesn't take i_mutex
> > > for direct IO writes. It uses it's internal rwlock for this (see
> > > xfs_file_dio_aio_write()). So I think this is just wrong.
> > 
> > The problem is that the use of ->direct_IO by the swap code is a gross
> > layering violation.  ->direct_IO is a callback for the filesystem, and
> > the swap code need to call ->read_iter instead of ->readpage and
> > ->write_tier instead of ->direct_IO, and leave the locking to the
> > filesystem.
> 
> The thing is, ->read_iter() and ->write_iter() might decide to fall back to 
> buffered IO path.  XFS is unusual in that respect - there O_DIRECT ends up
> with short write in such case.  Other filesystems, OTOH...

Alright, now what? Using ->direct_IO directly is pretty much a no go
because of the different locking conventions as was pointed out. Maybe
some "no, really, just direct I/O" iocb flag?
Christoph Hellwig Dec. 23, 2014, 9:37 a.m. UTC | #7
On Sat, Dec 20, 2014 at 06:51:33AM +0000, Al Viro wrote:
> > The problem is that the use of ->direct_IO by the swap code is a gross
> > layering violation.  ->direct_IO is a callback for the filesystem, and
> > the swap code need to call ->read_iter instead of ->readpage and
> > ->write_tier instead of ->direct_IO, and leave the locking to the
> > filesystem.
> 
> The thing is, ->read_iter() and ->write_iter() might decide to fall back to 
> buffered IO path.  XFS is unusual in that respect - there O_DIRECT ends up
> with short write in such case.  Other filesystems, OTOH...

We'll just need a ->swap_activate method that makes sure we really do
direct I/O.  For all filesystems currently suporting swap just checking
that all blocks are allocated (as the ->bmap path already does) should
be enough.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/mm/page_io.c b/mm/page_io.c
index 955db8b..1630ac0 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -263,6 +263,7 @@  int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	if (sis->flags & SWP_FILE) {
 		struct kiocb kiocb;
 		struct file *swap_file = sis->swap_file;
+		struct inode *inode = file_inode(swap_file);
 		struct address_space *mapping = swap_file->f_mapping;
 		struct bio_vec bv = {
 			.bv_page = page,
@@ -283,9 +284,11 @@  int __swap_writepage(struct page *page, struct writeback_control *wbc,
 
 		set_page_writeback(page);
 		unlock_page(page);
+		mutex_lock(&inode->i_mutex);
 		ret = mapping->a_ops->direct_IO(ITER_BVEC | WRITE,
 						&kiocb, &from,
 						kiocb.ki_pos);
+		mutex_unlock(&inode->i_mutex);
 		if (ret == PAGE_SIZE) {
 			count_vm_event(PSWPOUT);
 			ret = 0;