diff mbox

[2/8] swap: lock i_mutex for swap_writepage direct_IO

Message ID 20141216085624.GA25256@mew (mailing list archive)
State New, archived
Headers show

Commit Message

Omar Sandoval Dec. 16, 2014, 8:56 a.m. UTC
On Tue, Dec 16, 2014 at 12:35:43AM -0800, Christoph Hellwig wrote:
> On Mon, Dec 15, 2014 at 02:11:00PM -0800, Omar Sandoval wrote:
> > Ok, I got the swap code working with ->read_iter/->write_iter without
> > too much trouble. I wanted to double check before I submit if there's
> > any gotchas involved with adding the O_DIRECT flag to a file pointer
> > after it has been opened -- swapon opens the swapfile before we know if
> > we're using the SWP_FILE infrastructure, and we need to add O_DIRECT so
> > ->{read,write}_iter use direct I/O, but we can't add O_DIRECT to the
> > original open without excluding filesystems that support the old bmap
> > path but not direct I/O.
> 
> In general just adding O_DIRECT is a problem.  However given that the
> swap file is locked against any other access while in use it seems ok
> in this particular case.  Just make sure to clear it on swapoff, and
> write a detailed comment explaining the situation.

I'll admit that I'm a bit confused. I want to do this:


This seems to be more or less equivalent to doing a fcntl(F_SETFL) to
add the O_DIRECT flag to swap_file (which is a struct file *). Swapoff
calls filp_close on swap_file, so I don't see why it's necessary to
clear the flag.

Comments

Christoph Hellwig Dec. 17, 2014, 8:06 a.m. UTC | #1
On Tue, Dec 16, 2014 at 12:56:24AM -0800, Omar Sandoval wrote:
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1728,6 +1728,9 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
>         }
>  
>         if (mapping->a_ops->swap_activate) {
> +               if (!mapping->a_ops->direct_IO)
> +                       return -EINVAL;
> +               swap_file->f_flags |= O_DIRECT;
>                 ret = mapping->a_ops->swap_activate(sis, swap_file, span);
>                 if (!ret) {
>                         sis->flags |= SWP_FILE;

This needs to hold swap_file->f_lock, but otherwise looks good.

> This seems to be more or less equivalent to doing a fcntl(F_SETFL) to
> add the O_DIRECT flag to swap_file (which is a struct file *). Swapoff
> calls filp_close on swap_file, so I don't see why it's necessary to
> clear the flag.

filp_lose doesn't nessecarily destroy the file structure, there might be
other reference to it, e.g. from dup() or descriptor passing.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Dec. 17, 2014, 8:20 a.m. UTC | #2
On Wed, Dec 17, 2014 at 12:06:10AM -0800, Christoph Hellwig wrote:

> > This seems to be more or less equivalent to doing a fcntl(F_SETFL) to
> > add the O_DIRECT flag to swap_file (which is a struct file *). Swapoff
> > calls filp_close on swap_file, so I don't see why it's necessary to
> > clear the flag.
> 
> filp_lose doesn't nessecarily destroy the file structure, there might be
> other reference to it, e.g. from dup() or descriptor passing.

Where the hell would those other references come from?  We open the damn
thing in sys_swapon(), never put it into descriptor tables, etc. and
the only reason why we use filp_close() instead of fput() is that we
would miss ->flush() otherwise.

Said that, why not simply *open* it with O_DIRECT to start with and be done
with that?  It's not as if those guys came preopened by caller - swapon(2)
gets a pathname and does opening itself.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Dec. 17, 2014, 8:24 a.m. UTC | #3
On Wed, Dec 17, 2014 at 08:20:21AM +0000, Al Viro wrote:
> Where the hell would those other references come from?  We open the damn
> thing in sys_swapon(), never put it into descriptor tables, etc. and
> the only reason why we use filp_close() instead of fput() is that we
> would miss ->flush() otherwise.
> 
> Said that, why not simply *open* it with O_DIRECT to start with and be done
> with that?  It's not as if those guys came preopened by caller - swapon(2)
> gets a pathname and does opening itself.

Oops, should have dug deeper into the code.  For some reason I assumed
the fd is passed in from userspace.

The suggestion from Al is much better, given that we never do normal
I/O on the swapfile, just the bmap + direct bio submission which I hope
could go away in favor of the direct I/O variant in the long run.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Omar Sandoval Dec. 17, 2014, 2:58 p.m. UTC | #4
On Wed, Dec 17, 2014 at 12:24:37AM -0800, Christoph Hellwig wrote:
> On Wed, Dec 17, 2014 at 08:20:21AM +0000, Al Viro wrote:
> > Where the hell would those other references come from?  We open the damn
> > thing in sys_swapon(), never put it into descriptor tables, etc. and
> > the only reason why we use filp_close() instead of fput() is that we
> > would miss ->flush() otherwise.
> > 
> > Said that, why not simply *open* it with O_DIRECT to start with and be done
> > with that?  It's not as if those guys came preopened by caller - swapon(2)
> > gets a pathname and does opening itself.
> 
> Oops, should have dug deeper into the code.  For some reason I assumed
> the fd is passed in from userspace.
> 
> The suggestion from Al is much better, given that we never do normal
> I/O on the swapfile, just the bmap + direct bio submission which I hope
> could go away in favor of the direct I/O variant in the long run.

See my previous message. If we use O_DIRECT on the original open, then
filesystems that implement bmap but not direct_IO will no longer work.
These are the ones that I found in my tree:

adfs
befs
bfs
ecryptfs
efs
freevxfs
hpfs
isofs
minix
ntfs
omfs
qnx4
qnx6
sysv
ufs

Several of these are read only, and I can't imagine that anyone is using
a swapfile on any of the rest, but if someone is, this would be a
regression, wouldn't it?
Christoph Hellwig Dec. 17, 2014, 6:52 p.m. UTC | #5
On Wed, Dec 17, 2014 at 06:58:32AM -0800, Omar Sandoval wrote:
> See my previous message. If we use O_DIRECT on the original open, then
> filesystems that implement bmap but not direct_IO will no longer work.
> These are the ones that I found in my tree:

In the long run I don't think they are worth keeping.  But to keep you
out of that discussion you can just try an open without O_DIRECT if the
open with the flag failed.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Dec. 17, 2014, 10:03 p.m. UTC | #6
On Wed, Dec 17, 2014 at 10:52:56AM -0800, Christoph Hellwig wrote:
> On Wed, Dec 17, 2014 at 06:58:32AM -0800, Omar Sandoval wrote:
> > See my previous message. If we use O_DIRECT on the original open, then
> > filesystems that implement bmap but not direct_IO will no longer work.
> > These are the ones that I found in my tree:
> 
> In the long run I don't think they are worth keeping.  But to keep you
> out of that discussion you can just try an open without O_DIRECT if the
> open with the flag failed.

Umm...  That's one possibility, of course (and if swapon(2) is on someone's
hotpath, I really would like to see what the hell they are doing - it has
to be interesting in a sick way).

Said that, there's an interesting problem with O_DIRECT.  It's irrelevant
in this case, but it *can* be changed halfway through e.g write(2) and
AFAICS we have at least some suspicious codepaths.  Look at
ext4_file_write_iter(), for example.  We check O_DIRECT, then grab some
locks, then proceed to look at the results of that check, do some work...
and call __generic_file_write_iter(), which checks O_DIRECT again.  If
it has been cleared (or, probably worse, set) it looks like we'll get
an interesting bunch of holes.

Should we just labda-expand that call of __generic_file_write_iter() and
replace its 
        if (unlikely(file->f_flags & O_DIRECT)) {
with
	if (odirect)
to be guaranteed that it'll match the things we'd done before the call?

I'm looking through the callchains right now in search of similar places
right now, will follow up when I'm done...

BTW, speaking of read/write vs. swap - what's the story with e.g. AFS
write() checking IS_SWAPFILE() and failing with -EBUSY?  Note that
	* it's done before acquiring i_mutex, so it isn't race-free
	* it's dubious from the POSIX POV - EBUSY isn't in the error
list for write(2).
	* other filesystems generally don't have anything of that sort.
NFS does, but local ones do not...
Besides, do we even allow swapfiles on AFS?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8798b2e..5145c09 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1728,6 +1728,9 @@  static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span)
        }
 
        if (mapping->a_ops->swap_activate) {
+               if (!mapping->a_ops->direct_IO)
+                       return -EINVAL;
+               swap_file->f_flags |= O_DIRECT;
                ret = mapping->a_ops->swap_activate(sis, swap_file, span);
                if (!ret) {
                        sis->flags |= SWP_FILE;