Message ID | CAHf9xvY1TSNa-zY82whUrK6DtnFYiNx1BEF6R1kGQj1HU+wbxA@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Am Donnerstag, 4. Oktober 2012 schrieb Alex Lyakas: > Hi Jan, > as I promised, here is some code for you to look at. > > First I will describe the approach in general. > > # Get rid of the pipe. Instead, user-space passes a buffer and kernel > fills the specified user-space buffer with commands. > # When the buffer is full, kernel stops generating commands and > returns a checkpoint to the user-space. Can it just fill a second buffer while userspace command handles the first? > # User-space does whatever it wants with the returned buffer, and then > calls the kernel again, with a buffer and a checkpoint that was > returned by the kernel from previous SEND ioctl(). > # Kernel re-arms itself to the specified checkpoint, and fills the > specified buffer with commands, attaches a new checkpoint and so on. > # Eventually kernel signals to the user that there are no more > commands. Thanks,
Matrin, On Sat, Oct 6, 2012 at 11:40 AM, Martin Steigerwald <Martin@lichtvoll.de> wrote: > Am Donnerstag, 4. Oktober 2012 schrieb Alex Lyakas: >> Hi Jan, >> as I promised, here is some code for you to look at. >> >> First I will describe the approach in general. >> >> # Get rid of the pipe. Instead, user-space passes a buffer and kernel >> fills the specified user-space buffer with commands. >> # When the buffer is full, kernel stops generating commands and >> returns a checkpoint to the user-space. > > Can it just fill a second buffer while userspace command handles the first? No, at this point kernel receives only one buffer and fills it up. Can you pls elaborate more what improvement you have in mind? Like user-space sending a list of buffers to the kernel in one shot? Or that user-space handles kind of producer-consumer pool of buffers and works on full buffers while kernel fills the empty ones (this, of course, can work). In general, my direction was to make the kernel call stateless, which makes the user-space part more flexible. Thanks, Alex. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Sonntag, 7. Oktober 2012 schrieb Alex Lyakas: > Matrin, Martin, but bet was just a typo. > On Sat, Oct 6, 2012 at 11:40 AM, Martin Steigerwald <Martin@lichtvoll.de> wrote: > > Am Donnerstag, 4. Oktober 2012 schrieb Alex Lyakas: > >> Hi Jan, > >> as I promised, here is some code for you to look at. > >> > >> First I will describe the approach in general. > >> > >> # Get rid of the pipe. Instead, user-space passes a buffer and > >> kernel fills the specified user-space buffer with commands. > >> # When the buffer is full, kernel stops generating commands and > >> returns a checkpoint to the user-space. > > > > Can it just fill a second buffer while userspace command handles the > > first? > > No, at this point kernel receives only one buffer and fills it up. > Can you pls elaborate more what improvement you have in mind? Like > user-space sending a list of buffers to the kernel in one shot? Or > that user-space handles kind of producer-consumer pool of buffers and > works on full buffers while kernel fills the empty ones (this, of > course, can work). > > In general, my direction was to make the kernel call stateless, which > makes the user-space part more flexible. Hmmm, okay. So the kernel has only one buffer. Well I had some kind of double buffering in mind. The kernel fills one buffer while the userspace application handles the other buffer. Then kernel and userspace swap buffers. Just an idea of mine. I do not do kernel development, so I do not know whether its feasible. Was happy enough that I wrote a C program that uses some kernel functions some weeks ago that actually worked :) So would using two buffers and swapping them make sense? I do not understand what you mean by producer-consumer, but it seems like thats basically what I described as double buffering here. Thanks,
Martin, I apologize for misspelling your name. Yes, user-space can handle a buffer pool of any number of buffers, like one thread fetching a free buffer, handling it to the kernel, then putting it on some ready-buffer queue and signalling to a another thread that fetches ready (full) buffers and handles them. After it is done with the buffer, it puts it back on a free-buffer queue and signals the first thread. This is exactly producer-consumer. Alex. On Sun, Oct 7, 2012 at 12:53 PM, Martin Steigerwald <Martin@lichtvoll.de> wrote: > Am Sonntag, 7. Oktober 2012 schrieb Alex Lyakas: >> Matrin, > > Martin, but bet was just a typo. > >> On Sat, Oct 6, 2012 at 11:40 AM, Martin Steigerwald > <Martin@lichtvoll.de> wrote: >> > Am Donnerstag, 4. Oktober 2012 schrieb Alex Lyakas: >> >> Hi Jan, >> >> as I promised, here is some code for you to look at. >> >> >> >> First I will describe the approach in general. >> >> >> >> # Get rid of the pipe. Instead, user-space passes a buffer and >> >> kernel fills the specified user-space buffer with commands. >> >> # When the buffer is full, kernel stops generating commands and >> >> returns a checkpoint to the user-space. >> > >> > Can it just fill a second buffer while userspace command handles the >> > first? >> >> No, at this point kernel receives only one buffer and fills it up. >> Can you pls elaborate more what improvement you have in mind? Like >> user-space sending a list of buffers to the kernel in one shot? Or >> that user-space handles kind of producer-consumer pool of buffers and >> works on full buffers while kernel fills the empty ones (this, of >> course, can work). >> >> In general, my direction was to make the kernel call stateless, which >> makes the user-space part more flexible. > > Hmmm, okay. So the kernel has only one buffer. > > Well I had some kind of double buffering in mind. The kernel fills one buffer > while the userspace application handles the other buffer. Then kernel and > userspace swap buffers. > > Just an idea of mine. I do not do kernel development, so I do not know > whether its feasible. Was happy enough that I wrote a C program that uses > some kernel functions some weeks ago that actually worked :) > > So would using two buffers and swapping them make sense? > > I do not understand what you mean by producer-consumer, but it seems like > thats basically what I described as double buffering here. > > Thanks, > -- > Martin 'Helios' Steigerwald - http://www.Lichtvoll.de > GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Sonntag, 7. Oktober 2012 schrieb Alex Lyakas: > On Sun, Oct 7, 2012 at 12:53 PM, Martin Steigerwald <Martin@lichtvoll.de> wrote: > > Am Sonntag, 7. Oktober 2012 schrieb Alex Lyakas: […] > >> On Sat, Oct 6, 2012 at 11:40 AM, Martin Steigerwald > > > > <Martin@lichtvoll.de> wrote: > >> > Am Donnerstag, 4. Oktober 2012 schrieb Alex Lyakas: > >> >> Hi Jan, > >> >> as I promised, here is some code for you to look at. > >> >> > >> >> First I will describe the approach in general. > >> >> > >> >> # Get rid of the pipe. Instead, user-space passes a buffer and > >> >> kernel fills the specified user-space buffer with commands. > >> >> # When the buffer is full, kernel stops generating commands and > >> >> returns a checkpoint to the user-space. > >> > > >> > Can it just fill a second buffer while userspace command handles > >> > the first? > >> > >> No, at this point kernel receives only one buffer and fills it up. > >> Can you pls elaborate more what improvement you have in mind? Like > >> user-space sending a list of buffers to the kernel in one shot? Or > >> that user-space handles kind of producer-consumer pool of buffers > >> and works on full buffers while kernel fills the empty ones (this, > >> of course, can work). > >> > >> In general, my direction was to make the kernel call stateless, > >> which makes the user-space part more flexible. > > > > Hmmm, okay. So the kernel has only one buffer. > > > > Well I had some kind of double buffering in mind. The kernel fills > > one buffer while the userspace application handles the other buffer. > > Then kernel and userspace swap buffers. > > > > Just an idea of mine. I do not do kernel development, so I do not > > know whether its feasible. Was happy enough that I wrote a C program > > that uses some kernel functions some weeks ago that actually worked > > :) > > > > So would using two buffers and swapping them make sense? > Martin, > I apologize for misspelling your name. No problem. > Yes, user-space can handle a buffer pool of any number of buffers, > like one thread fetching a free buffer, handling it to the kernel, > then putting it on some ready-buffer queue and signalling to a another > thread that fetches ready (full) buffers and handles them. After it is > done with the buffer, it puts it back on a free-buffer queue and > signals the first thread. This is exactly producer-consumer. Ah, so the kernel does not need to have anything to do with it. Nice. Thanks,
Hi Alex, On Thu, October 04, 2012 at 17:59 (+0200), Alex Lyakas wrote: > as I promised, here is some code for you to look at. And quite a lot of it. I hadn't thought of such a big change when I wrote "preferably in form of a patch". As a side note, your patch doesn't follow the general kernel coding style (but read on before you rework this one). > First I will describe the approach in general. > > # Get rid of the pipe. Instead, user-space passes a buffer and kernel > fills the specified user-space buffer with commands. > # When the buffer is full, kernel stops generating commands and > returns a checkpoint to the user-space. > # User-space does whatever it wants with the returned buffer, and then > calls the kernel again, with a buffer and a checkpoint that was > returned by the kernel from previous SEND ioctl(). > # Kernel re-arms itself to the specified checkpoint, and fills the > specified buffer with commands, attaches a new checkpoint and so on. > # Eventually kernel signals to the user that there are no more commands. We had that in the very beginning of btrfs send. Having only a single ioctl saves a whole lot of system calls. > I realize this is a big change, and a new IOCTL has to be introduced > in order not to break current user-kernel protocol. > The pros as I see them: > # One data-copy is avoided (no pipe). For WRITE commands two > data-copies are avoided (no read_buf needed) I'm not sure I understand those correctly. If you're talking about the user mode part, we could simply pass stdout to the kernel, saving the unnecessary pipe and copy operations in between without introducing a new buffer. > # ERESTARTSYS issue disappears. If needed, ioctl is restarted, but > there is no problem with that, it will simply refill the buffer from > the same checkpoint. This is the subject of this thread and the thing I'd like to focus on currently. > Cons: > # Instead of one ioctl(), many ioctls() are issued to finish the send. > # Big code change Two big cons. I'd like to quota Alexander's suggestions again: On Wed, August 01, 2012 at 14:09 (+0200), Alexander Block wrote: > I have two possible solutions in my mind. > 1. Store some kind of state in the ioctl arguments so that we can > continue where we stopped when the ioctl reenters. This would however > complicate the code a lot. > 2. Spawn a thread when the ioctl is called and leave the ioctl > immediately. I don't know if ERESTARTSYS can happen in vfs_xxx calls > if they happen from a non syscall thread. What do you think about those two? I like the first suggestion. Combining single-ioctl with signal handling capabilities feels like the right choice. When we get ERESTARTSYS, we know exactly how many bytes made it to user mode. To reach a comfortable state for a restart, we can store part of the stream together with the meta information in our internal state before returning to user mode. The ioctl will be restarted sooner or later and our internal state tells us where to proceed. Thanks, -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Jan, thanks for taking time to look at the code. On Mon, Oct 8, 2012 at 11:26 AM, Jan Schmidt <list.btrfs@jan-o-sch.net> wrote: > Hi Alex, > > On Thu, October 04, 2012 at 17:59 (+0200), Alex Lyakas wrote: >> as I promised, here is some code for you to look at. > > And quite a lot of it. I hadn't thought of such a big change when I wrote > "preferably in form of a patch". > > As a side note, your patch doesn't follow the general kernel coding style (but > read on before you rework this one). > >> First I will describe the approach in general. >> >> # Get rid of the pipe. Instead, user-space passes a buffer and kernel >> fills the specified user-space buffer with commands. >> # When the buffer is full, kernel stops generating commands and >> returns a checkpoint to the user-space. >> # User-space does whatever it wants with the returned buffer, and then >> calls the kernel again, with a buffer and a checkpoint that was >> returned by the kernel from previous SEND ioctl(). >> # Kernel re-arms itself to the specified checkpoint, and fills the >> specified buffer with commands, attaches a new checkpoint and so on. >> # Eventually kernel signals to the user that there are no more commands. > > We had that in the very beginning of btrfs send. Having only a single ioctl > saves a whole lot of system calls. > >> I realize this is a big change, and a new IOCTL has to be introduced >> in order not to break current user-kernel protocol. >> The pros as I see them: >> # One data-copy is avoided (no pipe). For WRITE commands two >> data-copies are avoided (no read_buf needed) > > I'm not sure I understand those correctly. If you're talking about the user mode > part, we could simply pass stdout to the kernel, saving the unnecessary pipe and > copy operations in between without introducing a new buffer. What I meant is the following: # For non-WRITE commands the flow is: put the command onto send_buf, copy to pipe, then user-space copies it out from the pipe. With my code: put command onto send_buf, then copy to user-space buffer (copy_to_user). So one data-copy is avoided (2 vs 3). # For WRITE commands: read data onto read_buf, then copy to send_buf, then copy to pipe, then user-mode copies to its buffer. With my code: read onto send_buf, then copy to user-space buffer. So 2 data-copies are avoided (2 vs 4). Does it make sense? > >> # ERESTARTSYS issue disappears. If needed, ioctl is restarted, but >> there is no problem with that, it will simply refill the buffer from >> the same checkpoint. > > This is the subject of this thread and the thing I'd like to focus on currently. > >> Cons: >> # Instead of one ioctl(), many ioctls() are issued to finish the send. >> # Big code change > > Two big cons. I'd like to quota Alexander's suggestions again: > > On Wed, August 01, 2012 at 14:09 (+0200), Alexander Block wrote: >> I have two possible solutions in my mind. >> 1. Store some kind of state in the ioctl arguments so that we can >> continue where we stopped when the ioctl reenters. This would however >> complicate the code a lot. >> 2. Spawn a thread when the ioctl is called and leave the ioctl >> immediately. I don't know if ERESTARTSYS can happen in vfs_xxx calls >> if they happen from a non syscall thread. > > What do you think about those two? I am not familiar enough with Linux kernel - will the second one work? > > I like the first suggestion. Combining single-ioctl with signal handling > capabilities feels like the right choice. When we get ERESTARTSYS, we know > exactly how many bytes made it to user mode. To reach a comfortable state for a > restart, we can store part of the stream together with the meta information in > our internal state before returning to user mode. I thought that ERESTARTSYS never returns to user mode (this is what I saw in my tests also). > The ioctl will be restarted sooner or later and our internal state tells us where to proceed. Ok, so you say that we should maintain checkpoint-like information and use it if the ioctl is automatically restarted. This is quite close to what I have done, I believe; we still need all the capabilities of re-arming tree search, saving context, skipping commands etc, that I have written. Did I understand your proposal correctly? But tell me one thing: let's say we call vfs_write() and it returns -ERESTARTSYS. Can we be sure that it wrote 0 bytes to the pipe in this call? Otherwise, we don't know how many bytes it wrote, so we cannot resume it correctly. Thanks, Alex. > > Thanks, > -Jan > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, October 08, 2012 at 13:38 (+0200), Alex Lyakas wrote: >>> I realize this is a big change, and a new IOCTL has to be introduced >>> in order not to break current user-kernel protocol. >>> The pros as I see them: >>> # One data-copy is avoided (no pipe). For WRITE commands two >>> data-copies are avoided (no read_buf needed) >> >> I'm not sure I understand those correctly. If you're talking about the user mode >> part, we could simply pass stdout to the kernel, saving the unnecessary pipe and >> copy operations in between without introducing a new buffer. > What I meant is the following: > # For non-WRITE commands the flow is: put the command onto send_buf, > copy to pipe, then user-space copies it out from the pipe. With my > code: put command onto send_buf, then copy to user-space buffer > (copy_to_user). So one data-copy is avoided (2 vs 3). > # For WRITE commands: read data onto read_buf, then copy to send_buf, > then copy to pipe, then user-mode copies to its buffer. With my code: > read onto send_buf, then copy to user-space buffer. So 2 data-copies > are avoided (2 vs 4). > Does it make sense? I'd rather just focus on the ERESTARTSYS issue for now. >> On Wed, August 01, 2012 at 14:09 (+0200), Alexander Block wrote: >>> I have two possible solutions in my mind. >>> 1. Store some kind of state in the ioctl arguments so that we can >>> continue where we stopped when the ioctl reenters. This would however >>> complicate the code a lot. >>> 2. Spawn a thread when the ioctl is called and leave the ioctl >>> immediately. I don't know if ERESTARTSYS can happen in vfs_xxx calls >>> if they happen from a non syscall thread. >> >> What do you think about those two? > I am not familiar enough with Linux kernel - will the second one work? I not an expert here, either. My uneducated guess is that even a spawned kernel thread can have signals pending, in which case we would be in the same trouble. >> I like the first suggestion. Combining single-ioctl with signal handling >> capabilities feels like the right choice. When we get ERESTARTSYS, we know >> exactly how many bytes made it to user mode. To reach a comfortable state for a >> restart, we can store part of the stream together with the meta information in >> our internal state before returning to user mode. > I thought that ERESTARTSYS never returns to user mode (this is what I > saw in my tests also). That error isn't returned to a user process, that's correct. From a kernel developer's perspective, we're returning to user space, though, waiting for the pending signal to be processed. >> The ioctl will be restarted sooner or later and our internal state tells us where to proceed. > > Ok, so you say that we should maintain checkpoint-like information and > use it if the ioctl is automatically restarted. This is quite close to > what I have done, I believe; we still need all the capabilities of > re-arming tree search, saving context, skipping commands etc, that I > have written. Did I understand your proposal correctly? In some way, yes. If possible, I'd like to (with decreasing priority) - stick with the stream format - maintain ioctl-compatibility - have a much smaller patch - store less state - stick with the pipe I haven't had time to completely understand your checkpoint concept, and one-big-chunk patches are hard to read. But the size of the added data structures scares me. Shouldn't we be able to do this with as little information as a single btrfs key and a buffer of generated but not yet pushed stream data? This would also save us skipping commands or output bytes. > But tell me one thing: let's say we call vfs_write() and it returns > -ERESTARTSYS. Can we be sure that it wrote 0 bytes to the pipe in this > call? From the version of fs/pipe.c in my working directory: yes. If we rely on this, we'd better check if this is something to rely on, but I hope it is. -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Jan, On Mon, Oct 8, 2012 at 3:37 PM, Jan Schmidt <list.btrfs@jan-o-sch.net> wrote: > On Mon, October 08, 2012 at 13:38 (+0200), Alex Lyakas wrote: >>>> I realize this is a big change, and a new IOCTL has to be introduced >>>> in order not to break current user-kernel protocol. >>>> The pros as I see them: >>>> # One data-copy is avoided (no pipe). For WRITE commands two >>>> data-copies are avoided (no read_buf needed) >>> >>> I'm not sure I understand those correctly. If you're talking about the user mode >>> part, we could simply pass stdout to the kernel, saving the unnecessary pipe and >>> copy operations in between without introducing a new buffer. >> What I meant is the following: >> # For non-WRITE commands the flow is: put the command onto send_buf, >> copy to pipe, then user-space copies it out from the pipe. With my >> code: put command onto send_buf, then copy to user-space buffer >> (copy_to_user). So one data-copy is avoided (2 vs 3). >> # For WRITE commands: read data onto read_buf, then copy to send_buf, >> then copy to pipe, then user-mode copies to its buffer. With my code: >> read onto send_buf, then copy to user-space buffer. So 2 data-copies >> are avoided (2 vs 4). >> Does it make sense? > > I'd rather just focus on the ERESTARTSYS issue for now. Agree. > >>> On Wed, August 01, 2012 at 14:09 (+0200), Alexander Block wrote: >>>> I have two possible solutions in my mind. >>>> 1. Store some kind of state in the ioctl arguments so that we can >>>> continue where we stopped when the ioctl reenters. This would however >>>> complicate the code a lot. >>>> 2. Spawn a thread when the ioctl is called and leave the ioctl >>>> immediately. I don't know if ERESTARTSYS can happen in vfs_xxx calls >>>> if they happen from a non syscall thread. >>> >>> What do you think about those two? >> I am not familiar enough with Linux kernel - will the second one work? > > I not an expert here, either. My uneducated guess is that even a spawned kernel > thread can have signals pending, in which case we would be in the same trouble. > >>> I like the first suggestion. Combining single-ioctl with signal handling >>> capabilities feels like the right choice. When we get ERESTARTSYS, we know >>> exactly how many bytes made it to user mode. To reach a comfortable state for a >>> restart, we can store part of the stream together with the meta information in >>> our internal state before returning to user mode. >> I thought that ERESTARTSYS never returns to user mode (this is what I >> saw in my tests also). > > That error isn't returned to a user process, that's correct. From a kernel > developer's perspective, we're returning to user space, though, waiting for the > pending signal to be processed. > >>> The ioctl will be restarted sooner or later and our internal state tells us where to proceed. >> >> Ok, so you say that we should maintain checkpoint-like information and >> use it if the ioctl is automatically restarted. This is quite close to >> what I have done, I believe; we still need all the capabilities of >> re-arming tree search, saving context, skipping commands etc, that I >> have written. Did I understand your proposal correctly? > > In some way, yes. If possible, I'd like to (with decreasing priority) > - stick with the stream format > - maintain ioctl-compatibility > - have a much smaller patch > - store less state > - stick with the pipe > > I haven't had time to completely understand your checkpoint concept, and > one-big-chunk patches are hard to read. But the size of the added data > structures scares me. Shouldn't we be able to do this with as little information > as a single btrfs key and a buffer of generated but not yet pushed stream data? > This would also save us skipping commands or output bytes. # In case we are doing diff-send, we need to restore the state of btrfs_compare_trees() call; we need to restore both left and right tree positions. So we need two keys (this is what my struct btrfs_compare_trees_checkpoint has). # After compare trees state is restored, it will call changed_cb(), at which point we need to restore our full state. So what I did, I saved the fixed-size part of send_ctx (like cur_ino, cur_inode_gen etc.). Perhaps some of this can be figured out on the first changed_cb call, though, but I went the easy way to store everything. Since the variable-size context (new_refs,deleted_refs) is cumbersome to store, we cannot save our state between any two commands. That's why I needed to save info on how many commands to skip and how many bytes to skip inside a WRITE command. At this point I don't see a straightforward way to store less than that, but I can try to go deeper and see if I can. So now the question is where to store that. Can we be sure that if we return -ERESTARTSYS, we will be restarted in the same syscall? If yes, then we can alloc and store a pointer in one of the reserved fields of struct btrfs_ioctl_send_args. If not, then this memory might not be freed. Otherwise, we need to embed this info in struct btrfs_ioctl_send_args, which breaks the current protocol. I see code like this in arch/x86/kernel/signal.c: case -ERESTARTSYS: if (!(ka->sa.sa_flags & SA_RESTART)) { regs->ax = -EINTR; break; } And also: " If you use sigaction to establish a signal handler, you can specify how that handler should behave. If you specify the SA_RESTART flag, return from that handler will resume a primitive; otherwise, return from that handler will cause EINTR." So it looks like the user can disable the restarting and rather handle EINTR. > >> But tell me one thing: let's say we call vfs_write() and it returns >> -ERESTARTSYS. Can we be sure that it wrote 0 bytes to the pipe in this >> call? > > From the version of fs/pipe.c in my working directory: yes. If we rely on this, > we'd better check if this is something to rely on, but I hope it is. (I see that is calls pipe_iov_copy_from_user() before checking for pending signals, actually, but maybe I am missing something) So what we can do: - stick with the stream format: YES - maintain ioctl-compatibility: Probably no, since the user can disable restarting. - have a much smaller patch: should be smaller, only the checkpoint part goes in, not the buffer part - store less state: not really - stick with the pipe: YES Thanks, Alex. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c old mode 100644 new mode 100755 index fcc8c21..4e1049a --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -5165,6 +5165,135 @@ static int tree_compare_item(struct btrfs_root *left_root, #define ADVANCE 1 #define ADVANCE_ONLY_NEXT -1 +static /*const*/ struct btrfs_key s_key_tree_end_reached = { + .objectid = (u64)-1, + .type = (u8)-1, + .offset = (u64)-1 +}; + +/* + * Generate a tree comparison checkpoint, suitable for being sent over network, + * written to disk etc. + * Returns the same 'cp' pointer that was passed. + */ +struct btrfs_compare_trees_checkpoint* +btrfs_compare_trees_gen_checkpoint(struct btrfs_compare_trees_checkpoint *cp, + struct btrfs_key *left_key, int left_end_reached, + struct btrfs_key *right_key, int right_end_reached) +{ + if (left_end_reached) { + cp->left_key__objectid = cpu_to_le64(s_key_tree_end_reached.objectid); + cp->left_key__type = s_key_tree_end_reached.type; + cp->left_key__offset = cpu_to_le64(s_key_tree_end_reached.offset); + } else { + cp->left_key__objectid = cpu_to_le64(left_key->objectid); + cp->left_key__type = left_key->type; + cp->left_key__offset = cpu_to_le64(left_key->offset); + } + + if (right_end_reached) { + cp->right_key__objectid = cpu_to_le64(s_key_tree_end_reached.objectid); + cp->right_key__type = s_key_tree_end_reached.type; + cp->right_key__offset = cpu_to_le64(s_key_tree_end_reached.offset); + } else { + cp->right_key__objectid = cpu_to_le64(right_key->objectid); + cp->right_key__type = right_key->type; + cp->right_key__offset = cpu_to_le64(right_key->offset); + } + + return cp; +} + +static int btrfs_compare_trees_rearm_to_tree_end(struct btrfs_root *root, + /* Output goes here */ + struct btrfs_path *path, + struct btrfs_key *key, int *level, int *end_reached) +{ + int ret = 0; + struct btrfs_key search_key = s_key_tree_end_reached; + + ret = btrfs_search_slot(NULL, root, &search_key, path, 0/*ins_len*/, 0/*cow*/); + if (ret > 0) { + *key = s_key_tree_end_reached; /* doesn't really matter in this case */ + *level = btrfs_header_level(root->commit_root); /* doesn't really matter in this case */ + *end_reached = ADVANCE; + ret = 0; + } else { + ret = -EILSEQ; + } + + return ret; +} + +static int btrfs_compare_trees_rearm_to_item(struct btrfs_root *root, struct btrfs_key *rearm_key, + /* Output goes here */ + struct btrfs_path *path, + struct btrfs_key *key, int *level, int *end_reached) +{ + int ret = 0; + struct btrfs_key search_key = *rearm_key; + + ret = btrfs_search_slot(NULL, root, &search_key, path, 0/*ins_len*/, 0/*cow*/); + if (ret != 0) { + /* We should be able to find the key */ + ret = -EILSEQ; + } else { + *key = *rearm_key; + *level = 0; /* We found the item, so we're on level 0 */ + *end_reached = 0; + } + + return ret; +} + +static int btrfs_compare_trees_rearm_cp(struct btrfs_compare_trees_checkpoint *cp, + struct btrfs_root *left_root, struct btrfs_root *right_root, + /* Output goes here */ + struct btrfs_path *left_path, struct btrfs_path *right_path, + struct btrfs_key *left_key, int *left_level, int *left_end_reached, + struct btrfs_key *right_key, int *right_level, int *right_end_reached) +{ + int ret = 0; + struct btrfs_key rearm_key; + + /* rearm left tree */ + rearm_key.objectid = le64_to_cpu(cp->left_key__objectid); + rearm_key.type = cp->left_key__type; + rearm_key.offset = le64_to_cpu(cp->left_key__offset); + if (btrfs_comp_cpu_keys(&rearm_key, &s_key_tree_end_reached) == 0) { + ret = btrfs_compare_trees_rearm_to_tree_end(left_root, + /* output */ + left_path, + left_key, left_level, left_end_reached); + } else { + ret = btrfs_compare_trees_rearm_to_item(left_root, &rearm_key, + /* output */ + left_path, + left_key, left_level, left_end_reached); + } + if (ret) + goto out; + + /* rearm right tree */ + rearm_key.objectid = le64_to_cpu(cp->right_key__objectid); + rearm_key.type = cp->right_key__type; + rearm_key.offset = le64_to_cpu(cp->right_key__offset); + if (btrfs_comp_cpu_keys(&rearm_key, &s_key_tree_end_reached) == 0) { + ret = btrfs_compare_trees_rearm_to_tree_end(right_root, + /* output */ + right_path, + right_key, right_level, right_end_reached); + } else { + ret = btrfs_compare_trees_rearm_to_item(right_root, &rearm_key, + /* output */ + right_path, + right_key, right_level, right_end_reached); + } + +out: + return ret; +} + /* * This function compares two trees and calls the provided callback for * every changed/new/deleted item it finds. @@ -5180,6 +5309,7 @@ static int tree_compare_item(struct btrfs_root *left_root, */ int btrfs_compare_trees(struct btrfs_root *left_root, struct btrfs_root *right_root, + struct btrfs_compare_trees_checkpoint *tree_cp, btrfs_changed_cb_t changed_cb, void *ctx) { int ret; @@ -5203,6 +5333,7 @@ int btrfs_compare_trees(struct btrfs_root *left_root, u64 left_start_ctransid; u64 right_start_ctransid; u64 ctransid; + struct btrfs_compare_trees_checkpoint curr_cp; left_path = btrfs_alloc_path(); if (!left_path) { @@ -5277,30 +5408,44 @@ int btrfs_compare_trees(struct btrfs_root *left_root, * the right if possible or go up and right. */ - left_level = btrfs_header_level(left_root->commit_root); - left_root_level = left_level; - left_path->nodes[left_level] = left_root->commit_root; - extent_buffer_get(left_path->nodes[left_level]); + if (tree_cp == NULL) { + left_level = btrfs_header_level(left_root->commit_root); + left_root_level = left_level; + left_path->nodes[left_level] = left_root->commit_root; + extent_buffer_get(left_path->nodes[left_level]); - right_level = btrfs_header_level(right_root->commit_root); - right_root_level = right_level; - right_path->nodes[right_level] = right_root->commit_root; - extent_buffer_get(right_path->nodes[right_level]); + right_level = btrfs_header_level(right_root->commit_root); + right_root_level = right_level; + right_path->nodes[right_level] = right_root->commit_root; + extent_buffer_get(right_path->nodes[right_level]); - if (left_level == 0) - btrfs_item_key_to_cpu(left_path->nodes[left_level], - &left_key, left_path->slots[left_level]); - else - btrfs_node_key_to_cpu(left_path->nodes[left_level], - &left_key, left_path->slots[left_level]); - if (right_level == 0) - btrfs_item_key_to_cpu(right_path->nodes[right_level], - &right_key, right_path->slots[right_level]); - else - btrfs_node_key_to_cpu(right_path->nodes[right_level], - &right_key, right_path->slots[right_level]); + if (left_level == 0) + btrfs_item_key_to_cpu(left_path->nodes[left_level], + &left_key, left_path->slots[left_level]); + else + btrfs_node_key_to_cpu(left_path->nodes[left_level], + &left_key, left_path->slots[left_level]); + if (right_level == 0) + btrfs_item_key_to_cpu(right_path->nodes[right_level], + &right_key, right_path->slots[right_level]); + else + btrfs_node_key_to_cpu(right_path->nodes[right_level], + &right_key, right_path->slots[right_level]); - left_end_reached = right_end_reached = 0; + left_end_reached = right_end_reached = 0; + } else { + ret = btrfs_compare_trees_rearm_cp(tree_cp, + left_root, right_root, + /* output */ + left_path, right_path, + &left_key, &left_level, &left_end_reached, + &right_key, &right_level, &right_end_reached); + if (ret) + goto out; + + left_root_level = btrfs_header_level(left_root->commit_root); + right_root_level = btrfs_header_level(right_root->commit_root); + } advance_left = advance_right = 0; while (1) { @@ -5384,6 +5529,8 @@ int btrfs_compare_trees(struct btrfs_root *left_root, advance_right = 0; } + BUG_ON(!(advance_left == 0 && advance_right == 0)); + if (left_end_reached && right_end_reached) { ret = 0; goto out; @@ -5393,6 +5540,9 @@ int btrfs_compare_trees(struct btrfs_root *left_root, left_path, right_path, &right_key, BTRFS_COMPARE_TREE_DELETED, + btrfs_compare_trees_gen_checkpoint(&curr_cp, + NULL/*left_key*/, ADVANCE/*left_end_reached*/, + &right_key, 0/*right_end_reached*/), ctx); if (ret < 0) goto out; @@ -5405,6 +5555,9 @@ int btrfs_compare_trees(struct btrfs_root *left_root, left_path, right_path, &left_key, BTRFS_COMPARE_TREE_NEW, + btrfs_compare_trees_gen_checkpoint(&curr_cp, + &left_key, 0/*left_end_reached*/, + NULL/*right_key*/, ADVANCE/*right_end_reached*/), ctx); if (ret < 0) goto out; @@ -5414,12 +5567,18 @@ int btrfs_compare_trees(struct btrfs_root *left_root, } if (left_level == 0 && right_level == 0) { + /* All the cases here will have the same checkpoint */ + btrfs_compare_trees_gen_checkpoint(&curr_cp, + &left_key, 0/*left_end_reached*/, + &right_key, 0/*right_end_reached*/); + cmp = btrfs_comp_cpu_keys(&left_key, &right_key); if (cmp < 0) { ret = changed_cb(left_root, right_root, left_path, right_path, &left_key, BTRFS_COMPARE_TREE_NEW, + &curr_cp, ctx); if (ret < 0) goto out; @@ -5429,6 +5588,7 @@ int btrfs_compare_trees(struct btrfs_root *left_root, left_path, right_path, &right_key, BTRFS_COMPARE_TREE_DELETED, + &curr_cp, ctx); if (ret < 0) goto out; @@ -5443,6 +5603,7 @@ int btrfs_compare_trees(struct btrfs_root *left_root, left_path, right_path, &left_key, BTRFS_COMPARE_TREE_CHANGED, + &curr_cp, ctx); if (ret < 0) goto out; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h old mode 100644 new mode 100755 index c38734a..a1e4282 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2945,9 +2945,15 @@ typedef int (*btrfs_changed_cb_t)(struct btrfs_root *left_root, struct btrfs_path *right_path, struct btrfs_key *key, enum btrfs_compare_tree_result result, + struct btrfs_compare_trees_checkpoint *tree_cp, void *ctx); +struct btrfs_compare_trees_checkpoint* +btrfs_compare_trees_gen_checkpoint(struct btrfs_compare_trees_checkpoint *cp, + struct btrfs_key *left_key, int left_end_reached, + struct btrfs_key *right_key, int right_end_reached); int btrfs_compare_trees(struct btrfs_root *left_root, struct btrfs_root *right_root, + struct btrfs_compare_trees_checkpoint *tree_cp, btrfs_changed_cb_t cb, void *ctx); int btrfs_cow_block(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h old mode 100644 new mode 100755 index 731e287..60f32bb --- a/fs/btrfs/ioctl.h +++ b/fs/btrfs/ioctl.h @@ -363,11 +363,80 @@ struct btrfs_ioctl_received_subvol_args { __u64 reserved[16]; /* in */ }; + /* + * This is a tree comparison checkpoint. + * This information is enough to rearm the tree comparison process + * to restart from the point where it was interrupted. + * + * This struct needs to be in strict endian and alignment, because it is sent over network. + */ +struct btrfs_compare_trees_checkpoint { + __le64 left_key__objectid; + __u8 left_key__type; + __le64 left_key__offset; + + __le64 right_key__objectid; + __u8 right_key__type; + __le64 right_key__offset; +} __attribute__ ((__packed__)); + + /* + * This structure is used to restart the "send" stream + * from some previously checkpointed position. + * It needs strict endian and alignment, because it is sent over network. + */ +struct btrfs_send_checkpoint { + __le32 cp_size_bytes; + __le32 version; + + /* Tree comparison checkpoint */ + struct btrfs_compare_trees_checkpoint tree_cmp_cp; + + /* Send context checkpoint */ + __le64 cur_ino; + __le64 cur_inode_gen; + __u8 cur_inode_new; + __u8 cur_inode_new_gen; + __u8 cur_inode_deleted; + __le64 cur_inode_size; + __le64 cur_inode_mode; + __le64 send_progress; + + /* + * Kernel cannot produce a checkpoint between any two commands, so + * this allows us to skip the required number of commands since + * the previous checkpoint. + * Also, if the last command was a large WRITE command, but we can send + * only a part of it, and later send the remainder. + */ + __le32 n_cmds_since_cp; + __le32 offset_in_write_cmd_bytes; + +} __attribute__ ((__packed__)); + +#define BTRFS_SEND_CHECKPOINT_VERSION 1 + +#define BTRFS_HAS_TREE_CMP_CHECKPOINT(tree_cp) \ + ((le64_to_cpu((tree_cp)->left_key__objectid) != 0) && \ + (le64_to_cpu((tree_cp)->right_key__objectid) != 0)) + +#define BTRFS_HAS_SEND_CHECKPOINT(cp) BTRFS_HAS_TREE_CMP_CHECKPOINT(&((cp)->tree_cmp_cp)) + struct btrfs_ioctl_send_args { - __s64 send_fd; /* in */ + __u8 __user *send_buffer; /* in/out: this buffer will be filled with commands by kernel */ + __u32 send_buffer_size_bytes; /* in: the size of send_buffer */ + /* out: the amount of valid data on successful completion */ + __u64 clone_sources_count; /* in */ __u64 __user *clone_sources; /* in */ __u64 parent_root; /* in */ + + struct btrfs_send_checkpoint __user *in_cp; /* in: checkpoint, kernel must check the version and size */ + __u32 in_cp_size_bytes; /* in: the size of the checkpoint buffer */ + + struct btrfs_send_checkpoint out_cp; /* out: the checkpoint for the next batch of commands */ + __u8 end_of_data; /* out: whether we have pulled the last batch */ + __u64 flags; /* in */ __u64 reserved[4]; /* in */ }; diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c old mode 100644 new mode 100755 index 760390f..b4bf56c --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -77,14 +77,30 @@ struct clone_root { #define SEND_CTX_NAME_CACHE_CLEAN_SIZE (SEND_CTX_MAX_NAME_CACHE_SIZE * 2) struct send_ctx { - struct file *send_filp; - loff_t send_off; char *send_buf; u32 send_size; u32 send_max_size; u64 total_send_size; u64 cmd_send_size[BTRFS_SEND_C_MAX + 1]; + /* How many commands/bytes we need to skip, before we can start sending new commands */ + u32 n_cmds_to_skip; + u32 offset_in_write_cmd_bytes_to_skip; + + /* Our current checkpoint */ + struct btrfs_send_checkpoint curr_cp; + + /* User buffer to be filled with data and how much data there already is */ + u8 __user *out_buffer; + u32 out_buffer_size_bytes; + u32 out_buffer_data_bytes; + + /* + * We set this to 1, when we cannot put more data into the user buffer, + * and we return -EOVERFLOW. + */ + u8 out_buffer_full; + struct vfsmount *mnt;