Message ID | 20200325225505.23998-12-parri.andrea@gmail.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | None | expand |
From: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Sent: Wednesday, March 25, 2020 3:55 PM > > For each storvsc_device, storvsc keeps track of the channel target CPUs > associated to the device (alloced_cpus) and it uses this information to > fill a "cache" (stor_chns) mapping CPU->channel according to a certain > heuristic. Update the alloced_cpus mask and the stor_chns array when a > channel of the storvsc device is re-assigned to a different CPU. > > Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> > Cc: "James E.J. Bottomley" <jejb@linux.ibm.com> > Cc: "Martin K. Petersen" <martin.petersen@oracle.com> > Cc: <linux-scsi@vger.kernel.org> > --- > drivers/hv/vmbus_drv.c | 4 ++ > drivers/scsi/storvsc_drv.c | 95 ++++++++++++++++++++++++++++++++++---- > include/linux/hyperv.h | 3 ++ > 3 files changed, 94 insertions(+), 8 deletions(-) > > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c > index 84d2f22c569aa..7199fee2b5869 100644 > --- a/drivers/hv/vmbus_drv.c > +++ b/drivers/hv/vmbus_drv.c > @@ -1721,6 +1721,10 @@ static ssize_t target_cpu_store(struct vmbus_channel *channel, > * in on a CPU that is different from the channel target_cpu value. > */ > > + if (channel->change_target_cpu_callback) > + (*channel->change_target_cpu_callback)(channel, > + channel->target_cpu, target_cpu); > + > channel->target_cpu = target_cpu; > channel->target_vp = hv_cpu_number_to_vp_number(target_cpu); > channel->numa_node = cpu_to_node(target_cpu); I think there's an ordering problem here. The change_target_cpu_callback will allow storvsc to flush the cache that it is keeping, but there's a window after the storvsc callback releases the spin lock and before this function changes channel->target_cpu to the new value. In that window, the cache could get refilled based on the old value of channel->target_cpu, which is exactly what we don't want. Generally with caches, you have to set the new value first, then flush the cache, and I think that works in this case. The callback function doesn't depend on the value of channel->target_cpu, and any cache filling that might happen after channel->target_cpu is set to the new value but before the callback function runs is OK. But please double-check my thinking. :-) > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index fb41636519ee8..a680592b9d32a 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -621,6 +621,63 @@ static inline struct storvsc_device *get_in_stor_device( > > } > > +void storvsc_change_target_cpu(struct vmbus_channel *channel, u32 old, u32 new) > +{ > + struct storvsc_device *stor_device; > + struct vmbus_channel *cur_chn; > + bool old_is_alloced = false; > + struct hv_device *device; > + unsigned long flags; > + int cpu; > + > + device = channel->primary_channel ? > + channel->primary_channel->device_obj > + : channel->device_obj; > + stor_device = get_out_stor_device(device); > + if (!stor_device) > + return; > + > + /* See storvsc_do_io() -> get_og_chn(). */ > + spin_lock_irqsave(&device->channel->lock, flags); > + > + /* > + * Determines if the storvsc device has other channels assigned to > + * the "old" CPU to update the alloced_cpus mask and the stor_chns > + * array. > + */ > + if (device->channel != channel && device->channel->target_cpu == old) { > + cur_chn = device->channel; > + old_is_alloced = true; > + goto old_is_alloced; > + } > + list_for_each_entry(cur_chn, &device->channel->sc_list, sc_list) { > + if (cur_chn == channel) > + continue; > + if (cur_chn->target_cpu == old) { > + old_is_alloced = true; > + goto old_is_alloced; > + } > + } > + > +old_is_alloced: > + if (old_is_alloced) > + WRITE_ONCE(stor_device->stor_chns[old], cur_chn); > + else > + cpumask_clear_cpu(old, &stor_device->alloced_cpus); I think target_cpu_store() can get called in parallel on multiple CPUs for different channels on the same storvsc device, but multiple changes to a single channel are serialized by higher levels of sysfs. So this function could run after multiple channels have been changed, in which case there's not just a single "old" value, and the above algorithm might not work, especially if channel->target_cpu is updated before calling this function per my earlier comment. I can see a couple of possible ways to deal with this. One is to put the update of channel->target_cpu in this function, within the spin lock boundaries so that the cache flush and target_cpu update are atomic. Another idea is to process multiple changes in this function, by building a temp copy of alloced_cpus by walking the channel list, use XOR to create a cpumask with changes, and then process all the changes in a loop instead of just handling a single change as with the current code at the old_is_alloced label. But I haven't completely thought through this idea. > + > + /* "Flush" the stor_chns array. */ > + for_each_possible_cpu(cpu) { > + if (stor_device->stor_chns[cpu] && !cpumask_test_cpu( > + cpu, &stor_device->alloced_cpus)) > + WRITE_ONCE(stor_device->stor_chns[cpu], NULL); > + } > + > + WRITE_ONCE(stor_device->stor_chns[new], channel); > + cpumask_set_cpu(new, &stor_device->alloced_cpus); > + > + spin_unlock_irqrestore(&device->channel->lock, flags); > +} > + > static void handle_sc_creation(struct vmbus_channel *new_sc) > { > struct hv_device *device = new_sc->primary_channel->device_obj; > @@ -648,6 +705,8 @@ static void handle_sc_creation(struct vmbus_channel *new_sc) > return; > } > > + new_sc->change_target_cpu_callback = storvsc_change_target_cpu; > + > /* Add the sub-channel to the array of available channels. */ > stor_device->stor_chns[new_sc->target_cpu] = new_sc; > cpumask_set_cpu(new_sc->target_cpu, &stor_device->alloced_cpus); > @@ -876,6 +935,8 @@ static int storvsc_channel_init(struct hv_device *device, bool is_fc) > if (stor_device->stor_chns == NULL) > return -ENOMEM; > > + device->channel->change_target_cpu_callback = storvsc_change_target_cpu; > + > stor_device->stor_chns[device->channel->target_cpu] = device->channel; > cpumask_set_cpu(device->channel->target_cpu, > &stor_device->alloced_cpus); > @@ -1248,8 +1309,10 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device > *stor_device, > const struct cpumask *node_mask; > int num_channels, tgt_cpu; > > - if (stor_device->num_sc == 0) > + if (stor_device->num_sc == 0) { > + stor_device->stor_chns[q_num] = stor_device->device->channel; > return stor_device->device->channel; > + } > > /* > * Our channel array is sparsley populated and we > @@ -1258,7 +1321,6 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device > *stor_device, > * The strategy is simple: > * I. Ensure NUMA locality > * II. Distribute evenly (best effort) > - * III. Mapping is persistent. > */ > > node_mask = cpumask_of_node(cpu_to_node(q_num)); > @@ -1268,8 +1330,10 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device > *stor_device, > if (cpumask_test_cpu(tgt_cpu, node_mask)) > num_channels++; > } > - if (num_channels == 0) > + if (num_channels == 0) { > + stor_device->stor_chns[q_num] = stor_device->device->channel; Is the above added line just fixing a bug in the existing code? I'm not seeing how it would derive from the other changes in this patch. > return stor_device->device->channel; > + } > > hash_qnum = q_num; > while (hash_qnum >= num_channels) > @@ -1295,6 +1359,7 @@ static int storvsc_do_io(struct hv_device *device, > struct storvsc_device *stor_device; > struct vstor_packet *vstor_packet; > struct vmbus_channel *outgoing_channel, *channel; > + unsigned long flags; > int ret = 0; > const struct cpumask *node_mask; > int tgt_cpu; > @@ -1308,10 +1373,11 @@ static int storvsc_do_io(struct hv_device *device, > > request->device = device; > /* > - * Select an an appropriate channel to send the request out. > + * Select an appropriate channel to send the request out. > */ > - if (stor_device->stor_chns[q_num] != NULL) { > - outgoing_channel = stor_device->stor_chns[q_num]; > + /* See storvsc_change_target_cpu(). */ > + outgoing_channel = READ_ONCE(stor_device->stor_chns[q_num]); > + if (outgoing_channel != NULL) { > if (outgoing_channel->target_cpu == q_num) { > /* > * Ideally, we want to pick a different channel if > @@ -1324,7 +1390,10 @@ static int storvsc_do_io(struct hv_device *device, > continue; > if (tgt_cpu == q_num) > continue; > - channel = stor_device->stor_chns[tgt_cpu]; > + channel = READ_ONCE( > + stor_device->stor_chns[tgt_cpu]); > + if (channel == NULL) > + continue; The channel == NULL case is new because a cache flush could be happening in parallel on another CPU. I'm wondering about the tradeoffs of continuing in the loop (as you have coded in this patch) vs. a "goto" back to the top level "if" statement. With the "continue" you might finish the loop without finding any matches, and fall through to the next approach. But it's only a single I/O operation, and if it comes up with a less than optimal channel choice, it's no big deal. So I guess it's really a wash. > if (hv_get_avail_to_write_percent( > &channel->outbound) > > ring_avail_percent_lowater) { > @@ -1350,7 +1419,10 @@ static int storvsc_do_io(struct hv_device *device, > for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) { > if (cpumask_test_cpu(tgt_cpu, node_mask)) > continue; > - channel = stor_device->stor_chns[tgt_cpu]; > + channel = READ_ONCE( > + stor_device->stor_chns[tgt_cpu]); > + if (channel == NULL) > + continue; Same comment here. > if (hv_get_avail_to_write_percent( > &channel->outbound) > > ring_avail_percent_lowater) { > @@ -1360,7 +1432,14 @@ static int storvsc_do_io(struct hv_device *device, > } > } > } else { > + spin_lock_irqsave(&device->channel->lock, flags); > + outgoing_channel = stor_device->stor_chns[q_num]; > + if (outgoing_channel != NULL) { > + spin_unlock_irqrestore(&device->channel->lock, flags); > + goto found_channel; > + } > outgoing_channel = get_og_chn(stor_device, q_num); > + spin_unlock_irqrestore(&device->channel->lock, flags); > } > > found_channel: > diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h > index edfcd42319ef3..9018b89614b78 100644 > --- a/include/linux/hyperv.h > +++ b/include/linux/hyperv.h > @@ -773,6 +773,9 @@ struct vmbus_channel { > void (*onchannel_callback)(void *context); > void *channel_callback_context; > > + void (*change_target_cpu_callback)(struct vmbus_channel *channel, > + u32 old, u32 new); > + > /* > * Synchronize channel scheduling and channel removal; see the inline > * comments in vmbus_chan_sched() and vmbus_reset_channel_cb(). > -- > 2.24.0
> > @@ -1721,6 +1721,10 @@ static ssize_t target_cpu_store(struct vmbus_channel *channel, > > * in on a CPU that is different from the channel target_cpu value. > > */ > > > > + if (channel->change_target_cpu_callback) > > + (*channel->change_target_cpu_callback)(channel, > > + channel->target_cpu, target_cpu); > > + > > channel->target_cpu = target_cpu; > > channel->target_vp = hv_cpu_number_to_vp_number(target_cpu); > > channel->numa_node = cpu_to_node(target_cpu); > > I think there's an ordering problem here. The change_target_cpu_callback > will allow storvsc to flush the cache that it is keeping, but there's a window > after the storvsc callback releases the spin lock and before this function > changes channel->target_cpu to the new value. In that window, the cache > could get refilled based on the old value of channel->target_cpu, which is > exactly what we don't want. Generally with caches, you have to set the new > value first, then flush the cache, and I think that works in this case. The > callback function doesn't depend on the value of channel->target_cpu, > and any cache filling that might happen after channel->target_cpu is set > to the new value but before the callback function runs is OK. But please > double-check my thinking. :-) Sorry, I don't see the problem. AFAICT, the "cache" gets refilled based on the values of alloced_cpus and on the current state of the cache but not based on the value of channel->target_cpu. The callback invocation uses the value of the "old" target_cpu; I think I ended up placing the callback call where it is for not having to introduce a local variable "old_cpu". ;-) > > @@ -621,6 +621,63 @@ static inline struct storvsc_device *get_in_stor_device( > > > > } > > > > +void storvsc_change_target_cpu(struct vmbus_channel *channel, u32 old, u32 new) > > +{ > > + struct storvsc_device *stor_device; > > + struct vmbus_channel *cur_chn; > > + bool old_is_alloced = false; > > + struct hv_device *device; > > + unsigned long flags; > > + int cpu; > > + > > + device = channel->primary_channel ? > > + channel->primary_channel->device_obj > > + : channel->device_obj; > > + stor_device = get_out_stor_device(device); > > + if (!stor_device) > > + return; > > + > > + /* See storvsc_do_io() -> get_og_chn(). */ > > + spin_lock_irqsave(&device->channel->lock, flags); > > + > > + /* > > + * Determines if the storvsc device has other channels assigned to > > + * the "old" CPU to update the alloced_cpus mask and the stor_chns > > + * array. > > + */ > > + if (device->channel != channel && device->channel->target_cpu == old) { > > + cur_chn = device->channel; > > + old_is_alloced = true; > > + goto old_is_alloced; > > + } > > + list_for_each_entry(cur_chn, &device->channel->sc_list, sc_list) { > > + if (cur_chn == channel) > > + continue; > > + if (cur_chn->target_cpu == old) { > > + old_is_alloced = true; > > + goto old_is_alloced; > > + } > > + } > > + > > +old_is_alloced: > > + if (old_is_alloced) > > + WRITE_ONCE(stor_device->stor_chns[old], cur_chn); > > + else > > + cpumask_clear_cpu(old, &stor_device->alloced_cpus); > > I think target_cpu_store() can get called in parallel on multiple CPUs for different > channels on the same storvsc device, but multiple changes to a single channel are > serialized by higher levels of sysfs. So this function could run after multiple > channels have been changed, in which case there's not just a single "old" value, > and the above algorithm might not work, especially if channel->target_cpu is > updated before calling this function per my earlier comment. I can see a > couple of possible ways to deal with this. One is to put the update of > channel->target_cpu in this function, within the spin lock boundaries so > that the cache flush and target_cpu update are atomic. Another idea is to > process multiple changes in this function, by building a temp copy of > alloced_cpus by walking the channel list, use XOR to create a cpumask > with changes, and then process all the changes in a loop instead of > just handling a single change as with the current code at the old_is_alloced > label. But I haven't completely thought through this idea. Same here: the invocations of target_cpu_store() are serialized on the per-connection channel_mutex... > > @@ -1268,8 +1330,10 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device > > *stor_device, > > if (cpumask_test_cpu(tgt_cpu, node_mask)) > > num_channels++; > > } > > - if (num_channels == 0) > > + if (num_channels == 0) { > > + stor_device->stor_chns[q_num] = stor_device->device->channel; > > Is the above added line just fixing a bug in the existing code? I'm not seeing how > it would derive from the other changes in this patch. It was rather intended as an optimization: Each time I/O for a device is initiated on a CPU that have "num_channels == 0" channel, the current code ends up calling get_og_chn() (in the attempt to fill the cache) and returns the device's primary channel. In the current code, the cost of this operations is basically the cost of parsing alloced_cpus, but with the changes introduced here this also involves acquiring (and releasing) the primary channel's lock. I should probably put my hands forward and say that I haven't observed any measurable effects due this addition in my experiments; OTOH, caching the returned/"found" value made sense... > > @@ -1324,7 +1390,10 @@ static int storvsc_do_io(struct hv_device *device, > > continue; > > if (tgt_cpu == q_num) > > continue; > > - channel = stor_device->stor_chns[tgt_cpu]; > > + channel = READ_ONCE( > > + stor_device->stor_chns[tgt_cpu]); > > + if (channel == NULL) > > + continue; > > The channel == NULL case is new because a cache flush could be happening > in parallel on another CPU. I'm wondering about the tradeoffs of > continuing in the loop (as you have coded in this patch) vs. a "goto" back to > the top level "if" statement. With the "continue" you might finish the > loop without finding any matches, and fall through to the next approach. > But it's only a single I/O operation, and if it comes up with a less than > optimal channel choice, it's no big deal. So I guess it's really a wash. Yes, I considered both approaches; they both "worked" here. I was a bit concerned about the number of "possible" gotos (again, mainly a theoretical issue, since I can imagine that the cash flushes will be relatively "rare" events in most cases and, in any case, they happen to be serialized); the "continue" looked like a suitable and simpler approach/compromise, at least for the time being. > > > if (hv_get_avail_to_write_percent( > > &channel->outbound) > > > ring_avail_percent_lowater) { > > @@ -1350,7 +1419,10 @@ static int storvsc_do_io(struct hv_device *device, > > for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) { > > if (cpumask_test_cpu(tgt_cpu, node_mask)) > > continue; > > - channel = stor_device->stor_chns[tgt_cpu]; > > + channel = READ_ONCE( > > + stor_device->stor_chns[tgt_cpu]); > > + if (channel == NULL) > > + continue; > > Same comment here. Similarly here. Thoughts? Thanks, Andrea
From: Andrea Parri <parri.andrea@gmail.com> Sent: Monday, March 30, 2020 11:55 AM > > > > @@ -1721,6 +1721,10 @@ static ssize_t target_cpu_store(struct vmbus_channel > *channel, > > > * in on a CPU that is different from the channel target_cpu value. > > > */ > > > > > > + if (channel->change_target_cpu_callback) > > > + (*channel->change_target_cpu_callback)(channel, > > > + channel->target_cpu, target_cpu); > > > + > > > channel->target_cpu = target_cpu; > > > channel->target_vp = hv_cpu_number_to_vp_number(target_cpu); > > > channel->numa_node = cpu_to_node(target_cpu); > > > > I think there's an ordering problem here. The change_target_cpu_callback > > will allow storvsc to flush the cache that it is keeping, but there's a window > > after the storvsc callback releases the spin lock and before this function > > changes channel->target_cpu to the new value. In that window, the cache > > could get refilled based on the old value of channel->target_cpu, which is > > exactly what we don't want. Generally with caches, you have to set the new > > value first, then flush the cache, and I think that works in this case. The > > callback function doesn't depend on the value of channel->target_cpu, > > and any cache filling that might happen after channel->target_cpu is set > > to the new value but before the callback function runs is OK. But please > > double-check my thinking. :-) > > Sorry, I don't see the problem. AFAICT, the "cache" gets refilled based > on the values of alloced_cpus and on the current state of the cache but > not based on the value of channel->target_cpu. The callback invocation > uses the value of the "old" target_cpu; I think I ended up placing the > callback call where it is for not having to introduce a local variable > "old_cpu". ;-) > You are right. My comment is bogus. > > > > @@ -621,6 +621,63 @@ static inline struct storvsc_device *get_in_stor_device( > > > > > > } > > > > > > +void storvsc_change_target_cpu(struct vmbus_channel *channel, u32 old, u32 new) > > > +{ > > > + struct storvsc_device *stor_device; > > > + struct vmbus_channel *cur_chn; > > > + bool old_is_alloced = false; > > > + struct hv_device *device; > > > + unsigned long flags; > > > + int cpu; > > > + > > > + device = channel->primary_channel ? > > > + channel->primary_channel->device_obj > > > + : channel->device_obj; > > > + stor_device = get_out_stor_device(device); > > > + if (!stor_device) > > > + return; > > > + > > > + /* See storvsc_do_io() -> get_og_chn(). */ > > > + spin_lock_irqsave(&device->channel->lock, flags); > > > + > > > + /* > > > + * Determines if the storvsc device has other channels assigned to > > > + * the "old" CPU to update the alloced_cpus mask and the stor_chns > > > + * array. > > > + */ > > > + if (device->channel != channel && device->channel->target_cpu == old) { > > > + cur_chn = device->channel; > > > + old_is_alloced = true; > > > + goto old_is_alloced; > > > + } > > > + list_for_each_entry(cur_chn, &device->channel->sc_list, sc_list) { > > > + if (cur_chn == channel) > > > + continue; > > > + if (cur_chn->target_cpu == old) { > > > + old_is_alloced = true; > > > + goto old_is_alloced; > > > + } > > > + } > > > + > > > +old_is_alloced: > > > + if (old_is_alloced) > > > + WRITE_ONCE(stor_device->stor_chns[old], cur_chn); > > > + else > > > + cpumask_clear_cpu(old, &stor_device->alloced_cpus); > > > > I think target_cpu_store() can get called in parallel on multiple CPUs for different > > channels on the same storvsc device, but multiple changes to a single channel are > > serialized by higher levels of sysfs. So this function could run after multiple > > channels have been changed, in which case there's not just a single "old" value, > > and the above algorithm might not work, especially if channel->target_cpu is > > updated before calling this function per my earlier comment. I can see a > > couple of possible ways to deal with this. One is to put the update of > > channel->target_cpu in this function, within the spin lock boundaries so > > that the cache flush and target_cpu update are atomic. Another idea is to > > process multiple changes in this function, by building a temp copy of > > alloced_cpus by walking the channel list, use XOR to create a cpumask > > with changes, and then process all the changes in a loop instead of > > just handling a single change as with the current code at the old_is_alloced > > label. But I haven't completely thought through this idea. > > Same here: the invocations of target_cpu_store() are serialized on the > per-connection channel_mutex... Agreed. My comment is not valid. > > > > > @@ -1268,8 +1330,10 @@ static struct vmbus_channel *get_og_chn(struct > storvsc_device > > > *stor_device, > > > if (cpumask_test_cpu(tgt_cpu, node_mask)) > > > num_channels++; > > > } > > > - if (num_channels == 0) > > > + if (num_channels == 0) { > > > + stor_device->stor_chns[q_num] = stor_device->device->channel; > > > > Is the above added line just fixing a bug in the existing code? I'm not seeing how > > it would derive from the other changes in this patch. > > It was rather intended as an optimization: Each time I/O for a device > is initiated on a CPU that have "num_channels == 0" channel, the current > code ends up calling get_og_chn() (in the attempt to fill the cache) and > returns the device's primary channel. In the current code, the cost of > this operations is basically the cost of parsing alloced_cpus, but with > the changes introduced here this also involves acquiring (and releasing) > the primary channel's lock. I should probably put my hands forward and > say that I haven't observed any measurable effects due this addition in > my experiments; OTOH, caching the returned/"found" value made sense... OK. That's what I thought. The existing code does not produce an incorrect result, but the cache isn't working as intended. This fixes it. > > > > > @@ -1324,7 +1390,10 @@ static int storvsc_do_io(struct hv_device *device, > > > continue; > > > if (tgt_cpu == q_num) > > > continue; > > > - channel = stor_device->stor_chns[tgt_cpu]; > > > + channel = READ_ONCE( > > > + stor_device->stor_chns[tgt_cpu]); > > > + if (channel == NULL) > > > + continue; > > > > The channel == NULL case is new because a cache flush could be happening > > in parallel on another CPU. I'm wondering about the tradeoffs of > > continuing in the loop (as you have coded in this patch) vs. a "goto" back to > > the top level "if" statement. With the "continue" you might finish the > > loop without finding any matches, and fall through to the next approach. > > But it's only a single I/O operation, and if it comes up with a less than > > optimal channel choice, it's no big deal. So I guess it's really a wash. > > Yes, I considered both approaches; they both "worked" here. I was a > bit concerned about the number of "possible" gotos (again, mainly a > theoretical issue, since I can imagine that the cash flushes will be > relatively "rare" events in most cases and, in any case, they happen > to be serialized); the "continue" looked like a suitable and simpler > approach/compromise, at least for the time being. Yes, I'm OK with your patch "as is". I was just thinking about the alternative, and evidently you did too. > > > > > > > if (hv_get_avail_to_write_percent( > > > &channel->outbound) > > > > ring_avail_percent_lowater) { > > > @@ -1350,7 +1419,10 @@ static int storvsc_do_io(struct hv_device *device, > > > for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) { > > > if (cpumask_test_cpu(tgt_cpu, node_mask)) > > > continue; > > > - channel = stor_device->stor_chns[tgt_cpu]; > > > + channel = READ_ONCE( > > > + stor_device->stor_chns[tgt_cpu]); > > > + if (channel == NULL) > > > + continue; > > > > Same comment here. > > Similarly here. Agreed. > > Thoughts? > > Thanks, > Andrea
On Mon, Mar 30, 2020 at 07:49:00PM +0000, Michael Kelley wrote: > From: Andrea Parri <parri.andrea@gmail.com> Sent: Monday, March 30, 2020 11:55 AM > > > > > > @@ -1721,6 +1721,10 @@ static ssize_t target_cpu_store(struct vmbus_channel > > *channel, > > > > * in on a CPU that is different from the channel target_cpu value. > > > > */ > > > > > > > > + if (channel->change_target_cpu_callback) > > > > + (*channel->change_target_cpu_callback)(channel, > > > > + channel->target_cpu, target_cpu); > > > > + > > > > channel->target_cpu = target_cpu; > > > > channel->target_vp = hv_cpu_number_to_vp_number(target_cpu); > > > > channel->numa_node = cpu_to_node(target_cpu); > > > > > > I think there's an ordering problem here. The change_target_cpu_callback > > > will allow storvsc to flush the cache that it is keeping, but there's a window > > > after the storvsc callback releases the spin lock and before this function > > > changes channel->target_cpu to the new value. In that window, the cache > > > could get refilled based on the old value of channel->target_cpu, which is > > > exactly what we don't want. Generally with caches, you have to set the new > > > value first, then flush the cache, and I think that works in this case. The > > > callback function doesn't depend on the value of channel->target_cpu, > > > and any cache filling that might happen after channel->target_cpu is set > > > to the new value but before the callback function runs is OK. But please > > > double-check my thinking. :-) > > > > Sorry, I don't see the problem. AFAICT, the "cache" gets refilled based > > on the values of alloced_cpus and on the current state of the cache but > > not based on the value of channel->target_cpu. The callback invocation > > uses the value of the "old" target_cpu; I think I ended up placing the > > callback call where it is for not having to introduce a local variable > > "old_cpu". ;-) > > > > You are right. My comment is bogus. > > > > > > > @@ -621,6 +621,63 @@ static inline struct storvsc_device *get_in_stor_device( > > > > > > > > } > > > > > > > > +void storvsc_change_target_cpu(struct vmbus_channel *channel, u32 old, u32 new) > > > > +{ > > > > + struct storvsc_device *stor_device; > > > > + struct vmbus_channel *cur_chn; > > > > + bool old_is_alloced = false; > > > > + struct hv_device *device; > > > > + unsigned long flags; > > > > + int cpu; > > > > + > > > > + device = channel->primary_channel ? > > > > + channel->primary_channel->device_obj > > > > + : channel->device_obj; > > > > + stor_device = get_out_stor_device(device); > > > > + if (!stor_device) > > > > + return; > > > > + > > > > + /* See storvsc_do_io() -> get_og_chn(). */ > > > > + spin_lock_irqsave(&device->channel->lock, flags); > > > > + > > > > + /* > > > > + * Determines if the storvsc device has other channels assigned to > > > > + * the "old" CPU to update the alloced_cpus mask and the stor_chns > > > > + * array. > > > > + */ > > > > + if (device->channel != channel && device->channel->target_cpu == old) { > > > > + cur_chn = device->channel; > > > > + old_is_alloced = true; > > > > + goto old_is_alloced; > > > > + } > > > > + list_for_each_entry(cur_chn, &device->channel->sc_list, sc_list) { > > > > + if (cur_chn == channel) > > > > + continue; > > > > + if (cur_chn->target_cpu == old) { > > > > + old_is_alloced = true; > > > > + goto old_is_alloced; > > > > + } > > > > + } > > > > + > > > > +old_is_alloced: > > > > + if (old_is_alloced) > > > > + WRITE_ONCE(stor_device->stor_chns[old], cur_chn); > > > > + else > > > > + cpumask_clear_cpu(old, &stor_device->alloced_cpus); > > > > > > I think target_cpu_store() can get called in parallel on multiple CPUs for different > > > channels on the same storvsc device, but multiple changes to a single channel are > > > serialized by higher levels of sysfs. So this function could run after multiple > > > channels have been changed, in which case there's not just a single "old" value, > > > and the above algorithm might not work, especially if channel->target_cpu is > > > updated before calling this function per my earlier comment. I can see a > > > couple of possible ways to deal with this. One is to put the update of > > > channel->target_cpu in this function, within the spin lock boundaries so > > > that the cache flush and target_cpu update are atomic. Another idea is to > > > process multiple changes in this function, by building a temp copy of > > > alloced_cpus by walking the channel list, use XOR to create a cpumask > > > with changes, and then process all the changes in a loop instead of > > > just handling a single change as with the current code at the old_is_alloced > > > label. But I haven't completely thought through this idea. > > > > Same here: the invocations of target_cpu_store() are serialized on the > > per-connection channel_mutex... > > Agreed. My comment is not valid. > > > > > > > > > @@ -1268,8 +1330,10 @@ static struct vmbus_channel *get_og_chn(struct > > storvsc_device > > > > *stor_device, > > > > if (cpumask_test_cpu(tgt_cpu, node_mask)) > > > > num_channels++; > > > > } > > > > - if (num_channels == 0) > > > > + if (num_channels == 0) { > > > > + stor_device->stor_chns[q_num] = stor_device->device->channel; > > > > > > Is the above added line just fixing a bug in the existing code? I'm not seeing how > > > it would derive from the other changes in this patch. > > > > It was rather intended as an optimization: Each time I/O for a device > > is initiated on a CPU that have "num_channels == 0" channel, the current > > code ends up calling get_og_chn() (in the attempt to fill the cache) and > > returns the device's primary channel. In the current code, the cost of > > this operations is basically the cost of parsing alloced_cpus, but with > > the changes introduced here this also involves acquiring (and releasing) > > the primary channel's lock. I should probably put my hands forward and > > say that I haven't observed any measurable effects due this addition in > > my experiments; OTOH, caching the returned/"found" value made sense... > > OK. That's what I thought. The existing code does not produce an incorrect > result, but the cache isn't working as intended. This fixes it. > > > > > > > > > @@ -1324,7 +1390,10 @@ static int storvsc_do_io(struct hv_device *device, > > > > continue; > > > > if (tgt_cpu == q_num) > > > > continue; > > > > - channel = stor_device->stor_chns[tgt_cpu]; > > > > + channel = READ_ONCE( > > > > + stor_device->stor_chns[tgt_cpu]); > > > > + if (channel == NULL) > > > > + continue; > > > > > > The channel == NULL case is new because a cache flush could be happening > > > in parallel on another CPU. I'm wondering about the tradeoffs of > > > continuing in the loop (as you have coded in this patch) vs. a "goto" back to > > > the top level "if" statement. With the "continue" you might finish the > > > loop without finding any matches, and fall through to the next approach. > > > But it's only a single I/O operation, and if it comes up with a less than > > > optimal channel choice, it's no big deal. So I guess it's really a wash. > > > > Yes, I considered both approaches; they both "worked" here. I was a > > bit concerned about the number of "possible" gotos (again, mainly a > > theoretical issue, since I can imagine that the cash flushes will be > > relatively "rare" events in most cases and, in any case, they happen > > to be serialized); the "continue" looked like a suitable and simpler > > approach/compromise, at least for the time being. > > Yes, I'm OK with your patch "as is". I was just thinking about the > alternative, and evidently you did too. Thank you for the confirmation. I'm wondering: could take this as a Reviewed-by: for this patch? ;-) Thanks, Andrea > > > > > > > > > > > > if (hv_get_avail_to_write_percent( > > > > &channel->outbound) > > > > > ring_avail_percent_lowater) { > > > > @@ -1350,7 +1419,10 @@ static int storvsc_do_io(struct hv_device *device, > > > > for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) { > > > > if (cpumask_test_cpu(tgt_cpu, node_mask)) > > > > continue; > > > > - channel = stor_device->stor_chns[tgt_cpu]; > > > > + channel = READ_ONCE( > > > > + stor_device->stor_chns[tgt_cpu]); > > > > + if (channel == NULL) > > > > + continue; > > > > > > Same comment here. > > > > Similarly here. > > Agreed. > > > > > Thoughts? > > > > Thanks, > > Andrea
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 84d2f22c569aa..7199fee2b5869 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -1721,6 +1721,10 @@ static ssize_t target_cpu_store(struct vmbus_channel *channel, * in on a CPU that is different from the channel target_cpu value. */ + if (channel->change_target_cpu_callback) + (*channel->change_target_cpu_callback)(channel, + channel->target_cpu, target_cpu); + channel->target_cpu = target_cpu; channel->target_vp = hv_cpu_number_to_vp_number(target_cpu); channel->numa_node = cpu_to_node(target_cpu); diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index fb41636519ee8..a680592b9d32a 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -621,6 +621,63 @@ static inline struct storvsc_device *get_in_stor_device( } +void storvsc_change_target_cpu(struct vmbus_channel *channel, u32 old, u32 new) +{ + struct storvsc_device *stor_device; + struct vmbus_channel *cur_chn; + bool old_is_alloced = false; + struct hv_device *device; + unsigned long flags; + int cpu; + + device = channel->primary_channel ? + channel->primary_channel->device_obj + : channel->device_obj; + stor_device = get_out_stor_device(device); + if (!stor_device) + return; + + /* See storvsc_do_io() -> get_og_chn(). */ + spin_lock_irqsave(&device->channel->lock, flags); + + /* + * Determines if the storvsc device has other channels assigned to + * the "old" CPU to update the alloced_cpus mask and the stor_chns + * array. + */ + if (device->channel != channel && device->channel->target_cpu == old) { + cur_chn = device->channel; + old_is_alloced = true; + goto old_is_alloced; + } + list_for_each_entry(cur_chn, &device->channel->sc_list, sc_list) { + if (cur_chn == channel) + continue; + if (cur_chn->target_cpu == old) { + old_is_alloced = true; + goto old_is_alloced; + } + } + +old_is_alloced: + if (old_is_alloced) + WRITE_ONCE(stor_device->stor_chns[old], cur_chn); + else + cpumask_clear_cpu(old, &stor_device->alloced_cpus); + + /* "Flush" the stor_chns array. */ + for_each_possible_cpu(cpu) { + if (stor_device->stor_chns[cpu] && !cpumask_test_cpu( + cpu, &stor_device->alloced_cpus)) + WRITE_ONCE(stor_device->stor_chns[cpu], NULL); + } + + WRITE_ONCE(stor_device->stor_chns[new], channel); + cpumask_set_cpu(new, &stor_device->alloced_cpus); + + spin_unlock_irqrestore(&device->channel->lock, flags); +} + static void handle_sc_creation(struct vmbus_channel *new_sc) { struct hv_device *device = new_sc->primary_channel->device_obj; @@ -648,6 +705,8 @@ static void handle_sc_creation(struct vmbus_channel *new_sc) return; } + new_sc->change_target_cpu_callback = storvsc_change_target_cpu; + /* Add the sub-channel to the array of available channels. */ stor_device->stor_chns[new_sc->target_cpu] = new_sc; cpumask_set_cpu(new_sc->target_cpu, &stor_device->alloced_cpus); @@ -876,6 +935,8 @@ static int storvsc_channel_init(struct hv_device *device, bool is_fc) if (stor_device->stor_chns == NULL) return -ENOMEM; + device->channel->change_target_cpu_callback = storvsc_change_target_cpu; + stor_device->stor_chns[device->channel->target_cpu] = device->channel; cpumask_set_cpu(device->channel->target_cpu, &stor_device->alloced_cpus); @@ -1248,8 +1309,10 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device, const struct cpumask *node_mask; int num_channels, tgt_cpu; - if (stor_device->num_sc == 0) + if (stor_device->num_sc == 0) { + stor_device->stor_chns[q_num] = stor_device->device->channel; return stor_device->device->channel; + } /* * Our channel array is sparsley populated and we @@ -1258,7 +1321,6 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device, * The strategy is simple: * I. Ensure NUMA locality * II. Distribute evenly (best effort) - * III. Mapping is persistent. */ node_mask = cpumask_of_node(cpu_to_node(q_num)); @@ -1268,8 +1330,10 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device, if (cpumask_test_cpu(tgt_cpu, node_mask)) num_channels++; } - if (num_channels == 0) + if (num_channels == 0) { + stor_device->stor_chns[q_num] = stor_device->device->channel; return stor_device->device->channel; + } hash_qnum = q_num; while (hash_qnum >= num_channels) @@ -1295,6 +1359,7 @@ static int storvsc_do_io(struct hv_device *device, struct storvsc_device *stor_device; struct vstor_packet *vstor_packet; struct vmbus_channel *outgoing_channel, *channel; + unsigned long flags; int ret = 0; const struct cpumask *node_mask; int tgt_cpu; @@ -1308,10 +1373,11 @@ static int storvsc_do_io(struct hv_device *device, request->device = device; /* - * Select an an appropriate channel to send the request out. + * Select an appropriate channel to send the request out. */ - if (stor_device->stor_chns[q_num] != NULL) { - outgoing_channel = stor_device->stor_chns[q_num]; + /* See storvsc_change_target_cpu(). */ + outgoing_channel = READ_ONCE(stor_device->stor_chns[q_num]); + if (outgoing_channel != NULL) { if (outgoing_channel->target_cpu == q_num) { /* * Ideally, we want to pick a different channel if @@ -1324,7 +1390,10 @@ static int storvsc_do_io(struct hv_device *device, continue; if (tgt_cpu == q_num) continue; - channel = stor_device->stor_chns[tgt_cpu]; + channel = READ_ONCE( + stor_device->stor_chns[tgt_cpu]); + if (channel == NULL) + continue; if (hv_get_avail_to_write_percent( &channel->outbound) > ring_avail_percent_lowater) { @@ -1350,7 +1419,10 @@ static int storvsc_do_io(struct hv_device *device, for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) { if (cpumask_test_cpu(tgt_cpu, node_mask)) continue; - channel = stor_device->stor_chns[tgt_cpu]; + channel = READ_ONCE( + stor_device->stor_chns[tgt_cpu]); + if (channel == NULL) + continue; if (hv_get_avail_to_write_percent( &channel->outbound) > ring_avail_percent_lowater) { @@ -1360,7 +1432,14 @@ static int storvsc_do_io(struct hv_device *device, } } } else { + spin_lock_irqsave(&device->channel->lock, flags); + outgoing_channel = stor_device->stor_chns[q_num]; + if (outgoing_channel != NULL) { + spin_unlock_irqrestore(&device->channel->lock, flags); + goto found_channel; + } outgoing_channel = get_og_chn(stor_device, q_num); + spin_unlock_irqrestore(&device->channel->lock, flags); } found_channel: diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index edfcd42319ef3..9018b89614b78 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -773,6 +773,9 @@ struct vmbus_channel { void (*onchannel_callback)(void *context); void *channel_callback_context; + void (*change_target_cpu_callback)(struct vmbus_channel *channel, + u32 old, u32 new); + /* * Synchronize channel scheduling and channel removal; see the inline * comments in vmbus_chan_sched() and vmbus_reset_channel_cb().
For each storvsc_device, storvsc keeps track of the channel target CPUs associated to the device (alloced_cpus) and it uses this information to fill a "cache" (stor_chns) mapping CPU->channel according to a certain heuristic. Update the alloced_cpus mask and the stor_chns array when a channel of the storvsc device is re-assigned to a different CPU. Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Cc: "James E.J. Bottomley" <jejb@linux.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: <linux-scsi@vger.kernel.org> --- drivers/hv/vmbus_drv.c | 4 ++ drivers/scsi/storvsc_drv.c | 95 ++++++++++++++++++++++++++++++++++---- include/linux/hyperv.h | 3 ++ 3 files changed, 94 insertions(+), 8 deletions(-)