Message ID | 20180817122410.8437-1-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915/execlists: Micro-optimise "idle" context switch | expand |
Hi Chris, Thank you for the patch! Yet something to improve: [auto build test ERROR on drm-intel/for-linux-next] [also build test ERROR on v4.18 next-20180817] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Chris-Wilson/drm-i915-execlists-Micro-optimise-idle-context-switch/20180817-205726 base: git://anongit.freedesktop.org/drm-intel for-linux-next config: i386-randconfig-x005-201832 (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): drivers/gpu/drm/i915/intel_lrc.c: In function 'execlists_dequeue': >> drivers/gpu/drm/i915/intel_lrc.c:730:9: error: implicit declaration of function 'intel_engine_signaled'; did you mean 'intel_engine_is_idle'? [-Werror=implicit-function-declaration] intel_engine_signaled(engine, ^~~~~~~~~~~~~~~~~~~~~ intel_engine_is_idle cc1: some warnings being treated as errors vim +730 drivers/gpu/drm/i915/intel_lrc.c 580 581 static void execlists_dequeue(struct intel_engine_cs *engine) 582 { 583 struct intel_engine_execlists * const execlists = &engine->execlists; 584 struct execlist_port *port = execlists->port; 585 const struct execlist_port * const last_port = 586 &execlists->port[execlists->port_mask]; 587 struct i915_request *last = port_request(port); 588 struct rb_node *rb; 589 bool submit = false; 590 591 /* 592 * Hardware submission is through 2 ports. Conceptually each port 593 * has a (RING_START, RING_HEAD, RING_TAIL) tuple. RING_START is 594 * static for a context, and unique to each, so we only execute 595 * requests belonging to a single context from each ring. RING_HEAD 596 * is maintained by the CS in the context image, it marks the place 597 * where it got up to last time, and through RING_TAIL we tell the CS 598 * where we want to execute up to this time. 599 * 600 * In this list the requests are in order of execution. Consecutive 601 * requests from the same context are adjacent in the ringbuffer. We 602 * can combine these requests into a single RING_TAIL update: 603 * 604 * RING_HEAD...req1...req2 605 * ^- RING_TAIL 606 * since to execute req2 the CS must first execute req1. 607 * 608 * Our goal then is to point each port to the end of a consecutive 609 * sequence of requests as being the most optimal (fewest wake ups 610 * and context switches) submission. 611 */ 612 613 if (last) { 614 /* 615 * Don't resubmit or switch until all outstanding 616 * preemptions (lite-restore) are seen. Then we 617 * know the next preemption status we see corresponds 618 * to this ELSP update. 619 */ 620 GEM_BUG_ON(!execlists_is_active(execlists, 621 EXECLISTS_ACTIVE_USER)); 622 GEM_BUG_ON(!port_count(&port[0])); 623 624 /* 625 * If we write to ELSP a second time before the HW has had 626 * a chance to respond to the previous write, we can confuse 627 * the HW and hit "undefined behaviour". After writing to ELSP, 628 * we must then wait until we see a context-switch event from 629 * the HW to indicate that it has had a chance to respond. 630 */ 631 if (!execlists_is_active(execlists, EXECLISTS_ACTIVE_HWACK)) 632 return; 633 634 if (need_preempt(engine, last, execlists->queue_priority)) { 635 inject_preempt_context(engine); 636 return; 637 } 638 639 /* 640 * In theory, we could coalesce more requests onto 641 * the second port (the first port is active, with 642 * no preemptions pending). However, that means we 643 * then have to deal with the possible lite-restore 644 * of the second port (as we submit the ELSP, there 645 * may be a context-switch) but also we may complete 646 * the resubmission before the context-switch. Ergo, 647 * coalescing onto the second port will cause a 648 * preemption event, but we cannot predict whether 649 * that will affect port[0] or port[1]. 650 * 651 * If the second port is already active, we can wait 652 * until the next context-switch before contemplating 653 * new requests. The GPU will be busy and we should be 654 * able to resubmit the new ELSP before it idles, 655 * avoiding pipeline bubbles (momentary pauses where 656 * the driver is unable to keep up the supply of new 657 * work). However, we have to double check that the 658 * priorities of the ports haven't been switch. 659 */ 660 if (port_count(&port[1])) 661 return; 662 663 /* 664 * WaIdleLiteRestore:bdw,skl 665 * Apply the wa NOOPs to prevent 666 * ring:HEAD == rq:TAIL as we resubmit the 667 * request. See gen8_emit_breadcrumb() for 668 * where we prepare the padding after the 669 * end of the request. 670 */ 671 last->tail = last->wa_tail; 672 } 673 674 while ((rb = rb_first_cached(&execlists->queue))) { 675 struct i915_priolist *p = to_priolist(rb); 676 struct i915_request *rq, *rn; 677 678 list_for_each_entry_safe(rq, rn, &p->requests, sched.link) { 679 /* 680 * Can we combine this request with the current port? 681 * It has to be the same context/ringbuffer and not 682 * have any exceptions (e.g. GVT saying never to 683 * combine contexts). 684 * 685 * If we can combine the requests, we can execute both 686 * by updating the RING_TAIL to point to the end of the 687 * second request, and so we never need to tell the 688 * hardware about the first. 689 */ 690 if (last && 691 !can_merge_ctx(rq->hw_context, last->hw_context)) { 692 /* 693 * If we are on the second port and cannot 694 * combine this request with the last, then we 695 * are done. 696 */ 697 if (port == last_port) { 698 __list_del_many(&p->requests, 699 &rq->sched.link); 700 goto done; 701 } 702 703 /* 704 * If GVT overrides us we only ever submit 705 * port[0], leaving port[1] empty. Note that we 706 * also have to be careful that we don't queue 707 * the same context (even though a different 708 * request) to the second port. 709 */ 710 if (ctx_single_port_submission(last->hw_context) || 711 ctx_single_port_submission(rq->hw_context)) { 712 __list_del_many(&p->requests, 713 &rq->sched.link); 714 goto done; 715 } 716 717 GEM_BUG_ON(last->hw_context == rq->hw_context); 718 719 /* 720 * Avoid reloading the previous context if we 721 * know it has just completed and we want 722 * to switch over to a new context. The CS 723 * interrupt is likely waiting for us to 724 * release the local irq lock and so we will 725 * proceed with the submission momentarily, 726 * which is quicker than reloading the context 727 * on the gpu. 728 */ 729 if (!submit && > 730 intel_engine_signaled(engine, 731 last->global_seqno)) { 732 GEM_BUG_ON(!list_is_first(&rq->sched.link, 733 &p->requests)); 734 return; 735 } 736 737 if (submit) 738 port_assign(port, last); 739 port++; 740 741 GEM_BUG_ON(port_isset(port)); 742 } 743 744 INIT_LIST_HEAD(&rq->sched.link); 745 __i915_request_submit(rq); 746 trace_i915_request_in(rq, port_index(port, execlists)); 747 last = rq; 748 submit = true; 749 } 750 751 rb_erase_cached(&p->node, &execlists->queue); 752 INIT_LIST_HEAD(&p->requests); 753 if (p->priority != I915_PRIORITY_NORMAL) 754 kmem_cache_free(engine->i915->priorities, p); 755 } 756 757 done: 758 /* 759 * Here be a bit of magic! Or sleight-of-hand, whichever you prefer. 760 * 761 * We choose queue_priority such that if we add a request of greater 762 * priority than this, we kick the submission tasklet to decide on 763 * the right order of submitting the requests to hardware. We must 764 * also be prepared to reorder requests as they are in-flight on the 765 * HW. We derive the queue_priority then as the first "hole" in 766 * the HW submission ports and if there are no available slots, 767 * the priority of the lowest executing request, i.e. last. 768 * 769 * When we do receive a higher priority request ready to run from the 770 * user, see queue_request(), the queue_priority is bumped to that 771 * request triggering preemption on the next dequeue (or subsequent 772 * interrupt for secondary ports). 773 */ 774 execlists->queue_priority = 775 port != execlists->port ? rq_prio(last) : INT_MIN; 776 777 if (submit) { 778 port_assign(port, last); 779 execlists_submit_ports(engine); 780 } 781 782 /* We must always keep the beast fed if we have work piled up */ 783 GEM_BUG_ON(rb_first_cached(&execlists->queue) && 784 !port_isset(execlists->port)); 785 786 /* Re-evaluate the executing context setup after each preemptive kick */ 787 if (last) 788 execlists_user_begin(execlists, execlists->port); 789 790 /* If the engine is now idle, so should be the flag; and vice versa. */ 791 GEM_BUG_ON(execlists_is_active(&engine->execlists, 792 EXECLISTS_ACTIVE_USER) == 793 !port_isset(engine->execlists.port)); 794 } 795 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Chris, Thank you for the patch! Yet something to improve: [auto build test ERROR on drm-intel/for-linux-next] [also build test ERROR on v4.18 next-20180817] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Chris-Wilson/drm-i915-execlists-Micro-optimise-idle-context-switch/20180817-205726 base: git://anongit.freedesktop.org/drm-intel for-linux-next config: i386-randconfig-s0-201832 (attached as .config) compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): drivers/gpu//drm/i915/intel_lrc.c: In function 'execlists_dequeue': >> drivers/gpu//drm/i915/intel_lrc.c:730:9: error: implicit declaration of function 'intel_engine_signaled' [-Werror=implicit-function-declaration] intel_engine_signaled(engine, ^~~~~~~~~~~~~~~~~~~~~ cc1: some warnings being treated as errors vim +/intel_engine_signaled +730 drivers/gpu//drm/i915/intel_lrc.c 580 581 static void execlists_dequeue(struct intel_engine_cs *engine) 582 { 583 struct intel_engine_execlists * const execlists = &engine->execlists; 584 struct execlist_port *port = execlists->port; 585 const struct execlist_port * const last_port = 586 &execlists->port[execlists->port_mask]; 587 struct i915_request *last = port_request(port); 588 struct rb_node *rb; 589 bool submit = false; 590 591 /* 592 * Hardware submission is through 2 ports. Conceptually each port 593 * has a (RING_START, RING_HEAD, RING_TAIL) tuple. RING_START is 594 * static for a context, and unique to each, so we only execute 595 * requests belonging to a single context from each ring. RING_HEAD 596 * is maintained by the CS in the context image, it marks the place 597 * where it got up to last time, and through RING_TAIL we tell the CS 598 * where we want to execute up to this time. 599 * 600 * In this list the requests are in order of execution. Consecutive 601 * requests from the same context are adjacent in the ringbuffer. We 602 * can combine these requests into a single RING_TAIL update: 603 * 604 * RING_HEAD...req1...req2 605 * ^- RING_TAIL 606 * since to execute req2 the CS must first execute req1. 607 * 608 * Our goal then is to point each port to the end of a consecutive 609 * sequence of requests as being the most optimal (fewest wake ups 610 * and context switches) submission. 611 */ 612 613 if (last) { 614 /* 615 * Don't resubmit or switch until all outstanding 616 * preemptions (lite-restore) are seen. Then we 617 * know the next preemption status we see corresponds 618 * to this ELSP update. 619 */ 620 GEM_BUG_ON(!execlists_is_active(execlists, 621 EXECLISTS_ACTIVE_USER)); 622 GEM_BUG_ON(!port_count(&port[0])); 623 624 /* 625 * If we write to ELSP a second time before the HW has had 626 * a chance to respond to the previous write, we can confuse 627 * the HW and hit "undefined behaviour". After writing to ELSP, 628 * we must then wait until we see a context-switch event from 629 * the HW to indicate that it has had a chance to respond. 630 */ 631 if (!execlists_is_active(execlists, EXECLISTS_ACTIVE_HWACK)) 632 return; 633 634 if (need_preempt(engine, last, execlists->queue_priority)) { 635 inject_preempt_context(engine); 636 return; 637 } 638 639 /* 640 * In theory, we could coalesce more requests onto 641 * the second port (the first port is active, with 642 * no preemptions pending). However, that means we 643 * then have to deal with the possible lite-restore 644 * of the second port (as we submit the ELSP, there 645 * may be a context-switch) but also we may complete 646 * the resubmission before the context-switch. Ergo, 647 * coalescing onto the second port will cause a 648 * preemption event, but we cannot predict whether 649 * that will affect port[0] or port[1]. 650 * 651 * If the second port is already active, we can wait 652 * until the next context-switch before contemplating 653 * new requests. The GPU will be busy and we should be 654 * able to resubmit the new ELSP before it idles, 655 * avoiding pipeline bubbles (momentary pauses where 656 * the driver is unable to keep up the supply of new 657 * work). However, we have to double check that the 658 * priorities of the ports haven't been switch. 659 */ 660 if (port_count(&port[1])) 661 return; 662 663 /* 664 * WaIdleLiteRestore:bdw,skl 665 * Apply the wa NOOPs to prevent 666 * ring:HEAD == rq:TAIL as we resubmit the 667 * request. See gen8_emit_breadcrumb() for 668 * where we prepare the padding after the 669 * end of the request. 670 */ 671 last->tail = last->wa_tail; 672 } 673 674 while ((rb = rb_first_cached(&execlists->queue))) { 675 struct i915_priolist *p = to_priolist(rb); 676 struct i915_request *rq, *rn; 677 678 list_for_each_entry_safe(rq, rn, &p->requests, sched.link) { 679 /* 680 * Can we combine this request with the current port? 681 * It has to be the same context/ringbuffer and not 682 * have any exceptions (e.g. GVT saying never to 683 * combine contexts). 684 * 685 * If we can combine the requests, we can execute both 686 * by updating the RING_TAIL to point to the end of the 687 * second request, and so we never need to tell the 688 * hardware about the first. 689 */ 690 if (last && 691 !can_merge_ctx(rq->hw_context, last->hw_context)) { 692 /* 693 * If we are on the second port and cannot 694 * combine this request with the last, then we 695 * are done. 696 */ 697 if (port == last_port) { 698 __list_del_many(&p->requests, 699 &rq->sched.link); 700 goto done; 701 } 702 703 /* 704 * If GVT overrides us we only ever submit 705 * port[0], leaving port[1] empty. Note that we 706 * also have to be careful that we don't queue 707 * the same context (even though a different 708 * request) to the second port. 709 */ 710 if (ctx_single_port_submission(last->hw_context) || 711 ctx_single_port_submission(rq->hw_context)) { 712 __list_del_many(&p->requests, 713 &rq->sched.link); 714 goto done; 715 } 716 717 GEM_BUG_ON(last->hw_context == rq->hw_context); 718 719 /* 720 * Avoid reloading the previous context if we 721 * know it has just completed and we want 722 * to switch over to a new context. The CS 723 * interrupt is likely waiting for us to 724 * release the local irq lock and so we will 725 * proceed with the submission momentarily, 726 * which is quicker than reloading the context 727 * on the gpu. 728 */ 729 if (!submit && > 730 intel_engine_signaled(engine, 731 last->global_seqno)) { 732 GEM_BUG_ON(!list_is_first(&rq->sched.link, 733 &p->requests)); 734 return; 735 } 736 737 if (submit) 738 port_assign(port, last); 739 port++; 740 741 GEM_BUG_ON(port_isset(port)); 742 } 743 744 INIT_LIST_HEAD(&rq->sched.link); 745 __i915_request_submit(rq); 746 trace_i915_request_in(rq, port_index(port, execlists)); 747 last = rq; 748 submit = true; 749 } 750 751 rb_erase_cached(&p->node, &execlists->queue); 752 INIT_LIST_HEAD(&p->requests); 753 if (p->priority != I915_PRIORITY_NORMAL) 754 kmem_cache_free(engine->i915->priorities, p); 755 } 756 757 done: 758 /* 759 * Here be a bit of magic! Or sleight-of-hand, whichever you prefer. 760 * 761 * We choose queue_priority such that if we add a request of greater 762 * priority than this, we kick the submission tasklet to decide on 763 * the right order of submitting the requests to hardware. We must 764 * also be prepared to reorder requests as they are in-flight on the 765 * HW. We derive the queue_priority then as the first "hole" in 766 * the HW submission ports and if there are no available slots, 767 * the priority of the lowest executing request, i.e. last. 768 * 769 * When we do receive a higher priority request ready to run from the 770 * user, see queue_request(), the queue_priority is bumped to that 771 * request triggering preemption on the next dequeue (or subsequent 772 * interrupt for secondary ports). 773 */ 774 execlists->queue_priority = 775 port != execlists->port ? rq_prio(last) : INT_MIN; 776 777 if (submit) { 778 port_assign(port, last); 779 execlists_submit_ports(engine); 780 } 781 782 /* We must always keep the beast fed if we have work piled up */ 783 GEM_BUG_ON(rb_first_cached(&execlists->queue) && 784 !port_isset(execlists->port)); 785 786 /* Re-evaluate the executing context setup after each preemptive kick */ 787 if (last) 788 execlists_user_begin(execlists, execlists->port); 789 790 /* If the engine is now idle, so should be the flag; and vice versa. */ 791 GEM_BUG_ON(execlists_is_active(&engine->execlists, 792 EXECLISTS_ACTIVE_USER) == 793 !port_isset(engine->execlists.port)); 794 } 795 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 36050f085071..682268d4249d 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -711,6 +711,24 @@ static void execlists_dequeue(struct intel_engine_cs *engine) GEM_BUG_ON(last->hw_context == rq->hw_context); + /* + * Avoid reloading the previous context if we + * know it has just completed and we want + * to switch over to a new context. The CS + * interrupt is likely waiting for us to + * release the local irq lock and so we will + * proceed with the submission momentarily, + * which is quicker than reloading the context + * on the gpu. + */ + if (!submit && + intel_engine_signaled(engine, + last->global_seqno)) { + GEM_BUG_ON(!list_is_first(&rq->sched.link, + &p->requests)); + return; + } + if (submit) port_assign(port, last); port++;
On gen9, we see an effect where when we perform an element switch just as the first context completes execution that switch takes twice as long, as if it first reloads the completed context. That is we observe the cost of context1 -> idle -> context1 -> context2 as being twice the cost of the same operation as on gen8. The impact of this is incredibly rare outside of microbenchmarks that are focused on assessing the throughput of context switches. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: MichaĆ Winiarski <michal.winiarski@intel.com> --- I think is a microbenchmark too far, as there is no real world impact of this as both the likelihood of submission at that precise point of time, and the context switch being a significant fraction of the batch runtime make the effect miniscule in practise. It is also not foolproof for even gem_ctx_switch: kbl ctx1 -> idle -> ctx2: ~25us; ctx1 -> idle -> ctx1 -> ctx2 (unpatched): ~53us ctx1 -> idle -> ctx1 -> ctx2 (patched): 30-40us bxt ctx1 -> idle -> ctx2: ~40us ctx1 -> idle -> ctx1 -> ctx2 (unpatched): ~80 ctx1 -> idle -> ctx1 -> ctx2 (patched): 60-70us So consider this as more of a plea for ideas; why does bdw behaviour better? Are we missing a flag, a fox or a chicken? -Chris --- drivers/gpu/drm/i915/intel_lrc.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)