Message ID | 20210106084739.63318-4-songmuchun@bytedance.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix some bugs about HugeTLB code | expand |
On Wed 06-01-21 16:47:36, Muchun Song wrote: > There is a race condition between __free_huge_page() > and dissolve_free_huge_page(). > > CPU0: CPU1: > > // page_count(page) == 1 > put_page(page) > __free_huge_page(page) > dissolve_free_huge_page(page) > spin_lock(&hugetlb_lock) > // PageHuge(page) && !page_count(page) > update_and_free_page(page) > // page is freed to the buddy > spin_unlock(&hugetlb_lock) > spin_lock(&hugetlb_lock) > clear_page_huge_active(page) > enqueue_huge_page(page) > // It is wrong, the page is already freed > spin_unlock(&hugetlb_lock) > > The race windows is between put_page() and spin_lock() which > is in the __free_huge_page(). The race window reall is between put_page and dissolve_free_huge_page. And the result is that the put_page path would clobber an unrelated page (either free or already reused page) which is quite serious. Fortunatelly pages are dissolved very rarely. I believe that user would require to be privileged to hit this by intention. > We should make sure that the page is already on the free list > when it is dissolved. Another option would be to check for PageHuge in __free_huge_page. Have you considered that rather than add yet another state? The scope of the spinlock would have to be extended. If that sounds more tricky then can we check the page->lru in the dissolve path? If the page is still PageHuge and reference count 0 then there shouldn't be many options where it can be queued, right? > Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage") > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > --- > mm/hugetlb.c | 38 ++++++++++++++++++++++++++++++++++++++ > 1 file changed, 38 insertions(+) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 4741d60f8955..8ff138c17129 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -79,6 +79,21 @@ DEFINE_SPINLOCK(hugetlb_lock); > static int num_fault_mutexes; > struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp; > > +static inline bool PageHugeFreed(struct page *head) > +{ > + return (unsigned long)head[3].mapping == -1U; > +} > + > +static inline void SetPageHugeFreed(struct page *head) > +{ > + head[3].mapping = (void *)-1U; > +} > + > +static inline void ClearPageHugeFreed(struct page *head) > +{ > + head[3].mapping = NULL; > +} > + > /* Forward declaration */ > static int hugetlb_acct_memory(struct hstate *h, long delta); > > @@ -1028,6 +1043,7 @@ static void enqueue_huge_page(struct hstate *h, struct page *page) > list_move(&page->lru, &h->hugepage_freelists[nid]); > h->free_huge_pages++; > h->free_huge_pages_node[nid]++; > + SetPageHugeFreed(page); > } > > static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) > @@ -1044,6 +1060,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) > > list_move(&page->lru, &h->hugepage_activelist); > set_page_refcounted(page); > + ClearPageHugeFreed(page); > h->free_huge_pages--; > h->free_huge_pages_node[nid]--; > return page; > @@ -1291,6 +1308,17 @@ static inline void destroy_compound_gigantic_page(struct page *page, > unsigned int order) { } > #endif > > +/* > + * Because we reuse the mapping field of some tail page structs, we should > + * reset those mapping to initial value before @head is freed to the buddy > + * allocator. The invalid value will be checked in the free_tail_pages_check(). > + */ > +static inline void reset_tail_page_mapping(struct hstate *h, struct page *head) > +{ > + if (!hstate_is_gigantic(h)) > + head[3].mapping = TAIL_MAPPING; > +} > + > static void update_and_free_page(struct hstate *h, struct page *page) > { > int i; > @@ -1298,6 +1326,7 @@ static void update_and_free_page(struct hstate *h, struct page *page) > if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) > return; > > + reset_tail_page_mapping(h, page); > h->nr_huge_pages--; > h->nr_huge_pages_node[page_to_nid(page)]--; > for (i = 0; i < pages_per_huge_page(h); i++) { > @@ -1504,6 +1533,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) > spin_lock(&hugetlb_lock); > h->nr_huge_pages++; > h->nr_huge_pages_node[nid]++; > + ClearPageHugeFreed(page); > spin_unlock(&hugetlb_lock); > } > > @@ -1770,6 +1800,14 @@ int dissolve_free_huge_page(struct page *page) > int nid = page_to_nid(head); > if (h->free_huge_pages - h->resv_huge_pages == 0) > goto out; > + > + /* > + * We should make sure that the page is already on the free list > + * when it is dissolved. > + */ > + if (unlikely(!PageHugeFreed(head))) > + goto out; > + > /* > * Move PageHWPoison flag from head page to the raw error page, > * which makes any subpages rather than the error page reusable. > -- > 2.11.0
On 1/6/21 8:56 AM, Michal Hocko wrote: > On Wed 06-01-21 16:47:36, Muchun Song wrote: >> There is a race condition between __free_huge_page() >> and dissolve_free_huge_page(). >> >> CPU0: CPU1: >> >> // page_count(page) == 1 >> put_page(page) >> __free_huge_page(page) >> dissolve_free_huge_page(page) >> spin_lock(&hugetlb_lock) >> // PageHuge(page) && !page_count(page) >> update_and_free_page(page) >> // page is freed to the buddy >> spin_unlock(&hugetlb_lock) >> spin_lock(&hugetlb_lock) >> clear_page_huge_active(page) >> enqueue_huge_page(page) >> // It is wrong, the page is already freed >> spin_unlock(&hugetlb_lock) >> >> The race windows is between put_page() and spin_lock() which >> is in the __free_huge_page(). > > The race window reall is between put_page and dissolve_free_huge_page. > And the result is that the put_page path would clobber an unrelated page > (either free or already reused page) which is quite serious. > Fortunatelly pages are dissolved very rarely. I believe that user would > require to be privileged to hit this by intention. > >> We should make sure that the page is already on the free list >> when it is dissolved. > > Another option would be to check for PageHuge in __free_huge_page. Have > you considered that rather than add yet another state? The scope of the > spinlock would have to be extended. If that sounds more tricky then can > we check the page->lru in the dissolve path? If the page is still > PageHuge and reference count 0 then there shouldn't be many options > where it can be queued, right? The tricky part with expanding lock scope will be the potential call to hugepage_subpool_put_pages as it may also try to acquire the hugetlb_lock. I am not sure what you mean by 'check the page->lru'? If we knew the page was on the free list, then we could dissolve. But, I do not think there is an easy way to determine that from page->lru. A hugetlb page is either going to be on the active list or free list. > >> Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage") >> Signed-off-by: Muchun Song <songmuchun@bytedance.com> >> --- >> mm/hugetlb.c | 38 ++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 38 insertions(+) >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 4741d60f8955..8ff138c17129 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -79,6 +79,21 @@ DEFINE_SPINLOCK(hugetlb_lock); >> static int num_fault_mutexes; >> struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp; >> >> +static inline bool PageHugeFreed(struct page *head) >> +{ >> + return (unsigned long)head[3].mapping == -1U; >> +} >> + >> +static inline void SetPageHugeFreed(struct page *head) >> +{ >> + head[3].mapping = (void *)-1U; >> +} >> + >> +static inline void ClearPageHugeFreed(struct page *head) >> +{ >> + head[3].mapping = NULL; >> +} >> + >> /* Forward declaration */ >> static int hugetlb_acct_memory(struct hstate *h, long delta); >> >> @@ -1028,6 +1043,7 @@ static void enqueue_huge_page(struct hstate *h, struct page *page) >> list_move(&page->lru, &h->hugepage_freelists[nid]); >> h->free_huge_pages++; >> h->free_huge_pages_node[nid]++; >> + SetPageHugeFreed(page); >> } >> >> static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) >> @@ -1044,6 +1060,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) >> >> list_move(&page->lru, &h->hugepage_activelist); >> set_page_refcounted(page); >> + ClearPageHugeFreed(page); >> h->free_huge_pages--; >> h->free_huge_pages_node[nid]--; >> return page; >> @@ -1291,6 +1308,17 @@ static inline void destroy_compound_gigantic_page(struct page *page, >> unsigned int order) { } >> #endif >> >> +/* >> + * Because we reuse the mapping field of some tail page structs, we should >> + * reset those mapping to initial value before @head is freed to the buddy >> + * allocator. The invalid value will be checked in the free_tail_pages_check(). >> + */ When I suggested using head[3].mapping for this state, I was not aware of this requirement. My suggestion was only following the convention used in PageHugeTemporary. I would not have made the suggestion if I had realized this was required. Sorry.
On Thu, Jan 7, 2021 at 5:00 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 1/6/21 8:56 AM, Michal Hocko wrote: > > On Wed 06-01-21 16:47:36, Muchun Song wrote: > >> There is a race condition between __free_huge_page() > >> and dissolve_free_huge_page(). > >> > >> CPU0: CPU1: > >> > >> // page_count(page) == 1 > >> put_page(page) > >> __free_huge_page(page) > >> dissolve_free_huge_page(page) > >> spin_lock(&hugetlb_lock) > >> // PageHuge(page) && !page_count(page) > >> update_and_free_page(page) > >> // page is freed to the buddy > >> spin_unlock(&hugetlb_lock) > >> spin_lock(&hugetlb_lock) > >> clear_page_huge_active(page) > >> enqueue_huge_page(page) > >> // It is wrong, the page is already freed > >> spin_unlock(&hugetlb_lock) > >> > >> The race windows is between put_page() and spin_lock() which > >> is in the __free_huge_page(). > > > > The race window reall is between put_page and dissolve_free_huge_page. > > And the result is that the put_page path would clobber an unrelated page > > (either free or already reused page) which is quite serious. > > Fortunatelly pages are dissolved very rarely. I believe that user would > > require to be privileged to hit this by intention. > > > >> We should make sure that the page is already on the free list > >> when it is dissolved. > > > > Another option would be to check for PageHuge in __free_huge_page. Have > > you considered that rather than add yet another state? The scope of the > > spinlock would have to be extended. If that sounds more tricky then can > > we check the page->lru in the dissolve path? If the page is still > > PageHuge and reference count 0 then there shouldn't be many options > > where it can be queued, right? > > The tricky part with expanding lock scope will be the potential call to > hugepage_subpool_put_pages as it may also try to acquire the hugetlb_lock. > > I am not sure what you mean by 'check the page->lru'? If we knew the page > was on the free list, then we could dissolve. But, I do not think there > is an easy way to determine that from page->lru. A hugetlb page is either > going to be on the active list or free list. > > > > >> Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage") > >> Signed-off-by: Muchun Song <songmuchun@bytedance.com> > >> --- > >> mm/hugetlb.c | 38 ++++++++++++++++++++++++++++++++++++++ > >> 1 file changed, 38 insertions(+) > >> > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >> index 4741d60f8955..8ff138c17129 100644 > >> --- a/mm/hugetlb.c > >> +++ b/mm/hugetlb.c > >> @@ -79,6 +79,21 @@ DEFINE_SPINLOCK(hugetlb_lock); > >> static int num_fault_mutexes; > >> struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp; > >> > >> +static inline bool PageHugeFreed(struct page *head) > >> +{ > >> + return (unsigned long)head[3].mapping == -1U; > >> +} > >> + > >> +static inline void SetPageHugeFreed(struct page *head) > >> +{ > >> + head[3].mapping = (void *)-1U; > >> +} > >> + > >> +static inline void ClearPageHugeFreed(struct page *head) > >> +{ > >> + head[3].mapping = NULL; > >> +} > >> + > >> /* Forward declaration */ > >> static int hugetlb_acct_memory(struct hstate *h, long delta); > >> > >> @@ -1028,6 +1043,7 @@ static void enqueue_huge_page(struct hstate *h, struct page *page) > >> list_move(&page->lru, &h->hugepage_freelists[nid]); > >> h->free_huge_pages++; > >> h->free_huge_pages_node[nid]++; > >> + SetPageHugeFreed(page); > >> } > >> > >> static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) > >> @@ -1044,6 +1060,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) > >> > >> list_move(&page->lru, &h->hugepage_activelist); > >> set_page_refcounted(page); > >> + ClearPageHugeFreed(page); > >> h->free_huge_pages--; > >> h->free_huge_pages_node[nid]--; > >> return page; > >> @@ -1291,6 +1308,17 @@ static inline void destroy_compound_gigantic_page(struct page *page, > >> unsigned int order) { } > >> #endif > >> > >> +/* > >> + * Because we reuse the mapping field of some tail page structs, we should > >> + * reset those mapping to initial value before @head is freed to the buddy > >> + * allocator. The invalid value will be checked in the free_tail_pages_check(). > >> + */ > > When I suggested using head[3].mapping for this state, I was not aware of > this requirement. My suggestion was only following the convention used in > PageHugeTemporary. I would not have made the suggestion if I had realized > this was required. Sorry. Yeah, PageHugeTemporary is lucky. free_tail_pages_check() not check head[2].mapping. I will revert to the previous version (using head[3].private). Thanks. > -- > Mike Kravetz > > >> +static inline void reset_tail_page_mapping(struct hstate *h, struct page *head) > >> +{ > >> + if (!hstate_is_gigantic(h)) > >> + head[3].mapping = TAIL_MAPPING; > >> +} > >> + > >> static void update_and_free_page(struct hstate *h, struct page *page) > >> { > >> int i; > >> @@ -1298,6 +1326,7 @@ static void update_and_free_page(struct hstate *h, struct page *page) > >> if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) > >> return; > >> > >> + reset_tail_page_mapping(h, page); > >> h->nr_huge_pages--; > >> h->nr_huge_pages_node[page_to_nid(page)]--; > >> for (i = 0; i < pages_per_huge_page(h); i++) { > >> @@ -1504,6 +1533,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) > >> spin_lock(&hugetlb_lock); > >> h->nr_huge_pages++; > >> h->nr_huge_pages_node[nid]++; > >> + ClearPageHugeFreed(page); > >> spin_unlock(&hugetlb_lock); > >> } > >> > >> @@ -1770,6 +1800,14 @@ int dissolve_free_huge_page(struct page *page) > >> int nid = page_to_nid(head); > >> if (h->free_huge_pages - h->resv_huge_pages == 0) > >> goto out; > >> + > >> + /* > >> + * We should make sure that the page is already on the free list > >> + * when it is dissolved. > >> + */ > >> + if (unlikely(!PageHugeFreed(head))) > >> + goto out; > >> + > >> /* > >> * Move PageHWPoison flag from head page to the raw error page, > >> * which makes any subpages rather than the error page reusable. > >> -- > >> 2.11.0 > >
On Thu, Jan 7, 2021 at 12:56 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 06-01-21 16:47:36, Muchun Song wrote: > > There is a race condition between __free_huge_page() > > and dissolve_free_huge_page(). > > > > CPU0: CPU1: > > > > // page_count(page) == 1 > > put_page(page) > > __free_huge_page(page) > > dissolve_free_huge_page(page) > > spin_lock(&hugetlb_lock) > > // PageHuge(page) && !page_count(page) > > update_and_free_page(page) > > // page is freed to the buddy > > spin_unlock(&hugetlb_lock) > > spin_lock(&hugetlb_lock) > > clear_page_huge_active(page) > > enqueue_huge_page(page) > > // It is wrong, the page is already freed > > spin_unlock(&hugetlb_lock) > > > > The race windows is between put_page() and spin_lock() which > > is in the __free_huge_page(). > > The race window reall is between put_page and dissolve_free_huge_page. > And the result is that the put_page path would clobber an unrelated page > (either free or already reused page) which is quite serious. > Fortunatelly pages are dissolved very rarely. I believe that user would > require to be privileged to hit this by intention. > > > We should make sure that the page is already on the free list > > when it is dissolved. > > Another option would be to check for PageHuge in __free_huge_page. Have > you considered that rather than add yet another state? The scope of the > spinlock would have to be extended. If that sounds more tricky then can > we check the page->lru in the dissolve path? If the page is still > PageHuge and reference count 0 then there shouldn't be many options > where it can be queued, right? Did you mean that we iterate over the free list to check whether the page is on the free list? If so, I do not think it is a good solution than introducing another state. Because if there are a lot of pages on the free list, it may take some time to do it with holding hugetlb_lock. Right? Actually, we have some tail page structs to store the state. At least it's not in short supply right now. Thanks. > > > Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage") > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > --- > > mm/hugetlb.c | 38 ++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 38 insertions(+) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 4741d60f8955..8ff138c17129 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -79,6 +79,21 @@ DEFINE_SPINLOCK(hugetlb_lock); > > static int num_fault_mutexes; > > struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp; > > > > +static inline bool PageHugeFreed(struct page *head) > > +{ > > + return (unsigned long)head[3].mapping == -1U; > > +} > > + > > +static inline void SetPageHugeFreed(struct page *head) > > +{ > > + head[3].mapping = (void *)-1U; > > +} > > + > > +static inline void ClearPageHugeFreed(struct page *head) > > +{ > > + head[3].mapping = NULL; > > +} > > + > > /* Forward declaration */ > > static int hugetlb_acct_memory(struct hstate *h, long delta); > > > > @@ -1028,6 +1043,7 @@ static void enqueue_huge_page(struct hstate *h, struct page *page) > > list_move(&page->lru, &h->hugepage_freelists[nid]); > > h->free_huge_pages++; > > h->free_huge_pages_node[nid]++; > > + SetPageHugeFreed(page); > > } > > > > static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) > > @@ -1044,6 +1060,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) > > > > list_move(&page->lru, &h->hugepage_activelist); > > set_page_refcounted(page); > > + ClearPageHugeFreed(page); > > h->free_huge_pages--; > > h->free_huge_pages_node[nid]--; > > return page; > > @@ -1291,6 +1308,17 @@ static inline void destroy_compound_gigantic_page(struct page *page, > > unsigned int order) { } > > #endif > > > > +/* > > + * Because we reuse the mapping field of some tail page structs, we should > > + * reset those mapping to initial value before @head is freed to the buddy > > + * allocator. The invalid value will be checked in the free_tail_pages_check(). > > + */ > > +static inline void reset_tail_page_mapping(struct hstate *h, struct page *head) > > +{ > > + if (!hstate_is_gigantic(h)) > > + head[3].mapping = TAIL_MAPPING; > > +} > > + > > static void update_and_free_page(struct hstate *h, struct page *page) > > { > > int i; > > @@ -1298,6 +1326,7 @@ static void update_and_free_page(struct hstate *h, struct page *page) > > if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) > > return; > > > > + reset_tail_page_mapping(h, page); > > h->nr_huge_pages--; > > h->nr_huge_pages_node[page_to_nid(page)]--; > > for (i = 0; i < pages_per_huge_page(h); i++) { > > @@ -1504,6 +1533,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) > > spin_lock(&hugetlb_lock); > > h->nr_huge_pages++; > > h->nr_huge_pages_node[nid]++; > > + ClearPageHugeFreed(page); > > spin_unlock(&hugetlb_lock); > > } > > > > @@ -1770,6 +1800,14 @@ int dissolve_free_huge_page(struct page *page) > > int nid = page_to_nid(head); > > if (h->free_huge_pages - h->resv_huge_pages == 0) > > goto out; > > + > > + /* > > + * We should make sure that the page is already on the free list > > + * when it is dissolved. > > + */ > > + if (unlikely(!PageHugeFreed(head))) > > + goto out; > > + > > /* > > * Move PageHWPoison flag from head page to the raw error page, > > * which makes any subpages rather than the error page reusable. > > -- > > 2.11.0 > > -- > Michal Hocko > SUSE Labs
On Wed 06-01-21 12:58:29, Mike Kravetz wrote: > On 1/6/21 8:56 AM, Michal Hocko wrote: > > On Wed 06-01-21 16:47:36, Muchun Song wrote: > >> There is a race condition between __free_huge_page() > >> and dissolve_free_huge_page(). > >> > >> CPU0: CPU1: > >> > >> // page_count(page) == 1 > >> put_page(page) > >> __free_huge_page(page) > >> dissolve_free_huge_page(page) > >> spin_lock(&hugetlb_lock) > >> // PageHuge(page) && !page_count(page) > >> update_and_free_page(page) > >> // page is freed to the buddy > >> spin_unlock(&hugetlb_lock) > >> spin_lock(&hugetlb_lock) > >> clear_page_huge_active(page) > >> enqueue_huge_page(page) > >> // It is wrong, the page is already freed > >> spin_unlock(&hugetlb_lock) > >> > >> The race windows is between put_page() and spin_lock() which > >> is in the __free_huge_page(). > > > > The race window reall is between put_page and dissolve_free_huge_page. > > And the result is that the put_page path would clobber an unrelated page > > (either free or already reused page) which is quite serious. > > Fortunatelly pages are dissolved very rarely. I believe that user would > > require to be privileged to hit this by intention. > > > >> We should make sure that the page is already on the free list > >> when it is dissolved. > > > > Another option would be to check for PageHuge in __free_huge_page. Have > > you considered that rather than add yet another state? The scope of the > > spinlock would have to be extended. If that sounds more tricky then can > > we check the page->lru in the dissolve path? If the page is still > > PageHuge and reference count 0 then there shouldn't be many options > > where it can be queued, right? > > The tricky part with expanding lock scope will be the potential call to > hugepage_subpool_put_pages as it may also try to acquire the hugetlb_lock. Can we rearrange the code and move hugepage_subpool_put_pages after all this is done? Or is there any strong reason for the particular ordering? > I am not sure what you mean by 'check the page->lru'? If we knew the page > was on the free list, then we could dissolve. But, I do not think there > is an easy way to determine that from page->lru. A hugetlb page is either > going to be on the active list or free list. Can it be on the active list with ref count = 0?
On Thu 07-01-21 13:39:38, Muchun Song wrote: > On Thu, Jan 7, 2021 at 12:56 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 06-01-21 16:47:36, Muchun Song wrote: > > > There is a race condition between __free_huge_page() > > > and dissolve_free_huge_page(). > > > > > > CPU0: CPU1: > > > > > > // page_count(page) == 1 > > > put_page(page) > > > __free_huge_page(page) > > > dissolve_free_huge_page(page) > > > spin_lock(&hugetlb_lock) > > > // PageHuge(page) && !page_count(page) > > > update_and_free_page(page) > > > // page is freed to the buddy > > > spin_unlock(&hugetlb_lock) > > > spin_lock(&hugetlb_lock) > > > clear_page_huge_active(page) > > > enqueue_huge_page(page) > > > // It is wrong, the page is already freed > > > spin_unlock(&hugetlb_lock) > > > > > > The race windows is between put_page() and spin_lock() which > > > is in the __free_huge_page(). > > > > The race window reall is between put_page and dissolve_free_huge_page. > > And the result is that the put_page path would clobber an unrelated page > > (either free or already reused page) which is quite serious. > > Fortunatelly pages are dissolved very rarely. I believe that user would > > require to be privileged to hit this by intention. > > > > > We should make sure that the page is already on the free list > > > when it is dissolved. > > > > Another option would be to check for PageHuge in __free_huge_page. Have > > you considered that rather than add yet another state? The scope of the > > spinlock would have to be extended. If that sounds more tricky then can > > we check the page->lru in the dissolve path? If the page is still > > PageHuge and reference count 0 then there shouldn't be many options > > where it can be queued, right? > > Did you mean that we iterate over the free list to check whether > the page is on the free list? No I meant to check that the page is enqueued which along with ref count = 0 should mean it has been released to the pool unless I am missing something.
On Thu, Jan 7, 2021 at 4:41 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 07-01-21 13:39:38, Muchun Song wrote: > > On Thu, Jan 7, 2021 at 12:56 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Wed 06-01-21 16:47:36, Muchun Song wrote: > > > > There is a race condition between __free_huge_page() > > > > and dissolve_free_huge_page(). > > > > > > > > CPU0: CPU1: > > > > > > > > // page_count(page) == 1 > > > > put_page(page) > > > > __free_huge_page(page) > > > > dissolve_free_huge_page(page) > > > > spin_lock(&hugetlb_lock) > > > > // PageHuge(page) && !page_count(page) > > > > update_and_free_page(page) > > > > // page is freed to the buddy > > > > spin_unlock(&hugetlb_lock) > > > > spin_lock(&hugetlb_lock) > > > > clear_page_huge_active(page) > > > > enqueue_huge_page(page) > > > > // It is wrong, the page is already freed > > > > spin_unlock(&hugetlb_lock) > > > > > > > > The race windows is between put_page() and spin_lock() which > > > > is in the __free_huge_page(). > > > > > > The race window reall is between put_page and dissolve_free_huge_page. > > > And the result is that the put_page path would clobber an unrelated page > > > (either free or already reused page) which is quite serious. > > > Fortunatelly pages are dissolved very rarely. I believe that user would > > > require to be privileged to hit this by intention. > > > > > > > We should make sure that the page is already on the free list > > > > when it is dissolved. > > > > > > Another option would be to check for PageHuge in __free_huge_page. Have > > > you considered that rather than add yet another state? The scope of the > > > spinlock would have to be extended. If that sounds more tricky then can > > > we check the page->lru in the dissolve path? If the page is still > > > PageHuge and reference count 0 then there shouldn't be many options > > > where it can be queued, right? > > > > Did you mean that we iterate over the free list to check whether > > the page is on the free list? > > No I meant to check that the page is enqueued which along with ref count > = 0 should mean it has been released to the pool unless I am missing > something. The page can be on the free list or active list or empty when it is freed to the pool. How to check whether it is on the free list? > -- > Michal Hocko > SUSE Labs
On Thu 07-01-21 16:53:13, Muchun Song wrote: > On Thu, Jan 7, 2021 at 4:41 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 07-01-21 13:39:38, Muchun Song wrote: > > > On Thu, Jan 7, 2021 at 12:56 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Wed 06-01-21 16:47:36, Muchun Song wrote: > > > > > There is a race condition between __free_huge_page() > > > > > and dissolve_free_huge_page(). > > > > > > > > > > CPU0: CPU1: > > > > > > > > > > // page_count(page) == 1 > > > > > put_page(page) > > > > > __free_huge_page(page) > > > > > dissolve_free_huge_page(page) > > > > > spin_lock(&hugetlb_lock) > > > > > // PageHuge(page) && !page_count(page) > > > > > update_and_free_page(page) > > > > > // page is freed to the buddy > > > > > spin_unlock(&hugetlb_lock) > > > > > spin_lock(&hugetlb_lock) > > > > > clear_page_huge_active(page) > > > > > enqueue_huge_page(page) > > > > > // It is wrong, the page is already freed > > > > > spin_unlock(&hugetlb_lock) > > > > > > > > > > The race windows is between put_page() and spin_lock() which > > > > > is in the __free_huge_page(). > > > > > > > > The race window reall is between put_page and dissolve_free_huge_page. > > > > And the result is that the put_page path would clobber an unrelated page > > > > (either free or already reused page) which is quite serious. > > > > Fortunatelly pages are dissolved very rarely. I believe that user would > > > > require to be privileged to hit this by intention. > > > > > > > > > We should make sure that the page is already on the free list > > > > > when it is dissolved. > > > > > > > > Another option would be to check for PageHuge in __free_huge_page. Have > > > > you considered that rather than add yet another state? The scope of the > > > > spinlock would have to be extended. If that sounds more tricky then can > > > > we check the page->lru in the dissolve path? If the page is still > > > > PageHuge and reference count 0 then there shouldn't be many options > > > > where it can be queued, right? > > > > > > Did you mean that we iterate over the free list to check whether > > > the page is on the free list? > > > > No I meant to check that the page is enqueued which along with ref count > > = 0 should mean it has been released to the pool unless I am missing > > something. > > The page can be on the free list or active list or empty when it > is freed to the pool. How to check whether it is on the free list? As I've said, I might be missing something here. But if the page is freed why does it matter whether it is on a active list or free list from the dissolve operation POV?
On Thu, Jan 7, 2021 at 7:18 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 07-01-21 16:53:13, Muchun Song wrote: > > On Thu, Jan 7, 2021 at 4:41 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 07-01-21 13:39:38, Muchun Song wrote: > > > > On Thu, Jan 7, 2021 at 12:56 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Wed 06-01-21 16:47:36, Muchun Song wrote: > > > > > > There is a race condition between __free_huge_page() > > > > > > and dissolve_free_huge_page(). > > > > > > > > > > > > CPU0: CPU1: > > > > > > > > > > > > // page_count(page) == 1 > > > > > > put_page(page) > > > > > > __free_huge_page(page) > > > > > > dissolve_free_huge_page(page) > > > > > > spin_lock(&hugetlb_lock) > > > > > > // PageHuge(page) && !page_count(page) > > > > > > update_and_free_page(page) > > > > > > // page is freed to the buddy > > > > > > spin_unlock(&hugetlb_lock) > > > > > > spin_lock(&hugetlb_lock) > > > > > > clear_page_huge_active(page) > > > > > > enqueue_huge_page(page) > > > > > > // It is wrong, the page is already freed > > > > > > spin_unlock(&hugetlb_lock) > > > > > > > > > > > > The race windows is between put_page() and spin_lock() which > > > > > > is in the __free_huge_page(). > > > > > > > > > > The race window reall is between put_page and dissolve_free_huge_page. > > > > > And the result is that the put_page path would clobber an unrelated page > > > > > (either free or already reused page) which is quite serious. > > > > > Fortunatelly pages are dissolved very rarely. I believe that user would > > > > > require to be privileged to hit this by intention. > > > > > > > > > > > We should make sure that the page is already on the free list > > > > > > when it is dissolved. > > > > > > > > > > Another option would be to check for PageHuge in __free_huge_page. Have > > > > > you considered that rather than add yet another state? The scope of the > > > > > spinlock would have to be extended. If that sounds more tricky then can > > > > > we check the page->lru in the dissolve path? If the page is still > > > > > PageHuge and reference count 0 then there shouldn't be many options > > > > > where it can be queued, right? > > > > > > > > Did you mean that we iterate over the free list to check whether > > > > the page is on the free list? > > > > > > No I meant to check that the page is enqueued which along with ref count > > > = 0 should mean it has been released to the pool unless I am missing > > > something. > > > > The page can be on the free list or active list or empty when it > > is freed to the pool. How to check whether it is on the free list? > > As I've said, I might be missing something here. But if the page is > freed why does it matter whether it is on a active list or free list > from the dissolve operation POV? As you said "check the page->lru". I have a question. How to check the page->lru in the dissolve path? BTW, dissolve_free_huge_page aims to free the page to buddy allocator. put_page (for HugeTLB page) aims to free the page to the hugepage pool. > -- > Michal Hocko > SUSE Labs
On Thu 07-01-21 19:38:00, Muchun Song wrote: > On Thu, Jan 7, 2021 at 7:18 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 07-01-21 16:53:13, Muchun Song wrote: > > > On Thu, Jan 7, 2021 at 4:41 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 07-01-21 13:39:38, Muchun Song wrote: > > > > > On Thu, Jan 7, 2021 at 12:56 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Wed 06-01-21 16:47:36, Muchun Song wrote: > > > > > > > There is a race condition between __free_huge_page() > > > > > > > and dissolve_free_huge_page(). > > > > > > > > > > > > > > CPU0: CPU1: > > > > > > > > > > > > > > // page_count(page) == 1 > > > > > > > put_page(page) > > > > > > > __free_huge_page(page) > > > > > > > dissolve_free_huge_page(page) > > > > > > > spin_lock(&hugetlb_lock) > > > > > > > // PageHuge(page) && !page_count(page) > > > > > > > update_and_free_page(page) > > > > > > > // page is freed to the buddy > > > > > > > spin_unlock(&hugetlb_lock) > > > > > > > spin_lock(&hugetlb_lock) > > > > > > > clear_page_huge_active(page) > > > > > > > enqueue_huge_page(page) > > > > > > > // It is wrong, the page is already freed > > > > > > > spin_unlock(&hugetlb_lock) > > > > > > > > > > > > > > The race windows is between put_page() and spin_lock() which > > > > > > > is in the __free_huge_page(). > > > > > > > > > > > > The race window reall is between put_page and dissolve_free_huge_page. > > > > > > And the result is that the put_page path would clobber an unrelated page > > > > > > (either free or already reused page) which is quite serious. > > > > > > Fortunatelly pages are dissolved very rarely. I believe that user would > > > > > > require to be privileged to hit this by intention. > > > > > > > > > > > > > We should make sure that the page is already on the free list > > > > > > > when it is dissolved. > > > > > > > > > > > > Another option would be to check for PageHuge in __free_huge_page. Have > > > > > > you considered that rather than add yet another state? The scope of the > > > > > > spinlock would have to be extended. If that sounds more tricky then can > > > > > > we check the page->lru in the dissolve path? If the page is still > > > > > > PageHuge and reference count 0 then there shouldn't be many options > > > > > > where it can be queued, right? > > > > > > > > > > Did you mean that we iterate over the free list to check whether > > > > > the page is on the free list? > > > > > > > > No I meant to check that the page is enqueued which along with ref count > > > > = 0 should mean it has been released to the pool unless I am missing > > > > something. > > > > > > The page can be on the free list or active list or empty when it > > > is freed to the pool. How to check whether it is on the free list? > > > > As I've said, I might be missing something here. But if the page is > > freed why does it matter whether it is on a active list or free list > > from the dissolve operation POV? > > As you said "check the page->lru". I have a question. > How to check the page->lru in the dissolve path? list_empty? > BTW, dissolve_free_huge_page aims to free the page > to buddy allocator. put_page (for HugeTLB page) aims > to free the page to the hugepage pool. Right. Can we simply back off in the dissolving path when ref count is 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario when the all above is true and the page is not being freed?
On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 07-01-21 19:38:00, Muchun Song wrote: > > On Thu, Jan 7, 2021 at 7:18 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 07-01-21 16:53:13, Muchun Song wrote: > > > > On Thu, Jan 7, 2021 at 4:41 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Thu 07-01-21 13:39:38, Muchun Song wrote: > > > > > > On Thu, Jan 7, 2021 at 12:56 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Wed 06-01-21 16:47:36, Muchun Song wrote: > > > > > > > > There is a race condition between __free_huge_page() > > > > > > > > and dissolve_free_huge_page(). > > > > > > > > > > > > > > > > CPU0: CPU1: > > > > > > > > > > > > > > > > // page_count(page) == 1 > > > > > > > > put_page(page) > > > > > > > > __free_huge_page(page) > > > > > > > > dissolve_free_huge_page(page) > > > > > > > > spin_lock(&hugetlb_lock) > > > > > > > > // PageHuge(page) && !page_count(page) > > > > > > > > update_and_free_page(page) > > > > > > > > // page is freed to the buddy > > > > > > > > spin_unlock(&hugetlb_lock) > > > > > > > > spin_lock(&hugetlb_lock) > > > > > > > > clear_page_huge_active(page) > > > > > > > > enqueue_huge_page(page) > > > > > > > > // It is wrong, the page is already freed > > > > > > > > spin_unlock(&hugetlb_lock) > > > > > > > > > > > > > > > > The race windows is between put_page() and spin_lock() which > > > > > > > > is in the __free_huge_page(). > > > > > > > > > > > > > > The race window reall is between put_page and dissolve_free_huge_page. > > > > > > > And the result is that the put_page path would clobber an unrelated page > > > > > > > (either free or already reused page) which is quite serious. > > > > > > > Fortunatelly pages are dissolved very rarely. I believe that user would > > > > > > > require to be privileged to hit this by intention. > > > > > > > > > > > > > > > We should make sure that the page is already on the free list > > > > > > > > when it is dissolved. > > > > > > > > > > > > > > Another option would be to check for PageHuge in __free_huge_page. Have > > > > > > > you considered that rather than add yet another state? The scope of the > > > > > > > spinlock would have to be extended. If that sounds more tricky then can > > > > > > > we check the page->lru in the dissolve path? If the page is still > > > > > > > PageHuge and reference count 0 then there shouldn't be many options > > > > > > > where it can be queued, right? > > > > > > > > > > > > Did you mean that we iterate over the free list to check whether > > > > > > the page is on the free list? > > > > > > > > > > No I meant to check that the page is enqueued which along with ref count > > > > > = 0 should mean it has been released to the pool unless I am missing > > > > > something. > > > > > > > > The page can be on the free list or active list or empty when it > > > > is freed to the pool. How to check whether it is on the free list? > > > > > > As I've said, I might be missing something here. But if the page is > > > freed why does it matter whether it is on a active list or free list > > > from the dissolve operation POV? > > > > As you said "check the page->lru". I have a question. > > How to check the page->lru in the dissolve path? > > list_empty? No. > > > BTW, dissolve_free_huge_page aims to free the page > > to buddy allocator. put_page (for HugeTLB page) aims > > to free the page to the hugepage pool. > > Right. Can we simply back off in the dissolving path when ref count is > 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario > when the all above is true and the page is not being freed? The list_empty(&page->lru) may always return false. The page before freeing is on the active list (hstate->hugepage_activelist).Then it is on the free list after freeing. So list_empty(&page->lru) is always false. > -- > Michal Hocko > SUSE Labs
On Thu 07-01-21 20:59:33, Muchun Song wrote: > On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko <mhocko@suse.com> wrote: [...] > > Right. Can we simply back off in the dissolving path when ref count is > > 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario > > when the all above is true and the page is not being freed? > > The list_empty(&page->lru) may always return false. > The page before freeing is on the active list > (hstate->hugepage_activelist).Then it is on the free list > after freeing. So list_empty(&page->lru) is always false. The point I was trying to make is that the page has to be enqueued when it is dissolved and freed. If the page is not enqueued then something racing. But then I have realized that this is not a great check to detect the race because pages are going to be released to buddy allocator and that will reuse page->lru again. So scratch that and sorry for the detour. But that made me think some more and one way to reliably detect the race should be PageHuge() check in the freeing path. This is what dissolve path does already. PageHuge becomes false during update_and_free_page() while holding the hugetlb_lock. So can we use that?
On Thu, Jan 7, 2021 at 10:11 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 07-01-21 20:59:33, Muchun Song wrote: > > On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > Right. Can we simply back off in the dissolving path when ref count is > > > 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario > > > when the all above is true and the page is not being freed? > > > > The list_empty(&page->lru) may always return false. > > The page before freeing is on the active list > > (hstate->hugepage_activelist).Then it is on the free list > > after freeing. So list_empty(&page->lru) is always false. > > The point I was trying to make is that the page has to be enqueued when > it is dissolved and freed. If the page is not enqueued then something > racing. But then I have realized that this is not a great check to > detect the race because pages are going to be released to buddy > allocator and that will reuse page->lru again. So scratch that and sorry > for the detour. > > But that made me think some more and one way to reliably detect the race > should be PageHuge() check in the freeing path. This is what dissolve > path does already. PageHuge becomes false during update_and_free_page() > while holding the hugetlb_lock. So can we use that? It may make the thing complex. Apart from freeing it to the buddy allocator, free_huge_page also does something else for us. If we detect the race in the freeing path, if it is not a HugeTLB page, the freeing path just returns. We also should move those things to the dissolve path. Right? But I find a tricky problem to solve. See free_huge_page(). If we are in non-task context, we should schedule a work to free the page. We reuse the page->mapping. If the page is already freed by the dissolve path. We should not touch the page->mapping. So we need to check PageHuge(). The check and llist_add() should be protected by hugetlb_lock. But we cannot do that. Right? If dissolve happens after it is linked to the list. We also should remove it from the list (hpage_freelist). It seems to make the thing more complex. > -- > Michal Hocko > SUSE Labs
On 1/7/21 12:40 AM, Michal Hocko wrote: > On Wed 06-01-21 12:58:29, Mike Kravetz wrote: >> On 1/6/21 8:56 AM, Michal Hocko wrote: >>> On Wed 06-01-21 16:47:36, Muchun Song wrote: >>>> There is a race condition between __free_huge_page() >>>> and dissolve_free_huge_page(). >>>> >>>> CPU0: CPU1: >>>> >>>> // page_count(page) == 1 >>>> put_page(page) >>>> __free_huge_page(page) >>>> dissolve_free_huge_page(page) >>>> spin_lock(&hugetlb_lock) >>>> // PageHuge(page) && !page_count(page) >>>> update_and_free_page(page) >>>> // page is freed to the buddy >>>> spin_unlock(&hugetlb_lock) >>>> spin_lock(&hugetlb_lock) >>>> clear_page_huge_active(page) >>>> enqueue_huge_page(page) >>>> // It is wrong, the page is already freed >>>> spin_unlock(&hugetlb_lock) >>>> >>>> The race windows is between put_page() and spin_lock() which >>>> is in the __free_huge_page(). >>> >>> The race window reall is between put_page and dissolve_free_huge_page. >>> And the result is that the put_page path would clobber an unrelated page >>> (either free or already reused page) which is quite serious. >>> Fortunatelly pages are dissolved very rarely. I believe that user would >>> require to be privileged to hit this by intention. >>> >>>> We should make sure that the page is already on the free list >>>> when it is dissolved. >>> >>> Another option would be to check for PageHuge in __free_huge_page. Have >>> you considered that rather than add yet another state? The scope of the >>> spinlock would have to be extended. If that sounds more tricky then can >>> we check the page->lru in the dissolve path? If the page is still >>> PageHuge and reference count 0 then there shouldn't be many options >>> where it can be queued, right? >> >> The tricky part with expanding lock scope will be the potential call to >> hugepage_subpool_put_pages as it may also try to acquire the hugetlb_lock. > > Can we rearrange the code and move hugepage_subpool_put_pages after all > this is done? Or is there any strong reason for the particular ordering? The reservation code is so fragile, I always get nervous when making any changes. However, the straight forward patch below passes some simple testing. The only difference I can see is that global counts are adjusted before sub-pool counts. This should not be an issue as global and sub-pool counts are adjusted independently (not under the same lock). Allocation code checks sub-pool counts before global counts. So, there is a SMALL potential that a racing allocation which previously succeeded would now fail. I do not think this is an issue in practice. diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3b38ea958e95..658593840212 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1395,6 +1395,11 @@ static void __free_huge_page(struct page *page) (struct hugepage_subpool *)page_private(page); bool restore_reserve; + spin_lock(&hugetlb_lock); + /* check for race with dissolve_free_huge_page/update_and_free_page */ + if (!PageHuge(page)) + return; + VM_BUG_ON_PAGE(page_count(page), page); VM_BUG_ON_PAGE(page_mapcount(page), page); @@ -1403,26 +1408,6 @@ static void __free_huge_page(struct page *page) restore_reserve = PagePrivate(page); ClearPagePrivate(page); - /* - * If PagePrivate() was set on page, page allocation consumed a - * reservation. If the page was associated with a subpool, there - * would have been a page reserved in the subpool before allocation - * via hugepage_subpool_get_pages(). Since we are 'restoring' the - * reservtion, do not call hugepage_subpool_put_pages() as this will - * remove the reserved page from the subpool. - */ - if (!restore_reserve) { - /* - * A return code of zero implies that the subpool will be - * under its minimum size if the reservation is not restored - * after page is free. Therefore, force restore_reserve - * operation. - */ - if (hugepage_subpool_put_pages(spool, 1) == 0) - restore_reserve = true; - } - - spin_lock(&hugetlb_lock); clear_page_huge_active(page); hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h), page); @@ -1446,6 +1431,28 @@ static void __free_huge_page(struct page *page) enqueue_huge_page(h, page); } spin_unlock(&hugetlb_lock); + + /* + * If PagePrivate() was set on page, page allocation consumed a + * reservation. If the page was associated with a subpool, there + * would have been a page reserved in the subpool before allocation + * via hugepage_subpool_get_pages(). Since we are 'restoring' the + * reservtion, do not call hugepage_subpool_put_pages() as this will + * remove the reserved page from the subpool. + */ + if (!restore_reserve) { + /* + * A return code of zero implies that the subpool will be + * under its minimum size if the reservation is not restored + * after page is free. Therefore, we need to add 1 to the + * global reserve count. + */ + if (hugepage_subpool_put_pages(spool, 1) == 0) { + spin_lock(&hugetlb_lock); + h->resv_huge_pages++; + spin_unlock(&hugetlb_lock); + } + } } /*
On 1/7/21 7:11 AM, Muchun Song wrote: > On Thu, Jan 7, 2021 at 10:11 PM Michal Hocko <mhocko@suse.com> wrote: >> >> On Thu 07-01-21 20:59:33, Muchun Song wrote: >>> On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko <mhocko@suse.com> wrote: >> [...] >>>> Right. Can we simply back off in the dissolving path when ref count is >>>> 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario >>>> when the all above is true and the page is not being freed? >>> >>> The list_empty(&page->lru) may always return false. >>> The page before freeing is on the active list >>> (hstate->hugepage_activelist).Then it is on the free list >>> after freeing. So list_empty(&page->lru) is always false. >> >> The point I was trying to make is that the page has to be enqueued when >> it is dissolved and freed. If the page is not enqueued then something >> racing. But then I have realized that this is not a great check to >> detect the race because pages are going to be released to buddy >> allocator and that will reuse page->lru again. So scratch that and sorry >> for the detour. >> >> But that made me think some more and one way to reliably detect the race >> should be PageHuge() check in the freeing path. This is what dissolve >> path does already. PageHuge becomes false during update_and_free_page() >> while holding the hugetlb_lock. So can we use that? > > It may make the thing complex. Apart from freeing it to the > buddy allocator, free_huge_page also does something else for > us. If we detect the race in the freeing path, if it is not a HugeTLB > page, the freeing path just returns. We also should move those > things to the dissolve path. Right? > > But I find a tricky problem to solve. See free_huge_page(). > If we are in non-task context, we should schedule a work > to free the page. We reuse the page->mapping. If the page > is already freed by the dissolve path. We should not touch > the page->mapping. So we need to check PageHuge(). > The check and llist_add() should be protected by > hugetlb_lock. But we cannot do that. Right? If dissolve > happens after it is linked to the list. We also should > remove it from the list (hpage_freelist). It seems to make > the thing more complex. You are correct. This is also an issue/potential problem with this race. It seems that adding the state information might be the least complex way to address issue.
On Fri, Jan 8, 2021 at 9:08 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 1/7/21 7:11 AM, Muchun Song wrote: > > On Thu, Jan 7, 2021 at 10:11 PM Michal Hocko <mhocko@suse.com> wrote: > >> > >> On Thu 07-01-21 20:59:33, Muchun Song wrote: > >>> On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko <mhocko@suse.com> wrote: > >> [...] > >>>> Right. Can we simply back off in the dissolving path when ref count is > >>>> 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario > >>>> when the all above is true and the page is not being freed? > >>> > >>> The list_empty(&page->lru) may always return false. > >>> The page before freeing is on the active list > >>> (hstate->hugepage_activelist).Then it is on the free list > >>> after freeing. So list_empty(&page->lru) is always false. > >> > >> The point I was trying to make is that the page has to be enqueued when > >> it is dissolved and freed. If the page is not enqueued then something > >> racing. But then I have realized that this is not a great check to > >> detect the race because pages are going to be released to buddy > >> allocator and that will reuse page->lru again. So scratch that and sorry > >> for the detour. > >> > >> But that made me think some more and one way to reliably detect the race > >> should be PageHuge() check in the freeing path. This is what dissolve > >> path does already. PageHuge becomes false during update_and_free_page() > >> while holding the hugetlb_lock. So can we use that? > > > > It may make the thing complex. Apart from freeing it to the > > buddy allocator, free_huge_page also does something else for > > us. If we detect the race in the freeing path, if it is not a HugeTLB > > page, the freeing path just returns. We also should move those > > things to the dissolve path. Right? > > > > But I find a tricky problem to solve. See free_huge_page(). > > If we are in non-task context, we should schedule a work > > to free the page. We reuse the page->mapping. If the page > > is already freed by the dissolve path. We should not touch > > the page->mapping. So we need to check PageHuge(). > > The check and llist_add() should be protected by > > hugetlb_lock. But we cannot do that. Right? If dissolve > > happens after it is linked to the list. We also should > > remove it from the list (hpage_freelist). It seems to make > > the thing more complex. > > You are correct. This is also an issue/potential problem with this > race. It seems that adding the state information might be the least > complex way to address issue. Yeah, I agree with you. Adding a state is a simple solution. > > -- > Mike Kravetz
On Thu 07-01-21 16:52:19, Mike Kravetz wrote: > On 1/7/21 12:40 AM, Michal Hocko wrote: > > On Wed 06-01-21 12:58:29, Mike Kravetz wrote: > >> On 1/6/21 8:56 AM, Michal Hocko wrote: > >>> On Wed 06-01-21 16:47:36, Muchun Song wrote: > >>>> There is a race condition between __free_huge_page() > >>>> and dissolve_free_huge_page(). > >>>> > >>>> CPU0: CPU1: > >>>> > >>>> // page_count(page) == 1 > >>>> put_page(page) > >>>> __free_huge_page(page) > >>>> dissolve_free_huge_page(page) > >>>> spin_lock(&hugetlb_lock) > >>>> // PageHuge(page) && !page_count(page) > >>>> update_and_free_page(page) > >>>> // page is freed to the buddy > >>>> spin_unlock(&hugetlb_lock) > >>>> spin_lock(&hugetlb_lock) > >>>> clear_page_huge_active(page) > >>>> enqueue_huge_page(page) > >>>> // It is wrong, the page is already freed > >>>> spin_unlock(&hugetlb_lock) > >>>> > >>>> The race windows is between put_page() and spin_lock() which > >>>> is in the __free_huge_page(). > >>> > >>> The race window reall is between put_page and dissolve_free_huge_page. > >>> And the result is that the put_page path would clobber an unrelated page > >>> (either free or already reused page) which is quite serious. > >>> Fortunatelly pages are dissolved very rarely. I believe that user would > >>> require to be privileged to hit this by intention. > >>> > >>>> We should make sure that the page is already on the free list > >>>> when it is dissolved. > >>> > >>> Another option would be to check for PageHuge in __free_huge_page. Have > >>> you considered that rather than add yet another state? The scope of the > >>> spinlock would have to be extended. If that sounds more tricky then can > >>> we check the page->lru in the dissolve path? If the page is still > >>> PageHuge and reference count 0 then there shouldn't be many options > >>> where it can be queued, right? > >> > >> The tricky part with expanding lock scope will be the potential call to > >> hugepage_subpool_put_pages as it may also try to acquire the hugetlb_lock. > > > > Can we rearrange the code and move hugepage_subpool_put_pages after all > > this is done? Or is there any strong reason for the particular ordering? > > The reservation code is so fragile, I always get nervous when making > any changes. However, the straight forward patch below passes some > simple testing. The only difference I can see is that global counts > are adjusted before sub-pool counts. This should not be an issue as > global and sub-pool counts are adjusted independently (not under the > same lock). Allocation code checks sub-pool counts before global > counts. So, there is a SMALL potential that a racing allocation which > previously succeeded would now fail. I do not think this is an issue > in practice. > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 3b38ea958e95..658593840212 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1395,6 +1395,11 @@ static void __free_huge_page(struct page *page) > (struct hugepage_subpool *)page_private(page); > bool restore_reserve; > > + spin_lock(&hugetlb_lock); > + /* check for race with dissolve_free_huge_page/update_and_free_page */ > + if (!PageHuge(page)) > + return; > + This really wants to unlock the lock, right? But this is indeed what I've had in mind.
On Thu 07-01-21 23:11:22, Muchun Song wrote: > On Thu, Jan 7, 2021 at 10:11 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 07-01-21 20:59:33, Muchun Song wrote: > > > On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko <mhocko@suse.com> wrote: > > [...] > > > > Right. Can we simply back off in the dissolving path when ref count is > > > > 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario > > > > when the all above is true and the page is not being freed? > > > > > > The list_empty(&page->lru) may always return false. > > > The page before freeing is on the active list > > > (hstate->hugepage_activelist).Then it is on the free list > > > after freeing. So list_empty(&page->lru) is always false. > > > > The point I was trying to make is that the page has to be enqueued when > > it is dissolved and freed. If the page is not enqueued then something > > racing. But then I have realized that this is not a great check to > > detect the race because pages are going to be released to buddy > > allocator and that will reuse page->lru again. So scratch that and sorry > > for the detour. > > > > But that made me think some more and one way to reliably detect the race > > should be PageHuge() check in the freeing path. This is what dissolve > > path does already. PageHuge becomes false during update_and_free_page() > > while holding the hugetlb_lock. So can we use that? > > It may make the thing complex. Apart from freeing it to the > buddy allocator, free_huge_page also does something else for > us. If we detect the race in the freeing path, if it is not a HugeTLB > page, the freeing path just returns. We also should move those > things to the dissolve path. Right? Not sure what you mean. Dissolving is a subset of the freeing path. It effectivelly only implements the update_and_free_page branch (aka free to buddy). It skips some of the existing steps because it believes it sees a freed page. But as you have pointed out this is racy and I strongly suspect it is simply wrong in those assumptions. E.g. hugetlb cgroup accounting can get wrong right? The more I think about it the more I think that dissolving path should simply have a common helper with __free_huge_page that would release the huge page to the allocator. The only thing that should be specific to the dissolving path is HWpoison handling. It shouldn't be playing with accounting and what not. Btw the HWpoison handling is suspicious as well, a lost race would mean this doesn't happen. But maybe there is some fixup handled later on... > But I find a tricky problem to solve. See free_huge_page(). > If we are in non-task context, we should schedule a work > to free the page. We reuse the page->mapping. If the page > is already freed by the dissolve path. We should not touch > the page->mapping. So we need to check PageHuge(). > The check and llist_add() should be protected by > hugetlb_lock. But we cannot do that. Right? If dissolve > happens after it is linked to the list. We also should > remove it from the list (hpage_freelist). It seems to make > the thing more complex. I am not sure I follow you here but yes PageHuge under hugetlb_lock should be the reliable way to check for the race. I am not sure why we really need to care about mapping or other state.
On Fri, Jan 8, 2021 at 4:43 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 07-01-21 23:11:22, Muchun Song wrote: > > On Thu, Jan 7, 2021 at 10:11 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 07-01-21 20:59:33, Muchun Song wrote: > > > > On Thu, Jan 7, 2021 at 8:38 PM Michal Hocko <mhocko@suse.com> wrote: > > > [...] > > > > > Right. Can we simply back off in the dissolving path when ref count is > > > > > 0 && PageHuge() if list_empty(page->lru)? Is there any other scenario > > > > > when the all above is true and the page is not being freed? > > > > > > > > The list_empty(&page->lru) may always return false. > > > > The page before freeing is on the active list > > > > (hstate->hugepage_activelist).Then it is on the free list > > > > after freeing. So list_empty(&page->lru) is always false. > > > > > > The point I was trying to make is that the page has to be enqueued when > > > it is dissolved and freed. If the page is not enqueued then something > > > racing. But then I have realized that this is not a great check to > > > detect the race because pages are going to be released to buddy > > > allocator and that will reuse page->lru again. So scratch that and sorry > > > for the detour. > > > > > > But that made me think some more and one way to reliably detect the race > > > should be PageHuge() check in the freeing path. This is what dissolve > > > path does already. PageHuge becomes false during update_and_free_page() > > > while holding the hugetlb_lock. So can we use that? > > > > It may make the thing complex. Apart from freeing it to the > > buddy allocator, free_huge_page also does something else for > > us. If we detect the race in the freeing path, if it is not a HugeTLB > > page, the freeing path just returns. We also should move those > > things to the dissolve path. Right? > > Not sure what you mean. Dissolving is a subset of the freeing path. It > effectivelly only implements the update_and_free_page branch (aka free > to buddy). It skips some of the existing steps because it believes it > sees a freed page. But as you have pointed out this is racy and I > strongly suspect it is simply wrong in those assumptions. E.g. hugetlb > cgroup accounting can get wrong right? OK. I know what you mean. The update_and_free_page should do the freeing which is similar to __free_huge_page(). > > The more I think about it the more I think that dissolving path should > simply have a common helper with __free_huge_page that would release > the huge page to the allocator. The only thing that should be specific > to the dissolving path is HWpoison handling. It shouldn't be playing > with accounting and what not. Btw the HWpoison handling is suspicious as > well, a lost race would mean this doesn't happen. But maybe there is > some fixup handled later on... > > > But I find a tricky problem to solve. See free_huge_page(). > > If we are in non-task context, we should schedule a work > > to free the page. We reuse the page->mapping. If the page > > is already freed by the dissolve path. We should not touch > > the page->mapping. So we need to check PageHuge(). > > The check and llist_add() should be protected by > > hugetlb_lock. But we cannot do that. Right? If dissolve > > happens after it is linked to the list. We also should > > remove it from the list (hpage_freelist). It seems to make > > the thing more complex. > > I am not sure I follow you here but yes PageHuge under hugetlb_lock > should be the reliable way to check for the race. I am not sure why we > really need to care about mapping or other state. CPU0: CPU1: free_huge_page(page) if (PageHuge(page)) dissolve_free_huge_page(page) spin_lock(&hugetlb_lock) update_and_free_page(page) spin_unlock(&hugetlb_lock) llist_add(page->mapping) // the mapping is corrupted The PageHuge(page) and llist_add() should be protected by hugetlb_lock. Right? If so, we cannot hold hugetlb_lock in free_huge_page() path. > -- > Michal Hocko > SUSE Labs
On Fri 08-01-21 17:01:03, Muchun Song wrote: > On Fri, Jan 8, 2021 at 4:43 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 07-01-21 23:11:22, Muchun Song wrote: [..] > > > But I find a tricky problem to solve. See free_huge_page(). > > > If we are in non-task context, we should schedule a work > > > to free the page. We reuse the page->mapping. If the page > > > is already freed by the dissolve path. We should not touch > > > the page->mapping. So we need to check PageHuge(). > > > The check and llist_add() should be protected by > > > hugetlb_lock. But we cannot do that. Right? If dissolve > > > happens after it is linked to the list. We also should > > > remove it from the list (hpage_freelist). It seems to make > > > the thing more complex. > > > > I am not sure I follow you here but yes PageHuge under hugetlb_lock > > should be the reliable way to check for the race. I am not sure why we > > really need to care about mapping or other state. > > CPU0: CPU1: > free_huge_page(page) > if (PageHuge(page)) > dissolve_free_huge_page(page) > spin_lock(&hugetlb_lock) > update_and_free_page(page) > spin_unlock(&hugetlb_lock) > llist_add(page->mapping) > // the mapping is corrupted > > The PageHuge(page) and llist_add() should be protected by > hugetlb_lock. Right? If so, we cannot hold hugetlb_lock > in free_huge_page() path. OK, I see. I completely forgot about this snowflake. I thought that free_huge_page was a typo missing initial __. Anyway you are right that this path needs a check as well. But I don't see why we couldn't use the lock here. The lock can be held only inside the !in_task branch. Although it would be much more nicer if the lock was held at this layer rather than both free_huge_page and __free_huge_page. But that clean up can be done on top.
On Fri, Jan 8, 2021 at 5:31 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 08-01-21 17:01:03, Muchun Song wrote: > > On Fri, Jan 8, 2021 at 4:43 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 07-01-21 23:11:22, Muchun Song wrote: > [..] > > > > But I find a tricky problem to solve. See free_huge_page(). > > > > If we are in non-task context, we should schedule a work > > > > to free the page. We reuse the page->mapping. If the page > > > > is already freed by the dissolve path. We should not touch > > > > the page->mapping. So we need to check PageHuge(). > > > > The check and llist_add() should be protected by > > > > hugetlb_lock. But we cannot do that. Right? If dissolve > > > > happens after it is linked to the list. We also should > > > > remove it from the list (hpage_freelist). It seems to make > > > > the thing more complex. > > > > > > I am not sure I follow you here but yes PageHuge under hugetlb_lock > > > should be the reliable way to check for the race. I am not sure why we > > > really need to care about mapping or other state. > > > > CPU0: CPU1: > > free_huge_page(page) > > if (PageHuge(page)) > > dissolve_free_huge_page(page) > > spin_lock(&hugetlb_lock) > > update_and_free_page(page) > > spin_unlock(&hugetlb_lock) > > llist_add(page->mapping) > > // the mapping is corrupted > > > > The PageHuge(page) and llist_add() should be protected by > > hugetlb_lock. Right? If so, we cannot hold hugetlb_lock > > in free_huge_page() path. > > OK, I see. I completely forgot about this snowflake. I thought that > free_huge_page was a typo missing initial __. Anyway you are right that > this path needs a check as well. But I don't see why we couldn't use the > lock here. The lock can be held only inside the !in_task branch. Because we hold the hugetlb_lock without disable irq. So if an interrupt occurs after we hold the lock. And we also free a HugeTLB page. Then it leads to deadlock. task context: interrupt context: put_page(page) spin_lock(&hugetlb_lock) put_page(page) spin_lock(&hugetlb_lock) // deadlock spin_unlock(&hugetlb_lock) > Although it would be much more nicer if the lock was held at this layer > rather than both free_huge_page and __free_huge_page. But that clean up > can be done on top. > > -- > Michal Hocko > SUSE Labs
On Fri 08-01-21 18:08:57, Muchun Song wrote: > On Fri, Jan 8, 2021 at 5:31 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 08-01-21 17:01:03, Muchun Song wrote: > > > On Fri, Jan 8, 2021 at 4:43 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 07-01-21 23:11:22, Muchun Song wrote: > > [..] > > > > > But I find a tricky problem to solve. See free_huge_page(). > > > > > If we are in non-task context, we should schedule a work > > > > > to free the page. We reuse the page->mapping. If the page > > > > > is already freed by the dissolve path. We should not touch > > > > > the page->mapping. So we need to check PageHuge(). > > > > > The check and llist_add() should be protected by > > > > > hugetlb_lock. But we cannot do that. Right? If dissolve > > > > > happens after it is linked to the list. We also should > > > > > remove it from the list (hpage_freelist). It seems to make > > > > > the thing more complex. > > > > > > > > I am not sure I follow you here but yes PageHuge under hugetlb_lock > > > > should be the reliable way to check for the race. I am not sure why we > > > > really need to care about mapping or other state. > > > > > > CPU0: CPU1: > > > free_huge_page(page) > > > if (PageHuge(page)) > > > dissolve_free_huge_page(page) > > > spin_lock(&hugetlb_lock) > > > update_and_free_page(page) > > > spin_unlock(&hugetlb_lock) > > > llist_add(page->mapping) > > > // the mapping is corrupted > > > > > > The PageHuge(page) and llist_add() should be protected by > > > hugetlb_lock. Right? If so, we cannot hold hugetlb_lock > > > in free_huge_page() path. > > > > OK, I see. I completely forgot about this snowflake. I thought that > > free_huge_page was a typo missing initial __. Anyway you are right that > > this path needs a check as well. But I don't see why we couldn't use the > > lock here. The lock can be held only inside the !in_task branch. > > Because we hold the hugetlb_lock without disable irq. So if an interrupt > occurs after we hold the lock. And we also free a HugeTLB page. Then > it leads to deadlock. There is nothing really to prevent making hugetlb_lock irq safe, isn't it?
On Fri, Jan 8, 2021 at 7:44 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 08-01-21 18:08:57, Muchun Song wrote: > > On Fri, Jan 8, 2021 at 5:31 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Fri 08-01-21 17:01:03, Muchun Song wrote: > > > > On Fri, Jan 8, 2021 at 4:43 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Thu 07-01-21 23:11:22, Muchun Song wrote: > > > [..] > > > > > > But I find a tricky problem to solve. See free_huge_page(). > > > > > > If we are in non-task context, we should schedule a work > > > > > > to free the page. We reuse the page->mapping. If the page > > > > > > is already freed by the dissolve path. We should not touch > > > > > > the page->mapping. So we need to check PageHuge(). > > > > > > The check and llist_add() should be protected by > > > > > > hugetlb_lock. But we cannot do that. Right? If dissolve > > > > > > happens after it is linked to the list. We also should > > > > > > remove it from the list (hpage_freelist). It seems to make > > > > > > the thing more complex. > > > > > > > > > > I am not sure I follow you here but yes PageHuge under hugetlb_lock > > > > > should be the reliable way to check for the race. I am not sure why we > > > > > really need to care about mapping or other state. > > > > > > > > CPU0: CPU1: > > > > free_huge_page(page) > > > > if (PageHuge(page)) > > > > dissolve_free_huge_page(page) > > > > spin_lock(&hugetlb_lock) > > > > update_and_free_page(page) > > > > spin_unlock(&hugetlb_lock) > > > > llist_add(page->mapping) > > > > // the mapping is corrupted > > > > > > > > The PageHuge(page) and llist_add() should be protected by > > > > hugetlb_lock. Right? If so, we cannot hold hugetlb_lock > > > > in free_huge_page() path. > > > > > > OK, I see. I completely forgot about this snowflake. I thought that > > > free_huge_page was a typo missing initial __. Anyway you are right that > > > this path needs a check as well. But I don't see why we couldn't use the > > > lock here. The lock can be held only inside the !in_task branch. > > > > Because we hold the hugetlb_lock without disable irq. So if an interrupt > > occurs after we hold the lock. And we also free a HugeTLB page. Then > > it leads to deadlock. > > There is nothing really to prevent making hugetlb_lock irq safe, isn't > it? Yeah. We can make the hugetlb_lock irq safe. But why have we not done this? Maybe the commit changelog can provide more information. See https://github.com/torvalds/linux/commit/c77c0a8ac4c522638a8242fcb9de9496e3cdbb2d > -- > Michal Hocko > SUSE Labs
On Fri 08-01-21 19:52:54, Muchun Song wrote: > On Fri, Jan 8, 2021 at 7:44 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 08-01-21 18:08:57, Muchun Song wrote: > > > On Fri, Jan 8, 2021 at 5:31 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Fri 08-01-21 17:01:03, Muchun Song wrote: > > > > > On Fri, Jan 8, 2021 at 4:43 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Thu 07-01-21 23:11:22, Muchun Song wrote: > > > > [..] > > > > > > > But I find a tricky problem to solve. See free_huge_page(). > > > > > > > If we are in non-task context, we should schedule a work > > > > > > > to free the page. We reuse the page->mapping. If the page > > > > > > > is already freed by the dissolve path. We should not touch > > > > > > > the page->mapping. So we need to check PageHuge(). > > > > > > > The check and llist_add() should be protected by > > > > > > > hugetlb_lock. But we cannot do that. Right? If dissolve > > > > > > > happens after it is linked to the list. We also should > > > > > > > remove it from the list (hpage_freelist). It seems to make > > > > > > > the thing more complex. > > > > > > > > > > > > I am not sure I follow you here but yes PageHuge under hugetlb_lock > > > > > > should be the reliable way to check for the race. I am not sure why we > > > > > > really need to care about mapping or other state. > > > > > > > > > > CPU0: CPU1: > > > > > free_huge_page(page) > > > > > if (PageHuge(page)) > > > > > dissolve_free_huge_page(page) > > > > > spin_lock(&hugetlb_lock) > > > > > update_and_free_page(page) > > > > > spin_unlock(&hugetlb_lock) > > > > > llist_add(page->mapping) > > > > > // the mapping is corrupted > > > > > > > > > > The PageHuge(page) and llist_add() should be protected by > > > > > hugetlb_lock. Right? If so, we cannot hold hugetlb_lock > > > > > in free_huge_page() path. > > > > > > > > OK, I see. I completely forgot about this snowflake. I thought that > > > > free_huge_page was a typo missing initial __. Anyway you are right that > > > > this path needs a check as well. But I don't see why we couldn't use the > > > > lock here. The lock can be held only inside the !in_task branch. > > > > > > Because we hold the hugetlb_lock without disable irq. So if an interrupt > > > occurs after we hold the lock. And we also free a HugeTLB page. Then > > > it leads to deadlock. > > > > There is nothing really to prevent making hugetlb_lock irq safe, isn't > > it? > > Yeah. We can make the hugetlb_lock irq safe. But why have we not > done this? Maybe the commit changelog can provide more information. > > See https://github.com/torvalds/linux/commit/c77c0a8ac4c522638a8242fcb9de9496e3cdbb2d Dang! Maybe it is the time to finally stack one workaround on top of the other and put this code into the shape. The amount of hackery and subtle details has just grown beyond healthy!
On Fri, Jan 8, 2021 at 8:04 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 08-01-21 19:52:54, Muchun Song wrote: > > On Fri, Jan 8, 2021 at 7:44 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Fri 08-01-21 18:08:57, Muchun Song wrote: > > > > On Fri, Jan 8, 2021 at 5:31 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Fri 08-01-21 17:01:03, Muchun Song wrote: > > > > > > On Fri, Jan 8, 2021 at 4:43 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Thu 07-01-21 23:11:22, Muchun Song wrote: > > > > > [..] > > > > > > > > But I find a tricky problem to solve. See free_huge_page(). > > > > > > > > If we are in non-task context, we should schedule a work > > > > > > > > to free the page. We reuse the page->mapping. If the page > > > > > > > > is already freed by the dissolve path. We should not touch > > > > > > > > the page->mapping. So we need to check PageHuge(). > > > > > > > > The check and llist_add() should be protected by > > > > > > > > hugetlb_lock. But we cannot do that. Right? If dissolve > > > > > > > > happens after it is linked to the list. We also should > > > > > > > > remove it from the list (hpage_freelist). It seems to make > > > > > > > > the thing more complex. > > > > > > > > > > > > > > I am not sure I follow you here but yes PageHuge under hugetlb_lock > > > > > > > should be the reliable way to check for the race. I am not sure why we > > > > > > > really need to care about mapping or other state. > > > > > > > > > > > > CPU0: CPU1: > > > > > > free_huge_page(page) > > > > > > if (PageHuge(page)) > > > > > > dissolve_free_huge_page(page) > > > > > > spin_lock(&hugetlb_lock) > > > > > > update_and_free_page(page) > > > > > > spin_unlock(&hugetlb_lock) > > > > > > llist_add(page->mapping) > > > > > > // the mapping is corrupted > > > > > > > > > > > > The PageHuge(page) and llist_add() should be protected by > > > > > > hugetlb_lock. Right? If so, we cannot hold hugetlb_lock > > > > > > in free_huge_page() path. > > > > > > > > > > OK, I see. I completely forgot about this snowflake. I thought that > > > > > free_huge_page was a typo missing initial __. Anyway you are right that > > > > > this path needs a check as well. But I don't see why we couldn't use the > > > > > lock here. The lock can be held only inside the !in_task branch. > > > > > > > > Because we hold the hugetlb_lock without disable irq. So if an interrupt > > > > occurs after we hold the lock. And we also free a HugeTLB page. Then > > > > it leads to deadlock. > > > > > > There is nothing really to prevent making hugetlb_lock irq safe, isn't > > > it? > > > > Yeah. We can make the hugetlb_lock irq safe. But why have we not > > done this? Maybe the commit changelog can provide more information. > > > > See https://github.com/torvalds/linux/commit/c77c0a8ac4c522638a8242fcb9de9496e3cdbb2d > > Dang! Maybe it is the time to finally stack one workaround on top of the > other and put this code into the shape. The amount of hackery and subtle > details has just grown beyond healthy! From readability point of view, maybe making the hugetlb_lock irq safe is an improvement (in the feature). The details are always messy. :) > -- > Michal Hocko > SUSE Labs
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4741d60f8955..8ff138c17129 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -79,6 +79,21 @@ DEFINE_SPINLOCK(hugetlb_lock); static int num_fault_mutexes; struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp; +static inline bool PageHugeFreed(struct page *head) +{ + return (unsigned long)head[3].mapping == -1U; +} + +static inline void SetPageHugeFreed(struct page *head) +{ + head[3].mapping = (void *)-1U; +} + +static inline void ClearPageHugeFreed(struct page *head) +{ + head[3].mapping = NULL; +} + /* Forward declaration */ static int hugetlb_acct_memory(struct hstate *h, long delta); @@ -1028,6 +1043,7 @@ static void enqueue_huge_page(struct hstate *h, struct page *page) list_move(&page->lru, &h->hugepage_freelists[nid]); h->free_huge_pages++; h->free_huge_pages_node[nid]++; + SetPageHugeFreed(page); } static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) @@ -1044,6 +1060,7 @@ static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) list_move(&page->lru, &h->hugepage_activelist); set_page_refcounted(page); + ClearPageHugeFreed(page); h->free_huge_pages--; h->free_huge_pages_node[nid]--; return page; @@ -1291,6 +1308,17 @@ static inline void destroy_compound_gigantic_page(struct page *page, unsigned int order) { } #endif +/* + * Because we reuse the mapping field of some tail page structs, we should + * reset those mapping to initial value before @head is freed to the buddy + * allocator. The invalid value will be checked in the free_tail_pages_check(). + */ +static inline void reset_tail_page_mapping(struct hstate *h, struct page *head) +{ + if (!hstate_is_gigantic(h)) + head[3].mapping = TAIL_MAPPING; +} + static void update_and_free_page(struct hstate *h, struct page *page) { int i; @@ -1298,6 +1326,7 @@ static void update_and_free_page(struct hstate *h, struct page *page) if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) return; + reset_tail_page_mapping(h, page); h->nr_huge_pages--; h->nr_huge_pages_node[page_to_nid(page)]--; for (i = 0; i < pages_per_huge_page(h); i++) { @@ -1504,6 +1533,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) spin_lock(&hugetlb_lock); h->nr_huge_pages++; h->nr_huge_pages_node[nid]++; + ClearPageHugeFreed(page); spin_unlock(&hugetlb_lock); } @@ -1770,6 +1800,14 @@ int dissolve_free_huge_page(struct page *page) int nid = page_to_nid(head); if (h->free_huge_pages - h->resv_huge_pages == 0) goto out; + + /* + * We should make sure that the page is already on the free list + * when it is dissolved. + */ + if (unlikely(!PageHugeFreed(head))) + goto out; + /* * Move PageHWPoison flag from head page to the raw error page, * which makes any subpages rather than the error page reusable.
There is a race condition between __free_huge_page() and dissolve_free_huge_page(). CPU0: CPU1: // page_count(page) == 1 put_page(page) __free_huge_page(page) dissolve_free_huge_page(page) spin_lock(&hugetlb_lock) // PageHuge(page) && !page_count(page) update_and_free_page(page) // page is freed to the buddy spin_unlock(&hugetlb_lock) spin_lock(&hugetlb_lock) clear_page_huge_active(page) enqueue_huge_page(page) // It is wrong, the page is already freed spin_unlock(&hugetlb_lock) The race windows is between put_page() and spin_lock() which is in the __free_huge_page(). We should make sure that the page is already on the free list when it is dissolved. Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Signed-off-by: Muchun Song <songmuchun@bytedance.com> --- mm/hugetlb.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+)