diff mbox series

[v2,1/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()

Message ID 20210722123635.60608-2-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() | expand

Commit Message

David Hildenbrand July 22, 2021, 12:36 p.m. UTC
Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
does not require a SIGBUS handler, doesn't actually touch page content,
and avoids context switches; it is, therefore, faster and easier to handle
than our current approach.

While MADV_POPULATE_WRITE is, in general, faster than manual
prefaulting, and especially faster with 4k pages, there is still value in
prefaulting using multiple threads to speed up preallocation.

More details on MADV_POPULATE_WRITE can be found in the Linux commit
4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
page tables") and in the man page proposal [1].

[1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com

This resolves the TODO in do_touch_pages().

In the future, we might want to look into using fallocate(), eventually
combined with MADV_POPULATE_READ, when dealing with shared file
mappings.

Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/qemu/osdep.h |  7 ++++
 util/oslib-posix.c   | 88 +++++++++++++++++++++++++++++++++-----------
 2 files changed, 74 insertions(+), 21 deletions(-)

Comments

Daniel P. Berrangé July 22, 2021, 1:31 p.m. UTC | #1
On Thu, Jul 22, 2021 at 02:36:30PM +0200, David Hildenbrand wrote:
> Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
> does not require a SIGBUS handler, doesn't actually touch page content,
> and avoids context switches; it is, therefore, faster and easier to handle
> than our current approach.
> 
> While MADV_POPULATE_WRITE is, in general, faster than manual
> prefaulting, and especially faster with 4k pages, there is still value in
> prefaulting using multiple threads to speed up preallocation.
> 
> More details on MADV_POPULATE_WRITE can be found in the Linux commit
> 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
> page tables") and in the man page proposal [1].
> 
> [1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com
> 
> This resolves the TODO in do_touch_pages().
> 
> In the future, we might want to look into using fallocate(), eventually
> combined with MADV_POPULATE_READ, when dealing with shared file
> mappings.
> 
> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/qemu/osdep.h |  7 ++++
>  util/oslib-posix.c   | 88 +++++++++++++++++++++++++++++++++-----------
>  2 files changed, 74 insertions(+), 21 deletions(-)


> @@ -497,6 +493,31 @@ static void *do_touch_pages(void *arg)
>      return NULL;
>  }
>  
> +static void *do_madv_populate_write_pages(void *arg)
> +{
> +    MemsetThread *memset_args = (MemsetThread *)arg;
> +    const size_t size = memset_args->numpages * memset_args->hpagesize;
> +    char * const addr = memset_args->addr;
> +    int ret;
> +
> +    if (!size) {
> +        return NULL;
> +    }
> +
> +    /* See do_touch_pages(). */
> +    qemu_mutex_lock(&page_mutex);
> +    while (!threads_created_flag) {
> +        qemu_cond_wait(&page_cond, &page_mutex);
> +    }
> +    qemu_mutex_unlock(&page_mutex);
> +
> +    ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
> +    if (ret) {
> +        memset_thread_failed = true;

I'm wondering if this use of memset_thread_failed is sufficient.

This is pre-existing from the current impl, and ends up being
used to set the bool result of 'touch_all_pages'. The caller
of that then does

    if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
        error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
            "pages available to allocate guest RAM");
    }

this was reasonable with the old impl, because the only reason
we ever see 'memset_thread_failed==true' is if we got SIGBUS
due to ENOMEM.

My concern is that madvise() has a bunch of possible errno
codes returned on failure, and we're not distinguishing
them. In the past this kind of thing has burnt us making
failures hard to debug.

Could we turn 'bool memset_thread_failed' into 'int memset_thread_errno'

Then, we can make 'touch_all_pages' have an 'Error **errp'
parameter, and it can directly call

 error_setg_errno(errp, memset_thead_errno, ....some message...)

when memset_thread_errno is non-zero, and thus we can remove
the generic message from the caller of touch_all_pages.

If you agree, it'd be best to refactor the existing code to
use this pattern in an initial patch.


Regards,
Daniel
David Hildenbrand July 22, 2021, 1:39 p.m. UTC | #2
On 22.07.21 15:31, Daniel P. Berrangé wrote:
> On Thu, Jul 22, 2021 at 02:36:30PM +0200, David Hildenbrand wrote:
>> Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
>> does not require a SIGBUS handler, doesn't actually touch page content,
>> and avoids context switches; it is, therefore, faster and easier to handle
>> than our current approach.
>>
>> While MADV_POPULATE_WRITE is, in general, faster than manual
>> prefaulting, and especially faster with 4k pages, there is still value in
>> prefaulting using multiple threads to speed up preallocation.
>>
>> More details on MADV_POPULATE_WRITE can be found in the Linux commit
>> 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
>> page tables") and in the man page proposal [1].
>>
>> [1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com
>>
>> This resolves the TODO in do_touch_pages().
>>
>> In the future, we might want to look into using fallocate(), eventually
>> combined with MADV_POPULATE_READ, when dealing with shared file
>> mappings.
>>
>> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/qemu/osdep.h |  7 ++++
>>   util/oslib-posix.c   | 88 +++++++++++++++++++++++++++++++++-----------
>>   2 files changed, 74 insertions(+), 21 deletions(-)
> 
> 
>> @@ -497,6 +493,31 @@ static void *do_touch_pages(void *arg)
>>       return NULL;
>>   }
>>   
>> +static void *do_madv_populate_write_pages(void *arg)
>> +{
>> +    MemsetThread *memset_args = (MemsetThread *)arg;
>> +    const size_t size = memset_args->numpages * memset_args->hpagesize;
>> +    char * const addr = memset_args->addr;
>> +    int ret;
>> +
>> +    if (!size) {
>> +        return NULL;
>> +    }
>> +
>> +    /* See do_touch_pages(). */
>> +    qemu_mutex_lock(&page_mutex);
>> +    while (!threads_created_flag) {
>> +        qemu_cond_wait(&page_cond, &page_mutex);
>> +    }
>> +    qemu_mutex_unlock(&page_mutex);
>> +
>> +    ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
>> +    if (ret) {
>> +        memset_thread_failed = true;
> 
> I'm wondering if this use of memset_thread_failed is sufficient.
> 
> This is pre-existing from the current impl, and ends up being
> used to set the bool result of 'touch_all_pages'. The caller
> of that then does
> 
>      if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
>          error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
>              "pages available to allocate guest RAM");
>      }
> 
> this was reasonable with the old impl, because the only reason
> we ever see 'memset_thread_failed==true' is if we got SIGBUS
> due to ENOMEM.
> 
> My concern is that madvise() has a bunch of possible errno
> codes returned on failure, and we're not distinguishing
> them. In the past this kind of thing has burnt us making
> failures hard to debug.
> 
> Could we turn 'bool memset_thread_failed' into 'int memset_thread_errno'
> 
> Then, we can make 'touch_all_pages' have an 'Error **errp'
> parameter, and it can directly call
> 
>   error_setg_errno(errp, memset_thead_errno, ....some message...)
> 
> when memset_thread_errno is non-zero, and thus we can remove
> the generic message from the caller of touch_all_pages.
> 
> If you agree, it'd be best to refactor the existing code to
> use this pattern in an initial patch.

We could also simply trace the return value, which should be 
comparatively easy to add. We should be getting either -ENOMEM or 
-EHWPOISON. And the latter is highly unlikely to happen when actually 
preallocating.

We made sure that we don't end up with -EINVAL as we're sensing of 
MADV_POPULATE_WRITE works on the mapping.

So when it comes to debugging, I'd actually prefer tracing -errno, as 
the real error will be of little help to end users.

Makes sense?
Daniel P. Berrangé July 22, 2021, 1:47 p.m. UTC | #3
On Thu, Jul 22, 2021 at 03:39:50PM +0200, David Hildenbrand wrote:
> On 22.07.21 15:31, Daniel P. Berrangé wrote:
> > On Thu, Jul 22, 2021 at 02:36:30PM +0200, David Hildenbrand wrote:
> > > Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
> > > does not require a SIGBUS handler, doesn't actually touch page content,
> > > and avoids context switches; it is, therefore, faster and easier to handle
> > > than our current approach.
> > > 
> > > While MADV_POPULATE_WRITE is, in general, faster than manual
> > > prefaulting, and especially faster with 4k pages, there is still value in
> > > prefaulting using multiple threads to speed up preallocation.
> > > 
> > > More details on MADV_POPULATE_WRITE can be found in the Linux commit
> > > 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
> > > page tables") and in the man page proposal [1].
> > > 
> > > [1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com
> > > 
> > > This resolves the TODO in do_touch_pages().
> > > 
> > > In the future, we might want to look into using fallocate(), eventually
> > > combined with MADV_POPULATE_READ, when dealing with shared file
> > > mappings.
> > > 
> > > Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >   include/qemu/osdep.h |  7 ++++
> > >   util/oslib-posix.c   | 88 +++++++++++++++++++++++++++++++++-----------
> > >   2 files changed, 74 insertions(+), 21 deletions(-)
> > 
> > 
> > > @@ -497,6 +493,31 @@ static void *do_touch_pages(void *arg)
> > >       return NULL;
> > >   }
> > > +static void *do_madv_populate_write_pages(void *arg)
> > > +{
> > > +    MemsetThread *memset_args = (MemsetThread *)arg;
> > > +    const size_t size = memset_args->numpages * memset_args->hpagesize;
> > > +    char * const addr = memset_args->addr;
> > > +    int ret;
> > > +
> > > +    if (!size) {
> > > +        return NULL;
> > > +    }
> > > +
> > > +    /* See do_touch_pages(). */
> > > +    qemu_mutex_lock(&page_mutex);
> > > +    while (!threads_created_flag) {
> > > +        qemu_cond_wait(&page_cond, &page_mutex);
> > > +    }
> > > +    qemu_mutex_unlock(&page_mutex);
> > > +
> > > +    ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
> > > +    if (ret) {
> > > +        memset_thread_failed = true;
> > 
> > I'm wondering if this use of memset_thread_failed is sufficient.
> > 
> > This is pre-existing from the current impl, and ends up being
> > used to set the bool result of 'touch_all_pages'. The caller
> > of that then does
> > 
> >      if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
> >          error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
> >              "pages available to allocate guest RAM");
> >      }
> > 
> > this was reasonable with the old impl, because the only reason
> > we ever see 'memset_thread_failed==true' is if we got SIGBUS
> > due to ENOMEM.
> > 
> > My concern is that madvise() has a bunch of possible errno
> > codes returned on failure, and we're not distinguishing
> > them. In the past this kind of thing has burnt us making
> > failures hard to debug.
> > 
> > Could we turn 'bool memset_thread_failed' into 'int memset_thread_errno'
> > 
> > Then, we can make 'touch_all_pages' have an 'Error **errp'
> > parameter, and it can directly call
> > 
> >   error_setg_errno(errp, memset_thead_errno, ....some message...)
> > 
> > when memset_thread_errno is non-zero, and thus we can remove
> > the generic message from the caller of touch_all_pages.
> > 
> > If you agree, it'd be best to refactor the existing code to
> > use this pattern in an initial patch.
> 
> We could also simply trace the return value, which should be comparatively
> easy to add. We should be getting either -ENOMEM or -EHWPOISON. And the
> latter is highly unlikely to happen when actually preallocating.
> 
> We made sure that we don't end up with -EINVAL as we're sensing of
> MADV_POPULATE_WRITE works on the mapping.

Those are in the "normal" usage scenarios. I'm wondering about the
abnormal scenarios where QEMU code is mistakenly screwed up or
libvirt / mgmt app makes some config mistake. eg we can get
things like EPERM if selinux or seccomp block the madvise
syscall by mistake (common if EQMU is inside docker for example),
or can we get EINVAL if the 'addr' is not page aligned, and so on.

> So when it comes to debugging, I'd actually prefer tracing -errno, as the
> real error will be of little help to end users.

I don't care about the end users interpreting it, rather us as maintainers
who get a bug report containing insufficient info to diagnose the root
cause.

Regards,
Daniel
David Hildenbrand July 22, 2021, 2:13 p.m. UTC | #4
On 22.07.21 15:47, Daniel P. Berrangé wrote:
> On Thu, Jul 22, 2021 at 03:39:50PM +0200, David Hildenbrand wrote:
>> On 22.07.21 15:31, Daniel P. Berrangé wrote:
>>> On Thu, Jul 22, 2021 at 02:36:30PM +0200, David Hildenbrand wrote:
>>>> Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
>>>> does not require a SIGBUS handler, doesn't actually touch page content,
>>>> and avoids context switches; it is, therefore, faster and easier to handle
>>>> than our current approach.
>>>>
>>>> While MADV_POPULATE_WRITE is, in general, faster than manual
>>>> prefaulting, and especially faster with 4k pages, there is still value in
>>>> prefaulting using multiple threads to speed up preallocation.
>>>>
>>>> More details on MADV_POPULATE_WRITE can be found in the Linux commit
>>>> 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
>>>> page tables") and in the man page proposal [1].
>>>>
>>>> [1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com
>>>>
>>>> This resolves the TODO in do_touch_pages().
>>>>
>>>> In the future, we might want to look into using fallocate(), eventually
>>>> combined with MADV_POPULATE_READ, when dealing with shared file
>>>> mappings.
>>>>
>>>> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    include/qemu/osdep.h |  7 ++++
>>>>    util/oslib-posix.c   | 88 +++++++++++++++++++++++++++++++++-----------
>>>>    2 files changed, 74 insertions(+), 21 deletions(-)
>>>
>>>
>>>> @@ -497,6 +493,31 @@ static void *do_touch_pages(void *arg)
>>>>        return NULL;
>>>>    }
>>>> +static void *do_madv_populate_write_pages(void *arg)
>>>> +{
>>>> +    MemsetThread *memset_args = (MemsetThread *)arg;
>>>> +    const size_t size = memset_args->numpages * memset_args->hpagesize;
>>>> +    char * const addr = memset_args->addr;
>>>> +    int ret;
>>>> +
>>>> +    if (!size) {
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    /* See do_touch_pages(). */
>>>> +    qemu_mutex_lock(&page_mutex);
>>>> +    while (!threads_created_flag) {
>>>> +        qemu_cond_wait(&page_cond, &page_mutex);
>>>> +    }
>>>> +    qemu_mutex_unlock(&page_mutex);
>>>> +
>>>> +    ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
>>>> +    if (ret) {
>>>> +        memset_thread_failed = true;
>>>
>>> I'm wondering if this use of memset_thread_failed is sufficient.
>>>
>>> This is pre-existing from the current impl, and ends up being
>>> used to set the bool result of 'touch_all_pages'. The caller
>>> of that then does
>>>
>>>       if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
>>>           error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
>>>               "pages available to allocate guest RAM");
>>>       }
>>>
>>> this was reasonable with the old impl, because the only reason
>>> we ever see 'memset_thread_failed==true' is if we got SIGBUS
>>> due to ENOMEM.
>>>
>>> My concern is that madvise() has a bunch of possible errno
>>> codes returned on failure, and we're not distinguishing
>>> them. In the past this kind of thing has burnt us making
>>> failures hard to debug.
>>>
>>> Could we turn 'bool memset_thread_failed' into 'int memset_thread_errno'
>>>
>>> Then, we can make 'touch_all_pages' have an 'Error **errp'
>>> parameter, and it can directly call
>>>
>>>    error_setg_errno(errp, memset_thead_errno, ....some message...)
>>>
>>> when memset_thread_errno is non-zero, and thus we can remove
>>> the generic message from the caller of touch_all_pages.
>>>
>>> If you agree, it'd be best to refactor the existing code to
>>> use this pattern in an initial patch.
>>
>> We could also simply trace the return value, which should be comparatively
>> easy to add. We should be getting either -ENOMEM or -EHWPOISON. And the
>> latter is highly unlikely to happen when actually preallocating.
>>
>> We made sure that we don't end up with -EINVAL as we're sensing of
>> MADV_POPULATE_WRITE works on the mapping.
> 
> Those are in the "normal" usage scenarios. I'm wondering about the
> abnormal scenarios where QEMU code is mistakenly screwed up or
> libvirt / mgmt app makes some config mistake. eg we can get
> things like EPERM if selinux or seccomp block the madvise
> syscall by mistake (common if EQMU is inside docker for example),
> or can we get EINVAL if the 'addr' is not page aligned, and so on.
> 
>> So when it comes to debugging, I'd actually prefer tracing -errno, as the
>> real error will be of little help to end users.
> 
> I don't care about the end users interpreting it, rather us as maintainers
> who get a bug report containing insufficient info to diagnose the root
> cause.

Well, okay. I'll have a look how this turns out.
diff mbox series

Patch

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 60718fc342..d1660d67fa 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -471,6 +471,11 @@  static inline void qemu_cleanup_generic_vfree(void *p)
 #else
 #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED
 #endif
+#ifdef MADV_POPULATE_WRITE
+#define QEMU_MADV_POPULATE_WRITE MADV_POPULATE_WRITE
+#else
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
+#endif
 
 #elif defined(CONFIG_POSIX_MADVISE)
 
@@ -484,6 +489,7 @@  static inline void qemu_cleanup_generic_vfree(void *p)
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
 
 #else /* no-op */
 
@@ -497,6 +503,7 @@  static inline void qemu_cleanup_generic_vfree(void *p)
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
 
 #endif
 
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index e8bdb02e1d..1cb80bf94c 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -484,10 +484,6 @@  static void *do_touch_pages(void *arg)
              *
              * 'volatile' to stop compiler optimizing this away
              * to a no-op
-             *
-             * TODO: get a better solution from kernel so we
-             * don't need to write at all so we don't cause
-             * wear on the storage backing the region...
              */
             *(volatile char *)addr = *addr;
             addr += hpagesize;
@@ -497,6 +493,31 @@  static void *do_touch_pages(void *arg)
     return NULL;
 }
 
+static void *do_madv_populate_write_pages(void *arg)
+{
+    MemsetThread *memset_args = (MemsetThread *)arg;
+    const size_t size = memset_args->numpages * memset_args->hpagesize;
+    char * const addr = memset_args->addr;
+    int ret;
+
+    if (!size) {
+        return NULL;
+    }
+
+    /* See do_touch_pages(). */
+    qemu_mutex_lock(&page_mutex);
+    while (!threads_created_flag) {
+        qemu_cond_wait(&page_cond, &page_mutex);
+    }
+    qemu_mutex_unlock(&page_mutex);
+
+    ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
+    if (ret) {
+        memset_thread_failed = true;
+    }
+    return NULL;
+}
+
 static inline int get_memset_num_threads(int smp_cpus)
 {
     long host_procs = sysconf(_SC_NPROCESSORS_ONLN);
@@ -510,10 +531,11 @@  static inline int get_memset_num_threads(int smp_cpus)
 }
 
 static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
-                            int smp_cpus)
+                            int smp_cpus, bool use_madv_populate_write)
 {
     static gsize initialized = 0;
     size_t numpages_per_thread, leftover;
+    void *(*touch_fn)(void *);
     char *addr = area;
     int i = 0;
 
@@ -523,6 +545,12 @@  static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
         g_once_init_leave(&initialized, 1);
     }
 
+    if (use_madv_populate_write) {
+        touch_fn = do_madv_populate_write_pages;
+    } else {
+        touch_fn = do_touch_pages;
+    }
+
     memset_thread_failed = false;
     threads_created_flag = false;
     memset_num_threads = get_memset_num_threads(smp_cpus);
@@ -534,7 +562,7 @@  static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
         memset_thread[i].numpages = numpages_per_thread + (i < leftover);
         memset_thread[i].hpagesize = hpagesize;
         qemu_thread_create(&memset_thread[i].pgthread, "touch_pages",
-                           do_touch_pages, &memset_thread[i],
+                           touch_fn, &memset_thread[i],
                            QEMU_THREAD_JOINABLE);
         addr += memset_thread[i].numpages * hpagesize;
     }
@@ -553,6 +581,12 @@  static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
     return memset_thread_failed;
 }
 
+static bool madv_populate_write_possible(char *area, size_t pagesize)
+{
+    return !qemu_madvise(area, pagesize, QEMU_MADV_POPULATE_WRITE) ||
+           errno != EINVAL;
+}
+
 void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
                      Error **errp)
 {
@@ -560,29 +594,41 @@  void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
     struct sigaction act, oldact;
     size_t hpagesize = qemu_fd_getpagesize(fd);
     size_t numpages = DIV_ROUND_UP(memory, hpagesize);
+    bool use_madv_populate_write;
 
-    memset(&act, 0, sizeof(act));
-    act.sa_handler = &sigbus_handler;
-    act.sa_flags = 0;
-
-    ret = sigaction(SIGBUS, &act, &oldact);
-    if (ret) {
-        error_setg_errno(errp, errno,
-            "os_mem_prealloc: failed to install signal handler");
-        return;
+    /*
+     * Sense on every invocation, as MADV_POPULATE_WRITE cannot be used for
+     * some special mappings, such as mapping /dev/mem.
+     */
+    use_madv_populate_write = madv_populate_write_possible(area, hpagesize);
+
+    if (!use_madv_populate_write) {
+        memset(&act, 0, sizeof(act));
+        act.sa_handler = &sigbus_handler;
+        act.sa_flags = 0;
+
+        ret = sigaction(SIGBUS, &act, &oldact);
+        if (ret) {
+            error_setg_errno(errp, errno,
+                "os_mem_prealloc: failed to install signal handler");
+            return;
+        }
     }
 
     /* touch pages simultaneously */
-    if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
+    if (touch_all_pages(area, hpagesize, numpages, smp_cpus,
+                        use_madv_populate_write)) {
         error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
             "pages available to allocate guest RAM");
     }
 
-    ret = sigaction(SIGBUS, &oldact, NULL);
-    if (ret) {
-        /* Terminate QEMU since it can't recover from error */
-        perror("os_mem_prealloc: failed to reinstall signal handler");
-        exit(1);
+    if (!use_madv_populate_write) {
+        ret = sigaction(SIGBUS, &oldact, NULL);
+        if (ret) {
+            /* Terminate QEMU since it can't recover from error */
+            perror("os_mem_prealloc: failed to reinstall signal handler");
+            exit(1);
+        }
     }
 }