Performance drop in parallel application

Questions about Wine on Linux
Locked
MarkoM
Level 1
Level 1
Posts: 7
Joined: Tue Mar 17, 2020 10:41 am

Performance drop in parallel application

Post by MarkoM »

Hi,

I am trying to run one application through wine. The application supports parallelization and number of cpus (n_cpus) being used can be passed to the application. What I have noticed with n_cpus=4 is that application run through wine takes 2x more time than on windows. I have also noticed that the performance drop happens when parallelisation starts, and there I even see that 4 processes extra spawned are mostly in sleep mode (Linux top command), but also that cpu time (~100-300%), so I don't know is this reporting issue? I have started using wine recently, so my question is: What could cause such a drop? What is the best way to debug this? I also read that wineserver could be bottleneck and I tried enabling WINESYNC 1, but it didn't help...
I hope someone can help me with this.
User avatar
Bob Wya
Level 12
Level 12
Posts: 3068
Joined: Sat Oct 16, 2010 7:40 pm

Re: Performance drop in parallel application

Post by Bob Wya »

@MarkoM

Did you use Wine Staging, as regular Wine does not have the esync patchset available.
You can troubleshoot whether esync is enabled with:

Code: Select all

export WINEDEBUG=+esync
If wineserver is a bottleneck - then you would expect to see high CPU time for this process.
You can run wineserver as a foreground process to troubleshoot it, see:

Code: Select all

wineserver --help
There is also fsync, which can potentially be more efficient than esync, and uses a custom Linux kernel.

Bob
MarkoM
Level 1
Level 1
Posts: 7
Joined: Tue Mar 17, 2020 10:41 am

Re: Performance drop in parallel application

Post by MarkoM »

I used regular wine5.0. Thanks I will check this.
MarkoM
Level 1
Level 1
Posts: 7
Joined: Tue Mar 17, 2020 10:41 am

Re: Performance drop in parallel application

Post by MarkoM »

It seems that wineserver is not a problem, I don't see huge cpu and sync didn't help. But what I see (and think it is strange, as never developed windows parallel app), that 4 processes that are being spawnd by parent are constantly with new pid (spawn, run, end and then go over again). Does this extra creation iso having one process handling message queue, can be cause of this performance drop? Can it be handled in wine?
User avatar
Bob Wya
Level 12
Level 12
Posts: 3068
Joined: Sat Oct 16, 2010 7:40 pm

Re: Performance drop in parallel application

Post by Bob Wya »

@MarkoM,

So, if I understand you, the parallel processes are being re-spawned rapidly.
I presume this behaviour can be replicated, when running your application, under native Windows?

You might want to check out the perf profiling tool - to checkout what component is causing the Wine bottleneck.
MarkoM wrote:Does this extra creation iso having one process handling message queue, can be cause of this performance drop?
Que? :?

Bob
MarkoM
Level 1
Level 1
Posts: 7
Joined: Tue Mar 17, 2020 10:41 am

Re: Performance drop in parallel application

Post by MarkoM »

Bob Wya wrote: Fri Mar 20, 2020 11:08 am
MarkoM wrote:Does this extra creation iso having one process handling message queue, can be cause of this performance drop?
Que? :?

Bob
I was thinking than normally process is spawned once and then does handling of tasks (some sort of task/message/job queue where it handles one task/message/job after another). But here it is constantly recreated (new task/message/job requires creation of new process), so maybe that is a reason of performance drop wrt to windows machine?
And yes, I can also see this in windows enviroment, processes are re-spawned.
MarkoM
Level 1
Level 1
Posts: 7
Joined: Tue Mar 17, 2020 10:41 am

Re: Performance drop in parallel application

Post by MarkoM »

I did profiling with perf, and it seems that the most dominant time consuming parts are

Code: Select all

 
 [kernel.kallsyms]         [k] native_queued_spin_lock_slowpath                                                                                                                                                    
 [kernel.kallsyms]         [k] __entry_trampoline_start                                                                                                                                                            
 [kernel.kallsyms]         [k] syscall_return_via_sysret                                                                                                                                                           
 ntdll.dll.so              [.] RtlLeaveCriticalSection                                                                                                                                                             
 ntdll.dll.so              [.] RtlpWaitForCriticalSection                                                                                                                                                          
 [kernel.kallsyms]         [k] try_to_wake_up                                                                                                                                                                      
 ntdll.dll.so              [.] RtlEnterCriticalSection
Among those, the first two

Code: Select all

native_queued_spin_lock_slowpath
and

Code: Select all

__entry_trampoline_start
take the most.
Is there possibility to solve this in wine?
ObsequiousNewt
Level 2
Level 2
Posts: 15
Joined: Wed Oct 15, 2014 8:19 pm

Re: Performance drop in parallel application

Post by ObsequiousNewt »

I'm not sure that's nearly enough information to derive anything useful from. Actual time percentages would be helpful; call stacks even more so. In particular what's calling RtlEnterCriticalSection()?

Note also that symbols marked with [k] are in the kernel and not Wine. That doesn't necessarily mean it's not Wine's fault, but just knowing that there's a lot of time spent in the kernel doesn't really help us understand why.

Does using a different "n_cpus" value than 1 or 4 yield interesting results?
MarkoM
Level 1
Level 1
Posts: 7
Joined: Tue Mar 17, 2020 10:41 am

Re: Performance drop in parallel application

Post by MarkoM »

ObsequiousNewt wrote: Thu Mar 26, 2020 9:32 am I'm not sure that's nearly enough information to derive anything useful from. Actual time percentages would be helpful; call stacks even more so. In particular what's calling RtlEnterCriticalSection()?

Note also that symbols marked with [k] are in the kernel and not Wine. That doesn't necessarily mean it's not Wine's fault, but just knowing that there's a lot of time spent in the kernel doesn't really help us understand why.

Does using a different "n_cpus" value than 1 or 4 yield interesting results?
n_cpus=1 is even slower (no parallelisation, so it is logical). However, what I have noticed sometimes, that in wine, some of the extra (n_cpus) processes can also hang, when n_cpus=1, this issue is gone. RtlEnterCriticalSection() is called by those n_cpus processes that are being regularly re-spawned.
MarkoM
Level 1
Level 1
Posts: 7
Joined: Tue Mar 17, 2020 10:41 am

Re: Performance drop in parallel application

Post by MarkoM »

Collecting stack trace in perf (I am not expert there, maybe doing something wrong) have discovered these function calls (maybe I missed sthg in previous perf analysis). What can be concluded from here?

Code: Select all

# Children      Self  Command          Shared Object            Symbol                                                     
# ........  ........  ...............  .......................  ...........................................................
#
    36.08%     1.48%  application_subprocess  libc-2.27.so             [.] syscall
            |          
             --34.61%--syscall
                       |          
                       |--26.58%--entry_SYSCALL_64_after_hwframe
                       |          |          
                       |           --25.49%--do_syscall_64
                       |                     |          
                       |                      --24.62%--sys_futex
                       |                                |          
                       |                                 --23.12%--do_futex
                       |                                           |          
                       |                                           |--14.64%--futex_wake
                       |                                           |          |          
                       |                                           |          |--10.93%--wake_up_q
                       |                                           |          |          |          
                       |                                           |          |           --10.73%--try_to_wake_up
                       |                                           |          |                     |          
                       |                                           |          |                     |--5.28%--__task_rq_lock
                       |                                           |          |                     |          |          
                       |                                           |          |                     |           --5.14%--_raw_spin_lock
                       |                                           |          |                     |                     |          
                       |                                           |          |                     |                      --5.08%--native_queued_spin_lock_slowpath
                       |                                           |          |                     |          
                       |                                           |          |                      --2.13%--ttwu_do_activate
Locked