The server is a multi-threaded, UDP, each thread listens a separate socket on a separate port, consequent port numbers from StartPort to StartPort + Number_of_threads. The software could be briefly described like this:
Code: Select all
procedure ListenThread(lpParameter: Pointer); stdcall; begin NT:=Integer(lpParameter); //thread num socket(); bind("0.0.0.0", port := StartPort + NT); while not shutdown do //shutdown is changed in the main thread begin RecvFrom(); if copy(buffer,1,10)='querytype1' then begin //get some parameters //generate reply //log to file named querytype1_port.log end else if copy(buffer,1,10)='querytype2' then begin //get some parameters //generate reply //log to file named querytype2_port.log end; end; Closesocket(); CloseHandle(); end; //Main procedure begin StartPort:=20000; NumServers:=10; WSAStartup(); for i:=0 to NumServers-1 do begin hThread[i] := CreateThread(nil, 0, @ListenThread, Pointer(I), 0, lpThreadId); sleep(10); end; WaitForMultipleObjects(); WSAcleanup(); end;
The problem is that sometimes (under load) the packet is received and processed by the wrong thread, i.e. wrong socket. How I found it? The file log querytype2_20001 contains the record about the packet sent from some IP, I was looking the same IP in the querytype1_20001 (that query must be made before) but I've found it in querytype1_20000! Again, port number is stored in the global array of structures, Servers[NT].port each cell of which could be accessed only by one thread (NT is a local integer which is set only once during thread creation and never changes). I never noticed this bug before, only using 2 latest server installations, Debian 8 and Debian 9 created in December and now in February. Also it never happened under Windows.
Now I wrote a stress test for my app, 32 threads simultaneously sending 1000 querytype1 packets each to the even port numbers, and there are still zero counters on the odd ports, so I can't reproduce it at the same system (but different port range) artifically. But it happened yesterday according to the log files. The situation when the client will send querytype1 to one server and querytype2 to another is impossible, IP of both queries must be logged by the same server (i.e. thread/port/socket). The general number of queries logged by my server (all threads) is differs on 0.5% of some estimates from outside. Each port (thread / socket) has difference about 1% to 8% from independent estimates, and what is important, if the port receives high number of packets, the difference is negative (packets sent to this port were logged by another less-loaded threads) and ports with low number of requests receive higher values. The total number of packets was about 150000 from 50000 IPs during 24 hours, it's not so much. I worked with a single server (single port) processing about 100pps.
Could somebody guess how it could happen? Maybe some problems with packets queue or network stack?