🤴🏾 🧖🏾 👩🏽‍🏭 Yes, my old laptop is several times more powerful than your production server 🛀 🚘 📰

These are the claims I heard from our developers. The most interesting thing is that this turned out to be true, giving rise to a lengthy investigation. It will be about SQL servers, which spin on our VMware.

Actually, to ensure that the production server is hopelessly behind the laptop is easy. Run (not on tempdb and not on a base with Delayed Durability enabled) the code:

set nocount on
create table _t (v varchar(100))
declare @n int=300000
while @n>0 begin 
  insert into _t select 'What a slowpoke!'
  delete from _t
  set @n=@n-1
  end
GO
drop table _t

On my desktop, it runs for 5 seconds, and on the production server - 28 seconds. Because SQL must wait for the physical end of the transaction log entry, and here we are doing very short transactions. Roughly speaking, we drove a large, powerful truck into city traffic, and we see how it is famously overtaken by pizza delivery scooters - throughput is not important here, only latency is important. And not a single network storage, no matter how many zeros in its price, will be able to win latency against the local SSD.

(in the comments it turned out that I lied - I had delayed durability in both places. Without delayed durability it turns out:
Desktop - 39 seconds, 15K tr / sec, 0.065ms / io roundtrip
PROD - 360 seconds, 1600 tr / sec, 0.6ms
I should have paid attention too soon)

However, in this case, we are dealing with the ~~trivial zeros of the zeta Riemann function~~ with a trivial example. In the example that the developers brought me, there was another. I became convinced that they were right, and began to clear out from the example all their specifics related to business logic. At some point, I realized that I could completely throw away their code, and write my own - which demonstrates the same problem - it runs 3-4 times slower on production:

create function dbo.isPrime (@n bigint)
returns int
as
  begin
  if @n = 1 return 0
  if @n = 2 return 1
  if @n = 3 return 1
  if @n % 2 = 0 return 0
  declare @sq int
  set @sq = sqrt(@n)+1 -- check odds up to sqrt
  declare @dv int = 1
  while @dv < @sq 
    begin
	set @dv=@dv+2
	if @n % @dv = 0 return 0
	end
  return 1
  end
GO
declare @dt datetime set @dt=getdate()
select dbo.isPrime(1000000000000037)
select datediff(ms,@dt,getdate()) as ms
GO

If everything is fine with you, then the verification of the simplicity of the number will be performed 6-7-8 seconds. So it was on a number of servers. But on some, the check took 25-40 seconds. Interestingly, there were no servers where the execution would take, say, 14 seconds - the code worked either very fast or very slow, that is, the problem was, let's say, black and white.

What I've done? Useful in VMware metrics. Everything was fine there - there were plenty of resources, Ready time = 0, everything was enough, during the test on both fast and slow servers CPU = 100 on one vCPU. I took a test to calculate the number of Pi - the test showed the same results on any servers. It smelled of black magic more and more.

Having got out on the DEV farm, I began to play as servers. It turned out that vMotion from host to host can "cure" the server, but can and vice versa, turn a "fast" server into a "slow" one. It seems like this - some hosts have a problem ... but ... no. Some virtual machine was slowing down on the host, for example, A but it was working fast on host B. And another virtual machine, on the contrary, was working fast on A and slowed down on B! The “fast” and “slow” cars were often spinning on the host!

From that moment on, the air distinctly smelled of sulfur. After all, the problem could not be attributed to any virtual machine (windows patches, for example) - because it turned into a “fast” one with vMotion. But the problem also could not be attributed to the host - because it could have both “fast” and “slow” machines. It was also not connected with the load - I managed to get a “slow” machine on the host, where there was nothing besides it at all.

Out of desperation, I launched Process Explorer from Sysinternals and looked at the SQL stack. On slow machines, the line immediately caught my eye:

ntoskrnl.exe! KeSynchronizeExecution + 0x5bf6
ntoskrnl.exe! KeWaitForMultipleObjects + 0x109d
ntoskrnl.exe! KeWaitForMultipleObjects + 0xb3f
ntosKrnx3 0!
ntoskrnl.exe! KeQuerySystemTimePrecise + 0x881 <- !!!
ntoskrnl.exe! ObDereferenceObjectDeferDelete + 0x28a
ntoskrnl.exe! KeSynchronizeExecution + 0x2de2
sqllang.dll! CDiagThreadSafe :: PxlvlReplace + 0x1a20
... skipped
sqldk.dll! SystemThread :: MakeMiniSOSThread + 0xa54
KERNEL32.DLL! BaseThreadInitThunk + 0x14
ntdll.dll! RtlUserThreadStart + 0x21

It was already something. The program was written:

    class Program
    {
        [DllImport("kernel32.dll")]
        static extern void GetSystemTimePreciseAsFileTime(out FILE_TIME lpSystemTimeAsFileTime);

        [StructLayout(LayoutKind.Sequential)]
        struct FILE_TIME
        {
            public int ftTimeLow;
            public int ftTimeHigh;
        }

        static void Main(string[] args)
        {
            for (int i = 0; i < 16; i++)
            {
                int counter = 0;

                var stopwatch = Stopwatch.StartNew();

                while (stopwatch.ElapsedMilliseconds < 1000)
                {
                    GetSystemTimePreciseAsFileTime(out var fileTime);
                    counter++;
                }

                if (i > 0)
                {
                    Console.WriteLine("{0}", counter);
                }
            }
        }
    }

This program showed even brighter deceleration - on “fast” machines it shows 16-18 million cycles per second, while on slow ones it’s one and a half million, or even 700 thousand. That is, the difference is 10-20 times (!!!). This was already a small victory: in any case, there was no threat of getting stuck between Microsoft and VMware support so that they would transfer arrows to each other.

Further, progress stopped - vacation, important matters, viral hysteria and a sharp increase in load. I often mentioned a magic problem to my colleagues, but at times it seemed that they didn’t always trust me - the statement that VMware slowed down the code 10-20 times was too monstrous.

I tried to unearth myself what was slowing down. At times it seemed to me that I found a solution - turning on and off Hot plugs, changing the memory size or the number of processors often turned the machine into a “fast” one. But not forever. But what turned out to be true is that it is enough to go out and knock on the wheel - that is, change any parameter of the virtual machine.

Finally, my American colleagues suddenly found the root cause.

Hosts differed in frequency!

This is usually not scary. But: when moving from a 'native' host to a host with a 'different' frequency, VMware should adjust the result of GetTimePrecise.
As a rule, this is not scary, unless there is an application that requests the exact time millions of times per second, like a SQL server.
But this is not scary, since SQL server does not always do this (see the Conclusion)

But there are cases when this rake painfully hits. And yet, yes, by tapping the wheel (changing something in the VM settings) I forced VMware to 'recount' the configuration, and the frequency of the current host became the 'native' frequency of the machine.

Decision

www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf

When you disable virtualization of the TSC, reading the TSC from within the virtual machine returns the physical machine’s TSC value, and writing the TSC from within the virtual machine has no effect. Migrating the virtual machine to another host, resuming it from suspended state, or reverting to a snapshot causes the TSC to jump discontinuously. Some guest operating systems fail to boot, or exhibit other timekeeping problems, when TSC virtualization is disabled. In the past, this feature has sometimes been recommended to improve performance of applications that read the TSC frequently, but performance of the virtual TSC has been improved substantially in current products. The feature has also been recommended for use when performing measurements that require a precise source of real time in the virtual machine.

In short, add the parameter

monitor_control.virtual_rdtsc = FALSE

Conclusion

You probably have a question: what for SQL call GetTimePrecise so often?

I do not have the SQL server source, but the logic says this. SQL is almost an OS with cooperative concurrency, where each thread has to give way from time to time. Where is it better to do it? Where there is a natural expectation - lock or IO. Well, what if we spin the computational cycles? Then the obvious and almost the only place is in the interpreter (this is not quite an interpreter), after the execution of the next operator.

As a rule, SQL server is not used for ~~nailing~~ clean calculations and this is not a problem. But cycles with working with all kinds of temporary tablets (which are immediately cached) turn the code into a sequence of very quickly executed statements.

By the way, if you wrap the function in NATIVELY COMPILED, then it stops asking for time, and its speed increases by a factor of 10. But what about cooperative multitasking? But for natively compiled code, I had to do PREEMPTIVE MULTITASKING in SQL.

Yes, my old laptop is several times more powerful than your production server

Decision

Conclusion

More articles: