Discussion:
Reasonable values for CRITICAL_SECTION spin-count
(too old to reply)
Oliver S.
2005-10-19 18:32:35 UTC
Permalink
Has anyone some information on reasonable values for the spin-count
parameter for CRITICAL_SECTIONs? I'm assuming that the CRITICAL_SECTION
is held only for a small number of instructions, i.e. below 100 and that
none of these instruction imposes a large stall to the processing while
the CRITICAL_SECTION is held.
Oliver S.
2005-10-19 21:06:02 UTC
Permalink
Post by Oliver S.
Has anyone some information on reasonable values for the spin-count
parameter for CRITICAL_SECTIONs? I'm assuming that the CRITICAL_SECTION
is held only for a small number of instructions, i.e. below 100 and that
none of these instruction imposes a large stall to the processing while
the CRITICAL_SECTION is held.
I have a nice idea: I simply measure the time a WaitForSingleObjectObject on
an event-object which is initially set (maybe a hundred times in a loop and
take the smallest value) in clock-cycles and set the spin-parameter so that
the default-spinning takes half the time of a WaitForSingleObject.
Stefan Kuhr
2005-10-21 09:05:55 UTC
Permalink
Hi Oliver,
Post by Oliver S.
Post by Oliver S.
Has anyone some information on reasonable values for the spin-count
parameter for CRITICAL_SECTIONs? I'm assuming that the CRITICAL_SECTION
is held only for a small number of instructions, i.e. below 100 and that
none of these instruction imposes a large stall to the processing while
the CRITICAL_SECTION is held.
I have a nice idea: I simply measure the time a
WaitForSingleObjectObject on
an event-object which is initially set (maybe a hundred times in a loop and
take the smallest value) in clock-cycles and set the spin-parameter so that
the default-spinning takes half the time of a WaitForSingleObject.
I would guess that the time you will measure will vary with the usage
scenario of the section of code you want to protect (i.e. number of
threads that typically access the critsect and the number of
instructions that they typically own the critsect). So if you protect
both very short and very long sequences of instructions with one and the
same critsect, you will probably have bad performance for long
instruction sequences if you initialize the critsect for short sequences
and vice versa

In my docs, there is sort of a "best practices" tip:

"The heap manager uses a spin count of roughly 4000 for its per-heap
critical sections. This gives great performance and scalability in
almost all worst-case scenarios."

Therefore in my code I always use 4000 and make a lame excuse for the
hardcoded number by quoting this from the docs in a comment. YMMV.
--
Stefan Kuhr

"Lesen schadet der Dummheit"
Oliver S.
2005-10-21 20:53:40 UTC
Permalink
Post by Stefan Kuhr
I would guess that the time you will measure will vary with the usage
scenario of the section of code you want to protect ...
Of course, but I won't stay in a spin loop longer than I would wait in
the kernel under most advantageous circumstances. I think it's a very
reasonable decision to wait half the time the fastest possible WaitFor
SingleObject would last.
I determine the successively appropriate spin-count in several steps:
- First I run WaitForSingleObject hundred times on a pre-set autoreset
-event and measure the fastest WFSO with the timestamp-counter of
the CPU.
- Then I start calling the spin-loop with spin-counts which go from
one up, doubling the spin-count every iteration until the time taken
for the spin-loop is above the time of the fastest WFSO. To be accu-
rate, I do this hundred times to account for any interrupts, DPCs and
other threads preempting this.
- Then I do the above again, but in the opposite reduction by first sub-
tracting a fourth of the previous found spin-count value, then an eigth
and so on until this subtraction-value comes to zero. So I'll get a
spin-count which matches the fastest possible WFSO.
To prevent that the thread doing this test is migrated to another CPU
and thereby the timestamp-counters aren't basing on the same zero-point,
I fix the thread doing this test to the CPU number one while performing
all the above things.
Post by Stefan Kuhr
"The heap manager uses a spin count of roughly 4000 for its per-heap
critical sections.
I think that's a riciculous high value.
Stefan Kuhr
2005-10-21 22:55:31 UTC
Permalink
Hi Oliver,
Post by Oliver S.
"The heap manager uses a spin count of roughly 4000 for itsent per-heap
critical sections.
I think that's a riciculous high value.
What you describe is really very interesting and it all makes sense to
me. What are the numbers you measured? Does it differ by an order of a
magnitude from 4000? Do you sort of do a test run on your developer
machine, take the results and hardcode the spin count in your code or do
you instead use some adaptive approach by measuring these things on the
fly and use results of previous runs of your code in later runs of the
same code within the same process?

This spin-count number thingamajig is something that is IMHO
underdocumented, I would really like to know what Ivan would tell us
about this and what the best practices are at MS regarding
InitializeCriticalSectionAndSpinCount.
--
Stefan
Oliver S.
2005-10-22 04:07:07 UTC
Permalink
What you describe is really very interesting and it all makes sense to me.
The thing a lot of developers adjusting that spin-counts aren't aware
is that this spin-count isn't there to spin in a magnitue of the time
the mutex is held, but just to prevent any kernel-call to WFSO; if we
would spin longer than this call usually lasts, spinning wouldn't get
the best performance-improvements over non-spinning. But they do the
first and give the spin-count that high turn-numbers.
What are the numbers you measured?
Does it differ by an order of a magnitude from 4000?
I've got a uniprocessor-system so spinning wouldn't make sense on this
and thus my spin-count-detection is guarded by a ...
if( (GetSystemInfo( &systemInfo ),
systemInfo.dwNumberOfProcessors) > 1 )
.. to prevent any spinning on uniprocessor-systems.
But if I remove that guard, the fastest WFSO takes 782 clock-cycles on my
Athlon and the corresponding spin-count is 44! If we take a magnitude-base
of ten, that's two magnitudes lower than 4000.
But my detection-routine probes the fastest possible WFSO on the same
core of a single the CPU in the same SMP-set and it even sets the event
-object by itself; I think that's not optimal. It would be more reasonable
to have two probe-threads running in different combinations, i.e. on dif-
ferent threads of the same core, on different cores of the same CPU and
on different CPUs. This probe threads should alternately wake up each other
through SetEvent and WaitForSingleObject and both threads should measure
the fastest time for a wake-up. The average time of all the mentioned com-
binations should be the reference for the following spin-count detection.
And to be even better and to account for NUMA-systems where cachelines hol-
ding lock-flags (they're set with atomic CAS-operations) travel different
times for different combinations of CPUs, cores or threads, these two
threads should run on all combinations which are possible. So when we
have an eight-way dual-core-system, there would be 16 * 15 combinations;
as these test is done only once, it should be ok to go through this com-
plex process.
But nevertheless, I think I get to the right magnitude for spinning with
my simple approach.
Do you sort of do a test run on your developer machine, take the results
and hardcode the spin count in your code or do you instead use some adaptive
approach by measuring these things on the fly and use results of previous
runs of your code in later runs of the same code within the same process?
I use my own mutex-class which has a static (i.e. "global") method called
CheckPlatformConditions(). This method checks whether we're on a multipro-
cessor-system and we haven't detected the spin-count yet. So this method
can be called from everywhere even when we have beeen using that mutex
and thereby missed appropriate spinning up to now.
This spin-count number thingamajig is something that is IMHO
underdocumented, I would really like to know what Ivan would
tell us about this and what the best practices are at MS regarding
InitializeCriticalSectionAndSpinCount.
I've also been looking on information to this; but I found nothing.
Chris Thomasson
2005-10-22 08:00:08 UTC
Permalink
Post by Oliver S.
And to be even better and to account for NUMA-systems where cachelines hol-
ding lock-flags (they're set with atomic CAS-operations) travel different
times for different combinations of CPUs, cores or threads, these two
threads should run on all combinations which are possible.
I did something similar for an old database server. I used the data to
decide when it was a good time to block a series of contended stacks. IIRC,
it gave me a somewhat measurable performance boost for some transactions,
most of the time. It was used to address some of the starvation issues that
crop up, wrt memory locality, processor distance, ect... when two processors
in a NUMA based system contend for the same stack. The stack design was
basically the same as the one described here:

http://www.hpl.hp.com/techreports/2004/HPL-2004-105.html
Post by Oliver S.
So when we
have an eight-way dual-core-system, there would be 16 * 15 combinations;
as these test is done only once, it should be ok to go through this com-
plex process.
IMO, the process is probably worth it; server startup is a good time to do a
lot of things...

;)
Oliver S.
2005-10-22 10:35:38 UTC
Permalink
Post by Chris Thomasson
So when we have an eight-way dual-core-system, there would be 16 * 15
combinations; as these test is done only once, it should be ok to go
through this complex process.
IMO, the process is probably worth it; server startup is a good time to
do a lot of things...
The process is a complex in contrast to a simple lock/unlock-pair.
But I think it won't run more than a single digit millisecond-range
so this sould be ok for every app.
Oliver S.
2005-10-23 03:41:18 UTC
Permalink
Post by Oliver S.
And to be even better and to account for NUMA-systems where cachelines
holding lock-flags (they're set with atomic CAS-operations) travel dif-
ferent times for different combinations of CPUs, cores or threads, these
two threads should run on all combinations which are possible. So when
we have an eight-way dual-core-system, there would be 16 * 15 combina-
tions; as these test is done only once, it should be ok to go through
this complex process.
I implemented this test; so here it is:

struct DefaultSpinDetection
{
static
int const DO_NOTHING = 1,
DO_NOTHING_ACK = 2,
DO_WAKEUP_SPINNING = 3,
FINISH = 4;
HANDLE hEvtWakeUp;
int volatile state;
};

DWORD MutexImpl::sm_dwDefaultSpinRepeat = 0;

void MutexImpl::CheckPlatformCondititions()
{
SYSTEM_INFO systemInfo;
unsigned processors;

if( sm_dwDefaultSpinRepeat == 0 &&
(GetSystemInfo( &systemInfo ),
processors = (unsigned)systemInfo.dwNumberOfProcessors) > 1 )
{
DWORDLONG dwlSumTicks;
ns_Win32::XHANDLE xhEvtWakeUp;
DefaultSpinDetection dsd;
ns_Win32::XHANDLE xhDSDThread;
HANDLE hOurThread;
DWORD_PTR dwpPreviousAffinityMask;
int processorCombinations;
unsigned ourProcessor,
dsdProcessor;
DWORDLONG dwlFastestWFSOTicks;
int tries;
DWORDLONG dwlStartTick,
dwlEndTick;
DWORDLONG dwlTicks;
double avgFastestWFSOTicks;
DWORDLONG dwlAvgFastestWFSOTicks;
DWORDLONG dwlFastestSpinTicks;
DWORDLONG dwlSpinTicksLimit;
DWORD dwSpinRepeat;
DWORD dwSpinRepeatReduce;
LONG lLockAndWaiterCounter;

GetSystemInfo( &systemInfo );
processors = (unsigned)systemInfo.dwNumberOfProcessors;

dwlSumTicks = 0;
xhEvtWakeUp.h = ::CreateEvent( NULL, FALSE, FALSE, NULL );
dsd.hEvtWakeUp = xhEvtWakeUp.h;
dsd.state = DefaultSpinDetection::DO_NOTHING;
xhDSDThread.h = ::CreateThread( NULL, 0, DSDThread,
&dsd, 0, NULL );
hOurThread = ::GetCurrentThread();
dwpPreviousAffinityMask = ::SetThreadAffinityMask( GetCurrentThread(),
1 );

for( (processorCombinations = 0,
ourProcessor = 0);
ourProcessor < processors; ourProcessor++ )
for( (::SetThreadAffinityMask( hOurThread,
(DWORD_PTR)1 << ourProcessor ),
dsdProcessor = ourProcessor + 1);
dsdProcessor < processors; dsdProcessor++ )
{
::SetThreadAffinityMask( xhDSDThread.h,
(DWORD_PTR)1 << dsdProcessor );

dsd.state = DefaultSpinDetection::DO_NOTHING;
while( dsd.state != DefaultSpinDetection::DO_NOTHING_ACK );

::ResetEvent( xhEvtWakeUp.h );
dsd.state = DefaultSpinDetection::DO_WAKEUP_SPINNING;

for( (dwlFastestWFSOTicks = (DWORDLONG)(LONGLONG)-1,
tries = 100);
tries; tries-- )
{
dwlStartTick = GetTSC();
::WaitForSingleObject( xhEvtWakeUp.h, INFINITE );
dwlEndTick = GetTSC();

if( (dwlTicks = dwlEndTick - dwlStartTick) < dwlFastestWFSOTicks )
dwlFastestWFSOTicks = dwlTicks;
}

dwlSumTicks += dwlFastestWFSOTicks;
processorCombinations++;
}

::SetThreadAffinityMask( hOurThread, dwpPreviousAffinityMask );
dsd.state = DefaultSpinDetection::FINISH;

avgFastestWFSOTicks = (double)(LONGLONG)dwlSumTicks
/ processorCombinations;
dwlAvgFastestWFSOTicks = (DWORDLONG)(LONGLONG)(avgFastestWFSOTicks + 0.5);

for( (lLockAndWaiterCounter = (LONG)0x80000000u,
dwlSpinTicksLimit = dwlAvgFastestWFSOTicks / 2,
dwSpinRepeat = 0); ;
dwSpinRepeat = (dwSpinRepeat != 0) ? (dwSpinRepeat * 2) : 1 )
{
for( (dwlFastestSpinTicks = (DWORDLONG)(LONGLONG)-1,
tries = 100);
tries; tries-- )
{
dwlStartTick = GetTSC();
SpinUntilAccessible( &lLockAndWaiterCounter, dwSpinRepeat );
dwlEndTick = GetTSC();

if( (dwlTicks = dwlEndTick - dwlStartTick) < dwlFastestSpinTicks )
dwlFastestSpinTicks = dwlTicks;
}

if( dwlFastestSpinTicks > dwlSpinTicksLimit )
break;
}

for( dwSpinRepeatReduce = dwSpinRepeat / 4;
dwSpinRepeatReduce > 0;
dwSpinRepeatReduce /= 2 )
{
for( (dwlFastestSpinTicks = (DWORDLONG)(LONGLONG)-1,
tries = 100);
tries; tries-- )
{
dwlStartTick = GetTSC();
SpinUntilAccessible( &lLockAndWaiterCounter, dwSpinRepeat );
dwlEndTick = GetTSC();

if( (dwlTicks = dwlEndTick - dwlStartTick) < dwlFastestSpinTicks )
dwlFastestSpinTicks = dwlTicks;
}

if( dwlFastestSpinTicks > dwlSpinTicksLimit )
dwSpinRepeat -= dwSpinRepeatReduce;
}

sm_dwDefaultSpinRepeat = dwSpinRepeat;
}
}

__declspec(naked)
bool __fastcall MutexImpl::SpinUntilAccessible(
LONG volatile *plLockAndWaiterCounter, DWORD dwSpinRepeat )
{
__asm
{
jmp spinLoopStart

spinLoop:
pause

spinLoopStart:
mov eax, [ecx]
cmp eax, 0
je accessible

sub edx, 1
jnc spinLoop

mov eax, 0
ret

accessible:
mov eax, 1
ret
}
}

__declspec(naked)
DWORDLONG __fastcall MutexImpl::GetTSC()
{
__asm
{
rdtsc
ret
}
}
m
2005-10-23 17:00:49 UTC
Permalink
BTW: read MSDN on InitializeCriticalSectionAndSpincount - is never spins on
single processor machines.



There is a fundamental flaw in your logic:



You are measuring the shortest possible call to WFSO and saying that 'if I
spin for more than half of this time, then I should just give up and call
WFSO'. This approach assumes that the guarded block takes a significantly
longer time to execute than the operations required to acquire the lock, and
that contention is insignificant.



The short answer is that each CS requires optimal tuning for its use and no
'catch all' solution will work. If you don't want to do this work, or need
somewhere to start, copy MS and use 4000 - it will work for most things.





Remember that you may be pre-empted at any time, and that CSs are not FIFO
or even deterministic. I won't event start talking about thread
priorities .
Post by Oliver S.
What you describe is really very interesting and it all makes sense to me.
The thing a lot of developers adjusting that spin-counts aren't aware
is that this spin-count isn't there to spin in a magnitue of the time
the mutex is held, but just to prevent any kernel-call to WFSO; if we
would spin longer than this call usually lasts, spinning wouldn't get
the best performance-improvements over non-spinning. But they do the
first and give the spin-count that high turn-numbers.
What are the numbers you measured?
Does it differ by an order of a magnitude from 4000?
I've got a uniprocessor-system so spinning wouldn't make sense on this
and thus my spin-count-detection is guarded by a ...
if( (GetSystemInfo( &systemInfo ),
systemInfo.dwNumberOfProcessors) > 1 )
.. to prevent any spinning on uniprocessor-systems.
But if I remove that guard, the fastest WFSO takes 782 clock-cycles on my
Athlon and the corresponding spin-count is 44! If we take a magnitude-base
of ten, that's two magnitudes lower than 4000.
But my detection-routine probes the fastest possible WFSO on the same
core of a single the CPU in the same SMP-set and it even sets the event
-object by itself; I think that's not optimal. It would be more reasonable
to have two probe-threads running in different combinations, i.e. on dif-
ferent threads of the same core, on different cores of the same CPU and
on different CPUs. This probe threads should alternately wake up each other
through SetEvent and WaitForSingleObject and both threads should measure
the fastest time for a wake-up. The average time of all the mentioned com-
binations should be the reference for the following spin-count detection.
And to be even better and to account for NUMA-systems where cachelines hol-
ding lock-flags (they're set with atomic CAS-operations) travel different
times for different combinations of CPUs, cores or threads, these two
threads should run on all combinations which are possible. So when we
have an eight-way dual-core-system, there would be 16 * 15 combinations;
as these test is done only once, it should be ok to go through this com-
plex process.
But nevertheless, I think I get to the right magnitude for spinning with
my simple approach.
Do you sort of do a test run on your developer machine, take the results
and hardcode the spin count in your code or do you instead use some adaptive
approach by measuring these things on the fly and use results of previous
runs of your code in later runs of the same code within the same process?
I use my own mutex-class which has a static (i.e. "global") method called
CheckPlatformConditions(). This method checks whether we're on a multipro-
cessor-system and we haven't detected the spin-count yet. So this method
can be called from everywhere even when we have beeen using that mutex
and thereby missed appropriate spinning up to now.
This spin-count number thingamajig is something that is IMHO
underdocumented, I would really like to know what Ivan would
tell us about this and what the best practices are at MS regarding
InitializeCriticalSectionAndSpinCount.
I've also been looking on information to this; but I found nothing.
Michel
2005-10-23 18:03:49 UTC
Permalink
I think there is more flaws than the one you mentioned. E.g even if you
were spinning longer than half the time WFSO having the spincount tuned
correctly can avoid having your thread migrate between CPUs which might
be costly for a host of reasons, especially if the lock overhead is
larger than the work done in the lock.

And I guess if there was an optimal spincount that could be calculated
the system would do it at startup or install time and use that behind
your back to increase performance.

/Michel
Slava M. Usov
2005-10-23 20:03:00 UTC
Permalink
"m" <***@b.c> wrote in message news:Oi5pSN$***@TK2MSFTNGP14.phx.gbl...

[...]
Post by m
The short answer is that each CS requires optimal tuning for its use and
no 'catch all' solution will work.
This person has already been told that "high-performance general-purpose"
solutions do not exist. But he's too ignorant to understand this and too
arrogant to listen. The very question for something "reasonable" in
something "high performance" is highly illustrative of this.

S
Oliver S.
2005-10-23 20:38:45 UTC
Permalink
Post by Slava M. Usov
This person has already been told that "high-performance
general-purpose" solutions do not exist.
The person that told me that assumed a certain meaning of "high-performance"
which isn't mine. High performance in genral-purpose algorithms means not that
this algorithms might be competitive with special-purpose algorithms, but that
they're as fast as possible for a general-purpose algorithm. And you made the
same stupid mistake to assume this definition.
Oliver S.
2005-10-24 03:48:33 UTC
Permalink
Post by m
BTW: read MSDN on InitializeCriticalSectionAndSpincount
- is never spins on single processor machines.
I know why it's like that and I'm aware of that. The WFSO-speed-detection
I did on a uniprocessor-system was for experimental purposes only.
Post by m
You are measuring the shortest possible call to WFSO and saying that 'if I
spin for more than half of this time, then I should just give up and call
WFSO'.
Excactly!
Post by m
This approach assumes that the guarded block takes a significantly longer
time to execute than the operations required to acquire the lock, ...
The time the mutex is held absolutely doesn't matter. It's all about pre-
venting kernel-calls which would consume more CPU-cycles than we would have
consumed through spinning! Why do you think we should spin the value of 4000
times you mentioed below? This value is that ridicoulously high that we could
even go to WFSO.
Of couese a mutex might be held that long, that spining the pre-determined
rounds wouldn't succeed in most cases. So in this cases spinning shouls be
disabled completely because we would consume more cpu-cycles than going
through WFSO. But CSes are usually held an extremely short time that this
spinnnig makes sense most times.
Post by m
... and that contention is insignificant.
I didn't mention that when I'm trying to get access to the mutex, I'm spin-
ning the predetermined times until I can see the mutex to be enterable. If
the compare-and-swap-operation after this spinning fails because the atomic
mutex-value has changed, this spining is resumed another round with the
same spin-count-value. That's ok because in this situation we're staying
before two alternatives again: spinning or WFSOing.
Post by m
The short answer is that each CS requires optimal tuning for its use
and no 'catch all' solution will work.
That's pure nonsense! Spinning is only there to prevent any kernel-calls;
and therefore we should never spin longer than we would wait WFSOing. So
my idea is perfectly ok.
Post by m
If you don't want to do this work, or need somewhere to start,
copy MS and use 4000 - it will work for most things.
4000 is a ridiculously high spin-count value; even more because on hyper-
threading-platforms, EnterCriticalSection employ the pause-instruction to
free any execution-resources and thereby the spinning takes longer for a
magnitude on those systems! When the time the CS is held by other threads
is very short, 4000 doesn't hurt much - but it's still a unnecessary per-
formance-degradation
And I have some doubts on the qualification of the developers working on
that issues at M$, so I won't copy anything. That's because M$ isn't sup-
plying a condition-variable in Win32.
Post by m
Remember that you may be pre-empted at any time, and that CSs are not
FIFO or even deterministic. I won't event start talking about thread
priorities .
That'a a different issue.
Michel
2005-10-24 19:10:55 UTC
Permalink
Post by Oliver S.
That's pure nonsense! Spinning is only there to prevent any kernel-calls;
and therefore we should never spin longer than we would wait WFSOing. So
my idea is perfectly ok.
No its not there only to prevent kernel calls. It's also there to
prevent your thread giving up its remaining timeslice. This is to
prevent unnecessary context switches and possible thread migrations
which might be expensive compared to spinning + the actual work, even if
the spinning is longer than a kernel-call. The kernel call will suspend
your thread and make it non-runnable until the event is set and it will
take some time before you get scheduled again other threads at the same
priority will run and your thread might be scheduled on another CPU. So
you cannot relate the cost of kernel-call and WFSO on a nonsignalled
object to the spincount at least not in my book and as far as I understand.

/Michel
Pavel Lebedinsky [MSFT]
2005-10-25 02:27:31 UTC
Permalink
No its not there only to prevent kernel calls. It's also there to prevent
your thread giving up its remaining timeslice. This is to prevent
unnecessary context switches and possible thread migrations which might be
expensive compared to spinning + the actual work, even if the spinning is
longer than a kernel-call. The kernel call will suspend your thread and
make it non-runnable until the event is set and it will take some time
before you get scheduled again other threads at the same priority will run
and your thread might be scheduled on another CPU. So you cannot relate
the cost of kernel-call and WFSO on a nonsignalled object to the spincount
at least not in my book and as far as I understand.
Yes, the idea is basically that spin time should be less than context switch
time (that is, how long it would take to get another thread running and
doing useful work on this processor). On top of that there might be
additional indirect costs such as various cache misses but these are
difficult to measure.

I'm not sure where the 4000 value came from. Probably it was based
on the estimated average context switch time back when the code was
written. For Windows Vista there will probably be some changes and
tweaks in this area but in general I doubt that using say 2000 vs. 4000
in your code would make a huge difference.
--
This posting is provided "AS IS" with no warranties, and confers no
rights.
Slava M. Usov
2005-10-25 14:26:54 UTC
Permalink
Post by Pavel Lebedinsky [MSFT]
Yes, the idea is basically that spin time should be less than context switch
time (that is, how long it would take to get another thread running and
doing useful work on this processor).
This is only partly true. The idea is that the entire sequence

acquire spinlock; do something; release spinlock

should not be subject to context switches MOST of the time. There are two
important corollaries.

First, threads have finite quanta on Windows; so the entire sequence and "do
something" in particular must be short and fast.

Second, "do something" must not do -- MOST of the time! -- anything that
results in context switches.

Both are necessary conditions for a successful spinlock use case. Both mean
that "do something" is subject to very serious constraints so anything
"general purpose" is clearly out of question.

If "do something" satisfies these conditions, then the spin time or spin
count must be such that the time spent in "acquire spinlock" is not much
more than the time "do something" takes MOST of the time; it should
specifically be NOT MUCH MORE to detect the case when a concurrent "do
something" has taken an unusual code path or suffered a context switch, in
which case burning any more time in "acquire spinlock" is totally wasteful.
So, it should not be much more, but it can be equal or even less [say, it
can be an average]. That depends on many things, and has to be validated
statistically.

In any case, this spin time/count has nothing to do with whatever time
WaitForSingleObject() takes, with zero or any other timeout.

S
Oliver S.
2005-10-26 00:23:02 UTC
Permalink
Post by Slava M. Usov
Post by Pavel Lebedinsky [MSFT]
Yes, the idea is basically that spin time should be less than context
switch time (that is, how long it would take to get another thread
running and doing useful work on this processor).
This is only partly true. The idea is that the entire sequence
acquire spinlock; do something; release spinlock should not be
subject to context switches MOST of the time.
We're talking cases where spinning might be useful. And this makes sense
only if we're likely to proceed the acquisition of the lock in below the
time of a context-switch. And that time is a extremely low fraction of a
timeslice. So deciding whether to spin or not according to if the mutex
-held time is mostly in the magnitude of a timeslice (which can be co-
nsidered a bug) only can give you a decision against spinung but not
pro spinning.
Post by Slava M. Usov
First, threads have finite quanta on Windows; so the entire
sequence and "do something" in particular must be short and fast.
The mutex-held-time becomes critical for spinning far before we get into
the magnitude of a timeslice. Imagine we have a mutex-held-time three times
longer than the time of a context-switch; then we're likely to fail with
spinning at the time of a context-switch most the times and we shouldn't
spin; and if we would spin beyond the time of a context-switch, we will
lower the overall throughput of the system by defferring other thread's
work. So Pavel is absolutely right when he says that spinning should not
take longer than the time of a context-switch.
Post by Slava M. Usov
Second, "do something" must not do -- MOST of the
time! -- anything that results in context switches.
That's not an additional rule, but a a specialization of your
first rule because this would just increase mutex-held-time.
Post by Slava M. Usov
If "do something" satisfies these conditions, then the spin time or spin
count must be such that the time spent in "acquire spinlock" is not much
more than the time "do something" takes MOST of the time;
That's right from the view of the thread trying to get hold on the mutex,
but absolutely wrong from the view of the overall processing. There might
be other threads that could do useful processing and we're about to defer
this work by spinning beyond the time of a context-switch.
Post by Slava M. Usov
In any case, this spin time/count has nothing to do with whatever
time WaitForSingleObject() takes, with zero or any other timeout.
Ok, that was a mistake, but your ideas are even more intuitive and shallow.
m
2005-10-26 15:04:28 UTC
Permalink
The only reason why I will say anything to this is to prevent you from
confusing others.
Post by Oliver S.
Post by Slava M. Usov
Post by Pavel Lebedinsky [MSFT]
Yes, the idea is basically that spin time should be less than context
switch time (that is, how long it would take to get another thread
running and doing useful work on this processor).
This is only partly true. The idea is that the entire sequence
acquire spinlock; do something; release spinlock should not be
subject to context switches MOST of the time.
We're talking cases where spinning might be useful. And this makes sense
only if we're likely to proceed the acquisition of the lock in below the
time of a context-switch. And that time is a extremely low fraction of a
timeslice. So deciding whether to spin or not according to if the mutex
-held time is mostly in the magnitude of a timeslice (which can be co-
nsidered a bug) only can give you a decision against spinung but not
pro spinning.
The cases where spinning _might_ be useful include all cases on SMP
machines. At any time the owning thread might be 1 instruction away from
releasing the lock; no matter how long, or how many context switches, the
'do something' takes.
Post by Oliver S.
Post by Slava M. Usov
First, threads have finite quanta on Windows; so the entire sequence and
"do something" in particular must be short and fast.
The mutex-held-time becomes critical for spinning far before we get into
the magnitude of a timeslice. Imagine we have a mutex-held-time three times
longer than the time of a context-switch; then we're likely to fail with
spinning at the time of a context-switch most the times and we shouldn't
spin; and if we would spin beyond the time of a context-switch, we will
lower the overall throughput of the system by defferring other thread's
work. So Pavel is absolutely right when he says that spinning should not
take longer than the time of a context-switch.
The point was that since there is a finite quantum, in order to avoid being
pre-empted while holding the lock (during 'do something'), the do-something
should be a short as possible. This has nothing to do with determination of
a spin-count but just a general statement about how to use critical sections
effectively.



Do you actually understand thread scheduling in Windows?
Post by Oliver S.
Post by Slava M. Usov
Second, "do something" must not do -- MOST of the
time! -- anything that results in context switches.
That's not an additional rule, but a a specialization of your
first rule because this would just increase mutex-held-time.
Post by Slava M. Usov
If "do something" satisfies these conditions, then the spin time or spin
count must be such that the time spent in "acquire spinlock" is not much
more than the time "do something" takes MOST of the time;
That's right from the view of the thread trying to get hold on the mutex,
but absolutely wrong from the view of the overall processing. There might
be other threads that could do useful processing and we're about to defer
this work by spinning beyond the time of a context-switch.
No - that is an entirely different requirement. It is trivial to produce an
extremely short sequence of code that will cause a context switch.



Sleep(1)



You are correct in saying that the elapsed time for the execution of such
code will be long, but the sequence is not - therefore a separate
requirement.



I would also add that the 'do something' also should not raise exceptions
except in the case of a catastrophic failure; again, to save time.



How do you define overall system performance?
Post by Oliver S.
Post by Slava M. Usov
In any case, this spin time/count has nothing to do with whatever
time WaitForSingleObject() takes, with zero or any other timeout.
Ok, that was a mistake, but your ideas are even more intuitive and shallow.
That is a great line! Let me give you a synonymous sarcastic version:

I was totally wrong. The whole point of my argument was that
critical-section spin-counts should be based on the time required for WFSO
on a signalled object, but since it is so obviously stupid, I have decided
to insult you. Then I decided to try to confuse the issue by pretending
that I was suggesting something based on the time for a context switch.
Slava M. Usov
2005-10-26 17:37:12 UTC
Permalink
Post by Oliver S.
We're talking cases where spinning might be useful. And this makes sense
only if we're likely to proceed the acquisition of the lock in below the
time of a context-switch. And that time is a extremely low fraction of a
timeslice.
This is BS because you cannot measure the real performance hit of a context
switch. Not in your "general purpose" kindergarten. So "the time of a
context-switch" is but a fiction. This invalidates the rest of your blabber.

The conditions on the spin time are explained in my previous message. If you
can't understand that, just memorize.

[...]
Post by Oliver S.
Ok, that was a mistake, but your ideas are even more intuitive and shallow.
Don't attribute your inability to understand to what you don't understand.
Grow up, learn a few things, and above all learn to listen. Start by
learning that "high performance general purpose" things do not exist.

S
Oliver S.
2005-10-27 14:25:19 UTC
Permalink
Post by Slava M. Usov
This is BS because you cannot measure the real performance
hit of a context switch.
Not exactly but you can get a good approximation and if we couldn't get
a approximate count on a value which isn't likely to scattered too much,
no one would try to do spinning and there wouldn't be any discussion
about this.
I've written two routines to measure the time of a voluntary cs (one
thread relinquishing its execution and another being eligible to run
to get re-started) and the time of a preempted cs. This code is some-
what tricky but not black magic.
Post by Slava M. Usov
Not in your "general purpose" kindergarten. So "the time of
a context-switch" is but a fiction.
Tell that Pavel! I think M$ would have to learn much from you!
Why aren't you already working there if you're that ingenious?
Post by Slava M. Usov
The conditions on the spin time are explained in my previous message.
These conditions are partitially no criteria for deciding on if to spin
(only on not-to-spin) and for the rest of the cases you're very likely
to defer useful work and thereby lowering the overall-throughput of the
system. So your assumtptions are useless on the one side and contrapro-
ductive on the other side.
Post by Slava M. Usov
Don't attribute your inability to understand to what you don't understand.
Grow up, learn a few things, and above all learn to listen. Start by
learning that "high performance general purpose" things do not exist.
Only in your defnition of this term. But it looks questionable if
someone has to define its own terms to doubt the statements of others.
You're making your own world to measure the statements of others to
protect yourself against critics.
Chris Thomasson
2005-10-28 01:16:16 UTC
Permalink
Post by Oliver S.
Post by Slava M. Usov
This is BS because you cannot measure the real performance
hit of a context switch.
Not exactly but you can get a good approximation and if we couldn't get
a approximate count on a value which isn't likely to scattered too much,
no one would try to do spinning and there wouldn't be any discussion
about this.
I've written two routines to measure the time of a voluntary cs (one
thread relinquishing its execution and another being eligible to run
to get re-started) and the time of a preempted cs. This code is some-
what tricky but not black magic.
Post by Slava M. Usov
Not in your "general purpose" kindergarten. So "the time of
a context-switch" is but a fiction.
Tell that Pavel! I think M$ would have to learn much from you!
Why aren't you already working there if you're that ingenious?
^^^^^^^^^^^^^^^^^^
Post by Oliver S.
Post by Slava M. Usov
Post by Oliver S.
And I have some doubts on the qualification of the developers working on
that issues at M$, so I won't copy anything. That's because M$ isn't sup-
plying a condition-variable in Win32.
You said you have doubts wrt Microsoft developer capability levels... Now
your calling them ingenious...

Which is it? Humm...

:/
Oliver S.
2005-10-29 16:14:35 UTC
Permalink
Post by Chris Thomasson
You said you have doubts wrt Microsoft developer capability levels...
Not in general but on those working on kernel-syncrhonization-things.
Post by Chris Thomasson
Now your calling them ingenious...
Where did I said that?
Oliver S.
2005-10-27 09:32:24 UTC
Permalink
Post by Slava M. Usov
This is BS because you cannot measure the real performance
hit of a context switch.
Not exactly but you can get a good approximation and if we couldn't get
a approximate count on a value which isn't likely to scattered too much,
no one would try to do spinning and there wouldn't be any discussion
about this.
I've written two routines to measure the time of a voluntary cs (one
thread relinquishing its execution and another being eligible to run
to get re-started) and the time of a preempted cs. This code is some-
what tricky but not black magic.
Post by Slava M. Usov
Not in your "general purpose" kindergarten. So "the time of
a context-switch" is but a fiction.
Tell that Pavel! I think M$ would have to learn much from you!
Why aren't you already working there if you're that ingenious?
Post by Slava M. Usov
The conditions on the spin time are explained in my previous message.
These conditions are partitially no criteria for deciding on if to spin
(only on not-to-spin) and for the rest of the cases you're very likely
to defer useful work and thereby lowering the overall-throughput of the
system. So your assumtptions are useless on the one side and contrapro-
ductive on the other side.
Post by Slava M. Usov
Don't attribute your inability to understand to what you don't understand.
Grow up, learn a few things, and above all learn to listen. Start by
learning that "high performance general purpose" things do not exist.
Only in your defnition of this term. But it looks questionable if
someone has to define its own terms to doubt the statements of others.
You're making your own world to measure the statements of others to
protect yourself against critics.
Oliver S.
2005-10-26 00:55:30 UTC
Permalink
Post by Pavel Lebedinsky [MSFT]
Yes, the idea is basically that spin time should be less than context
switch time (that is, how long it would take to get another thread
running and doing useful work on this processor). On top of that there
might be additional indirect costs such as various cache misses but
these are difficult to measure.
I'm absolutely convinced that this is the thing spin-time should be related
to (my inital idea focussed on the thoughput of the single thread trying to
acquire the mutex but not on the overall throughput of the system). But how
long should this spin-time be exactly? If we don't spin, we're likely to lower
the overall throughput by causing a context-switch wich might be unnecessary.
If we spin the time of a context-switch, we're likely to get the context-switch
to cost double of the time. So should we spin half of the time?
Post by Pavel Lebedinsky [MSFT]
I'm not sure where the 4000 value came from. Probably it was based
on the estimated average context switch time back when the code was
written.
I measured the context-switch from one thread to another on the same CPU to
cost about 800 clock-cycles on my uniprocessor Athlon-XP when a thread chose
to relinquish the execution-resources by doing WDFSO and there's another thread
which is eligible to be re-started immediately. It would be interesting to know
what timing a SMP-system has on this; especially because the pools of runnable
threads are managed globally and the cachelines of the management-data on this
pools are moved between the CPUs so that this time might vary depending on from
where a thread has been put to the runnable-pool on a NUMA-system. And furhter-
more, there's the likehood of thread-migration which could make the context
-switch very costly.
Oliver S.
2005-10-31 07:09:44 UTC
Permalink
Post by Oliver S.
I measured the context-switch from one thread to another on the same CPU
to cost about 800 clock-cycles on my uniprocessor Athlon-XP when a thread
chose to relinquish the execution-resources by doing WDFSO and there's
another thread which is eligible to be re-started immediately. It would
be interesting to know what timing a SMP-system has on this; especially
because the pools of runnable threads are managed globally and the cache-
lines of the management-data on this pools are moved between the CPUs so
that this time might vary depending on from where a thread has been put
to the runnable-pool on a NUMA-system.
As I don't own a SMP-system I helped myself to get some worst-case numbers
on cases when the cache doesn't hold any OS-specific contents by simply
polluting the caches myself in several nucances. These are the numbers
for my uniprocessor Athlon-XP (1,4GHz, SiS745-chipset):

no cache-pollution: ~750 clock cycles
clean pollution of the l1-cache: ~750 clock cycles
dirty pollution of the l1-cache: ~1330 clock cycles
clean pollution of the l2-cache: ~750 clock cycles
dirty pollution of the l2-cache: ~2760 clock cycles

As "clean pollution" I consider overwriting a the cachelines of the cache
with content from higher memory-levels (l2-cahce or memory) without causing
any cacheline to become "dirty", i.e. it must be written back to memory (I'm
aware that write-through-caches never become dirty). Dirty pollution simply
overwrites the cachelines so a cache-level becomes completely dirty (assuming
my XP-OS uses proper page-colouring - but without that, ***@home woudln't
run competititve under XP *g*).


This is my current testing-code (the only problem I see with this code is
that there might be always some other threads in the running queue between
the two threads flipping each other; the best would be to hand the whole
task to the OS-initialization *g*):


#include <windows.h>
#include <memory>
#include "../common/xstddef.h"
#include "../common/Win32/xhandle.h"

std::size_t const CACHE_POLLUTION_SIZE = 64 * 1024;
bool const DO_DIRTY_CACHE_POLLUTION = true;

DWORDLONG __fastcall GetTSC();
void __fastcall memrd( void *dest, std::size_t count );
DWORD WINAPI FlippingThread( LPVOID lpvThreadParam );

struct ThreadInfo
{
void *pvCacheTrashingBuffer;
std::size_t szCacheTrashingBuffer;
bool fDoDirtyCachePollution;
LONG volatile lRemainingTests;
BOOL volatile *pfPrevTickIntact;
DWORDLONG volatile *pdwlPrevTick;
DWORDLONG volatile *pdwlFastestContextSwitch;
HANDLE hEvtWaitFor;
HANDLE hEvtWakeUpOtherThread;
};

void __cdecl main()
{
using namespace ns_OClasses;
using namespace ns_OClasses::ns_Win32;

std::size_t szCacheTrashingBuffer;
void *pvCacheTrashingBuffer;
bool fDoDirtyCachePollution;
BOOL volatile fPrevTickIntact;
DWORDLONG volatile dwlPrevTick;
DWORD dwTests;
DWORDLONG volatile dwlFastestContextSwitch;
XHANDLE ahEvtFlippingWaitFor[2];
ThreadInfo atiFlippingThreads[2];
XHANDLE ahThrFlippingThreads[2];

szCacheTrashingBuffer = ::CACHE_POLLUTION_SIZE;
pvCacheTrashingBuffer = ::VirtualAlloc( NULL, szCacheTrashingBuffer, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE );
std::memset( pvCacheTrashingBuffer, 0, szCacheTrashingBuffer );
fDoDirtyCachePollution = ::DO_DIRTY_CACHE_POLLUTION;

fPrevTickIntact = FALSE;
dwlFastestContextSwitch = (DWORDLONG)(LONGLONG)-1;
dwTests = 100000;

ahEvtFlippingWaitFor[0].h = ::CreateEvent( NULL, FALSE, FALSE, NULL );
ahEvtFlippingWaitFor[1].h = ::CreateEvent( NULL, FALSE, FALSE, NULL );

atiFlippingThreads[0].pvCacheTrashingBuffer = pvCacheTrashingBuffer;
atiFlippingThreads[0].szCacheTrashingBuffer = szCacheTrashingBuffer;
atiFlippingThreads[0].fDoDirtyCachePollution = fDoDirtyCachePollution;
atiFlippingThreads[0].lRemainingTests = dwTests - (dwTests / 2);
atiFlippingThreads[0].pfPrevTickIntact = &fPrevTickIntact;
atiFlippingThreads[0].pdwlPrevTick = &dwlPrevTick;
atiFlippingThreads[0].pdwlFastestContextSwitch = &dwlFastestContextSwitch;
atiFlippingThreads[0].hEvtWaitFor = ahEvtFlippingWaitFor[0].h;
atiFlippingThreads[0].hEvtWakeUpOtherThread = ahEvtFlippingWaitFor[1].h;

atiFlippingThreads[1].pvCacheTrashingBuffer = pvCacheTrashingBuffer;
atiFlippingThreads[1].szCacheTrashingBuffer = szCacheTrashingBuffer;
atiFlippingThreads[1].fDoDirtyCachePollution = fDoDirtyCachePollution;
atiFlippingThreads[1].lRemainingTests = dwTests / 2;
atiFlippingThreads[1].pfPrevTickIntact = &fPrevTickIntact;
atiFlippingThreads[1].pdwlPrevTick = &dwlPrevTick;
atiFlippingThreads[1].pdwlFastestContextSwitch = &dwlFastestContextSwitch;
atiFlippingThreads[1].hEvtWaitFor = ahEvtFlippingWaitFor[1].h;
atiFlippingThreads[1].hEvtWakeUpOtherThread = ahEvtFlippingWaitFor[0].h;

ahThrFlippingThreads[0].h = ::CreateThread( NULL, 0, FlippingThread, &atiFlippingThreads[0], 0, NULL );
ahThrFlippingThreads[1].h = ::CreateThread( NULL, 0, FlippingThread, &atiFlippingThreads[1], 0, NULL );

::SetThreadAffinityMask( ahThrFlippingThreads[0].h, 1 );
::SetThreadAffinityMask( ahThrFlippingThreads[1].h, 1 );

::SetEvent( ahEvtFlippingWaitFor[0].h );

::WaitForMultipleObjects( 2, (HANDLE *)ahThrFlippingThreads, TRUE, INFINITE );
}

DWORD WINAPI FlippingThread( LPVOID lpvThreadParam )
{
ThreadInfo *pti = (ThreadInfo *)lpvThreadParam;
DWORDLONG dwlTicks;

for( ; ; )
{
if( pti->lRemainingTests-- == 0 )
return 0;

::WaitForSingleObject( pti->hEvtWaitFor, INFINITE );

if( *pti->pfPrevTickIntact &&
(dwlTicks = (GetTSC() - *pti->pdwlPrevTick) & 0xFFFFFFFFFFFFu) < *pti->pdwlFastestContextSwitch )
*pti->pdwlFastestContextSwitch = dwlTicks;

if( pti->fDoDirtyCachePollution )
std::memset( pti->pvCacheTrashingBuffer, 0, pti->szCacheTrashingBuffer );
else
memrd( pti->pvCacheTrashingBuffer, pti->szCacheTrashingBuffer );

::SetEvent( pti->hEvtWakeUpOtherThread );

*pti->pfPrevTickIntact = FALSE;
ns_OClasses::load_store_fence();
*pti->pdwlPrevTick = GetTSC();
ns_OClasses::load_store_fence();
*pti->pfPrevTickIntact = TRUE;
}
}

__declspec(naked)
DWORDLONG __fastcall GetTSC()
{
__asm
{
rdtsc
ret
}
}

__declspec(naked)
void __fastcall memrd( void *dest, std::size_t count )
{
__asm
{
cmp edx, 1
jb byeBye

killAgain:
mov al, [ecx]
sub edx, 1
jae killAgain

byeBye:
ret
}
}

m
2005-10-25 22:04:53 UTC
Permalink
This does not deserve a reply.
Post by Oliver S.
Post by m
BTW: read MSDN on InitializeCriticalSectionAndSpincount
- is never spins on single processor machines.
I know why it's like that and I'm aware of that. The WFSO-speed-detection
I did on a uniprocessor-system was for experimental purposes only.
Post by m
You are measuring the shortest possible call to WFSO and saying that 'if
I spin for more than half of this time, then I should just give up and
call WFSO'.
Excactly!
Post by m
This approach assumes that the guarded block takes a significantly longer
time to execute than the operations required to acquire the lock, ...
The time the mutex is held absolutely doesn't matter. It's all about pre-
venting kernel-calls which would consume more CPU-cycles than we would have
consumed through spinning! Why do you think we should spin the value of 4000
times you mentioed below? This value is that ridicoulously high that we could
even go to WFSO.
Of couese a mutex might be held that long, that spining the pre-determined
rounds wouldn't succeed in most cases. So in this cases spinning shouls be
disabled completely because we would consume more cpu-cycles than going
through WFSO. But CSes are usually held an extremely short time that this
spinnnig makes sense most times.
Post by m
... and that contention is insignificant.
I didn't mention that when I'm trying to get access to the mutex, I'm spin-
ning the predetermined times until I can see the mutex to be enterable. If
the compare-and-swap-operation after this spinning fails because the atomic
mutex-value has changed, this spining is resumed another round with the
same spin-count-value. That's ok because in this situation we're staying
before two alternatives again: spinning or WFSOing.
Post by m
The short answer is that each CS requires optimal tuning for its use
and no 'catch all' solution will work.
That's pure nonsense! Spinning is only there to prevent any kernel-calls;
and therefore we should never spin longer than we would wait WFSOing. So
my idea is perfectly ok.
Post by m
If you don't want to do this work, or need somewhere to start,
copy MS and use 4000 - it will work for most things.
4000 is a ridiculously high spin-count value; even more because on hyper-
threading-platforms, EnterCriticalSection employ the pause-instruction to
free any execution-resources and thereby the spinning takes longer for a
magnitude on those systems! When the time the CS is held by other threads
is very short, 4000 doesn't hurt much - but it's still a unnecessary per-
formance-degradation
And I have some doubts on the qualification of the developers working on
that issues at M$, so I won't copy anything. That's because M$ isn't sup-
plying a condition-variable in Win32.
Post by m
Remember that you may be pre-empted at any time, and that CSs are not
FIFO or even deterministic. I won't event start talking about thread
priorities .
That'a a different issue.
Loading...