x86 synchronization primitives

Kragen Sitaker kragen@pobox.com
Fri, 6 Aug 1999 12:17:58 -0400 (EDT)


There are several different kinds of synchronization supported by Intel
x86 SMPs (essentially all of which are also useful for avoiding
interrupt-handler races on uniprocessor machines too, but not as widely
needed, since interrupt handlers tend not to do much).

This document may be incomplete; I may have left out something important.

(See http://kids.intel.com/design/intarch/techinfo/pentium/instsum.htm
for more info.)

The simplest is the XCHG instruction, which can be used to atomically
exchange two registers or a register and a memory location.  This makes
it possible to implement multiple exclusion; reserve a particular
location in RAM as a mutex, and initially set it to 1.  To acquire the
mutex, set a register to 0, and XCHG it with the location in RAM.  If
what you get back is a 1, then you have successfully acquired the
mutex; otherwise, someone else has it.  You can return the mutex simply
by setting the location to a nonzero value.

Intel's doc says, "When a memory operand is used with the XCHG
instruction, the processor's LOCK signal is automatically asserted.
This instruction is thus useful for implementing semaphores or similar
data structures for process synchronization."

I don't know how efficient this is; it sounds like it could have an
adverse effect on the speed of execution of the other processor.

LOCK:       ; mutex pointer is in EBX; clobbers EAX
	XOR EAX, EAX
	XCHG EAX, [EBX]
	AND EAX, EAX
	JZ LOCK          ; if we got a zero, spin-wait
	RET
	
UNLOCK:     ; mutex pointer is in EBX
	MOV [EBX], 1
	RET

XADD swaps two operands and then stores the sum in the destination
operand, setting the flags from the result.  This can be combined with
the LOCK prefix to lock the bus, if I understand correctly.

Intel's recommended use is to allow two processors to execute one DO
loop, but that seems rather dubious to me; not only do you have to
steal the loop counter from the other processor's cache every
iteration, you also probably have to steal the data you're iterating
over, don't you?

Still, it should be useful for allowing fairly light-weight
synchronization when incrementing or decrementing some counter:
	MOV EAX, 1       ; MT-safe ++*ebx
	XADD [EBX], EAX  ; you can even use flags to see if the result is 0
                         ; or negative

Since you get the old value of the counter, you can use this to implement, for
example, a shared stack.

The Pentium and K5 have these CMPXCHG and CMPXCHG8B instructions that
are evidently useful in implementing lock-free synchronization.  (See
Henry Massalin's dissertation for some details of doing this on a
68000-based machine in the Synthesis OS.  The thesis is available at
ftp://ftp.cse.ogi.edu/pub/esr/reports/cucs-039-92.PS.  It appears he
may now be known as Alexia Massalin.)

CMPXCHG [memaddr], reg compares a memory location to EAX (or AX, or
AL); if they are the same, it writes the source operand to the memory
location.  This can obviously be used in the same way as XCHG, but it
can be used in another very interesting way as well, for lock-free
synchronization.

Suppose you have a process that updates a shared data structure.  To
ensure atomicity, it generates a private updated copy of the data
structure; when it is finished, it atomically updates a single pointer
which used to point to the old data structure so that it now points to
the new data structure.

The straightforward way of doing this will be useful if there's some
possibility of the process failing, and it gives you atomicity.  But we
can modify this procedure only a little bit to allow multiple
simultaneous updates while ensuring correctness.

The process simply atomically compares the pointer to the value it had
when it started its work, and if so, makes the pointer point to the new
data structure.  If some other process has updated the data structure
in the mean time, the comparison will fail and the exchange will not
happen.  In this case, the process must start over from the
newly-updated data structure.

This obviously occasionally means more work; an update may have to be
done twice, or even an arbitrary number of times, and large updates
competing with small updates will usually lose.  But (according to
Massalin's thesis) it's considerably faster than acquiring and
releasing a lock, and most of the time no retry is needed.

You can use this for things other than pointers, of course.

The CMPXCHG8B instruction does the same thing, but with eight-byte
operands.  The Illinois-Intel Multithreading Library
(http://support.intel.com/technology/itj/q11998/articles/art_5d.htm)
uses this to implement a lock-free task stack.

I think the CMPXCHG8B instruction is not quite adequate to avoid all
use of locks, but I'm not sure.

This instruction is notorious for its use in the f00fc7c8 bug; this bug
allows unprivileged users to crash a Pentium.  Apparently the code is
LOCK CMPXCHG8B dest, src, where dest is a register.  This causes an
invalid opcode exception, but the CPU can't run the invalid opcode
exception handler because the bus is locked. (There are OS-level
workarounds.)

The LOCK prefix apparently ensures that only one processor can access
the system bus at a time.  It can be used with several other
instructions; the bit-test operations, BT, BTS (bit test and set), BTC
(bit test and complement), and BTR (bit test and reset) can be used in
the same way as XCHG to implement locks; ADD, ADC, SUB, SBB, INC, and
DEC can be used in the same way as XADD to implement shared counters
(although XADD is more flexible, since you get the old value of the
counter in a register); NOT, NEG, OR, AND, and XOR can be used
similarly to the BT family.

Grunch.  I'm sick of this stuff for today.  :)

More info on Intel x86 opcodes is at
http://webdia.cem.itesm.mx/dia/ac/aortiz/cb95852/nasmdoca.html.