Wednesday, June 5, 2013

Atomics versus Locks

I've written in the past about using intrinsics to improve performance.  And I've discussed using this functionality to build your own synchronization.  I'm not alone in this knowledge, and I want to further expand the answer to a stackoverflow question.  But sometimes a little knowledge is a dangerous thing.  The trouble came from a paper that looked at, among other things, the performance of incrementing a value using locks versus using interlocked compare-and-swap for two threads.

Alright, compare-and-swap (CAS) should be faster, as one can approximate lock acquire and release as this operation.  So let's test and see how the results turn out.  First, the base code:

unsigned long long counter = 0;

int main(int argc, char** argv)
{
    unsigned long long start, end, o;

    start = rdtsc();
    do {
        counter++;
    } while (counter < MAX);
    end = rdtsc();
   
    printf("%lld\n", (end - start));  
    return 0;
}


This version is the "Single Thread" result.  But the first variation is to switch counter to be volatile, as this variable needs this property when we move to multithreaded.  By doing so, the compiler has to read, modify, then write the result to memory on every loop iteration.  Overhead, versus keeping the value in a register, like the assembly that follows:

loop: add    $0x1,%rax
      cmp    $0x1dcd64ff,%rax
      jbe    loop

And the volatile code adds: mov    0x1429(%rip),%rax and a corresponding write after.  Already this is enough to measurably impact performance.  Let's move forward and switch to performing our increment using compare-and-swap.

    do {
        unsigned long long new, old = counter;
       
        if (old < MAX)
        {
            new = old + 1;
            __sync_val_compare_and_swap(&counter, old, new);
        }
        else
            i = 0;
       
    } while (i == 1);


Now, we only want to increment when we haven't reached max.  If we have, then terminate.  But the assembly is now unpleasant:

loop: mov    0x144b(%rip),%rax        # 401a90 <counter>
      cmp    $0x1dcd64ff,%rax
      jbe    swap
...
swap: lea    0x1(%rax),%rcx
      lock cmpxchg %rcx,0x140b(%rip)        # 401a90 <counter>
      jmp    loop

First, the optimized code has split this into two pieces.  The check and the increment, and these have been split.  A fun exercise would be to convince the compiler to switch the jbe into ja and combine the pieces back together.  But the fact is, the impact of using the atomic operation looks like 6x.

As stated before, a lock is approximately twice the cost of the atomic operation and that result is roughly observed (along with some further overhead).  Each version is then modified to spawn a second thread that does the same work as the first.  All of my results are cycles per increment and follow (taken from a single run, reproducibility not guaranteed):

     TEST         -      -O3      -      -O0
Single Thread          -   1 cyc / inc -   7 cyc / inc
Single Thread volatile -   6 cyc / inc -  16 cyc / inc
Single Thread w/ CAS   -  38 cyc / inc -  46 cyc / inc
Single Thread w/ lock  -  90 cyc / inc -  92 cyc / inc
Two Threads   w/ CAS   - 250 cyc / inc - 251 cyc / inc
Two Threads   w/ lock  - 730 cyc / inc - 630 cyc / inc

What can we conclude?  First, adding -O3 (optimized) had almost no effect on results when the synchronization mechanisms are introduced.  Yes, unoptimized code is bad, but it doesn't matter when all of your work is synchronization and the effects that it imposes on the great processor performance tricks.  

Second, but why did unoptimized lock work better?  Because there was a different thread interleaving.  Perhaps the OS put one of threads asleep and the other thread ran happily for a while.  The point is that if you are synchronizing, then that is your overhead.

Third, and here is the dangerous knowledge, you might conclude that you want to use CAS instead of the lock.  Yet, this test was highly artificial.  The synchronized operation took just a couple of cycles optimized, so we could observe the overhead of synchronization.  Suppose your work is now 1250 cycles, now using CAS is only roughly a 33% improvement (1250 + 250 versus 1250 + 730) and not a 3x (250 versus 730).  Furthermore, how often is your synchronization a single variable?

To conclude, IF what you want to synchronize is a single variable, then using CAS can work.  Have I done this?  Yes.  Do I normally?  No.  I have this tool in my programming toolbox, for those times that I really need it.

Anyway, why not use atomic increment instead?  Because we want to ensure that the final value is exactly MAX, no more and no less.

Wednesday, May 22, 2013

More on Haswell

Since I am already excited about some of the features of the new Haswell family, it was great to see an even deeper look into microarchitecture.  There are the usual improvements, like expanding the instruction window or increasing cache bandwidth.  So read for yourselves at Ars Technica.

Tuesday, April 16, 2013

When whitespace matters

In the process of grading student programming projects, I discovered that whitespace can matter in how C code is compiled.  Some programmers may be aware that strings can be concatenated together:

const char* t = "this string" " is one";

Because the whitespace is ignored.  Furthermore, programmers may encode strings via macros, and replace the instance with the macro:

#define A_STRING " is one"

Now, in the latest version of g++, new C++ support can treat the item following the string literal as either a macro or a user defined literal.  The compiler makes the determination based on whether there is whitespace.

const char* t = "this string"A_STRING; // compiler error
const char* t = "this string" A_STRING; // expected behavior

I like consistent use of whitespace in code for stylistic reasons, but introducing a singular dependency on whitespace is odd for C/C++, in contrast to python that is highly whitespace dependent.

Wednesday, April 3, 2013

Architecture Abstraction Leaks

As a computer architect, I like to think that the hardware abstraction is generally opaque.  If a programmer really needs performance or certain effects, then it is possible to reach through the layer, but in general the computer gives you what you want.  Sometimes, a programmer can do simple things that pull the covers back and reveal the architecture in all of its gritty detail.  One example would be following row versus column indexing when looping over large pieces of data.

In this post, I wanted to draw your attention to a question posed on stackoverflow, Why is processing a sorted array faster than an unsorted array?  In this question, we are ignoring the time to sort the array.  The operation being applied to each array element is relatively simple and is conditional on the value of the element.  By sorting the array, the same operations could be applied to contiguous elements.  And as the same operations are being applied, the code in question was having a clear win from branch prediction.

But the branch predictor is abstracted hardware, so once again the programmer just learned about the architecture when he / she didn't intend to.

Wednesday, March 27, 2013

Transactional Memory and Intel's TSX

It was some years ago that I sat in the audience and heard AMD present a sketch for how transactional memory (TM) would be implemented in the x64 ISA.  More recently a fellow student mentioned that Intel has some new extensions entering the x64 ISA for locks, etc.  I'm always a fan of properly implemented locks, as they so often limit performance and scalability.  So let's dig into Intel's TSX and why I actually want to go buy a gadget when it's released.

A programmer can delineate the transactional section with XBEGIN and XEND instructions.  Within the transactional section, all reads and writes are added to a read- or a write-set accordingly.  The granularity for tracking is a cache line.  If another processor makes a read request to a line in the write-set or either request to a read-set, then the transaction aborts.

Transactions can be semi-nested.  A transaction can only commit if the outer transaction is complete.  Internally nested transactions do not commit on XEND.  If any transaction in the nest aborts, then the entire transaction aborts.  If |XBEGIN| equals |XEND|, then the entire transaction commits and becomes globally visible.  Transactions can be explicitly aborted by the XABORT instruction, which enables the code to abort early when it can determine that the transaction will or should fail.

As I understand it, TSX is being built on top of the existing cache coherence mechanisms.  Each cache line gains an additional bit to indicate if it is part of a transaction.  Each memory operation is treated normally between the processor and the coherence hierarchy with several caveats.  If a dirty, transactional block is evicted, then the transaction fails.  If a dirty, transactional block is demand downgraded from modified to shared or invalid, then the transaction fails.  In this case, a new message would indicate that the request to forward the data fails and the request should be satisfied by memory.

If the transaction commits, then the transactional bits are cleared on each cache line.  And the lines operate normally according to existing cache coherence mechanisms.

Wrapping this up, TSX is an ISA extension that almost every program can take advantage of and therefore has an appeal toward conducting personal testing, just like building my own x64 machine back in 2005.

Monday, March 4, 2013

Maps and Red-Black Trees

When cooperating on a prior work of research, Brainy: effective selection of data structures, I learned about the cost of selecting particular data structures.  I expect that every Computer Scientist knows the general cost of using standard data structures like trees and linked lists.  Now, the extension that Brainy provided was to recognize that a "tree" could have different implementations and these implementations can have different costs for a given workload.  I, and I expect others, learned about AVL, splay, red-black and other tree types.

All this is good until your type hides its implementation and you reason about it in abstract.  A map is commonly misinterpreted by its implementation.  A map is a key-value store, where a "key" has an associated "value".  An address book can be understood as a map of name to address / phone number / etc.

Now, the map is assumed to be implemented as a hash table, to give O(1) look-up cost.  However in the C++ standard library, map uses a red-black tree for O(log n).  Before you go and request that the libraries be changed, understand that experimental measurements suggest that the red-black implementation wins out when n > 500, as the hash collisions dominate the idealized O(1) cost.  This is the problem that Brainy attempts to solve: the programmer selects the functionality, e.g., map and Brainy selects the best implementation given your program and architecture.

So go ahead and use a map, but recognize that you may not have O(1) cost and definitely won't as n grows increasingly large.

Monday, January 28, 2013

Repost: When Beauty is Not Truth

While it was not a field discussed in the article, When Beauty Is Not Truth, nonetheless I wonder about the focus I put on elegance in programming (starting with the name of the blog).  So let's consider a couple of quick things about Computer Science and the beauty of the code.  I should add that the article discusses a rough equivalence between beauty, elegance, and simplicity.

First, beautiful code is not always correct.

Second, beautiful code, by virtue of its simplicity, is less likely to have bugs.

Third, beautiful code is more readable, which facilitates comprehension.

(Now back to preparing the lecture for today's class)