The Cost of Volatile

Assessing the scalability of programs and algorithms on multicore is critical. There is an important literature on locks and locking schemes, but the exact cost of volatile is less clear.

For the software composition seminar, I proposed this year a small project on the topic. The project was realized by Stefan Nüsch. He did a nice job and his results shed some light on the matter.

Essentially, we devised a benchmark where multiple threads would access objects within a pool. Each thread has a set of objects to work with. To generate contention, the sets could be fully disjointed, have partial overlap, of have a full overlap. The ratio of reads and writes per thread was also configurable.

On an AMD 64 Dual Core, the graph looks as follows:

bench_amd64On a i7 Quad Core, the graph looks as follows:

bench_i7We clearly see that different architectures have different performance profiles.

In future work, we could try to reproduce false sharing and assess the impact of other forms of data locality.

More details about the benchmark and methodology can be found in his presentation. The code in on github.

Here a some links about the semantics of volatile, and mememory management in general.

Understanding the Visibility of Side-Effects

This entry is all about the Java Memory Model and the happens-before partial order it formalizes.

The official specification of the JMM is authoritative but unreadable. Fortunately, there are useful resources around that aim at making it more accessible:

Happens-Before

The JMM defines a partial order called happens-before on all actions in the program. Actions are reads and writes to variables, locks and unlocks of monitors etc. The rules for happens-before are as follows:

  • Program order rule. Each action in a thread happens-before every action in that thread that comes later in the program order.
  • Monitor lock rule. An unlock on a monitor lock happens-before every subsequent lock on that same monitor lock.
  • Volatile variable rule. A write to a volatile field happens-before every subsequent read of that same field.
  • Thread start rule. A call to Thread.start on a thread happens-before any other thread detects that thread has terminated, either by successfully return from Thread.join or by Thread.isAlive returning false.

The happens-before partial order provides clear guidance about what values are allowed to be returned when the data are read. In other words: it formalizes the visibility of side-effect across threads.

A read is allowed to return the value of a write 1) if that write is the last write to that variable before the read along some path in the happens-before order, or 2) if the write is not ordered with respect to that read in the happens-before order.

Instructions can be reordered as long as the happens-before order is preserved. The code below

t1 = 1;
t2 = 2;
t3 = t1+t2;
t4 = 2*t1;

can for instance be be reordered as follows

t2 = 2;
t1 = 1;
t4 = 2*t1;
t3 = t1+t2;

If two threads shared data without using any synchronization, writes of a thread are considered unordered with respect to the other thread; according to condition 2), the execution is not deterministic and the writes of a thread might or might not be visible to the other thread.

Let us consider the two threads below, and one given execution:

  T1        T2

s = s+1 

          s = s+1

s = s+1 

The second increment might or might not see the previous increment in T1. Similarly, the thrid increment might or might not see the side-effect of the second increment.

In improperly synchronized programs, a read will return the value of one the possible matching writes, according to conditions 1) and 2). This corresponds to data races. In properly synchronized programs, a read has one unique ancestor write.

  T1        T2

lock m
  |
s = s+1 
  |
unlock m  
          \  
          lock m
             |
          s = s+1
             |
          unlock m
         /
lock m
  |
s = s+1 
  |
unlock m                  

Implementing Happens-Before

Conditions 1) and 2) specify that side-effects might or might not be visible consistently to theads in improperly synchronized programs. They enable the implementation to not read the shared memory all the time, either using CPU caches or compiler optimization. The specification does however never speak of specific implementation choice like caching and optimizations.

For instance, the code

t2 = s+1
t3 = s*2

could be rewritten as by a compiler optimization

t = s
t2 = t+1
t3 = t*2

, where t caches the value of the read to s.

Typically, the compiler will emit memory barriers to flush the CPU caches. The JSR-133 Cookbook provides guidance about implementation issues.

Data Races

Failure to properly synchronize accesses to shared data is called a data race. Inversely, a program is data race free if all accesses to shared state are synchronized. But what does data race freedom exactly mean?

Is the program data race free if,

  1. accesses to shared data happen within critical section?
  2. shared data is declared volatile?
  3. accesses to shared data are never concurrent?

Issues about the memory model are typically discussed using the double-checked locking pattern, and the “pausable” thread pattern. They provide only partial answer to these questions.

Point 1 is true. If the accesses to shared data are protected by the same cricital section, there is no possible data race.

Point 2 is true also. If you define variables as volatile, the side effect will be made visible to the other thread and there will be no data race. But remember: data race freedom does not mean that the behavior is correct. Consider the trival example below:

counter = counter + 1;

Making the counter volatile won’t suffice to ensure all increments are recorded.

Point 3 holds, but requires some clarification of the underlying assumptions (Answers to my stackoverflow question failed be clear cut and authoritative). Let us consider the situation when multiple threads manipulate an object X that is plain data, but alternate their temporal execution (so that X is never accessed concurrently) via another object Y that rely on concurrency control (wait, notify, synchronize). Should fields of object X be volatile or not?

If we assume that threads are synchronized using at least one the concurrency control primitive — an not just temporal alternance thanks, say, to a well-time sleep statements — this implies the alternate acquisition of at least a lock m, as for instance in:

while(true) {
    synchronized(m) { wait(); }
    counter = counter + 1;
    synchronized(m) { notify(); }
}

There is a happens-before dependency between the increment statement and the release of the wait lock. By construction, when the lock is released, exactly one thread will acquire it and wait again. So exactly one increment will happen-after the release of the wait lock. By consequence the increment is guaranteed to see only the value of the previous increment–no need of volatile.

More

Writing Immutable Objects with Elegance

Edit: This blog post has been turned into a draft paper.

Today, I got burned with design issues of Smalltalk’s dictionaries. They implement object comparison = using value equality, but are at the same time mutable. The internal state of a dictionary consists of an array of associations. In addition to problems of equality, the the internal state can be accessed and aliased with Dictionary>>associationsDo: and Dictionary>>add:

We should instead have an immutable dictionary using value comparison, and a mutable dictionary using identity comparision. Mixing styles is too dangerous. For the mutable dictionary, the internal state should never be accessible and data should be copied to ensure no dependencies on mutable state are established.

In an old post I rant on the conflict between the imperative object paradigm that promotes mutable state, and the functional paradigm that promotes immutable data structures. It is mostly a problem of readability, though.

Assignment Syntax

To make the use of immutable objects more readable, we could actually generalize operator assignment syntax like +=, -=, <<= to any message.  Postfixing a message with = would imply that the reference pointed to the receiver of the message is updated with the value of the message.

aReference message=: 5            
<-->      
aReference := aReference message: 5
aDictionray at: key put=: value   
<-->      
aDictionary := aDictionary at: key put=: value

When the selector has multiple arguments, the prefix comes at the very end. This is simple, efficient syntactic suggar.

What about self-sends? Accordind to the previous examples, self would be updated with the value of message. While it might sound counter-intuitive or plain absurd, this is what we need:

Point>>moveX: offsetX y: offsetY
self x=: self x + offsetX. 
self y=: self y + offsetY.
^ self

Assuming #x: and #y: return copies of the objects, wihth this semantics, the reference to self on the last line corresponds to the last copy created.

The only difference between self-sends and regular sends is the implicit contract that is assumed for the method. In the regular case, the message send can return any object. Any invocation of the returned object will happen with a regular send that will in the worst case raise a message not understood exception. For self-sends to work as in the example above, messages #x: and #y: must return instances of Point, so that the activation frame can be updated correctly. Updating the activation frame rebinds self to the new object, but preserves the temporary variables.

(I believe this would have an incidence on closures. More investigations are needed. The precise semantics could maybe be simulated with continuations)

Copy syntax

The previous proposal still leaves the burden of copying the objects to the developers. In the previous examples, #x: , #y: and #at:put: would need to first clone the receiver (self), then update it.

Point>>x: newX
^ ( self clone ) basicX: newX ; yourself.

Ugly, right? Following a similar syntactic approaches, message sends could be prefixed with % to indicate that the message must be delivered to a clone of the receiver:

aReference %message: 5 <--> aReference clone message: 5

We know that cloning is broken. However, it is not the subject of this post, so we will assume that we have a reasonable implementation of #clone. With % and = we have all ingredient to implement immutable structures easily.

Point>>x: newX
  self basicX: newX

Point>>moveX: offsetX y: offsetY
self %x=: self x + offsetX. 
self %y=: self y + offsetY.
^ self

(The accessor Point>>x is actually superfluous, since it is similar to basicX. It serves as example only.)

For an even more concise syntax, a third prefix/postfix could be introduced.

aReference ~message: 5 <--> aReference %message=: 5

Nested Objects

The proposed syntactic suggar has limited benefits for more complex mutation of objects with nested objects. Let’s consider an immutable circle with an immutable point as center.

Circle>>moveX: offsetX y: offsetY
   self ~center: (self center %moveX: offsetX y: offsetY )
  ^ self

But what we would really like to write is

Circle>>moveX: offsetX y: offsetY
   self center ~moveX: offsetX y: offsetY
  ^ self

Handling this situation so that the receiver “self center” is replaced with the new point, implies first the replacement of “self” with a new circle. The replacement of the receiver “self center” (that is not an L-value) could be achieved if by convention the corresponding setter is used. The above code would then execute implicitely “self ~center: xxx” to replace “self center”. This corresponds to the intended behavior. In other words,

self a ~m: args <--> self ~a: (self a %m: args)
self a b ~m: args <--> self ~a: (self a %b: (self a b %m: args))
etc.

The ~ can appear only before the last message send. The statement “self a ~b m: args” would be ill-defined.

More Links

Transformation for Class Immutability

Implementation of Semaphores

For the need of the experimental Pinocchio research VM, we needed to add support for threading and concurrency. We implemented green threads, a la Squeak and there is then no “real” multi-core concurrency going on. The VM relies on AST interpretation, instead of bytecode. With green threads, the interpretation of an AST node can always be considered atomic: no two AST node can be interpreted concurrently. This is unlike Java and its memory model, where individual bytecodes can be interpreted concurrently, possibly with nasty side-effects (e.g. manipulation of long is not atomic). Thread preemption can happen anytime between AST nodes evaluation.

How can we add support for semaphores?

The pharo design

The pharo design can be informally summarize like this: when a semaphore is instantiated, its counter is set to one. Whenever a block needs to be evaluated in an exclusive way, the counter is checked. If the counter > 0, it is decreased and the block is evaluated. If the counter = 0, the thread is suspended an added to the list of threads currently waiting on this semaphore. When the critical block has been evaluated, the list of suspended threads is checked. If there are not suspended threads, the counter is incremented to 1. Otherwise, one of the suspended thread is picked and resumed.

This implementation explains why Semaphore extends LinkedList. I’m not sure it’s the best design decision, because it’s not conceptually a list and the list protocol should not be exposed by a semaphore. It uses inheritance for implementation reuse, but composition would have been just fine here if the semaphore was internally holding a linked list (and maybe use a pluggable ownership type system to check that the list does not leak out of the semaphore…).

Also, semaphores must be created using the forMutualExclusion factory method. This method instantiates and initialize the semaphore to allow exactly one execution at a time (hence the term mutual exclusion), but nothing would prevent you from initializing the semaphore so that up to N blocks can be executed concurrently.

The respective code for wait and signal (which respectively decrement and increment the counter) are:

wait
 excessSignals>0
 ifTrue: [excessSignals := excessSignals-1]
 ifFalse: [self addLastLink: Processor activeProcess suspend]
signal
 self isEmpty
 ifTrue: [excessSignals := excessSignals+1]
 ifFalse: [Processor resume: self removeFirstLink]

They are however implemented as primitives. I suspect this is not for performance reason, but for the sake of concurrency correctness. These operation themselves need to be atomic. Implemented in Smalltalk, the thread could be preempted during one of these, breaking the semaphore’s design.

The test-and-set design

These two methods and the internal counter suggest that an implementation relying on more generic concurrency primitive is possible. Typical concurrency primitives for this are test-and-set or compare-and-swap.
 We’ve added a primitive testAndSet to Boolean, and implemented the Semphore with busy waiting (also sometimes called spin lock):
 
 critical: aBlock
 | v |
 "we spin on the lock until we can enter the semaphore"
 [ lock testAndSet ] whileTrue: [ PThread current yield ].
 "we evaluate the block and make sure we reset the flag when we leave it"
 [ v := aBlock value. ] ensure: [ lock value: false ].
 ^ v.

The design could be improved to no use busy waiting. Instead of yielding, the thread would be suspended and added to a list. In the ensure block,  the flag would be reset and one of the thread would be resumed. The resumed thread would however still need to testAndSet the lock to prevent that no other thread has entered the semaphore in the meantime, possibly delaying the thread forever. So if fairness is required, this algorithm is not optimal.

The bakery design

You can also implement critical section without the support of other concurrency primitives. The most famous one is probably Lamport’s bakery algorithm:

What is significant about the bakery algorithm is that it implements mutual exclusion without relying on any lower-level mutual exclusion.  Assuming that reads and writes of a memory location are atomic actions, as previous mutual exclusion algorithms had done, is tantamount to assuming mutually exclusive access to the location.  So a mutual exclusion algorithm that assumes atomic reads and writes is assuming lower-level mutual exclusion.  Such an algorithm cannot really be said to solve the mutual exclusion problem.  Before the bakery algorithm, people believed that the mutual exclusion problem was unsolvable–that you could implement mutual exclusion only by using lower-level mutual exclusion.

In our case with green threads, read and write are atomic because they are single AST nodes, but this isn’t necessary the case.

Here ends this little trip into basic concurrency. There is a rich litterature on the topic — which is truely fantastic — and we might explore and implement more sophisicated abstractions later on.

References

The Java Memory Model
Thread Synchronization and Critical Section Problem
A New Solution of Dijkstra’s Concurrent Programming Problem

Playing with Agents

Clojure is a programming language with a rich set of features. The rationale behind Clojure and its set of features can be summarized with these 4 points:

  • functional programming (which favors immutable structure)
  • polymorphism, with global mutable reference, dynamic typing and multi-method dispatch
  • transactional memory to see consistent snapshot of the world
  • advanced concurrency mechanisms at the language-level (CAS, agents, watcher)

I already discussed a bit the TM part in my previous post. Here I move to watcher and agents, a mechanism to run  asychronous tasks in separate threads .  Agents are like workers in the thread pool pattern, they execute tasks in a sequential way one after the other. Submitting a task to an agent must be done in a transacted section (dosync). Waiting for the agent to complete can be achieved with special instructions (wait agent-name). Watchers are special agent, which will run a task each time a reference is changed. The task to run is specified when the watcher is constructed.

Watchers

Here is a version of my test, where each update in the shared map triggers the watcher. When the 100 occurrences have been counter, the watcher prints “OK”.

(ns org.ewe.sharedMapWatcher (:gen-class))
(def my-watcher (agent 0))

(defn my-watcher-action [current-value reference]
(let [new-value  (inc current-value)]
(if (= new-value 100) (println "100 words have been counted"))
new-value
)
)

(def store-map-ref (ref (sorted-map)))

(defn store-it [hash]
(dosync
(let [store-map @store-map-ref]
( if (contains? store-map hash)
( let [count-ref (store-map hash)
new-count (inc @count-ref)]
(ref-set count-ref new-count)
)
( let [new-ref (ref 0)]
(add-watcher new-ref :send-off my-watcher my-watcher-action)
(ref-set new-ref 1)
(ref-set store-map-ref
(assoc store-map hash new-ref)
)
)
)
)
)
)

(defn build-list [n]
(if (= n 0)
(list (Thread. #(store-it n)))
(conj (build-list (dec n)) (Thread. #(store-it (mod n 4))) )
)
)

(defn -main []
(time (do
(let [NUM 100
threads (build-list NUM)]

(dotimes [iter NUM] (.start (nth threads iter)))
(dotimes [iter NUM] (.join (nth threads iter)))

(await my-watcher)

(println "count = " (count @store-map-ref))
(doseq [[hash count-ref] @store-map-ref]
(println hash " -> " @count-ref )
)
)
)
)
)

Message passing

Agents are asynchronous by nature. We may be tempted to think of them like actors in the actor model, but they aren’t. Quoting another post, “an actor is a process with a mailbox which dequeues and processes messages individually. The actor process contains all the logic for processing messages. […] Clojure’s agent inverts ownership of message handling logic: agents are submitted functions which are stored in a mailbox and then executed in order.”

Agents can be used to create fully asynchronous processing model. Agents can be passed to other agent function to create chain of message passing. It is for instance possible to send a function to an agent and specify itself as the callback agent (the *agent* variable refers to the currently executing agent).

(ns org.ewe.asyncAgent2 (:gen-class))
(import '(java.util Date))

(def agent-1 (agent 0))
(def agent-2 (agent nil))

(defn print-time [s]
(println (. (new Date) (toString)) s ))

(defn callback [current-value result]
result
)

(defn compute-and-callback [current-value callback-agent]
(let [result 1234]
(Thread/sleep 10000)
(send callback-agent callback result)
)
nil
)

(defn async-compute [current-value]
(send agent-2 compute-and-callback *agent*)
current-value
)

(defn perform []
(dosync (send agent-1 async-compute) )

(await agent-2)
(print-time "agent-2 is over")

         (await agent-1)
(print-time "agent-1 is over")

(println "agent-1:" @agent-1)
(println "agent-2:" @agent-2)
)

(defn -main []
(time (perform)
)
)

Producer-consumer

Producer consumer is a popular approach to schedule tasks concurrently. The shared list of items acts then as a coordinator between the producers and consumers. Producer-consumer comes in two flavors: fully or partially transactional.

In fully transactional mode, the consumptions of a message is entirely transacted. If the consumption failed (e.g. error during processing), the item remains in the queue. Traditional lock-based implementation don’t enforce this. The shared list is correctly synchronized, but once an item has been popped, it is removed definitively. On the other hand, messaging systems (JMS, etc.) do propose a fully transactional model.

Fully transactional producer-consumer implementation becomes easy with a TM. The problem however (in my understanding) is that no two items can be fetched and processed concurrently, as it would result in a concurrent modification of the shared list of items. The performance of such an system would be the same as having a completely sequential system. Of course, agents could be used to decouple item push and pop from their actual treatment (by mean of asynchronous threatment), but then the system isn’t fully transactional anymore. If the agent fails, the item is lost.

Conclusion

“All concurrency issues boil down to coordinating access to mutable state. The less mutable state, the easier it is to ensure thread safety.”

— Java Concurrency in Practice (from Bill Clementson’s blog)

Transactional memory and aysnchronous mechanism are means to coordinate access to shared data. But they still don’t solve the problem of shared data on their own. I will need to do more experiment to figure out exactly what kind of problem are really simplified by the agent, the actor and the TM models.

A First Look at Clojure

Transactional memory (TM) offers the promise to simplify concurrent programming. Lock-based programming, the de-facto standard, is indeed hard to get right. Common problems with lock-based programming are deadlock, priority inversion, and lack of composition. Lack of composition refers to the fact that two structures correctly synchronized do not necessary compose to a new structure itself correctly synchronized. Transferring amount between two accounts is an example: even if the account is synchronized correctly, the transfer is not safe.

TM is however not a silver bullet neither.  One major open point of TM implementations is how to ensure progress (e.g. no livelock) of the system. TM is indeed like optimistic locking. The problems of contention and starvation are still present. Non-blocking algorithm can offer different level of guarantee regarding the system-wide progress.  Also depending on the particular implementation, one process may still read inconsistent state, which could lead to inappropriate decisions. Some TM use multi-version concurrency control to solve this issue.

As TM have become more and more mature, we see now a shift from distributed system area to the language area. How can TM be integrated into existing language nicely? What are the abstractions that are required?

Language support

Here are a few questions related to language support:

  • Operational – what is the exact semantics of the TM? Does it provide snapshot isolation? What is the effect of exception with transactions? How can we facilitate the usage of blocking IO (normally prohibited) in transactions? How non-transacted and transacted code interoperate?.
  • Transaction demarcations – How transactions can be controlled (e.g.  start, stop, but also retry)? Are there other parameters to support (e.g. retry timeout)?
  • State manipulation – How shared state is accessed and modified? What are the mechanisms and data structure?

State modification – conceptual views

I tend to think there is two ways to see shared state manipulation. One is the Clojure way, were you deal with immutable data but mutable references. Mutating shared data is always done through a reference, and also part of a transaction.

Another view, is the object database view (also the ORM view), were a transaction is a snapshot of the objects which can then be manipulated as usual.

Let’s consider a very simple example:  a shared map <String, Integer> is supposed to count the number of occurrence of a particular word. While indexing, document are split in smaller part and processed concurrently. The map needs then to be updated concurrently.

immutable state + mutable ref  object
(defn store-it [word]
(dosync
(let [store-map @store-map-ref
old-count (store-map word)

new-count (if old-count
(inc old-count) 1)]
(ref-set store-map-ref
(assoc store-map word new-count)
)
)
)
)

storeIt( word )
{
transaction
{
oldCount = storeMapRef.get( word, 0 );
storeMapRef.put( word, oldCount+1 );
}
}

State modification – contention

As mentioned already, TM is not silver bullet. We still need to think of the intent of our code from a multi-threaded perspective.  In the example above, the complete map was considered as one piece of shared data. The concurrent  modificatin of the occurence count for two different words would conflict and one of the transactions would be rolled back. To have finer grain control and reduce contention (Disjoint-access parallelism), we can consider the occurrence count not as a primitive, but as a reference to a counter.  The map is altered only when a new word is inserted and a new counter starting at zero is created. Incrementing the occurrence of a word is a modification of the counter, which means that two counter can be incremented concurrently.

immutable state + mutable ref objects
(defn store-it [word]
(dosync
(let [store-map @store-map-ref]
( if (contains? store-map word)
( let [count-ref
(store-map word)
new-count
(inc @count-ref)]
(ref-set count-ref new-count)
)
(ref-set store-map-ref
(assoc store-map word (ref 1))
)
)
)
)
)
storeIt( word )
{
transaction
{
if( storeMapRef.contains( word ) )
{
counterRef = storeMapRef.get( word );
counterRef.inc();
}
else
{
storeMapRef.put( word, new Counter(0) );
}
}
}

Conclusion

TM is indeed a promising way for concurrent programming. There are open points to watch, and more evidence is needed on the performance of TM-based system in real case. Understanding of the concurrency at the applicative level is still required even with TM. Poorly designed application may result in system with high contention even with TM. The transaction facilities are there but the application must still use them in a consistent way to implement its very own logic. Different conceptual views exist to access shared data with TM. Further experiment are needed to decide which one is better.

EDIT

Interesting pointers: