Real-time scheduling
I wanted a simple, lightweight scheduler that I could tune up for crypto-currency trading applications. I knew from past reading that a data structure called a hashed wheel timer was worth investigating, and that Netty had one, but I was not aware of anything standalone. So I was happy to find Alex Petrov's hashed-wheel-timer. I prefer little libraries to big frameworks, and this one was ideal: the core is just one class, with tests and JMH benchmarks built around it to compare its performance to the JDK's default ScheduledExecutorService implementation.
Hashed wheel timers
The hashed wheel timer is an approximated timer that relies on a ring buffer with slots attached to lists of tasks. A single thread iterates through the ring buffer and through each slot's entry, waiting for resolution time per "tick" so the spacing in time between entries in the wheel is equal to the resolution. You then compute how many turns of the wheel will be required for a given timer to expire, and place your task in that slot. In a wheel with 10 slots and 1 millisecond resolution, a timer that expires after 25 milliseconds should go in slot #5, turn 2: 2 * 10 + 5 = 25 milliseconds. These slides provide a good summary, including this diagram:
Fluent API
The HashedWheelTimer implements ScheduledExecutorService
with a few helper methods. I did not want to expose it directly to my code or assume that any replacement would implement the same API, so I wrote a wrapper which provides a fluent API and which takes advantage of the newer Java 8 java.time
interfaces. The basic interface has just two calls:
public interface Scheduler {
Schedule<Void> schedule(Runnable runnable);
<T> Schedule<T> schedule(Callable<T> callable);
}
which start building a Schedule
object. From here you can set up daily runs, repeats, initial delay, etc.. You then call start() to insert it into the scheduler. E.g., a simple repeating timer that fires five times every 50 milliseconds and counts down a latch looks like this:
CountDownLatch latch = new CountDownLatch(5);
HashedWheelTimer timerWheel = new HashedWheelTimer();
Scheduler scheduler = new ScheduledExectorServiceScheduler(timerWheel);
scheduler.schedule(latch::countDown)
.repeat(Duration.ofMillis(50), 5)
.start();
latch.await();
The full fluent interface:
public interface Schedule<T> {
Exclusions EMPTY_EXCLUSIONS = time -> false;
Schedule<T> delay(Duration duration);
Schedule<T> at(LocalTime time);
Schedule<T> repeat(Duration interval, long numTimes);
Schedule<T> repeatIndefinitely(Duration interval);
Schedule<T> withExecutionListener(ExecutionListener listener);
Schedule<T> withExclusions(Exclusions exclusions);
long getTaskId();
void start();
For trading purposes the ability to run tasks at a fixed time daily or a fixed interval, with a delay, covers almost everything you need. The one special feature is Exclusions:
public interface Exclusions {
boolean skipRun(Clock currentTime);
}
This is a generalization of holiday calendars, something particularly important in financial applications tied to exchange and bank settlement calendars.
You can find the full code for the fluent API -- still a work in progress -- on GitHub.
Nano-improvements to System.nanoTime()
By design the Hashed Wheel Timer makes a lot of System.nanoTime()
calls, especially with the BusySpinWait strategy:
public static class BusySpinWait implements WaitStrategy {
@Override
public void waitUntil(long deadline) throws InterruptedException {
while (deadline >= System.nanoTime()) {
if (Thread.currentThread().isInterrupted()) {
throw new InterruptedException();
}
}
}
}
In theory these could be replaced by a JNI call with assembler to RDTSC, but before taking the time to drop in a native call -- with the added complexity -- I wanted to evaluate the overhead of all those nanoTime() calls.
Bill Torpey set out to make just this comparison in 2014 in "Measuring Latency in Linux", and found that at that time there was still an advantage to making the native call, though marginal: he observed times ranging from 39 - 41 nanoseconds for making a JNI call, vs. 54 - 57 nanoseconds for System.nanoTime(). Three years later, does that result still hold? I had some doubts about this because Peter Lawrey had removed his JNI-based clock from the OpenHFT library, and when asked indicated the complexity vs. benefit trade-off did not make sense any longer.
Thankfully Bill provided the source code on GitHub with a runner for MacOS and Linux, so it's possible to re-run his experiment three years later.
Here is test case #1, MacBook Pro, 2.8 GHz Intel Core i7 running macOS Sierra 10.12.6 and JDK 1.8.0_111:
ClockBench.java
Method samples min max avg median stdev
System.nanoTime 255 32.00 36.00 34.79 34.00 1.06
CLOCK_REALTIME 255 0.00 1000.00 47.06 500.00 211.76
cpuid+rdtsc 255 136.00 158.00 142.51 147.00 8.52
rdtscp 255 51.00 76.00 64.33 63.50 9.82
rdtsc 255 39.00 55.00 44.94 47.00 6.52
Using CPU frequency = 1.000000
And here's Test case #2, Intel Core i7-7740X Kaby Lake-X Quad-Core 4.3 GHz), running Linux Ubuntu, 4.10.0-32-generic, JDK 1.8.0_131:
ClockBench.java
Method samples min max avg median stdev
System.nanoTime 255 13.00 16.00 14.36 14.50 0.57
CLOCK_REALTIME 255 16.00 19.00 17.76 17.50 0.52
cpuid+rdtsc 255 123.00 131.00 126.47 127.00 1.33
rdtscp 255 45.00 52.00 48.80 48.50 1.15
rdtsc 255 25.00 32.00 28.40 28.50 1.49
Using CPU frequency = 1.000000
Three years later, Linux System.nanoTime() is about double the speed of rdtsc on Linux, and marginally faster on MacOS. That suggests Peter is right to not bother with JNI-based clocks for nanosecond timing on Linux, but why?
My speculation is this has to do with System.nanoTime() being one of the JVM intrinsics, which as explained here by are basically little assembly macros that the JIT drops in place of JNI calls. Because of this, once the JIT does its job and compiles the bytecode to assembly what you have is the plain RDTSC assembly code vs. a JNI call to C to execute RDTSC assembly code, which will have larger overhead.
The only possible improvement I can think of here would involve implementing the entire WaitStrategy as a native method, conceptually:
while RDTSC_In_Nanoseconds <= deadline
if JVM_IsInterrupted
raise_interrupted_exception()
The advantage of this approach is the JNI call is made once per waitUntil(long) call, rather than once per busy spin loop, so you amortize the cost of crossing the Java-C boundary, at the expense of greater complexity. This will give you more precision: you will have a better chance to terminate closer to your deadline.
A richer exploration of this topic can be found in Aleksey Shipilёv's "Nanostrusting the Nanotime" -- but for now we'll just leave System.nanoTime() alone on the assumption that the HotSpot JIT authors have our backs here.