Thursday, December 29, 2011

Strategies to find suitable holes for memory segment allocation request

There are various strategies utilized by modern OS to find suitable memory holes for segment allocation requests from programs.

There are normally four of them:

  • First Fit
  • Best Fit
  • Quick Fit 
  • Buddy System

Buddy System is one of the smartest and most efficient strategy which are widely used. The same strategy logic might be able to apply to our trading system designs to achieve the best efficiency thus lowest possible latency.

Buddy System - Memory Hole Searching Strategy



buddy system,where the allocation and deallocation of memory is always in the order of a power of 2. A request for a segment allocation is rounded to the nearest power of 2 that is greater than or equal to the requested amount. The memory manager maintains n, n >=1, lists of holes. List(i) i=0, ...,n-1, holds all holes of size 2 to power of i. A hole may be removed from List(i), and split into two holes of size 2 to power of i-1 (called ‘buddies’ of size 2 to power of i, see picture above), and the two holes are entered in List(i-1). Conversely, a pair of buddies of size 2 to power of i may be removed from List(i), coalesced into a single larger hole, and the new hole of size 2 to power of i+1 is entered in List(i+1). To allocate a hole of size 2 to power of i, the search is started at List(i). If the list is not empty, a hole from the list is allocated. Otherwise, get a hole of size 2 to power of (i+1) from List(i+1), split the hole into two, put one in List(i), and allocate the other one. Hole deallocation is done in reverse fashion: to free a hole of size 2 to power of i, put it in List(i); if its buddy is already there, remove both, coalesce, and insert the coalesced hole in List(i+1). This insertion may cause coalescing of two buddies, the irremoval from List(i+1), and a new insertion in List(i+2), etc.

Wednesday, December 14, 2011

Pros and cons of disabling C-STATE (and C1E)

For the BIOS to have full control of all the features of the newer cpus, they all need to be enabled.
Maybe this will help (taken from another post):
That was the case for older CPU's but the i3/i5/i7 benefit from both, SpeedStep is better for changing the multiplier/voltage but C State has additional benefits on the new Intel CPU's, instead of the whole CPU either being on/off/idle parts of the CPU can now be turned on/off or set to idle and this works in conjunction with intels Turbo Mode.
So basically they did do the same job but there are benefits to having both on when it comes to the new i3/i5/i7 CPU's.
you will want to set CxE Function to C6 to get these new benefits alongside having SpeedStep enabled (they can work independent of each other but its best to have both enabled, be warned though with newer EVGA BIOS's having CxE Function enabled will allow the higher Turbo Mode multipliers to kick in and could make your OC unstable, if this is the case disable CxE Function but you could keep SpeedStep enabled if it still works, on the X58 Classified the voltage part of SpeedStep does not work with a manually inserted Voltage, it does however still work on the E758 3X SLI board with a manually inserted vCore voltage, this is just due to the components used and how the boards are set-up due to the segments they are for, Classified being a primarily overclocking board when power saving features are secondary. There are still work around for the X58 Classifieds using the ECP, this should allow you to OC the CPU but use an AUTO voltage which would allow the voltage part of SpeedStep to work
 More details please find it at http://www.techsupportforum.com/forums/f15/pros-and-cons-of-disabling-c-state-and-c1e-559253.html 

Monday, December 12, 2011

Understand OS Scheduling for better system performance

Current modern OS, which is interactive privileged over real-time privileged, normally adopt Round-Robin Scheduling strategy, which is more effective in allocating CPU resources to active processes than FCFS (First Come First Serve).
Round Robin Scheduling

For example as below picture which showing how the CPU is allocated to processes. The response time for P1, P2, P3, P4, and P5 are 30, 24, 42, 14, and 18 time units, respectively. The average response time is 25.6 time units,which is better than that (of 28.4) for FCFS scheduling. Nevertheless, RR scheduling leads to more context switches.
Process CPU Allocation
And please note that OS has a so-called Fair Share Scheduling among users and groups. To has a high priority process to gain more CPU time slices, it is better to have a user to only run this process. To better utilize the Round Robin Scheduling, time slices need to be carefully defined that the core logic of your process can be finished within a time slice that it will never scheduled out and wait again for next slices. In this way, your high frequency trading solutions can run more efficient and occupy more CPU power to finish its tasks.

Thursday, December 8, 2011

Better utilize CPU L1 Cache & L2 Cache for Performance Tuning

To better utilize CPU L1 Cache & L2 Cache for Performance Tuning, we need to understand a few important points:
1. Cache and RAM like cache size and why cache is needed in modern CPUs. one fact is that CPU is much faster than memory speed that current system performance bottle neck is memory access and its PCIe BUS speed.
CPU Cache & RAM Architecture
 2. Cache Miss - A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency. There are three kinds of cache misses: instruction read miss, data read miss, and data write miss. details can refer to http://en.wikipedia.org/wiki/CPU_cache#Cache_miss  

a research report on "Optimization Study for Multicores" by Muneeb Anwar Khan shows that how cache can be better utilized to achieve better system performance thus reduce latency. Please note Acumem is the profiling tool he used for identify the problematic codes.

one simple example: look at the source code:
Better utilize CPU L1 Cache & L2 Cache for Performance Tuning


 Problem 1:
The report shows problem 1’s fetches to be 30.8% of all the memory fetches in this application, with a fetch utilization of 43.1% in the first highlighted statement. The instruction stats show the misses to be 34% of all the cache misses in the application, and fetch and miss ratio at 21.2%. Reducing the fetch and miss ratio would
greatly help improve bandwidth issues. 

 Problem 2:
The report points out at poor fetch utilization for the second highlighted statement.
Having an identical miss and fetch ratio of 15.4%; it has an extremely poor fetch
utilization of only 12.8%. 

 let us see the revised code based on the identification of 2 problems:
Better utilize CPU L1 Cache & L2 Cache for Performance Tuning

what is the performance improvement? with just a few simple modifications by eliminating the unnecessary cache of not-used data, the speedup is about 2.9X.
Better utilize CPU L1 Cache & L2 Cache for Performance Tuning

By better utilizing CPU caches, a further low latency high frequency trading platforms can be achieved within the scope of CPU host itself. 

Friday, December 2, 2011

SSD Read/Write Performance

SSD Read/Write Performance
Information get from http://www.tomshardware.com/reviews/best-ssd-price-per-gb-ssd-performance,2942.html

One random read from SSD takes about 20 to 50 micro-seconds. Hopefully it can be much faster in near future to allow CPU OS to read/write with SSD like it is a bit slower RAM.