Programming: How to improve application performance by understanding the CPU Cache levels
A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost to access data from the main memory. A cache is a smaller, faster memory, closer to a processor core, which stores copies of the data from frequently used main memory locations. Most CPUs have different independent caches, including instruction and data caches, where the data cache is usually organized as a hierarchy of more cache levels.
Levels of CPU Cache:
L1 Cache
L1 cache resides in each core. It is the fastest accessible memory. L1 cache has split into two types as follows,
Instruction cache — stores the executable instruction which makes the instruction fetch operation faster.
Data cache — stores the data to process which speeds up the data fetch operation.
L2 Cache
L2 cache is larger and slower than the L1 cache. It resides in the core or in the motherboard. Latest Intel Core processors have L2 cache integrated into core along with L1 cache.
L3 Cache
It is the largest cache and resided outside of the core. Accessing the L3 cache is slower than L2 cache as it is larger and the processor has to search through all the data available in the L3 cache.
Intel® Core™ i7–4770S processor
In the Intel® Core™ i7–4770S processor, the L1 cache and L2 cache are integrated into the core and the L3 cache resided in the processor.
Hit rate and Miss rate
A successful fetch from a cache is called as hit rate.
A unsuccessful fetch from a cache is called as miss rate.
If the processor couldn’t find what it is looking for in L1 cache then that is a miss rate for L1 cache. Then the processor advances to the L2 cache and look for the data. If the data available then the processor will use it or the processor will advance to L3 for the data lookup.
Likewise, if the data is not available in
Impact of CPU Cache in Programming
Consider the following C# code
The above code process a certain string in two ways. It has to do following operations,
- Append the given string to the string builders for the given loop count times.
- The method “ GetStringByCombinedExecution” appends the string to both string builders in the same loop and prints time taken in milliseconds.
- The method “ GetStringBySeperateExecution” appends the string to both string builders in different loops and prints time taken in milliseconds.
Lets set value for the loopCount as 10, and below is the result. It shows the combined execution is faster than separate execution.
Lets set value for the loopCount as 100, and below is the result. The numbers giving different results now, the separate execution is faster now.
Lets set value for the loopCount as 1000, and below is the result. The separate execution is faster.
Lets set value for the loopCount as 10000, and below is the result. The separate execution is faster by 10 milliseconds.
Whats happening? Why the numbers are different just by splitting the execution?
Because of CPU Cache.
In the scenario of combined execution, both the string builder objects may not be available all the time in the L1 cache. So it adds up the miss rate and the hit rate will be declined which causes the processor to take more number of CPU cycles to complete the operation.
While the execution happens separately, the hit rate will be high as the process happens only on the single instance of the string builder it will be available in the L1 Cache. So the processor is able to complete the operation in less number of CPU cycles.
Conclusion:
The loops are unavoidable and have to be implemented in day-to-day programming. Recently I did a performance optimization in my work. From which I understood the efficiency of my application is very slow compared to the PoC I did. The only change I had in my application is, I was doing more operations in the same loop with different objects. So after long research and analysis, I started to fragment the operations in separate loops which actually gave a huge performance boost in our application. We came down from 10 seconds to 200 milliseconds after we split up all the operations to separate loops, which triggered me to write this post.