performance - Empty loop is slower than a non-empty one in C -


When trying to figure out how long it takes to execute a line of code, I It looked strange:

  int main (char argc, char * argv []) {time_t start, end; Uint64_t i; Double total_time, free_time; Int A = 1; Int B = 1; Start = clock (); For (i = 0; i <(1 <- 31) -1; i ++); End = clock (); Free_time = (double) (end-time) / CLOCKS_PER_SEC; Printf ("% f \ n", free_time); Start = clock (); For (i = 0; i <(1 <31) -1; i ++) {A + = B% 2; } End = clock (); Free_time = (double) (end-time) / CLOCKS_PER_SEC; Printf ("% f \ n", free_time); Return (0); }  

which shows executed:

  5.873425 4.826874  

Why more time is used than empty loop Second, there is a directive? Of course I have tried many forms, but every time, an empty loop takes more than one time with a single instruction.

Note that I have tried to swap the order of loops and add some warm-up code and this

I get codeblocks as GNU gcc compiler, linux ubuntu 14.04 with the IDE form I am using and QuadCar is Intel i5 at 2.3 GHz (I tried to run the program on a core, it does not change the result).

The fact that modern processors are complicated executes all the instructions in a complicated and interesting manner interacting with each other will do. Thanks to OP and "another man", apparently it was found that the short loop takes 11 chakras while taking 9 chakras for a long time. For a longer time loop, 9 cycles take a lot of time, even if there are too many operations. For a small loop, this should be a little stall due to being too low, and just adding a nop makes enough loops to avoid the stall

One thing is this If we look at the code:

  0x00000000004005af : addq $ 0x1, -0x20 (% rbp) 0x00000000004005b4 & lt; + 55>: Cmpq $ 0x7fffffff, -0x20 (% rbp) 0x00000000004005bc & lt; + 63>: JB 0x4005af  

We read i and write it back ( addq ). We read it immediately, and compare it ( cmpq ) and then we use loop but loop branch prediction, so when the addq is executed then the processor is actually It is not ensured that it is allowed to write on i (because the branch forecast may be wrong).

Then we compare with the i processor will try to avoid reading i from memory, because it takes a lot of time to read it. Instead, some hardware will remember that we wrote it to i , and it reads instead of reading i , receives the cmpq command Data from Store Instructions Unfortunately, we are not sure at this point that if i was actually written or not! So that can bring a stall here

The problem here is that the conditional jump, addq which leads to the conditional store, and cmpq which are sure Not that data from where to meet, all are very close together. They are abnormally close together. It may be that they are very close together, the processor can not understand at this time whether the store has to take the instruction i or read it in memory and read it from memory, which is slow Because he has to wait for the store to end. And gives enough time to the processor by adding a nop .

Usually you think there is a ram, and there is a cash.

  • L3 cache (optional)
  • On the modern Intel processor, read memory (fastest speed):

    1. Memory (RAM) L2 cache
    2. L1 cache
    3. Previous store instructions which have not yet been written in the L1 cache.

    then processor that internally sorts in a short, slow loop:

    1. i from L1 cache Read
    2. 1 to i
    3. Type i to L1 cache
    4. Unless the I linux cache
    5. read i from 1 cache
    6. Compare i with INSAMX
    7. Li>
    8. The branch (1) if it is low

    The processor is in a long, fast, loop:

    1. Too many stuff
    2. i from L1 cache
    3. 1 to i
    4. TOR "which i is L1 cache
    5. Read i directly from the" store "instructions without touching the L1 cache
    6. Compare i with INSAMX
    7. From the branch (1) if it is less

  • Comments

    Popular posts from this blog

    sqlite3 - UPDATE a table from the SELECT of another one -

    c# - Showing a SelectedItem's Property -

    javascript - Render HTML after each iteration in loop -