Skip to content

Instantly share code, notes, and snippets.

@travisdowns
Last active October 17, 2019 00:53
Show Gist options
  • Save travisdowns/4d80ee9a6b36d20590e846fa346797c1 to your computer and use it in GitHub Desktop.
Save travisdowns/4d80ee9a6b36d20590e846fa346797c1 to your computer and use it in GitHub Desktop.

In a recent paper, a method is described for calculating the loop-carried dependencies (LCD), if any, of a assembly level loop, as follows:

OSACA can detect LCDs by creating a DAG of a code comprising two back-to-back copies of the loop body. It can thus analyze all possible paths from each vertex of the first kernel section and detect cyclic LCDs if there exists a dependency chain from one instruction form to its corresponding duplicate.

However, I don't think this is sufficient to catch all loop carried dependencies. In particular, you can have a case where a dependency is loop carried, but where no vertex (instruction) in the second iteration depends on the same vertex1 in the first. Rather, some other vertex in the second depends on the first, and in some subsequent iteration, the cycle is completed.

An example:

add eax, 1  ; A (deps: E-previous)
add eax, 1  ; B (deps: A)
add eax, 1  ; C (deps: B)

; swap eax, ebx (exploded version of xchg eax, ebx)
mov ecx, eax ; D (deps: C)
mov eax, ebx ; E (deps: F-previous)
mov ebx, ecx ; F (deps: D)

Is there a loop carried dependecy here? Yes, there is: the chain of add instructions in all even iterations depends on the result of the chain in the preceeding even iteration, and the same is true for odd iterations. In particular, there is an intra-iteration chain A->B->C->D->F, but this chain only depends on E from the previous iteration which doens't appear in the chain. The other chain is simply E, which depends on F from the previous iteration. So the dependency is carried, but it takes two iterations for any dependency to flow back to the same vertex.

We see then that there are two equivalent carried dependecy chains of length 3 (if we assume the mov are eliminated with zero latency), but each takes 2 iterations, so the total LCD length is 1.5. This also shows you can have non-integer LCD lengths when measured per iteration!

I would show the OSACA generated graphs here, but I couldn't get it to run, so you'll have to use your imagination. If anyone wants to contribute ASCII art, I'll include it :).


1 By same I mean "its corresponding dulicate" as in the quoted portion of the paper.

@JanLJL
Copy link

JanLJL commented Oct 10, 2019

Hi Travis,
I'm one of the developers of OSACA and first of all want to thank you for being interested in our tool :)!
We are always happy about discussing strengths and weaknesses and are open for any suggestions of enhancement.

Second of all, you are right with your hypothesis, OSACA can not identify the dependency chain created by you.
I quickly reproduced your example

movl    $111,%ebx       #IACA/OSACA START MARKER
.byte   100,103,144     #IACA/OSACA START MARKER

addl    $1, %eax
addl    $1, %eax
addl    $1, %eax

# swap eax, ebx
movl    %eax, %ecx
movl    %ebx, %eax
movl    %ecx, %ebx

movl    $222,%ebx       #IACA/OSACA END MARKER
.byte   100,103,144     #IACA/OSACA END MARKER

and this is how the output including the dependency graph would look like:

Open Source Architecture Code Analyzer (OSACA) - v0.3
Analyzed file:      two_it_dep.s
Architecture:       csx
Timestamp:          2019-10-04 07:40:39

 P - Throughput of LOAD operation can be hidden behind a past or future STORE instruction
 * - Instruction micro-ops not bound to a port
 X - No throughput/latency information for this instruction in data file


Throughput Analysis Report
--------------------------
                             Port pressure in cycles                              
     |  0   - 0DV  |  1   |  2   -  2D  |  3   -  3D  |  4   |  5   |  6   |  7   |
----------------------------------------------------------------------------------
   4 | 0.25        | 0.25 |             |             |      | 0.25 | 0.25 |      |   addl    $1, %eax
   5 | 0.25        | 0.25 |             |             |      | 0.25 | 0.25 |      |   addl    $1, %eax
   6 | 0.25        | 0.25 |             |             |      | 0.25 | 0.25 |      |   addl    $1, %eax
   8 |             |      |             |             |      |      |      |      | * movl    %eax, %ecx
   9 |             |      |             |             |      |      |      |      | * movl    %ebx, %eax
  10 |             |      |             |             |      |      |      |      | * movl    %ecx, %ebx

       0.75          0.75                                      0.75   0.75         


Latency Analysis Report
-----------------------
   4 |  1.0 | | addl    $1, %eax
   5 |  1.0 | | addl    $1, %eax
   6 |  1.0 | | addl    $1, %eax
   8 |  1.0 | | movl    %eax, %ecx
  10 |  1.0 | | movl    %ecx, %ebx

        5.0


Loop-Carried Dependencies Analysis Report
-----------------------------------------

two_it_dep graph

As you can see, no loop-carried dependency is detected, which is because we only unroll the loop twice.
Your and similar examples therefore could be covered by unrolling the loop N times, which still would not be totally sound for all possible loop-carried dependencies since N is finite.
This also applies to dependencies inside the memory, e.g.:

.Loop:
  movq    (%rbx), %rax
  addq    $1, %rax
  movq    %rax, 8(%rbx)
  addq    $8, %rbx
  cmpq    %rbx, %r15
  jb      .Loop

Here in each iteration the value in rax is stored in the memory (movq %rax, 8(%rbx))and will be loaded again in the upcoming iteration (movq (%rbx), %rax).
This is not typical for the types of code we analyze, but at least simple cases should be recognized in future releases.

Nevertheless, we are aware of these weaknesses and they are on our list of improvement in the future!

@travisdowns
Copy link
Author

Thanks for the comments, Jan!

I created this issue to discuss the reg dependency case which I guess is a better place than a gist to discuss it, as I could not e.g., notify you by username here.

@travisdowns
Copy link
Author

I created this other issue for the memory dependency thing. On Twitter, Georg Hager mentioned that the type of "skips an iteration" thing mentioned here usually go though memory dependencies, but I'm not sure about that. I could be - but in any case I guess at a fundamental level the issues are more or less orthogonal so it makes sense to have two separate issues?

BTW, I while I think you can solve the first issue fairly easily, I am not sure how you could solve the memory dependence issue complete, since this is static analysis without knowledge of memory addresses - certainly you can prove that certain loads must alias certain earlier stores, but of course this is not sufficient in general since other loads may alias depending on unknown register values. I left a more detailed comment in the issue.

@cod3monk
Copy link

You're correct. We will not be able to solve the memory dependencies fully, but for regular codes most situations should be solvable and annotations could resolve it for any other situations.

@travisdowns
Copy link
Author

travisdowns commented Oct 17, 2019

for regular codes most situations should be solvable

Here, I would have to disagree, unless say HPC code is very different than most other code. Most memory->memory dependencies in a loop would be dynamic, due to occasionally or often overlapp writes and reads, not something you could determine statically. That's because if you could determine it statically, so could the compiler (in fact the compiler has even more semantics at the HLL level so it has an easier job compared to looking at the assembly) and it would try as hard as it could these values in a register. If it does need to spill due to register pressure, it would choose read-only values first, since memory write-read dependencies are bad due to their long latency.

So only in the unusual case that the compiler has to spill so many values that it includes locations read and written in the same loop would you usually find this case.

Now, you do find this case all the time in regular code - but not inside high performance loops: you find it e.g., across function call boundaries where arguments or return values are written on one side and read on the other, but I think it's outside the scope of OSACA.

All that to say I wouldn't put too much faith in static analysis for memory deps: annotations I think are much more promising.

I admit I could be totally wrong about this, e.g., not consider some scenario that leads to lots of RW dependencies without being able to enregister them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment