Skip to content

Instantly share code, notes, and snippets.

@FioraAeterna
Last active August 29, 2015 14:06
Show Gist options
  • Save FioraAeterna/4b2c4d2e635e6f4f83f7 to your computer and use it in GitHub Desktop.
Save FioraAeterna/4b2c4d2e635e6f4f83f7 to your computer and use it in GitHub Desktop.
Dolphin optimization ideas
  1. Track which registers a block clobbers without using -- then, when linking, don't store those, because we don't need them. (do this with PPCAnalyst)

  2. Track which float registers don't need to be converted to doubles (i.e. are only used by single -precision ops that take single-precision input) and don't convert them. (do this with PPCAnalyst)

  3. Track which float registers don't need to be movddup'd to create a top half (for PS1), the avoid the redundant movddup where possible. In a PR

  4. Support movbe in loads (requires backpatcher modifications).

5. Support reordering for other things that can be merged (rlwinm in addition to cmp).

6. Support branch merging for boolean ops (I've seen and used a lot; some compilers seem to prefer it over rlwinm, so it depends on the game).

  1. Thinking about it, the fact that we go through GPRs for basically all loads/stores (because we use bswap, and not pshufb) probably is hurting AMD CPUs a lot. Don't those have higher latencies fo GPR/XMM transfers? I tested this and it's in a PR, but doesn't seem to help Intel (seems to help AMD a bit?)

8. When we're out of registers and need to spill one, prefer spilling those that aren't dirty.

  1. Track when floats go directly from load to store and bypass PPC_FP for them.

  2. Use PEXT for CR register conversion?

11. Move carry out of XER entirely; keep it separate so we can write and clear it in fewer ops.

  1. Keep some common float constants in xmm registers? This might not be worth the trouble, though it feels gross to see like 20 fmadd instructions in a row re-loading the same rounding constants.

  2. If we're storing integer constants at the end of a block (flushing), merge neighboring 32-bit stores into 64-bit or 128-bit, especially when they're zero? I've tried this and have a hacky implementation locally, though it doesn't seem to get used /that/ much.

  3. Optimize dcbst/dcbi; most games don't need the JIT invalidation, which can take up to ~5% of time in some games.

  4. Optimizations that take advantage of PPC calling convention; for example, on block transfers that are function calls, keep function parameters in x86 registers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment