- Alex Bennée
- Senior Virtualisation Engineer @ Linaro
- QEMU’s TCG (emulation)/KVM
- IRC: stsquad/ajb-linaro
- using Python
<<noweb inclusions>>
- source for slides on request
“QEMU is a generic and open source machine emulator and virtualizer.” - from https://www.qemu.org
+------------------------------+
| cBLU |
| +--------------------------+ |
| | cYEL | |
| | Guest Linux App | |
| | (ARM/Linux binary) | |
| | | |
| +--------------------------+ |
| |
| +--------------------------+ |
| | cPNK | |
| | JIT Code (x86) | |
| | | |
| +--------------------------+ |
| |
| +--------------------------+ |
| | cGRE | |
| | QEMU | |
| | | |
| +--------------------------+ |
+------------------------------+
^
--=------------------|-------------------=----
V
+------------------------------+
| cBLU |
| Host Linux Kernel (x86) |
| |
+------------------------------+
+------------------------------+
| cBLU |
| +--------------------------+ |
| | cRED | |
| | Guest Memory | |
| | (ARM) | |
| | +---------+ +---------+ | |
| | |cYEL | |cYEL | | |
| | | App | | App | | |
| | | | | | | |
| | +---------+ +---------+ | |
| |------=-------------------| |
| | +----------------------+ | |
(User) | | |cBLU | | |
| | | Guest Kernel | | |
| | | | | |
| | +----------------------+ | |
| +--------------------------+ |
| |
| +--------------------------+ |
| | cPNK | |
| | JIT Code (x86) | |
| | | |
| +--------------------------+ |
| |
| +--------------------------+ |
| | cGRE QEMU | |
| +--------------------------+ |
+------------------------------+
----------------------=-----------------------
+------------------------------+
| cBLU |
(Kern)| Host Linux Kernel (x86) |
| |
+------------------------------+
:header-args+: :dir ~/lsrc/qemu.git
:header-args+: :var QEMU=”./aarch64-linux-user/qemu-aarch64”
:header-args+: :var BINARY=”./aarch64-linux-user/tests/sha1”
:header-args+: :eval no-export
:header-args+: :noweb no-export
:header-args+: :var trace=”trace.out”
Running
echo "BINARY=${BINARY}"
${QEMU} -d cpu -D $trace ${BINARY}
Gives
BINARY=./aarch64-linux-user/tests/sha1 SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6
head -n 12 $trace
PC=00000000004003f0 X00=0000000000000000 X01=0000000000000000 X02=0000000000000000 X03=0000000000000000 X04=0000000000000000 X05=0000000000000000 X06=0000000000000000 X07=0000000000000000 X08=0000000000000000 X09=0000000000000000 X10=0000000000000000 X11=0000000000000000 X12=0000000000000000 X13=0000000000000000 X14=0000000000000000 X15=0000000000000000 X16=0000000000000000 X17=0000000000000000 X18=0000000000000000 X19=0000000000000000 X20=0000000000000000 X21=0000000000000000 X22=0000000000000000 X23=0000000000000000 X24=0000000000000000 X25=0000000000000000 X26=0000000000000000 X27=0000000000000000 X28=0000000000000000 X29=0000000000000000 X30=0000000000000000 SP=00000040008515a0 PSTATE=40000000 -Z-- S EL0t
Running:
echo $(du -h $trace)
echo $(ag -i "PC=" $trace | wc -l) PCs
Gives
65K trace.out 799 PCs:header-args+: :var trace=”trace.out” :header-args+: :noweb no-export :header-args+: :var heatmap=generated-filepath(file=”graphs/default-execution-heatmap.png”)
import numpy as np
from re import compile
pcs = []
extract_re = compile("\s+PC=([0-9A-Fa-f]{4,16})")
with open(trace, "r") as data:
for line in data:
match = extract_re.match(line)
if match and match.group(1):
pc = int(match.group(1), 16)
pcs.append(pc)
array = np.array(pcs)
print ("Collected %d samples from %s" % (len(pcs), trace))
Result
Collected 799 samples from trace.out
<<load-trace-data>>
base = np.amin(array, axis=0)
top = np.amax(array, axis=0)
base_page = base >> 12
top_page = top >> 12
print("Active region is %d pages from %#x to %#x" %
(top_page - base_page, base, top))
Result
Collected 799 samples from trace.out Active region is 97 pages from 0x4001d8 to 0x461f50
<<examine-data>>
counts = np.bincount(array)
mf = counts.argmax()
print("Most frequent PC = %#x (%d hits)" % (mf, counts[mf]))
Result
Collected 799 samples from trace.out Active region is 97 pages from 0x4001d8 to 0x461f50 Most frequent PC = 0x41b600 (12 hits)
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
<<examine-frequency>>
from math import sqrt
# trim low addresses and pad array to page size
active = counts[base_page<<12:]
extra = 4096 - (len(active) % 4096)
padded = np.pad(active, (0, extra),
'constant', constant_values=(0))
# reduce the map (4 insns/bucket)
bytes_bucket = 16
reduced = np.add.reduce(padded.reshape((-1, bytes_bucket)), 1)
# format as a 2d grid
side = int(sqrt(len(reduced)))
result = reduced[0:side ** 2].reshape((-1, side))
print("Reduced to %d x %d grid (%d bytes/bucket)"
% (side, len(result), bytes_bucket))
<<setup-matplotlib>>
<<prepare-data>>
ax = plt.subplot(111)
im = plt.imshow(result, cmap="hot")
<<tweak-labels>>
# add "temp" bar to right
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
plt.colorbar(im, cax=cax)
plt.savefig(heatmap)
Run Loop
qemu_tcg_cpu_thread_fn()
+--------------+
| |
| V
| +--------------------+
| | Find Next Block |
| |cGRE |
| | +----------------+
| | | Generate Block |
| | |cRED | Add new code to buffer
| | | +--------------------------------------=---+
| | | | |
| | +----------------+ |
| | | |
| | | |
| +--------------------+ +---------------------------+ |
| | Run Block +-------->|cBLU | |
| |cGRE | | Translated Code Buffer | |
| | | | | |
| | | | | |
| | |<--------+ | |
| +---------+----------+ +------------+--------------+ |
| | : : |
| | : :<---+
+--------------+ : :
: :
: :
:header-args+: :dir ~/org/presentations
Guest Instructions TCG Micro Ops Host Instructions
(ARM) (x86)
+--------------------------+ +------------------------------+ +----------------------------+
| cYEL | | cGRE | |cPNK |
Guest PC -->| ldr q0, [x21, x0] | +->| mov_i64 tmp2,x21 | +->| movl -0x14(%r14), %ebp |
| ldr q1, [x20, x0] | | | mov_i64 tmp3,x0 | | | testl %ebp, %ebp |
| fmul v0.4s, v0.4s, v1.4s | | | add_i64 tmp2,tmp2,tmp3 | | | jl 0x5555559b87af |
| str q0, [x19, x0] | | | movi_i64 tmp7,$0x8 | | | movq 0xe8(%r14), %rbp |
| add x0, x0, #0x10 (16) | | | add_i64 tmp6,tmp2,tmp7 | | | movq 0x40(%r14), %rbx |
| cmp x0, #0x400000 | | | qemu_ld_i64 tmp4,tmp2,leq,0 | | | addq %rbx, %rbp |
| b.ne loop | | | qemu_ld_i64 tmp5,tmp6,leq,0 | | | leaq 8(%rbp), %r12 |
| | | | st_i64 tmp4,env,$0x898 | | | movq (%rbp), %rbp |
+------------+-------------+ | | st_i64 tmp5,env,$0x8a0 | | | movq (%r12), %r12 |
| | | ... | | | movq %rbp, 0x898(%r14) |
| | | ... | | | movq %r12, 0x8a0(%r14) |
| | | | | | movq 0xe0(%r14), %rbp |
| | +----+-------------------------+ | | addq %rbx, %rbp |
| | | | | leaq 8(%rbp), %rbx |
v | v | | movq (%rbp), %rbp |
+------+-------+ | +------------+ +------------+ | | movq (%rbx), %rbx |
|{io} cGRE| | |{io} cGRE| |{io} cGRE| | | movq %rbp, 0x8a8(%r14) |
| | | | | | | | | ... |
| TCG Disas +-------+ | TCG Opt +--->| TCG Gen +--+ | ... |
| | | | | | | |
| | | | | | | |
+--------------+ +------------+ +------------+ +----------------------------+
:header-args+: :dir ~/org/presentations
:header-args+: :var trace=”trace-nochain.out”
${QEMU} -d cpu,nochain -D $trace ${BINARY}
Result
36M trace-nochain.out 859943 PCs
Collected 859943 samples from trace-nochain.out Active region is 97 pages from 0x4001d8 to 0x461f50 Most frequent PC = 0x417e40 (64003 hits):header-args+: :var heatmap=generated-filepath(file=”graphs/nochain-execution-heatmap.png”) :header-args+: :var heatmap=generated-filepath(file=”graphs/nochain-execution-heatmap-log.png”)
<<setup-matplotlib>>
<<prepare-data>>
from matplotlib.colors import LogNorm
log = LogNorm(vmin=1, vmax=np.max(result))
ax = plt.subplot(111)
ax.set_facecolor("black")
im = plt.imshow(result, cmap="hot", norm=log)
<<tweak-labels>>
# add "temp" bar to right with 2n ticks
from matplotlib.ticker import LogLocator, LogFormatterSciNotation
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
locator = LogLocator(base=2)
formatter = LogFormatterSciNotation(base=2)
plt.colorbar(im, cax=cax, ticks=locator, format=formatter)
plt.savefig(heatmap)
${QEMU} -singlestep -d cpu,nochain -D $trace ${BINARY}
-ENOSPC
:header-args+: :var trace=”trace-nochain-dfilter.out”${QEMU} -d cpu,nochain -D $trace -dfilter 0x400580..0x402fdc ${BINARY}
Result
Collected 589537 samples from trace-nochain-dfilter.out Active region is 2 pages from 0x400580 to 0x402fcc Most frequent PC = 0x400580 (64001 hits):header-args+: :var heatmap=generated-filepath(file=”graphs/dfilter-execution-heatmap.png”)
-d cpu | -d cpu,nochain | -singlestep -d cpu,nochain | |
---|---|---|---|
Time | quick | slow | very slow |
Size | small | medium | huuuuge |
Covr | trans | exec (tb) | insn exec |
exec_tb(void *tb, uintptr_t pc) "tb:%p pc=0x%"PRIxPTR
trace_exec_tb(tb, tb->pc);
- Plain text (textual log file)
- Simple (binary log file)
- dtrace/ftrace
- ust (LTTNG)
- syslog
./configure --enable-trace-backends=simple,ftrace,log
- compact files
- faster output
- lots of interesting trace points
- TCG integration
${QEMU} -d nochain,trace:exec_tb -D $trace ${BINARY}
Gives
8.4M trace-event-nochain.out:header-args+: :python python3 :header-args+: :var trace=”trace-memory-events.out” :header-args+: :var BINARY=”./aarch64-linux-user/tests/fcvt”
${QEMU} -trace enable=guest_mem_before_exec,file=$trace ${BINARY}
./scripts/simpletrace.py trace-events-all $trace | head -n 3
guest_mem_before_exec 0.000 pid=29877 __cpu=0xptr vaddr=0x4000851590 info=0x3 guest_mem_before_exec 3.981 pid=29877 __cpu=0xptr vaddr=0x400420 info=0x3 guest_mem_before_exec 0.798 pid=29877 __cpu=0xptr vaddr=0x400428 info=0x3
import sys
sys.path.append("scripts/") # in QEMU_SRC
import simpletrace
class MemAnalyser(simpletrace.Analyzer):
def __init__(self):
self.stores = []
self.scounts = { 0:0, 1:0, 2:0, 3:0 }
self.loads = []
self.lcounts = { 0:0, 1:0, 2:0, 3:0 }
def guest_mem_before_exec(self, ts, cpu, vaddr, info):
store = info & (1 << 5)
size = info & 0x7
if store:
self.stores.append((ts, vaddr, size))
self.scounts[size] += 1
else:
self.loads.append((ts, vaddr, size))
self.lcounts[size] += 1
<<define-memtrace-analyser>>
# Gather data from the trace
mema = MemAnalyser()
simpletrace.process("trace-events-all", trace, mema)
print("Loads: %d -> %s" % (len(mema.loads), mema.load_counts))
print("Stores: %d -> %s" % (len(mema.stores), mema.store_counts))
Gives
Loads: 4064403 -> {0: 1082887, 1: 5561, 2: 646870, 3: 2329085} Stores: 2377689 -> {0: 179729, 1: 4970, 2: 462047, 3: 1730943}
(qemu) info trace-events bdrv_co_p*
bdrv_co_pwrite_zeroes : state 0
bdrv_co_pwritev : state 0
bdrv_co_preadv : state 0
(qemu) trace-event bdrv_co_pwritev on
(qemu) trace-event bdrv_co_preadv on
(qemu) trace-file set blocks.out
<<simpletrace-import>>
class BlockAnalyser(simpletrace.Analyzer):
def __init__(self):
self.rtotal = 0
self.wtotal = 0
self.reads = []
self.writes = []
def bdrv_co_pwritev(self, ts, pid, bs, offset, nbytes, flgs):
self.wtotal += nbytes
self.writes.append((ts, self.wtotal))
def bdrv_co_preadv(self, ts, pid, bs, offset, nbytes, flgs):
self.rtotal += nbytes
self.reads.append((ts, self.rtotal))
cold = BlockAnalyser()
hot = BlockAnalyser()
simpletrace.process("trace-events-all", "coldblocks.out", cold)
simpletrace.process("trace-events-all", "hotblocks.out", hot)
<<get-block-data>>
plt.subplot(2, 1, 1)
plt.plot(cold_reads[:,0], cold_reads[:,1], color="blue")
plt.plot(cold_writes[:,0], cold_writes[:,1], color="red")
plt.title('QEMU Compile')
plt.ylabel('Cold Cache')
plt.yticks([])
plt.xticks([])
plt.legend(labels=("read", "write"))
plt.subplot(2, 1, 2)
plt.plot(hot_reads[:,0], hot_reads[:,1], color="blue")
plt.plot(hot_writes[:,0], hot_writes[:,1], color="red")
plt.ylabel('Hot Cache')
plt.yticks([])
plt.xlabel('time')
plt.xticks([])
plt.legend(labels=("read", "write"))
plt.savefig(plot)
- latency (TCG code is slower)
- icount pegs time to insns
- not cycle accurate!
- system emulation only
${QEMU} ${QEMU_ARGS} -icount 1
With icount
<<simpletrace-import>>
class MyAnalyser(simpletrace.Analyzer):
def __init__(self):
self.irqs = []
self.reads = []
self.rtotal = 0
def tcg_irq(self, ts, cpu, cpu_id):
self.irqs.append(ts)
def bdrv_co_preadv(self, ts, pid, bs, offset, nbytes, flgs):
self.rtotal += nbytes
self.reads.append((ts, self.rtotal))
tickless = MyAnalyser()
ticked = MyAnalyser()
simpletrace.process("trace-events-all", "tickless-irqs.out", tickless)
simpletrace.process("trace-events-all", "ticked-irqs.out", ticked)
ic_tickless = MyAnalyser()
ic_ticked = MyAnalyser()
simpletrace.process("trace-events-all", "tickless-icount-irqs.out", ic_tickless)
simpletrace.process("trace-events-all", "ticked-icount-irqs.out", ic_ticked)
- builds on record/replay
- uses gdbstub
- see {PATCH v6 00/25} Fixing record/replay and adding reverse debugging
- in-situ analysis
- per instruction
- debate around API/ABI scope
- see {Qemu-devel} {RFC PATCH v2 0/7} QEMU binary instrumentation prototype
- better memory logging
- icount for user-mode
- block execution counts
- ThreadSanitizer?
- Cross-platform
- System Emulation
- Open platform
- Trace points > debug logs
- Active Development
- Use cases
- Questions?
Code to tweak the ticks to match the actual address of the hit.
from mpl_toolkits.axes_grid1 import make_axes_locatable
ylabels=ax.get_yticks().tolist()
addr_labels = []
for l in ylabels:
offset = l * side * bytes_bucket
addr_labels.append("%#x" % ((base_page << 12) + offset))
ax.set_yticklabels(addr_labels)
ax.set_xticks([])
(concat default-directory "generated/" file)
publish_org.sh
<<setup-matplotlib>>
<<analyse-memory-pattern>>
import numpy as np
loads = np.array(analyser.loads)
laddrs = loads[:,1]
counts = np.bincount(laddrs)
mf = counts.argmax()
print("Most frequent addr = %#x (%d hits)" % (mf, counts[mf]))
#plt.savefig(plot)
#return plot