The challenge requires players to write shellcode implementing MD5. The limitations are:
- One basic block, which means no jmp call like instructions.
- At most 0x233 instructions.
- Execute 0x2000 length code.
You can find implementation in assembly meet the demands: https://www.nayuki.io/page/fast-md5-hash-implementation-in-x86-assembly
Then use an infinite length x86 instruction to bypass the next two limitations. You can easily find it here: https://stackoverflow.com/questions/14698350/x86-64-asm-maximum-bytes-for-an-instruction/18972014
Both of the links are shown at the first if you Google. I deliberately made this challenge easy since I want to introduce the bug Unicorn has. Don't treat this as a feature, it's a bug! It's fixed upstream in QEMU commit. However, Unicorn uses an old version of QEMU, so it's affected.
A simple logic bug.
The bug mentioned before can lead to EoP inside a VM running in TCG mode, which is detailed explained by this Project Zero issue. When it comes to Unicorn, it doesn't work anymore. Because in Unicorn, you have only one CPU, and physical address and virtual address are one-to-one mapped. You can't create a situation that the issue did, which needs two different processes.
Then how does the bug somehow affects Unicorn? Let's dig more into TCG internal. All the codes below are taken from Unicorn v1.0.3.
TCG generates JIT code per basic block as a TB (Translation Block), and caches TB for efficiency. When the code executes to a new block, it first calls tb_find_fast
to look for a cached TB in cpu->tb_jmp_cache
. If not found, tb_find_slow
will look for in tcg_ctx->tb_ctx.tb_phys_hash
. If still not found, a new TB will be generated via tb_gen_code
.
In tb_gen_code
, it handles the situation that a basic block spans two pages.
tcg_ctx->code_gen_ptr = (void *)(((uintptr_t)tcg_ctx->code_gen_ptr +
code_gen_size + CODE_GEN_ALIGN - 1) & ~(CODE_GEN_ALIGN - 1));
phys_page2 = -1;
/* check next page if needed */
if (tb->size) {
target_ulong virt_page2 = (pc + tb->size - 1) & TARGET_PAGE_MASK;
if ((pc & TARGET_PAGE_MASK) != virt_page2) {
phys_page2 = get_page_addr_code(env, virt_page2);
}
}
tb_link_page(cpu->uc, tb, phys_pc, phys_page2);
return tb;
The second page's address is calculated by tcg_ctx->code_gen_ptr + code_gen_size
and aligned. This code is correct as long as a basic block's size is smaller than a page. This is ensured by the following generating logic, it stops translation if the block is too large.
/* if too long translation, stop generation too */
if (tcg_ctx->gen_opc_ptr >= gen_opc_end ||
(pc_ptr - pc_start) >= (TARGET_PAGE_SIZE - 32) ||
num_insns >= max_insns) {
gen_jmp_im(dc, pc_ptr - dc->cs_base);
gen_eob(dc);
block_full = true;
break;
}
However, the bug breaks this logic by inserting a long instruction, making the basic block spans three pages. The page in the middle will not be recorded by TCG. From the view of TCG, it considers the translated TB has only two pages. This page becomes an invisible page to TCG. No matter what you write to that page, it doesn't affect the TB cache.
But you may still not able to solve the challenge due to another issue.
Self-modifying code is a special challenge in x86 emulation because no instruction cache invalidation is signaled by the application when code is modified.
User-mode emulation marks a host page as write-protected (if it is not already read-only) every time translated code is generated for a basic block. Then, if a write access is done to the page, Linux raises a SEGV signal. QEMU then invalidates all the translated code in the page and enables write accesses to the page. For system emulation, write protection is achieved through the software MMU.
If there's any TB generated on the middle page, it will be protected. In tb_alloc_page
:
#ifdef DEBUG_TB_INVALIDATE
printf("protecting code page: 0x" TARGET_FMT_lx "\n",
page_addr);
#endif
}
#else
/* if some code is already present, then the pages are already
protected. So we handle the case where only the first TB is
allocated in a physical page */
if (!page_already_protected) {
tlb_protect_code(uc, page_addr);
}
#endif
I add a jmp
instruction at the end of user shellcode to stop a long basic block. You shouldn't use your own shellcode to patch the code, or it will be protected and any write to that page will always flush the relevant TB caches.
So the intended solution is as following:
- Send shellcode invoking the backdoor.
- Use provided patch functionality to patch the
jmp
to the instruction prefix. - Execute user shellcode to trigger cache generating.
- Execute admin code, cached TB will be executed.
It's really a shame that I made a mistake that the middle page is writable, so most teams solved it by modifying the admin shellcode. But GG to Shellphish, they solved it in my intended way. I'm glad at least one team did it that way.