[EMNLP 2023] CP-BCS: Binary Code Summarization Guided by Pseudo Code

Sentosa Singapore - EMNLP 2023

Abstract

Automatically generating function summaries for binaries is an extremely valuable but chal- lenging task, since it involves translating the ex- ecution behavior and semantics of the low-level language (assembly code) into human-readable natural language. However, most current works on understanding assembly code are oriented towards generating function names, which in- volve numerous abbreviations that make them still confusing. To bridge this gap, we focus on generating complete summaries for binary functions, especially for stripped binary (no symbol table and debug information in reality). To fully exploit the semantics of assem- bly code, we present a control flow graph and pseudo code guided binary code summariza- tion framework called CP-BCS. CP-BCS uti- lizes a bidirectional instruction-level control flow graph and pseudo code that incorporates expert knowledge to learn the comprehensive binary function execution behavior and logic semantics. We evaluate CP-BCS on 3 different binary optimization levels (O1, O2, and O3) for 3 different computer architectures (X86, X64, and ARM). The evaluation results demonstrate CP-BCS is superior and significantly improves the efficiency of reverse engineering.

Publication
In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

img