Terminology
Most decompiler terminology is familiar to compiler researchers. However, a few terms are not, and not all readers may be familiar with the compiler technology.Terms
- "Original source code", as opposed to the decompiled output (which although an output, is still referred to as a "source" code. This is the original, usually high level source code that the program was written in.
- "Input program", for the program that the decompiler reads. Terms such as "source program" just create confusion.
- "Decompiled output", which as noted above is a form of source code.
- "Executable file" refers to a general class of files that could be decompiled. Here, the term includes both machine code programs, and programs compiled to a virtual machine form. The term "native" distinguishes machine code executables from others. Most people understand "machine code" as meaning an executable that is executed directly by the processor, and so means the same but is clearer than "native executable".
- "Source" is very firmly associated with a high level program representation, and the terms "Target" and "retargeting" are firmly associated with machines. The only solution seems to be to avoid these terms altogether.
- "Binary" program. In reality, all program
representations are combiations of ones and zeroes, and so in a sense
are binary programs.
- NJMC TK: The New Jersey Machine Code Toolkit. See Norman Ramsey's page.
- UQBT: University of Queensland Binary Translator. Boomerang is
based in part on code from UQBT. See the UQBT page
and the alpha code release here.
- SSL: Semantic Specification Language. This is the language used
in SSL files, which specify the semantics (meaning) of instructions.
- IR: Internal Representation; a representation of the input program in a form that is convenient for the current analysis or transformation.
- RTL: Register Transfer List (sometimes Register Transfer Language). The term Register Transfers actually comes from hardware design, where registers are arrays of single bit storage elements, but in software engineering has come to mean a style of program representation at the register and memory level. Every transfer (assignment) is explicit, including to flags registers.
- DFA: Data Flow Analysis.
- SSA: Static Single Assignment. A representation variation that makes certain kinds of DFA easier to perform.
- CFG: Control Flow Graph. Nodes in the CFG are Basic Blocks, and edges represent possible control flow (execution paths that the program could take). For example, a basic block ending in a conditional branch would have two out-edges, one each for the case where the branch is taken and not taken.
- BB: Basic Block. Usually a list of statements or RTL which are always executed together. A basic block is terminated by a conditional or unconditional branch or call, a indirect branch or call (including n-way branches or switch statements), return instructions, or labels (where other control flow enters). If the label is not explicit, a "fall through" basic block could terminate in an ordinary (non control flow altering) instruction.
- TA: Type Analysis.
- HLL: High Level Language; in a compiler, typically the output language.
- AST: Abstract Syntax Tree. This is an IR close to the HLL, typically at the statement level. For example, a node of the AST might be labeled as a pretested-while node, and children of that node could represent the loop conditional expression, and a block node representing statements in the loop. In a compiler, an AST typically results from parsing the input HLL program.
Last modified: 27/Aug/2005: "Minor spelling changes (thanks, Mohsen!).