Archfinch - rate things and get recommendations based on what you like.

Your account is temporary. Sign up to save your data (it takes less than a minute).

Hex-Rays decompiler

Tag it C decompilers reverseengineering

Hex-Rays decompiler

11 months ago by circaee
6 comments

Rate: 1 2 3 4 5

Summary:

Hex-Rays Decompiler is a C decompiler
Decompilers turn machine code into code in the original language, which helps readability
Hex-Rays Decompiler requires IDA Pro 5.5 to work

(edit)

Comments

Thank you! You will be redirected to your new comment shortly.

Tag it

11 months ago by Br3akTh3W1nd | reply | link

Rate: 1 2 3 4 5

Thank you! You will be redirected to your new comment shortly.

I wonder how decompilers work. I've written a disassembler as a side project a long time, but that was straight-forward. Decompilers seem much harder.

Tag it informative

11 months ago by drx | reply | link

Rate: 1 2 3 4 5

Thank you! You will be redirected to your new comment shortly.

I've written a C decompiler with alistra (it's on github if you're curious), so I can shed some light on this.

Decompilation is a tricky process. In the general case, it is undecidable, meaning that in general, you cannot write an algorithm that would decompile every possible piece of code. It remains undecidable even if you disallow self-modifying code.

However, there are still many patterns in compiled machine code which you can use to try to construct a partial decompilation.

Our approach is best shown using a flowchart (yes):

Stages of decompilation

I will describe briefly the stages:

Object dump

Usually, executable programs are in a specific format (ELF and a.out in Unix, PE on Windows, etc.). Object dump is a process which takes an object file and extracts information from it, such as: the machine code, the entry point (where the code starts when you run the executable), list of symbols (function names, data, etc. with their associated addresses), and so on.

For that, you can use the objdump utility available on most GNU systems as part of GNU Binutils.

An example object dump (to get a list of symbols).
```
drx:~/ocd$ objdump -t tests/test_ack | awk '$2 == "g"{print $0}' | grep -v _
000000000040054f g     F .text  0000000000000059              ack
0000000000400524 g     F .text  000000000000002b              main
```
For reference, tests/test_ack.c (I will be using this example throughout the post):
```
int main()
{
    return printf("%d\n", ack(3, 4));
}

int ack(int m, int n)
{
    if (m == 0) return n + 1;}
    else
    {
        if (n == 0) return ack(m - 1, 1);
        else return ack(m - 1, ack(m, n - 1));
    }
}
```
(The code implements the Ackermann function)

Disassembly

Compiled executable code is usually stored on disk as machine code (i.e. a set of binary instructions that is understood by CPUs directly). There is usually a human-readable way to transcribe this machine code, called assembly language.

Disassembly consists of turning machine code into assembly language. For example, here's a snippet of the disassembly of test_ack into x64 code:

...

0000000000400524 <main>:
  400524:   55                      push   %rbp
  400525:   48 89 e5                mov    %rsp,%rbp
  400528:   be 04 00 00 00          mov    $0x4,%esi
  40052d:   bf 03 00 00 00          mov    $0x3,%edi
  400532:   b8 00 00 00 00          mov    $0x0,%eax
  400537:   e8 13 00 00 00          callq  40054f <ack>
  40053c:   89 c6                   mov    %eax,%esi
  40053e:   bf 9c 06 40 00          mov    $0x40069c,%edi
  400543:   b8 00 00 00 00          mov    $0x0,%eax
  400548:   e8 cb fe ff ff          callq  400418 <printf@plt>
  40054d:   c9                      leaveq 
  40054e:   c3                      retq

000000000040054f <ack>:
  40054f:   55                      push   %rbp
  400550:   48 89 e5                mov    %rsp,%rbp
  400553:   48 83 ec 10             sub    $0x10,%rsp
  400557:   89 7d fc                mov    %edi,-0x4(%rbp)
  40055a:   89 75 f8                mov    %esi,-0x8(%rbp)
  40055d:   83 7d fc 00             cmpl   $0x0,-0x4(%rbp)
  400561:   75 08                   jne    40056b <ack+0x1c>

 ...

ocd disassembled into an intermediate language, so that more machine code types could be plugged in in the future.

Control flow analysis

A control flow graph is generated, which encodes information about the different paths a program execution might take, depending on control statements of the language (jumps, conditional statemenents, loops).

Such control flow graphs have distinct patterns for certain control statments, here's a sample table:

if if/else while pass

Those patterns can be then rewritten (using a graph rewriting system, not unlike a grammar rewriting system) into single nodes until no more patterns can be found. With this process the original code patterns emerge and we get some insight into the original program structure.

Here is an example of such rewriting. This is the control flow graph for test_ack as it is rewritten into ultimately one node.
Data flow analysis

First, it makes sense to convert things like register operands, etc. into fresh variables. For example, this:
```
mov  eax, 0
```
might be converted to
```
var_13 = 0
```
Then, the program's computation flow is analyzed. Dead instructions are removed and others are folded into more succinct expressions. This
```
a = a + 1
b = a * 2
n = 2
n = b - a
```
might be converted into
```
n = (a+1)*2-a+1
```
Often, compilers have idioms. For example, instead of writing
```
a = 0
```
this might be used:
```
a = a xor a
```
to save cycles. A decompiler might keep a "dictionary" of such idioms to simplify the output.

Program output

Finally, the program can be output.

For example, this is what ocd outputs for test_ack:

int ack()
{
    var_0 = temp_2;
    var_1 = temp_3;
    var_2 = var_0 - 0;
    if (var2)
    {
        var_2 = var_1 - 0;
        if (var2)
        {
            temp_4 = ack(temp_5 - 1, ack(var_0, temp_4 - 1));
        }
        else
        {
            temp_4 = ack(temp_4 - 1, 1);
        }
    }
    else
    {
        temp_4 = temp_4 + 1;
    }
    return temp_4;
}
int main()
{
    return unknown_function("%d\n", ack(3, 4));
}

As you can see, it's not perfect, but you can see the general structure.

For a more in-depth analysis of the problem, you can read Cristina Cifuentes' PhD thesis, Reverse Compilation Techniques.

Tag it

11 months ago by anonymous user | reply | link

Rate: 1 2 3 4 5

Thank you! You will be redirected to your new comment shortly.

Do you have any background in reverse enginnering? Do you know if decompilers are used in practice?

Tag it

11 months ago by drx | reply | link

Rate: 1 2 3 4 5

Thank you! You will be redirected to your new comment shortly.

I used to reverse engineer games a lot as a hobby. A decompiler would have been useful with old games written in C/C++, but only slightly.