Archfinch - rate things and get recommendations based on what you like.

Your account is temporary.   Sign up   to save your data (it takes less than a minute).

Hex-Rays decompiler

Summary:

  • Hex-Rays Decompiler is a C decompiler
  • Decompilers turn machine code into code in the original language, which helps readability
  • Hex-Rays Decompiler requires IDA Pro 5.5 to work

(edit)

Comments

Thank you! You will be redirected to your new comment shortly.


Rate: 1 2 3 4 5
Thank you! You will be redirected to your new comment shortly.

I wonder how decompilers work. I've written a disassembler as a side project a long time, but that was straight-forward. Decompilers seem much harder.

Rate: 1 2 3 4 5
Thank you! You will be redirected to your new comment shortly.

I've written a C decompiler with alistra (it's on github if you're curious), so I can shed some light on this.

Decompilation is a tricky process. In the general case, it is undecidable, meaning that in general, you cannot write an algorithm that would decompile every possible piece of code. It remains undecidable even if you disallow self-modifying code.

However, there are still many patterns in compiled machine code which you can use to try to construct a partial decompilation.

Our approach is best shown using a flowchart (yes):

Stages of decompilation

I will describe briefly the stages:

  1. Object dump

    Usually, executable programs are in a specific format (ELF and a.out in Unix, PE on Windows, etc.). Object dump is a process which takes an object file and extracts information from it, such as: the machine code, the entry point (where the code starts when you run the executable), list of symbols (function names, data, etc. with their associated addresses), and so on.

    For that, you can use the objdump utility available on most GNU systems as part of GNU Binutils.

    An example object dump (to get a list of symbols).

    drx:~/ocd$ objdump -t tests/test_ack | awk '$2 == "g"{print $0}' | grep -v _
    000000000040054f g     F .text  0000000000000059              ack
    0000000000400524 g     F .text  000000000000002b              main
    

    For reference, tests/test_ack.c (I will be using this example throughout the post):

    int main()
    {
        return printf("%d\n", ack(3, 4));
    }
    
    int ack(int m, int n)
    {
        if (m == 0) return n + 1;}
        else
        {
            if (n == 0) return ack(m - 1, 1);
            else return ack(m - 1, ack(m, n - 1));
        }
    }
    

    (The code implements the Ackermann function)

  2. Disassembly

    Compiled executable code is usually stored on disk as machine code (i.e. a set of binary instructions that is understood by CPUs directly). There is usually a human-readable way to transcribe this machine code, called assembly language.

    Disassembly consists of turning machine code into assembly language. For example, here's a snippet of the disassembly of test_ack into x64 code:

    ...
    
    0000000000400524 <main>:
      400524:   55                      push   %rbp
      400525:   48 89 e5                mov    %rsp,%rbp
      400528:   be 04 00 00 00          mov    $0x4,%esi
      40052d:   bf 03 00 00 00          mov    $0x3,%edi
      400532:   b8 00 00 00 00          mov    $0x0,%eax
      400537:   e8 13 00 00 00          callq  40054f <ack>
      40053c:   89 c6                   mov    %eax,%esi
      40053e:   bf 9c 06 40 00          mov    $0x40069c,%edi
      400543:   b8 00 00 00 00          mov    $0x0,%eax
      400548:   e8 cb fe ff ff          callq  400418 <printf@plt>
      40054d:   c9                      leaveq 
      40054e:   c3                      retq
    
    000000000040054f <ack>:
      40054f:   55                      push   %rbp
      400550:   48 89 e5                mov    %rsp,%rbp
      400553:   48 83 ec 10             sub    $0x10,%rsp
      400557:   89 7d fc                mov    %edi,-0x4(%rbp)
      40055a:   89 75 f8                mov    %esi,-0x8(%rbp)
      40055d:   83 7d fc 00             cmpl   $0x0,-0x4(%rbp)
      400561:   75 08                   jne    40056b <ack+0x1c>
    
     ...
    

    ocd disassembled into an intermediate language, so that more machine code types could be plugged in in the future.

  3. Control flow analysis

    A control flow graph is generated, which encodes information about the different paths a program execution might take, depending on control statements of the language (jumps, conditional statemenents, loops).

    Such control flow graphs have distinct patterns for certain control statments, here's a sample table:

    ifif/elsewhilepass
    ifif/elsewhilepass

    Those patterns can be then rewritten (using a graph rewriting system, not unlike a grammar rewriting system) into single nodes until no more patterns can be found. With this process the original code patterns emerge and we get some insight into the original program structure.

    Here is an example of such rewriting. This is the control flow graph for test_ack as it is rewritten into ultimately one node.

    Control flow graph example

  4. Data flow analysis

    First, it makes sense to convert things like register operands, etc. into fresh variables. For example, this:

    mov  eax, 0
    

    might be converted to

    var_13 = 0
    

    Then, the program's computation flow is analyzed. Dead instructions are removed and others are folded into more succinct expressions. This

    a = a + 1
    b = a * 2
    n = 2
    n = b - a
    

    might be converted into

    n = (a+1)*2-a+1
    

    Often, compilers have idioms. For example, instead of writing

    a = 0
    

    this might be used:

    a = a xor a
    

    to save cycles. A decompiler might keep a "dictionary" of such idioms to simplify the output.

  5. Program output

    Finally, the program can be output.

    For example, this is what ocd outputs for test_ack:

    int ack()
    {
        var_0 = temp_2;
        var_1 = temp_3;
        var_2 = var_0 - 0;
        if (var2)
        {
            var_2 = var_1 - 0;
            if (var2)
            {
                temp_4 = ack(temp_5 - 1, ack(var_0, temp_4 - 1));
            }
            else
            {
                temp_4 = ack(temp_4 - 1, 1);
            }
        }
        else
        {
            temp_4 = temp_4 + 1;
        }
        return temp_4;
    }
    int main()
    {
        return unknown_function("%d\n", ack(3, 4));
    }
    

    As you can see, it's not perfect, but you can see the general structure.

For a more in-depth analysis of the problem, you can read Cristina Cifuentes' PhD thesis, Reverse Compilation Techniques.

Rate: 1 2 3 4 5
Thank you! You will be redirected to your new comment shortly.

Do you have any background in reverse enginnering? Do you know if decompilers are used in practice?

Rate: 1 2 3 4 5
Thank you! You will be redirected to your new comment shortly.

I used to reverse engineer games a lot as a hobby. A decompiler would have been useful with old games written in C/C++, but only slightly.

Decompilers aren't used widely, they don't give enough value to bother trying to get them to work. Perhaps if the interface was better?

I wrote the decompiler for fun, to see what could be done, rather than to be useful.

Rate: 1 2 3 4 5
Thank you! You will be redirected to your new comment shortly.

The Hex-Rays decompiler requires IDA Pro 5.5 to work. According to this price listing, the bundle costs $2558. Pricey.

Rate: 1 2 3 4 5
Thank you! You will be redirected to your new comment shortly.

Houses are not very cheap and not everyone is able to buy it. However, personal loans was created to aid different people in such kind of cases.



Similar

People who this link also

Submitted by circaee

Short url: