1 |
dpavlin |
38 |
Random thoughts about native code generation, which will be compatible |
2 |
|
|
with the already existing (non-host-specific) dyntrans core. |
3 |
|
|
|
4 |
|
|
|
5 |
|
|
How to keep track of the number of times a basic block is executed? |
6 |
|
|
(Perhaps needed, since unnecessary native code generation may slow things |
7 |
|
|
down. Only the blocks that are really common need to be natively |
8 |
|
|
translated.) |
9 |
|
|
|
10 |
|
|
Perhaps having a small additional array per page is a solution? |
11 |
|
|
unsigned char count[NR_OF_IC_ENTRIES_PER_PAGE]; |
12 |
|
|
For a typical MIPS cpu, that would be 1024 bytes extra per page. |
13 |
|
|
The main loop could be changed to increase count, and if count goes beyond |
14 |
|
|
a certain threshhold, the block is natively translated. Hm. |
15 |
|
|
|
16 |
|
|
Or perhaps the overhead of implementing this counter check is more than it |
17 |
|
|
is worth? After all, most of the time will be spent executing (some of) |
18 |
|
|
the translated loops. |
19 |
|
|
|
20 |
|
|
------------------------------------- |
21 |
|
|
|
22 |
|
|
At most one [basic] block is ever translated at any given time. |
23 |
|
|
A small array can hold the INR entries, and a small memory area can |
24 |
|
|
hold a (double-linked list) of native instruction entries. |
25 |
|
|
|
26 |
|
|
Simple instructions: |
27 |
|
|
|
28 |
|
|
32-bit MIPS: |
29 |
|
|
andi $5,$5,0xff00 |
30 |
|
|
ori $5,$5,0x0011 |
31 |
|
|
|
32 |
|
|
Intermediate native representation: |
33 |
|
|
AND_REG32PTR_REG32PTR_IMM16 (offset to reg 5, offset to reg 5, 0xff00) |
34 |
|
|
OR_REG32PTR_REG32PTR_IMM16 (offset to reg 5, offset to reg 5, 0x0011) |
35 |
|
|
|
36 |
|
|
Non-peephole-optimized x86[_64] code: (esi = struct cpu *) |
37 |
|
|
mov eax, [esi + offset_to_source_reg] |
38 |
|
|
and eax, 0xff00 |
39 |
|
|
mov [esi + offset_to_destination_reg], eax (#1) |
40 |
|
|
mov eax, [esi + offset_to_source_reg] (#2) |
41 |
|
|
or eax, 0x0011 |
42 |
|
|
mov [esi + offset_to_destination_reg], eax |
43 |
|
|
|
44 |
|
|
Peephole-optimized x86[_64] code: |
45 |
|
|
(on the first pass, #2 is removed, since it loads back a value which was |
46 |
|
|
previously written. the value is already in eax!) |
47 |
|
|
(on the second pass, the store at #1 is removed, since another store |
48 |
|
|
later on overwrites the same register) |
49 |
|
|
mov eax, [esi + offset_to_source_reg] |
50 |
|
|
and eax, 0xff00 |
51 |
|
|
or eax, 0x0011 |
52 |
|
|
mov [esi + offset_to_destination_reg], eax |
53 |
|
|
|
54 |
|
|
Native code entry: |
55 |
|
|
(none on x86_64) |
56 |
|
|
|
57 |
|
|
Native code exit: |
58 |
|
|
ret[q] |
59 |
|
|
|
60 |
|
|
--------------------------- |
61 |
|
|
|
62 |
|
|
Update of nr-of-executed-instructions and the IC pointer: |
63 |
|
|
|
64 |
|
|
All possible return paths need to update the following: |
65 |
|
|
|
66 |
|
|
x) The nr-of-executed-instructions count (one less than the |
67 |
|
|
number of instructions in the translated block, since an |
68 |
|
|
implicit count of 1 is already included). |
69 |
|
|
x) The next_ic pointer, and also the cur_page if we have |
70 |
|
|
switched page. |
71 |
|
|
|
72 |
|
|
----------------------------- |
73 |
|
|
|
74 |
|
|
Stages during translation: |
75 |
|
|
|
76 |
|
|
Stage 1: |
77 |
|
|
Emulated ISA (e.g. MIPS) to INR instructions. |
78 |
|
|
Each emulated instruction may be turned into 0 or |
79 |
|
|
more INR instructions. |
80 |
|
|
This is done in e.g. src/cpus/cpu_mips_instr.c |
81 |
|
|
using semi-magic macros. |
82 |
|
|
The INR array is a fixed size small array, pointed |
83 |
|
|
to by the cpu struct. |
84 |
|
|
|
85 |
|
|
Stage 2: |
86 |
|
|
INR -> native operations (e.g. x86). |
87 |
|
|
This is done in src/native/native_x86.c. |
88 |
|
|
Things to think about are round-robin use of |
89 |
|
|
temporary registers. |
90 |
|
|
native_inr_to_native_ops() takes a cpu as input, |
91 |
|
|
translates the current INR entries into native |
92 |
|
|
pseudo-opcodes. |
93 |
|
|
|
94 |
|
|
Stage 3: |
95 |
|
|
Optimization, native ops -> native ops. |
96 |
|
|
This is done in src/native/native_x86_optim.c, |
97 |
|
|
and is an optional step. It should be possible |
98 |
|
|
to turn this step of, for debugging. |
99 |
|
|
If e.g. a value is in a register, and it is stored |
100 |
|
|
to memory, then the same memory position does not |
101 |
|
|
have to be read back; the value is already in a |
102 |
|
|
register. |
103 |
|
|
|
104 |
|
|
Stage 4: |
105 |
|
|
Code generation, native ops -> native machine code. |
106 |
|
|
Done in src/native/native_x86_gen.c. |
107 |
|
|
|
108 |
|
|
Stage 5: |
109 |
|
|
Patch _older_ code chunks so that they can branch |
110 |
|
|
directly to the new chunk, if possible. |
111 |
|
|
An optional step. |
112 |
|
|
|
113 |
|
|
Stage 6: |
114 |
|
|
Enter the newly generated native code chunk into |
115 |
|
|
the physpage' ic->f. |