Annotation of /trunk/src/native/THOUGHTS_AND_IDEAS

Random thoughts about native code generation, which will be compatible 
with the already existing (non-host-specific) dyntrans core.


How to keep track of the number of times a basic block is executed? 
(Perhaps needed, since unnecessary native code generation may slow things 
down. Only the blocks that are really common need to be natively 
translated.)

Perhaps having a small additional array per page is a solution?
        unsigned char count[NR_OF_IC_ENTRIES_PER_PAGE];
For a typical MIPS cpu, that would be 1024 bytes extra per page.
The main loop could be changed to increase count, and if count goes beyond
a certain threshhold, the block is natively translated. Hm.

Or perhaps the overhead of implementing this counter check is more than it 
is worth? After all, most of the time will be spent executing (some of) 
the translated loops.

-------------------------------------

At most one [basic] block is ever translated at any given time.
A small array can hold the INR entries, and a small memory area can
hold a (double-linked list) of native instruction entries.

Simple instructions:

32-bit MIPS:
        andi $5,$5,0xff00
        ori $5,$5,0x0011

Intermediate native representation:
        AND_REG32PTR_REG32PTR_IMM16 (offset to reg 5, offset to reg 5, 0xff00)
        OR_REG32PTR_REG32PTR_IMM16 (offset to reg 5, offset to reg 5, 0x0011)

Non-peephole-optimized x86[_64] code:  (esi = struct cpu *)
        mov eax, [esi + offset_to_source_reg]
        and eax, 0xff00
        mov [esi + offset_to_destination_reg], eax      (#1)
        mov eax, [esi + offset_to_source_reg]           (#2)
        or eax, 0x0011
        mov [esi + offset_to_destination_reg], eax

Peephole-optimized x86[_64] code:
(on the first pass, #2 is removed, since it loads back a value which was
previously written. the value is already in eax!)
(on the second pass, the store at #1 is removed, since another store
later on overwrites the same register)
        mov eax, [esi + offset_to_source_reg]
        and eax, 0xff00
        or eax, 0x0011
        mov [esi + offset_to_destination_reg], eax

Native code entry:
        (none on x86_64)

Native code exit:
        ret[q]

---------------------------

Update of nr-of-executed-instructions and the IC pointer:

        All possible return paths need to update the following:

        x) The nr-of-executed-instructions count (one less than the
           number of instructions in the translated block, since an
           implicit count of 1 is already included).
        x) The next_ic pointer, and also the cur_page if we have
           switched page.

-----------------------------

Stages during translation:

        Stage 1:
                Emulated ISA (e.g. MIPS) to INR instructions.
                Each emulated instruction may be turned into 0 or
                more INR instructions.
                This is done in e.g. src/cpus/cpu_mips_instr.c
                using semi-magic macros.
                The INR array is a fixed size small array, pointed
                to by the cpu struct.

        Stage 2:
                INR -> native operations (e.g. x86).
                This is done in src/native/native_x86.c.
                Things to think about are round-robin use of
                temporary registers.
                native_inr_to_native_ops() takes a cpu as input,
                translates the current INR entries into native
                pseudo-opcodes.

        Stage 3:
                Optimization, native ops -> native ops.
                This is done in src/native/native_x86_optim.c,
                and is an optional step. It should be possible
                to turn this step of, for debugging.
                If e.g. a value is in a register, and it is stored
                to memory, then the same memory position does not
                have to be read back; the value is already in a
                register.

        Stage 4:
                Code generation, native ops -> native machine code.
                Done in src/native/native_x86_gen.c.

        Stage 5:
                Patch _older_ code chunks so that they can branch
                directly to the new chunk, if possible.
                An optional step.

        Stage 6:
                Enter the newly generated native code chunk into
                the physpage' ic->f.
1	dpavlin	38	Random thoughts about native code generation, which will be compatible
2			with the already existing (non-host-specific) dyntrans core.
3
4
5			How to keep track of the number of times a basic block is executed?
6			(Perhaps needed, since unnecessary native code generation may slow things
7			down. Only the blocks that are really common need to be natively
8			translated.)
9
10			Perhaps having a small additional array per page is a solution?
11			unsigned char count[NR_OF_IC_ENTRIES_PER_PAGE];
12			For a typical MIPS cpu, that would be 1024 bytes extra per page.
13			The main loop could be changed to increase count, and if count goes beyond
14			a certain threshhold, the block is natively translated. Hm.
15
16			Or perhaps the overhead of implementing this counter check is more than it
17			is worth? After all, most of the time will be spent executing (some of)
18			the translated loops.
19
20			-------------------------------------
21
22			At most one [basic] block is ever translated at any given time.
23			A small array can hold the INR entries, and a small memory area can
24			hold a (double-linked list) of native instruction entries.
25
26			Simple instructions:
27
28			32-bit MIPS:
29			andi $5,$5,0xff00
30			ori $5,$5,0x0011
31
32			Intermediate native representation:
33			AND_REG32PTR_REG32PTR_IMM16 (offset to reg 5, offset to reg 5, 0xff00)
34			OR_REG32PTR_REG32PTR_IMM16 (offset to reg 5, offset to reg 5, 0x0011)
35
36			Non-peephole-optimized x86[_64] code: (esi = struct cpu *)
37			mov eax, [esi + offset_to_source_reg]
38			and eax, 0xff00
39			mov [esi + offset_to_destination_reg], eax (#1)
40			mov eax, [esi + offset_to_source_reg] (#2)
41			or eax, 0x0011
42			mov [esi + offset_to_destination_reg], eax
43
44			Peephole-optimized x86[_64] code:
45			(on the first pass, #2 is removed, since it loads back a value which was
46			previously written. the value is already in eax!)
47			(on the second pass, the store at #1 is removed, since another store
48			later on overwrites the same register)
49			mov eax, [esi + offset_to_source_reg]
50			and eax, 0xff00
51			or eax, 0x0011
52			mov [esi + offset_to_destination_reg], eax
53
54			Native code entry:
55			(none on x86_64)
56
57			Native code exit:
58			ret[q]
59
60			---------------------------
61
62			Update of nr-of-executed-instructions and the IC pointer:
63
64			All possible return paths need to update the following:
65
66			x) The nr-of-executed-instructions count (one less than the
67			number of instructions in the translated block, since an
68			implicit count of 1 is already included).
69			x) The next_ic pointer, and also the cur_page if we have
70			switched page.
71
72			-----------------------------
73
74			Stages during translation:
75
76			Stage 1:
77			Emulated ISA (e.g. MIPS) to INR instructions.
78			Each emulated instruction may be turned into 0 or
79			more INR instructions.
80			This is done in e.g. src/cpus/cpu_mips_instr.c
81			using semi-magic macros.
82			The INR array is a fixed size small array, pointed
83			to by the cpu struct.
84
85			Stage 2:
86			INR -> native operations (e.g. x86).
87			This is done in src/native/native_x86.c.
88			Things to think about are round-robin use of
89			temporary registers.
90			native_inr_to_native_ops() takes a cpu as input,
91			translates the current INR entries into native
92			pseudo-opcodes.
93
94			Stage 3:
95			Optimization, native ops -> native ops.
96			This is done in src/native/native_x86_optim.c,
97			and is an optional step. It should be possible
98			to turn this step of, for debugging.
99			If e.g. a value is in a register, and it is stored
100			to memory, then the same memory position does not
101			have to be read back; the value is already in a
102			register.
103
104			Stage 4:
105			Code generation, native ops -> native machine code.
106			Done in src/native/native_x86_gen.c.
107
108			Stage 5:
109			Patch _older_ code chunks so that they can branch
110			directly to the new chunk, if possible.
111			An optional step.
112
113			Stage 6:
114			Enter the newly generated native code chunk into
115			the physpage' ic->f.