SquirrelJME: 2016/04/20

08:57

So for the single pass compilation, I will also need an existing run-time state so that final variables may potentially be optimized globally and so that I actually get global optimizations. I can have the interpreter be similar. When compiling the memory space will be similar to an actual running system which essentially sees everything as a bunch of integers. The SSA operations would essentially be the required operations so that it is easily translated to the target system.

09:16

I suppose for native systems, the SSA will need personalities. For example if I target a 16-bit CPU it will not have 32-bit add operations so they will need to be split up if specific values need to be calculated. That is, I will need to keep a potential value store. If an if is done and a value is say below 100 but greater than zero and a value of 7 is added, I can just keep it as a 8-bit add instead of requiring a full 32-bits. This would be better for 8-bit and 16-bit systems which I do plan to target. Otherwise even simple math would take lots of memory and far too many cycles to complete. The personality will provide details such as the native integer size, if it can do Java compatible floating point, the memory model (flat, segmented, split data/code, and banked).

09:40

With all of the stack operations and such, and copies I can have a basic high speed to SSA first which is rather ugly. Then once all of it is loaded in then optimization can be performed to generate a new program which is faster and potentially smaller. For the program state, instead of having what I have before with a set of variables with values I can have instead a global register file which operations perform work on. So instead of operations having state such as "variable 2 is an int with value 42." the operations would at on unique variables given by their IDs (so for example add $17 = $14 + $12). Then in the register file there would be slots for 17, 14, and 12. Another possible kind of SSA thing would be to only care about side effects. The register file will contain operations and their derived values from other registers. Then if a method returns variable 12 then it goes backwards and performs the required operations to compute the given value. However this would be easily done if there were no method calls for example. A bunch of methods could be called which change the state of a global field which to the method itself it never set in that method, but the one that is called. One major hurdle is getting the SSA representation to be clean and the ability for optimization to update variable data and such. At least with a register file all of the used variables would be in a single location. Each variable could have direct dependencies. If an operation adds 14 and 12 and puts it into 17 then the variable 17 will have a dependency on that operation.

09:50

However, even with copy operations in the SSA code, the native code generation pass which takes input SSA could determine if any actual work was performed.

10:14

However, a very limited system such as ancient 16-bit systems with about only 64K of RAM will be unlikely to load the entire program into SSA form, optimize it, then generate native code very fast. So, I propose instead a straight generation pass from Java byte code to machine code with stack caching in a linear sense. However, in specific cases I cannot cache some items because they could for example be the start of a for loop. However, if I do not scan the method before hand for jumps going back I can do a rewind. Since everything is done in a single pass so to speak, if there is a goto into the back and it goes for example to the start of a loop then I can just drop everything I performed and flag a specific state. This would only be handled for a single operation however. Once its jump back point has been calculated it is dropped and not done again. However any jump backs will require recalculation. The next thing to consider is jump forwards (such as tableswitch and such). For future operations I can just set a marker to say that there is a given state. I will need to store the variable types so that verification is possible and the stack map can be used. I will do the same thing as before by using a linear passing of the stack map. However, jump backs which revert state will also force the stack map to go back also so that it has to be handled again. So if a jump back occurs, then all known state following the target instruction is dropped as if it never occured except for that flagging bit. Then for basic optimization of values instead of having value ranges I can have a sign extended mask. That is the highest bit of the mask represents the sign bit so a value of 127 will be 0xFF along with -127 using the same even though 127 is 0x7F, the extra bit set will represent the sign bit to extend. This would only work for integers however. Then to remove multiple null checks I can have a null flagging bit which says that something is known to be null or not null. For arrays I can have a specifier which gives the array length. For example, an unknown array would have length n. If a request is made to access index 8 in the array, then a check is made against the size. Following that access or a similar check against the length of an array will result in the array having the indices up 8 values be unchecked. The important optimization would be removing array bounds checks and null pointer checks since those can be costly. So an operation which performs lots of work on a single array would eventually get all of its checks removed depending on how the method works.

10:30

The next thing would be exceptions. I will do as I thought of before, have the exceptions be ran first so that the tables are generated at the start. I can also drop some temporaries because when it comes to exceptions anything that is on the stack is destroyed. The only thing on the stack is the exception which is to be handled. So as before, a dedicated register and an index which specifies the exception table entry which should be handled. With that, there would only be a need to have a single lookup for exception handling. One thing however I need to determine when exceptions and gotos exist is block barriers. Say an exception handler is between a jump forward and another kind of block and there is no potential jump in. A jump back should not remove the exception handler code at all. However, it is possible for an exception handler to jump back into the normal program control flow (an exception handler in a for loop for example). So exceptions complicate this. So if an exception performs a goto or similar then it will perform a jump back and the exception potentially rehandled since it could modify the state of the program. Then as before, jump forwards would set a future flag that an instruction later is a jump target. So the most important thing is reading the exception table first to determine the potential future jumps since past jumps would require recalculation.

11:34

Took a walk, have some ideas. Instead of CFMethod returning an InputStream of a byte array containing the code attribute it instead has a CFCodeAttribute which can provide an InputStream for CPProgram usage. It would also handle exceptions. This way when I write the processor I do not have to have another copy of the all of the data such as exceptions, the byte code, and the attributes. The class will just provide a view of the class.

Then something similar could be done with the deflate processor. I can have the sliding window provide the InputStream data and it would support limited mark and such. However, it could only be done once. So instead of an output queue, it would just use the window directly. I also thought about having a system property which can set the maximum size of the sliding window. So for example on very limited systems with barely any memory, a sliding window of 32KiB can be very large and not all of it might be needed at all.

15:56

So I will need an output program which contains the parsing state which is set by an input program. The input program parser will read and handle the byte codes. Actually one property of the byte code is that all operations must have the same variable type state. So jump backs, if their value states differ then they have to be calculated. However types which are removed do nothing at all.

16:05

For simplicity the stack caching idea would have a fast code generator although it would not produce the fastest code. So basically only local variables would count for the most part. The stack operations would really do nothing. The locals stay the same during exception handlers, so they are assigned actual registers. The stack entries would just be virtual temporaries which are assigned no registers. So essentially local variables will be permanent. When a local variable is written to in the byte code the stack is consumed and the code to be generated is executed. Other points such as returning values and loading and storing from fields and arrays, and would trigger events to occur. So essentially there would be synchronization events. Stack entries would be pure virtual. If there is an event where a method is called, then its return value would be placed in a real register (since it has to go somewhere). That variable will be like a local except that it is dropped during exception handlers. So I would flow through the program quite normally, and in the case of these synchronizations, depending on the operations I would not need to redo operations if I jump back unless a synchronization occurs. I do however have to consider that there could be an entry on the stack which has a different value when it is jumped to. So jump targets which are not exception handlers would have to have all their stack entries on a barrier of sorts. So that the stack entries become temporary real registers. One of the good things is that I never have to worry about jsr and ret since no code will ever have those instructions.

16:16

So I may have to go through all the instructions to determine jump targets. If an instruction has stack entries then I will do an implicit generate of the entries and real register allocation on them. There would just be real registers contianing the actual values and the calculations dropped so that they just become values. Then normal virtual stack registers would be used while the temporary real ones will just contain some values. Then when that jump back point is reached, all the input values will be calculated between the slices so that it can jump back. Doing it this way, although not as optimized will be faster and simpler to implement. So I only perform stack caching when an instruction is not a target of a jump. If I jump to an instruction in the future, it would be the same. There would be temporary register sets. Thus, the basic blocks of a method will start with the instruction which is the start of an exception handler, the method, or the target of a jump. The end of basic blocks would be return values, method calls, load/store fields/arrays, jumps to a target, or instructions which have jumps to it. Each basic block has its own set of stack temporaries if required. At the end of a basic block, all values are synchronized into these temporaries. During the transition of one block to another, temporaries may be moved and shuffled as needed so the values fit.

I should note that as before, locals are always up to date regardless of whether they are just set, and never used again.

16:25

For basic blocks though, should I end a block at remote call or should it end on the instruction before it? Before a method is called however, registers need to be saved and the stack and registers setup so that arguments can be passed to the given method. If I end before it, I would have all of the stack items saved accordingly since they are all popped from the stack. That would be good because then a method call or similar would start a basic block and it would have none of the stack temporaries used as method arguments in it. Then the same thing can be done for field get and set, throwing, and returning of values (and some others). So operations such as getfield will be the start of a basic block. The same would be done for all the synchronizing operations.

16:31

This would not produce the best code, however it would be simple to perform and stuff such as method invocation would be simplified because methods being at the start would mean that all stack temporaries are synced. So after setting up the exception jump table, I determine the bounds of basic blocks. If there are two method calls in a row or similar then they would be in their own basic block since there can only be a method call at the start of the block.

17:46

To start the virtual machine faster, there can be an initial program state which has already been calculated in memory. For example, all of the static variables and instances would for the most part already exist. This could be a bonus on systems with a small amount of memory where initialization and loading of classes could be quite filling.

17:58

If I can get the run-time to run in under 8KiB then it could run on an NES. This would be an achievement. Including the 8KiB save chip, the NES comes with 2KiB of space. So there is about 10KiB of RAM available. With the right ROM size and enough bank switching, it would be possible to have a system on a large ROM. However I would want it to be as close as possible to 2KiB so that it does not require "pirate" ROM carts.

18:18

One thing though, are conditionals at the bounds of a basic block?

18:39

For JIT simplicity, the entire program on start will either be compiled completely at the start and potentially cached or just compiled all at once. This would permit for some whole program optimization regarding fields. If say a final field with an int contains the value 42, a getstatic or getfield of that field would just result in 42 with no actual memory operation required.

18:45

The compiler's class library lookup will need a separated detail. I will require this so that I do need to actually have to depend on javame-class-file since for example the JIT when cross compiling will use class files, however on the target system when it is running all of the classes in the main library are built into the executable. Thus, they cannot be in class file (they can but that would waste much space).

19:01

Having a bridge over the class file code will permit as stated before. Although there would be some duplication it would be better than hacking the class file code to support built in classes, including the class files in the ROM, or including the CLDC JARs somewhere.