17:04
Ok, I should have a token parse location which specifies the zone a token is in. This is a token without just a basic high level context.
17:51
Well this appeared in my code for parts of identifiers:
(__c >= 0x0001 && __c < 0x0009) ||
(__c >= 0x000E && __c < 0x001C) ||
These are ASCII control codes. I never would have suspected these would be valid.
17:55
It even includes NUL. So the valid characters are: NUL, SOH, STX, ETX, EOT, EOT, ENQ, ACK, BEL, BS, SO, SI, DLE, DC1, DC2, DC3, DC4, NAK, SYN, ETB, CAN, EM, SUB, ESC, and DEL. This excludes: HT, LT, VT, FF, CR, FS, GS, RS, and US.
19:28
I am going to teir the tokenizer up. First have a basic tokenizer which is really simple then have one that is a bit more complex and knows more about the program state (handled as a queue) which can create better context sensitive tokens.
19:30
So that means the token zone goes away and things get simpler. The Tokenizer
will just only care about tokens.
19:58
Ok so the first stage reader gets simplified and is given position and such.
20:13
One thing with peaking is that the line and column information will end up
being a bit off, so if I want something accurate I will need to record it. So
I suppose there will be a bunch of slots although a bunch of arrays could do.
Maybe instead of the Tokenizer
doing the queueing the LogicalReader
can
that way it is just in one spot. It itself has to de-escape unicode sequences
anyway so it will have bytes in a queue for the most part.
20:52
I do not actually need a queue, I can take multiple routes based on the first read character.
22:06
Ok so unicode escape sequences are nicely handled and it works quite well.