| Age | Commit message (Collapse) | Author |
|
Some new keywords, I opted to modify java-8 grammar to use the new
names, even if they are not going to match anything. Makes the
tokenizer easier to write.
|
|
Only parses Java 8 tokens for now.
|
|
Will be used by tokenizer for short lists of strings
|
|
|
|
|
|
Reads UTF-8 and UTF-16 into UTF-8 or UTF-16 strings.
If strict is true, fails at first invalid character.
If strict is false, invalid characters are replaced with U+FFFD.
For the replacement, I changed behavior if uN::read_replace to only
jump one byte. Otherwise a common invalid case when ISO-8859-1 or
WIN-1252 are read as UTF-8 would skip many characters.
If skip_bom is true any bom at start of stream is ignored.
If skip_bom is false any bom will be included.
Input format can be forced, if not detect is used which will
try to guess and then fallback to UTF-8.
|
|
|
|
Generate the lookup tables from UnicodeData.txt, do to that,
add gen_ugc, which uses csv, buffers, line, io and other modules
to do the job.
|
|
Are not going to use them
|
|
|
|
Only a basic argument parser to start with.
|