Changes in the validation of UTF-8

All UTF-8 encoding functionality (including the escape
sequence '\u') accepts all values from the original UTF-8
specification (with sequences of up to six bytes).

By default, the decoding functions in the UTF-8 library do not
accept invalid Unicode code points, such as surrogates. A new
parameter 'nonstrict' makes them accept all code points up to
(2^31)-1, as in the original UTF-8 specification.
This commit is contained in:
Roberto Ierusalimschy
2019-03-15 13:14:17 -03:00
parent 8fa4f1380b
commit 1e0c73d5b6
6 changed files with 164 additions and 72 deletions

2
llex.c
View File

@@ -335,7 +335,7 @@ static unsigned long readutf8esc (LexState *ls) {
while ((save_and_next(ls), lisxdigit(ls->current))) {
i++;
r = (r << 4) + luaO_hexavalue(ls->current);
esccheck(ls, r <= 0x10FFFF, "UTF-8 value too large");
esccheck(ls, r <= 0x7FFFFFFFu, "UTF-8 value too large");
}
esccheck(ls, ls->current == '}', "missing '}'");
next(ls); /* skip '}' */