Changes in the validation of UTF-8

All UTF-8 encoding functionality (including the escape
sequence '\u') accepts all values from the original UTF-8
specification (with sequences of up to six bytes).

By default, the decoding functions in the UTF-8 library do not
accept invalid Unicode code points, such as surrogates. A new
parameter 'nonstrict' makes them accept all code points up to
(2^31)-1, as in the original UTF-8 specification.
This commit is contained in:
Roberto Ierusalimschy
2019-03-15 13:14:17 -03:00
parent 8fa4f1380b
commit 1e0c73d5b6
6 changed files with 164 additions and 72 deletions

View File

@@ -1004,6 +1004,8 @@ the escape sequence @T{\u{@rep{XXX}}}
(note the mandatory enclosing brackets),
where @rep{XXX} is a sequence of one or more hexadecimal digits
representing the character code point.
This code point can be any value smaller than @M{2@sp{31}}.
(Lua uses the original UTF-8 specification here.)
Literal strings can also be defined using a long format
enclosed by @def{long brackets}.
@@ -6899,6 +6901,7 @@ x = string.gsub("$name-$version.tar.gz", "%$(%w+)", t)
}
@LibEntry{string.len (s)|
Receives a string and returns its length.
The empty string @T{""} has length 0.
Embedded zeros are counted,
@@ -6907,6 +6910,7 @@ so @T{"a\000bc\000"} has length 5.
}
@LibEntry{string.lower (s)|
Receives a string and returns a copy of this string with all
uppercase letters changed to lowercase.
All other characters are left unchanged.
@@ -6915,6 +6919,7 @@ The definition of what an uppercase letter is depends on the current locale.
}
@LibEntry{string.match (s, pattern [, init])|
Looks for the first @emph{match} of
@id{pattern} @see{pm} in the string @id{s}.
If it finds one, then @id{match} returns
@@ -6946,6 +6951,7 @@ The format string cannot have the variable-length options
}
@LibEntry{string.rep (s, n [, sep])|
Returns a string that is the concatenation of @id{n} copies of
the string @id{s} separated by the string @id{sep}.
The default value for @id{sep} is the empty string
@@ -6958,11 +6964,13 @@ with a single call to this function.)
}
@LibEntry{string.reverse (s)|
Returns a string that is the string @id{s} reversed.
}
@LibEntry{string.sub (s, i [, j])|
Returns the substring of @id{s} that
starts at @id{i} and continues until @id{j};
@id{i} and @id{j} can be negative.
@@ -6998,6 +7006,7 @@ this function also returns the index of the first unread byte in @id{s}.
}
@LibEntry{string.upper (s)|
Receives a string and returns a copy of this string with all
lowercase letters changed to uppercase.
All other characters are left unchanged.
@@ -7318,8 +7327,24 @@ or one plus the length of the subject string.
As in the string library,
negative indices count from the end of the string.
Functions that create byte sequences
accept all values up to @T{0x7FFFFFFF},
as defined in the original UTF-8 specification;
that implies byte sequences of up to six bytes.
Functions that interpret byte sequences only accept
valid sequences (well formed and not overlong).
By default, they only accept byte sequences
that result in valid Unicode code points,
rejecting values larger than @T{10FFFF} and surrogates.
A boolean argument @id{nonstrict}, when available,
lifts these checks,
so that all values up to @T{0x7FFFFFFF} are accepted.
(Not well formed and overlong sequences are still rejected.)
@LibEntry{utf8.char (@Cdots)|
Receives zero or more integers,
converts each one to its corresponding UTF-8 byte sequence
and returns a string with the concatenation of all these sequences.
@@ -7327,14 +7352,15 @@ and returns a string with the concatenation of all these sequences.
}
@LibEntry{utf8.charpattern|
The pattern (a string, not a function) @St{[\0-\x7F\xC2-\xF4][\x80-\xBF]*}
The pattern (a string, not a function) @St{[\0-\x7F\xC2-\xFD][\x80-\xBF]*}
@see{pm},
which matches exactly one UTF-8 byte sequence,
assuming that the subject is a valid UTF-8 string.
}
@LibEntry{utf8.codes (s)|
@LibEntry{utf8.codes (s [, nonstrict])|
Returns values so that the construction
@verbatim{
@@ -7347,7 +7373,8 @@ It raises an error if it meets any invalid byte sequence.
}
@LibEntry{utf8.codepoint (s [, i [, j]])|
@LibEntry{utf8.codepoint (s [, i [, j [, nonstrict]]])|
Returns the codepoints (as integers) from all characters in @id{s}
that start between byte position @id{i} and @id{j} (both included).
The default for @id{i} is 1 and for @id{j} is @id{i}.
@@ -7355,7 +7382,8 @@ It raises an error if it meets any invalid byte sequence.
}
@LibEntry{utf8.len (s [, i [, j]])|
@LibEntry{utf8.len (s [, i [, j [, nonstrict]]])|
Returns the number of UTF-8 characters in string @id{s}
that start between positions @id{i} and @id{j} (both inclusive).
The default for @id{i} is @num{1} and for @id{j} is @num{-1}.
@@ -7365,6 +7393,7 @@ returns a false value plus the position of the first invalid byte.
}
@LibEntry{utf8.offset (s, n [, i])|
Returns the position (in bytes) where the encoding of the
@id{n}-th character of @id{s}
(counting from position @id{i}) starts.
@@ -8755,6 +8784,12 @@ You can enclose the call in parentheses if you need to
discard these extra results.
}
@item{
By default, the decoding functions in the @Lid{utf8} library
do not accept surrogates as valid code points.
An extra parameter in these functions makes them more permissive.
}
}
}