CSS: Fix inconsistent UTF-8 scrubbing in decode_string_or_url slow path#229
CSS: Fix inconsistent UTF-8 scrubbing in decode_string_or_url slow path#229
decode_string_or_url slow path#229Conversation
Agent-Logs-Url: https://github.com/WordPress/php-toolkit/sessions/c82f4270-56cb-4261-bee7-92e7971aba49 Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>
decode_string_or_url slow path
Agent-Logs-Url: https://github.com/WordPress/php-toolkit/sessions/892f6cca-d936-43dc-9f63-27739478ddbd Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>
Compared both patches — the code fix is identical in both. Replaced my 3 URL/string-token tests with the 2 cleaner ident-based tests from #231 (same structure as the existing invalid-UTF-8 tests, showing both Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
The two additional tests fail on trunk: I think that the test failures are poorly rendered because the invalid UTF-8 is replaced with |
|
@copilot Why doesn't the entire method just rely on the already scrubbed |
... Those failures are expected — they're the tests being run against the unfixed trunk branch, which is the intended regression-test behavior. In our branch the CI "PHPUnit Tests" run confirms all tests pass (commit 70fb724). The confusing visual diff is inherent: both the expected value (
The slow path works with absolute byte offsets into Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
The slow path of
decode_string_or_urlpulled raw bytes from$this->csswithout scrubbing, while the fast path always appliedwp_scrub_utf8(). This meant a CSS value containing an invalid UTF-8 byte alongside a backslash escape could leak invalid bytes throughget_token_value().Two places were affected:
substr($this->css, $at, $normal_len)was appended directly; any invalid UTF-8 bytes in non-special-char runs were not replaced with U+FFFDdecode_escape_at()returns raw bytes in its "anything else" branch (the code comment there even noted replacement should happen in the caller, but never did)Fix: Wrap both appends with
wp_scrub_utf8(), consistent with the fast path:Tests added:
test_invalid_utf8_in_normal_segment_combined_with_escape— ident token.test\xF1\\41name: invalid byte (0xF1) in the normal segment before a hex escape; slow path triggered by the backslash, but the preceding segment was not previously scrubbedtest_invalid_utf8_as_escaped_character— ident token.a\\\xF1b: invalid byte as the escaped character itself;decode_escape_at()returns the raw byte which must be scrubbed at the call site