Fix IO::Encoder#write when operating on long strings#16797
Fix IO::Encoder#write when operating on long strings#16797jgaskins wants to merge 6 commits intocrystal-lang:masterfrom
IO::Encoder#write when operating on long strings#16797Conversation
We accomplish this by ignoring `Errno::E2BIG` errors.
|
This pull request has been mentioned on Crystal Forum. There might be relevant details there: |
We need to test single-byte ASCII characters but also multibyte Unicode characters to ensure we're encoding the characters correctly when that multibyte character lands on a buffer boundary.
ysbaddaden
left a comment
There was a problem hiding this comment.
The error should be handled by Crystal::Iconv#convert directly. That would fix all usages at once (IO::Encoding, String.encode, ...).
src/io/encoding.cr
Outdated
| if err == Crystal::Iconv::ERROR | ||
| @iconv.handle_invalid(pointerof(inbuf_ptr), pointerof(inbytesleft)) | ||
| end |
There was a problem hiding this comment.
Shouldn't this path be handled by Iconv#convert as well?
There was a problem hiding this comment.
This is a great question. Seems like that would address the concern raised here.
There was a problem hiding this comment.
But, to be clear, I'm not confident enough to make that decision. There may be a reason it's handled there that I don't have context for.
There was a problem hiding this comment.
Good point. Every call site does just the same. That smells like copy-paste. Since the #convert method already handles invalids on FreeBSD and DragonflyBSD, let's encapsulate the whole behavior into #convert 👍
3463c2f to
2eb7781
Compare
| # Using both ASCII characters and a 26-byte Unicode characters to | ||
| # ensure we hit as many byte boundaries inside the Unicode characters | ||
| # as we can to get sufficient confidence in this test. | ||
| text = "test string 👩🏾🤝👨🏻" * 10240 |
There was a problem hiding this comment.
question: Do we really need specific single-/multi-byte characters at all to test this properly?
The original example only uses single-byte characters to reproduce the bug.
There was a problem hiding this comment.
It depends on how iconv works and how familiar someone is with it. I don’t know anything at all about it, so I needed a test case that gives me sufficient confidence that the behavior introduced in this PR doesn’t count multi-byte characters that cross the 1024-byte boundary (for example: starts at byte 1022 and ends at byte 1030) as invalid.
I have no idea how iconv handles that scenario (it may very well protect against it, but again, I don’t know) and this test case shows that it handles it as expected. Without the multi-byte character, I couldn’t say for sure. Since I didn’t find any tests that exercised this scenario and it was easy to test, I added it in.
This PR fixes encoding operations on long strings by ignoring
Errno::E2BIGerrors.FWIW, this will need additional testing to ensure that strings with non-ASCII characters still work, but the test included with this PR fails without this patch and passes with it.
Fixes #16796