Skip to content

Fix IO::Encoder#write when operating on long strings#16797

Open
jgaskins wants to merge 6 commits intocrystal-lang:masterfrom
jgaskins:fix-io-encoder-with-long-strings
Open

Fix IO::Encoder#write when operating on long strings#16797
jgaskins wants to merge 6 commits intocrystal-lang:masterfrom
jgaskins:fix-io-encoder-with-long-strings

Conversation

@jgaskins
Copy link
Copy Markdown
Contributor

This PR fixes encoding operations on long strings by ignoring Errno::E2BIG errors.

FWIW, this will need additional testing to ensure that strings with non-ASCII characters still work, but the test included with this PR fails without this patch and passes with it.

Fixes #16796

We accomplish this by ignoring `Errno::E2BIG` errors.
@crysbot
Copy link
Copy Markdown
Collaborator

crysbot commented Mar 30, 2026

This pull request has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/data-loss-when-writing-long-lines-with-file-print-after-file-set-encoding/8830/4

@ysbaddaden ysbaddaden added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:text labels Mar 30, 2026
We need to test single-byte ASCII characters but also multibyte Unicode
characters to ensure we're encoding the characters correctly when that
multibyte character lands on a buffer boundary.
Copy link
Copy Markdown
Collaborator

@ysbaddaden ysbaddaden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error should be handled by Crystal::Iconv#convert directly. That would fix all usages at once (IO::Encoding, String.encode, ...).

@ysbaddaden ysbaddaden added this to the 1.20.0 milestone Apr 2, 2026
Comment on lines 43 to 45
if err == Crystal::Iconv::ERROR
@iconv.handle_invalid(pointerof(inbuf_ptr), pointerof(inbytesleft))
end
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this path be handled by Iconv#convert as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question. Seems like that would address the concern raised here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, to be clear, I'm not confident enough to make that decision. There may be a reason it's handled there that I don't have context for.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Every call site does just the same. That smells like copy-paste. Since the #convert method already handles invalids on FreeBSD and DragonflyBSD, let's encapsulate the whole behavior into #convert 👍

@jgaskins jgaskins force-pushed the fix-io-encoder-with-long-strings branch from 3463c2f to 2eb7781 Compare April 4, 2026 17:09
Comment on lines +800 to +803
# Using both ASCII characters and a 26-byte Unicode characters to
# ensure we hit as many byte boundaries inside the Unicode characters
# as we can to get sufficient confidence in this test.
text = "test string 👩🏾‍🤝‍👨🏻" * 10240
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Do we really need specific single-/multi-byte characters at all to test this properly?
The original example only uses single-byte characters to reproduce the bug.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on how iconv works and how familiar someone is with it. I don’t know anything at all about it, so I needed a test case that gives me sufficient confidence that the behavior introduced in this PR doesn’t count multi-byte characters that cross the 1024-byte boundary (for example: starts at byte 1022 and ends at byte 1030) as invalid.

I have no idea how iconv handles that scenario (it may very well protect against it, but again, I don’t know) and this test case shows that it handles it as expected. Without the multi-byte character, I couldn’t say for sure. Since I didn’t find any tests that exercised this scenario and it was easy to test, I added it in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:text

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data loss when writing long lines with File#print after File#set_encoding

5 participants