String operations don't seem unicode-aware

What kind of device are you using?:
iPad Pro (4th gen) (HS says "“iPad8,11”)
iOS 14.5.1
HS version 3.46.6
Player version: 1.5.14
1 sentence description of the problem (I was doing _________, and then __________ happened):
I created some text objects containing Unicode characters, and then operations like
“length” and “character in … at” seem to be operating on the underlying bytes representing the string, rather than the logical characters: the length of what should appear to be a single Unicode character can be 2 instead of 1, and "character in … at " can produce garbage, which confuses text display if not the viewer.
Steps that the Hopscotch team can take to reproduce my problem every time:
Put strings with Unicode inside Text objects, and try “length” and “character in … at” operations.
I’ve made a project that demonstrates the surprises: 11oo9i2mhq
I expected this to happen:
I think I expect HS to count codepoints, not bytes, but I’m not a Unicode expert.
But instead this happened:
When you play this project 11oo9i2mhq on the web or on the iPad, some of the lengths are too long (single Unicode characters counting as two characters), and indexing into the string is offset (note the two “$20” examples, one with Unicode $, one with ASCII $). The example on the lower right (next to Monkey) has a bunch of Unicode country flags in the string. On the web, the dimensions of the text display is totally wacky; on my iPad, nothing about string shows up (hence the Monkey’s question)

Can someone with greater powers than me post a link to project 11oo9i2mhq ?

11 Likes

Can you add a link, then space it out, then I’ll take the spaces out

  • First -
1 Like

Here you are! :)

2 Likes

ht tps : / /c.get hop scotch . com/p/ 11oo9i2mhq

2 Likes

hey, just read your OP, it’s not a bug (i think), because:

AE is coming with more detailed explanation :eyes:

2 Likes

Many characters, particularly emojis, usually consist of two or more codepoints, which can be very confusing as most people would interpret it as one character, not two or three or even sometimes more.

As far as I know, Hopscotch uses the standard JavaScript length and substring functions, which don’t have any way of avoiding this issue. This could be solved by developing a different string-reading method, however I don’t know if that’s something THT is even looking into.


Also, a faster way to post links for now is just to paste the link, select it, then tap the </> button (a few to the right of the bold and italics.

8 Likes

Ok, thanks @NTh3R for highlighting that line in the documentation. And thanks @Awesome_E for the further info.

I love Hopscotch, and I don’t envy the development work of growing it in a backward compatible way, but if the behavior reveals some implementation detail about how Unicode works, by breaking the abstraction of a “string” as a sequence of “characters” (each of which appears atomically as I enter the string), then that really feels like a bug, no matter the warnings in the documentation. Especially if we’re trying to keep things simple for people first learning about programming?

One problem (on the side of “this is something that merits fixing”) is that I don’t see a way to program around this if I needed to: what operations can I do on a string to know if “length” is going to behave as expected, and if “character in … at” is going to return only the characters that I entered?

I lack now the permissions to post images or links (including to projects) but when I web search for:
javascript unicode-aware string length
and for:
javascript unicode-aware string indexing
I see that others have already tackled these same issues, so at least THT does not need to reinvent the wheel here, right? otoh I have done little Unicode-aware programming myself so I might be naive.

5 Likes

Also, in my program (linked above), when run on my iPad, the Monkey is looking for a Text display that just never shows up, and on the web player that Text is displayed with bizarro dimensions, so that also seems more like a bug than documented behavior.

4 Likes

Here it is

2 Likes

already did it :)

2 Likes

Oops I didn’t saw it… sorry

2 Likes

it’s okay lol

2 Likes

I’m happy u forgive me:)

3 Likes

While I haven’t been doing my homework on the difference between encodings, I think that I understand the broader perspective of the bug report.

Yep, the Unicode implementation is all about that. We learned about it in class this year, and it’s this “sequence division” of characters that makes Unicode support so many characters and outperforming most other encodings totally on this point.

I do agree with this. Even though the technical aspect might be difficult for THT to implement, I understand that this can easily become confusing.

I assume with this, you mean “the characters that I entered” such as all characters that have the length of 1?

Hopscotch supports string regex checks. If user input is what is concerning you, maybe you can use a regex to only allow characters that would count as length 1 to be inputted?

2 Likes

I really appreciate how constructive the discussion on this forum is, and thank you for letting me be part of it; I’m trying to respect it even if I write too many words!

Part of the genius of Hopscotch is using JavaScript for program representation and execution (so it plays well with Apple’s rules on compiling code within an iOS app), but the downside is that sometimes the JavaScript shows through. Others have already been surprised by using “<” to compare one number and one string (thanks to JavaScript’s dynamic typing). I think this “string” confusion is another moment of Hopscotch unfortunately exposing JavaScript design choices. I bet @Awesome_E could supply other examples.

Since I first wrote my post I’ve learned a little more about how JavaScript does Unicode (but I’m still learning…): strings in Javascript are encoded with UTF-16, which means the string is a sequence of 16-bit numbers called “code units”. JavaScript’s string.length() and Hopscotch’s “length()” are counting code units. However the idea of a letter or symbol is connected to a “code point”, a Unicode-specified numeric identifier, but which is actually represented in the computer with (encoding-specific) code units. Many symbols (the letters in the “basic multilingual plane”) require only a single 16-bit code unit of representation, but other symbols, like emoji, require multiple code units of representation.

A big improvement in Hopscotch’s string handling would be to ensure that operations on strings are in terms of code points, rather than code units. For the symbols that are available in the emoji keyboard in HS, I think that would maintain the abstraction of “strings” as sequences of “characters” that can be typed into a string and extracted from a string individually.

Here’s what I mean by “character in … at” returning only the characters that I entered. When I look at a keyboard, I see one symbol per key, whether that’s an “A” or a smiley face emoji. I hit a key once, and one symbol shows up. A string is a sequence of those symbols. The length of the string should be the number of symbols. If I type three symbols, saved in a string in variable “Label”, then “character in (Label) at (0)” should be the first symbol I typed, and “… at (1)” to be the 2nd symbol, and “… (2)” to be the 3rd. I shouldn’t have to care about the difference between simple letters and emojis.

That’s not what happens now (as my example project shows): when you use “character in … at” on multi-code-unit symbols you get garbage output, or sometimes mysteriously no output at all. We here understand what’s going on, because we know something about how Javascript’s length and substring functions are in terms of code units, but is the idea of Hopscotch to introduce people to the complexities of Unicode and design choices of Javascript? No, right?

My original title “String operations don’t seem unicode-aware” was actually ignorant: it should have been more like “HS string operations expose limits of JS string functions”. Javascript’s choice of UTF-16 for string representation is very Unicode-aware, but JS’s choice to make basic string functions be about the numeric representation of strings (the code units), versus the sequence of symbols represented (the code points), is something that HS also exposes, which really surprised me (the “principle of least surprise” is one strategy for software design).

Since HS makes it so easy to enter emojis into strings, and since there are now good JavaScript libraries for iterating through the code points of a string (instead of the code units, like now), I’m hoping that that can be part of the evolution of how Hopscotch handles strings. What do others think?

There may be regex for detecting multi-code-unit characters, but that is its own steep learning curve.

PS

There is another layer of Unicode complexity, which I’m not expecting to be part of Hopscotch’s string handling anytime soon. Sometimes the thing you’d look at and call a single symbol is more complicated, like a letter with a special accent. The letter and the accent might each be one code point, so now the symbol (a “grapheme cluster”) is composed of multiple code points, which in turn need multiple code units. So, even more sophisticated string parsing could be in terms of grapheme clusters, rather than code points. But since this is stepping outside what you can easily access from the HS keyboard it seems more complicated than necessary.

sorry for the essay. I’m old.

6 Likes

Don’t worry, it’s so amazing to see you contribute with detailed replies and thoughts related to how Hopscotch works, and also how some functions can be less confusing for beginners or people who don’t know how JavaScript works. Actually, I know the basics of JavaScript, but I had no idea about these implementations.

I feel like this certainly would be a good improvement for everyone wanting to implement string lengths in some way in their projects. Also, thanks for clarifying and correcting my assumption - I see what you mean.

Yeah, like I said in the previous sentences, I think that it would be a great improvement! Thanks a lot for taking the time to suggest this and great find also!

Totally correct.

…maybe you’re old, but also passionate about Hopscotch and using your experience to suggest improvements to the app and the language! And we’re really thankful for that :slight_smile:

5 Likes

It might be worth using a regular expression to count the number of characters, but that would be very cumbersome and would probably lead to a lot more less predictable issues, due to the emoji’s regular expression not being perfect.


Another weakness would be the floating point errors, but that’s not really specific to just JavaScript.

5 Likes

Thanks for your supportive words. Besides hoping others read and like this, is there a process for getting the attention and buy-in of THT?

3 Likes

Floating point (FP) representation can be surprising, but it is really an incredible engineering feat: approximating the entire real number line with a finite number of bits. As you may know, sometimes the problem with FP operations is not the fact of FP, but the sequence of operations that reduces the precision, and there can be more or less accurate ways of getting an answer even though the two ways seem mathematically identical. If you have time, are there examples of FP problems in HS that you can share?

FP is sometimes called a “leaky abstraction” of the real numbers, and in this sense maybe FP is like the Unicode issue: an implementation detail leaking through and spoiling an otherwise clean abstraction. But I think improving code point handling in strings is a lot easier than changing FP.

But one thing I wish HS had: arctan2(y,x), instead of just arctan(y/x). I have never actually wanted arctan alone, I’m always using arctan2 as a of recovering an angle from an (x,y) pair. Being forced to use arctan(y/x) can introduce a lot of FP rounding error.

PS

If you search HS for “TapTapTapDraw” you’ll see an example of me using arctan but really wanting arctan2

2 Likes

Yep, it’s definitely quite interesting, and I know it’s much harder than it seems in the surface, but it still causes a problem.

3 Likes