Conversation

Charlotte lotteheartplural/Cinny cinny_heart_plural thetadelta ursaminor treblesand

random idea for a bitwise encoding of unicode

00<7 bits>: one shot ASCII
01<4 bits>: 4 bit delta
10<4 bits>: 5 bit delta with leading 1 bit
1100<5 bits>: 6 bit delta with leading 1 bit
1101<10 bits>: 10 bit delta
11100<12 bits>: 12 bit delta
11110<14 bits>: 14 bit delta
11111<21 bits>: jump to specific 21 bit unicode codepoint

5
0
1

@charlotte have you seen unishox?

1
0
0
@charlotte That's basically BOCU-1 minus normalization, I'm pretty sure.
1
0
0
@charlotte Encoding each codepoint as an offset from the last?
1
0
0

@chjara bocu-1 doesn’t do that,

This is a raccoon: 🦝

is encoded to
[a4, b8, b9, c3, 20, b9, c3, 20, b1, 20, c2, b1, b3, b3, bf, bf, be, 8a, 20, fd, 04, 2e]

it is more block based where it stores the difference to the middle of a 256 codepoint block however. it’s also fundamentally byte oriented since bitwise manipulation is something computers are quite bad at

0
0
0

minor addition:

11101<16 bits>: 16 bit jump within the same plane

0
0
0

add an extra bit to the delta things for direction and these schemes are unironically extremely good for CJK. getting very close to 2 bytes per raccodepoint in a paragraph from a japanese newspaper article.

……which is also where shift-jis and utf-16 are at lmao

0
0
0

Charlotte lotteheartplural/Cinny cinny_heart_plural thetadelta ursaminor treblesand

Edited 3 hours ago

okay update:

new encoding:

00<6 bit>: one-shot U+00xx
01<6 bit>: jump to U+00xx | U+0040
10<5 bit>: 5 bit delta
110<7 bit>: 7 bit delta
11100<9 bit>: 9 bit delta
11101<12 bit>: 12 bit delta
111100<13 bit>: 13 bit delta
111110<15 bit>: 15 bit delta
111101<16 bit>: jump to codepoint in same plane
111111<21 bit>: jump to specific codepoint

this now gets it to less than 8 bits per character in english text

0
0
1