Raccoon Noises

Conversation

Charlotte /Cinny

random idea for a bitwise encoding of unicode

00<7 bits>: one shot ASCII
01<4 bits>: 4 bit delta
10<4 bits>: 5 bit delta with leading 1 bit
1100<5 bits>: 6 bit delta with leading 1 bit
1101<10 bits>: 10 bit delta
11100<12 bits>: 12 bit delta
11110<14 bits>: 14 bit delta
11111<21 bits>: jump to specific 21 bit unicode codepoint

Ryan Castellucci (they/them)

ryanc@infosec.exchange

5 months ago

Reply to @charlotte

@charlotte have you seen unishox?

chjara ☭

chjara@akko.tuxcrafting.xyz

5 months ago

Reply to @charlotte

@charlotte That's basically BOCU-1 minus normalization, I'm pretty sure.

Charlotte /Cinny

charlotte

5 months ago

Reply to @chjara@akko.tuxcrafting.xyz

@chjara no

chjara ☭

chjara@akko.tuxcrafting.xyz

5 months ago

Reply to @charlotte

@charlotte Encoding each codepoint as an offset from the last?

Charlotte /Cinny

charlotte

5 months ago

Reply to @chjara@akko.tuxcrafting.xyz

Edited 5 months ago

@chjara bocu-1 doesn’t do that,

This is a raccoon: 🦝

is encoded to
[a4, b8, b9, c3, 20, b9, c3, 20, b1, 20, c2, b1, b3, b3, bf, bf, be, 8a, 20, fd, 04, 2e]

it is more block based where it stores the difference to the middle of a 256 codepoint block however. it’s also fundamentally byte oriented since bitwise manipulation is something computers are quite bad at

Charlotte /Cinny

charlotte

5 months ago

Reply to @ryanc@infosec.exchange

@ryanc i haven’t

Charlotte /Cinny

charlotte

5 months ago

Reply to @charlotte

minor addition:

11101<16 bits>: 16 bit jump within the same plane

Charlotte /Cinny

charlotte

5 months ago

Reply to @charlotte

add an extra bit to the delta things for direction and these schemes are unironically extremely good for CJK. getting very close to 2 bytes per raccodepoint in a paragraph from a japanese newspaper article.

……which is also where shift-jis and utf-16 are at lmao

Charlotte /Cinny

charlotte

5 months ago

Reply to @charlotte

Edited 5 months ago

okay update:

new encoding:

00<6 bit>: one-shot U+00xx
01<6 bit>: jump to U+00xx | U+0040
10<5 bit>: 5 bit delta
110<7 bit>: 7 bit delta
11100<9 bit>: 9 bit delta
11101<12 bit>: 12 bit delta
111100<13 bit>: 13 bit delta
111110<15 bit>: 15 bit delta
111101<16 bit>: jump to codepoint in same plane
111111<21 bit>: jump to specific codepoint

this now gets it to less than 8 bits per character in english text

@dingens

dingens@troet.cafe

5 months ago

Reply to @charlotte

@charlotte nice thing to think about! :)

for this to be efficient you'd need the deltas to be signed ints however (not sure you got this cause it's not mentioned)

Another thought: in most non-English Latin texts you'll have jumps from the ASCII range to Latin extended (where äéîøù live), might make sense to optimize for that as well.

Also, the 15bit delta and the 16bit jump seem a little redundant, as does the 13bit delta (you'll always have the first bit 1 there, else you'd use the 12bits)

Charlotte /Cinny

charlotte

5 months ago

Reply to @dingens@troet.cafe

@dingens at some point i was just tweaking the numbers until size reductions came out

i have implemented it already and tested it on a mix of single-language and multilingual text, the only text where it lost over utf-8 was in a synthetic pattern of U+10FFFF U+A0 repeated

the 9 bit one can jump from ascii into latin extended a and then back, the 12 bit one jumps into diacritics. 12/13 bits is meant for CJK mostly and yeah i could omit a bit in the 13 bit one. i could probably optimize it further with a proper dataset instead of back-of-the-napkin calculations based on looking at the unicode block sizes for a few languages. the encoding gets better the further away you get from the start of encoding space however

About Raccoon Noises

Rules

Hate against minority groups is forbidden. This includes racism, sexism, ableism, xenophobia, homophobia, transphobia, antisemitism, islamophobia, queer exclusionism, etc.
Content that is illegal under German Law is not permitted. This especially includes the promotion and dissemination of any Nazi symbolism and ideology, except for education, reporting on past or current events, and antifascist art.
Please add content description to all media that you post. This instance automatically adds a CW if it is missing. If you are unable to create one, you can request one via the #DescriptionWanted hashtag
Be considerate. Add content warnings for NSFW Content, common phobias, overly long posts, controversial subjects, etc. Please try to avoid flashing images and quickly moving text inside of your posts.
NSFW content is generally allowed, but all NSFW content must be properly marked as such, including kinks. Profile images, names, bios, etc must be fully SFW, or they are subject to removal
Bots are allowed, however they must be marked as such and must make unlisted posts, may only @ or interact with posts of other users iff they have prompted the bot, or have given explicit permission to do so. Additionally, bots may not post more than 10 posts in a 60 minute interval without interaction.

We highly encourage reporting posts violating our rules, even if they are not on our instance. Your reports will not be ignored. For transparency we publish local moderation decisions for users on this server, and federation moderation decisions on the #FediBlock hashtag.
We do the following moderation automatically:

Unlisting of bot posts
Modification or removal of posts that cause issues with certain clients

Privacy Policy

What data do we collect?

We collect the following data:

Email Addresses from local users
Posts and Media uploaded by local users
User Profiles and Posts by certain remote users

How do we collect your data?

If you are a user of this instance, we collect and process your data when you sign up for or use interactive features (e.g. Posting) of the Website.
If you are not a local user, we collect your data over the following ways:

One of our users has requested to follow your account, and you have accepted the request.
One of your posts has been interacted with by a remote account, that a local account has followed. This includes Replies, Repeats, Quotes, Likes, Emoji Reactions, and @-Mentions.
You have requested that your post is shown to one of our users (i.e. through @-Mentions or DMs)
User Interaction: One of our users has explicitely looked up your profile or one of your posts on this instance, for example to interact with it.
You have posted a public post on an instance that participates in the awoo.today relay.

How will we use your data?

We collect your data so that we can:

Store and display your posts to our local users
Display public posts to anonymous users
Deliver your public, unlisted, and private posts to your followers
Deliver direct messages to the recipient
Allow our users to follow you
Allow our users to interact with your posts

As members of the awoo.today relay, we will send posts that you have marked as “public” to all of the other instances participating in the relay.

How do we store your data?

We store your post, profile and account data securely in the Hetzner Datacenter in Falkenstein, Germany. See their DIN ISO/IEC 27001 certification Media is stored on Vultr
We employ technical security measures to avoid exposure to sensitive data.
We also store backups of post, profile, and account data in multiple locations, in an encrypted form, on our server near Chemnitz, Germany, as well as on Vultr.
For technical reasons it is not possible modify these backups to remove your data. If this is a concern, please contact us.

What are your data protection rights?

We want to make sure that you are aware of your data protection rights. Every user is entitled to the following:
The right to access — You can request a copy of the data we have about you. This may require a short verification for remote users. Local users can do so in the settings under Export/Import
The right to rectification — You can request us to correct any information you believe is inaccurate. You also have the right to request us to complete the information you believe is inaccurate.
The right to erasure — You can request us to erase the data we have about you.
The right to restrict to processing — You can restrict us from transmitting your posts to other servers by setting your post visiblity to “Local”. Remote users can also restrict processing of certain posts, by setting its visiblity to “Unlisted” or “Private”.
The right to object to processing — As a remote user, you can object to further processing of posts and profile data by blocking this domain.
The right to data portability — You can at any point move to other instances. Due to technical restrictions, it is currently not possible to automatically transfer the users you follow and posts to your new account.
If you make a request, we have one month to respond to you. If you would like to exercise any of these rights, or need help with the included tools, please contact us at our email privacy@chir.rs

Cookies

Cookies are text files placed on your computer to collect standard Internet log information and visitor behavior information. When you visit our websites, we may collect information from you automatically through cookies or similar technology
For further information, visit allaboutcookies.org.

How do we use cookies?

We use cookies for keeping you logged in. Additionally we store certain configuration in cookies, however these cookies are never transmitted to anyone.

How to manage cookies

You can tell your browser to not accept cookies, or tell it to remove cookies this website has stored on your device. Please consult your browser’s documentation on instructions on how to do that.

Privacy policies of other websites

This site contains many links to other websites. This privacy policy only applies to this website. Please consult the privacy policy of these remote sites before entering any personal information.

Changes to our privacy policy

We may make occasional adjustments to this privacy policy. This policy was last updated on 2022-12-30.

How to contact us

If you have any questions about this policy, the data we hold about you, or want to exercise one of your data protection rights, please contact us at: privacy@chir.rs

How to contact the appropriate authority

Should you wish to report a complaint, or if you feel that we haven’t addressed your concern in a satisfactory manner, you may contact the Sächsische Datenschutzbehörde.

We also offer the Mastodon Web UI. Keep in mind that some features are missing, like emoji reactions, quoting, and JPEG XL.

Art Credit

Bun, blobfox, vlpn, raccoon, fox, gphn, neofox, neocat, drgn, floof: Created by @volpeon@is-a.wyvern.rip
rosahaj pride: by @braid@alpaka.social