Conversation

Björkus "No time_t to Die" Dorkus

How.... DO I convert from the internal wchar_t literal encoding to a file name on *nix platforms...?

3
1
1

I could try a conversion to UTF-8, and if that fails I.... guess do something else?? I wonder if there's wopen primitives on *nixen....

3
0
0

None that are widely available. I guess I convert to UTF-8, and then convert from wide-to-narrow, and if both of those conversions fail simply junk the whole file into the trash.

2
0
0

@thephd i don’t think you can do that without making environmental assumptions that probably won’t hold outside of macos and some linux file systems

1
0
2

@charlotte @thephd
I've been under the impression that wchar_t was considered a well-intentioned mistake, and that one should either use a Unicode type internally, or a custom type internally.

Did I get that wrong??

1
0
1

@dougmerritt @charlotte This is unfortunately about taking the data that's inside of a L"foo" literal at compile-time and shoveling it into a compile-time, compiler-internal function. So I have to treat it as I would for whatever the wchar_t encoding is supposed to be.

0
0
0

@thephd What are you using wchar_t for? That's pretty unorthodox on *nix, we typically just use char and assume that the bytes in a char array are "mostly UTF-8". In general, system interfaces just accept and return byte arrays which are typically roughly UTF-8-shaped

If I *did* have a wchar_t* for some reason, I would probably assume that it's UTF-32 and write a UTF-32 -> UTF-8 transformer...

1
0
0

@mort Sorry, I should've specified I'm working on the INSIDE of a compiler and I'm faced with an L"..." string literal whose contents are governed by -fexec-wide-charset=UTF-EBCDIC32 or something equally terrifying.

I have to convert from whatever that data is to something usable, inside of the compiler, with a const char* file. I can't just use the L"" string raw because each code unit is usually bigger than char and so there'd be pockets of 0x00 in the 2-byte set that'd not work if the data stream was passed directly to an file opening primitive.

0
0
0

@thephd I'm pretty sure you know more about this than me, but care to explain why `wcstombs` wouldn't work (with the default locale active)?

1
0
0

Björkus "No time_t to Die" Dorkus

Edited 7 days ago

@ljrk First part of the explanation here: https://pony.social/@thephd/116790194673690682

Second part of the explanation is wcstombs is naturally a lossy conversion that fails most of the time because L"" can be unicode and somewhat frequently plain "" is NOT UTF- compatible. This paper discusses it in detail: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3265.htm#intro-problem-roundtrip

C2y (hopefully C29) has functions I added to make that be less of a problem, so one could try mwcsntoc8sn first and only if that fails then try mwcsntomcsn to get around this issue: https://thephd.dev/cuneicode-and-the-future-of-text-in-c

But for my problem specifically, even my functions aren't helpful because the encoding of "wide characters", as mentioned in that first linked reply. For other people not writing compilers, these functions would do the job.

0
0
0

@thephd maybe as a practical partial solution only support UTF8 locales because screw others?

1
0
0

@thephd I saw you mentioning compile time. I think wchar_t being UCS4 is universal outside Windows.

1
0
0

@ldvsoft It is unfortunately not universal thanks to -fexec-wide-charset=....

1
0
0

@ldvsoft But my current GCC patch does just use sorry_at (loc, "we don't support wchar_t strings at this time") or something. I'm trying to fix that, though.

0
0
0

@thephd
If only somebody had made a library specifically for converting between text representations easily and efficiently...

1
0
0

@ami alas, I don't think GCC is going to put cuneicode into itself anytime soon

0
0
0

@thephd I don't really understand this problem. Filenames on unix are sequences of bytes. There's no encoding. Are you implementing a UI? It's only in the UI that filenames need to be displayed or entered based on some encoding.

1
0
0