22:13 < fraggle> are you going to support UTF-32 middle endian? 22:13 < fraggle> what about UTF-9 and UTF-18 22:22 <@GhostlyDeath> screw em 22:29 < fraggle> what about all your PDP-11 users 22:29 < fraggle> they'll be completely without unicode support 22:30 <@GhostlyDeath> I don't think they can run ReMooD 22:31 <@GhostlyDeath> also fraggle 22:31 <@GhostlyDeath> ReMooD is still ASCII 22:31 <@GhostlyDeath> it just uses Unicode on the inside 22:31 <@GhostlyDeath> so it will still run on 98 without unicows 22:31 < fraggle> unicows? 22:32 < fraggle> oh i see 22:32 < fraggle> my approach would just be to use UTF-8 22:32 < fraggle> UTF-8 is great, it's like magic 22:33 <@GhostlyDeath> but then I'd have to handle multibyte 22:33 <@GhostlyDeath> while parsing text files 22:33 <@GhostlyDeath> if (*x == MULTIBYTEINDICATOR) 22:33 < fraggle> you just pretend everything is ASCII and fix things up in a few places 22:33 < fraggle> magic 22:34 <@GhostlyDeath> would be anything reading and writing files, parsing text, drawing text 22:34 < fraggle> that's the point really 22:34 <@GhostlyDeath> would require the same code everywhere really 22:35 < fraggle> you only need to change the text drawing code 22:35 < fraggle> and you're done 22:35 <@GhostlyDeath> what about scripts? 22:35 < fraggle> what about them? 22:35 <@GhostlyDeath> parsing them 22:35 < fraggle> you just parse them like they're ascii 22:35 <@GhostlyDeath> and if they want to internationalize them? 22:36 <@GhostlyDeath> There would be multibyte 22:36 < fraggle> how do you mean 22:36 <@GhostlyDeath> but then I would have to interpret those also in the parser 22:36 <@GhostlyDeath> What if they want to print out japanese characters? 22:36 < fraggle> sorry i don't understand 22:36 < fraggle> the point i'm making is that you don't need to interpret UTF-8 any different from ASCII 22:37 <@GhostlyDeath> except for multibyte 22:37 < fraggle> no 22:37 < fraggle> at all 22:37 < fraggle> that's what i mean 22:37 <@GhostlyDeath> So I just leave them in as invalid characters? 22:37 < fraggle> in what situation would you have to parse them differently? 22:38 <@GhostlyDeath> if someone decides to type in some japanese 22:38 < fraggle> right 22:38 <@GhostlyDeath> or parse some files 22:38 < fraggle> well let's find a specific example 22:38 < fraggle> like a script 22:38 <@GhostlyDeath> Going for UTF-16 will simplify everything for me 22:38 < fraggle> maybe a fragglescript? 22:38 <@GhostlyDeath> as I won't have to worry about multibyte UTF-8 22:38 < fraggle> well that's my point really 22:38 < fraggle> i'm saying i think UTF-8 will make everything simpler for you 22:39 < fraggle> because you'll hardly have to change anything 22:39 <@GhostlyDeath> scripts and lumps will be converted to UTF-16 22:39 < fraggle> if you take the example of a script 22:39 <@GhostlyDeath> strlen won't work on multibyte strings 22:39 < fraggle> like a variable definition for example in a script 22:39 <@GhostlyDeath> I would have to create a function to go through each character 22:39 <@GhostlyDeath> and treat multibyte chars as single chars 22:40 < fraggle> yeah 22:40 <@GhostlyDeath> same goes for everything else 22:40 <@GhostlyDeath> wcs* uses wchar_t 22:40 < fraggle> but you have to think about when you actually use strlen and why 22:40 < fraggle> in what situation is that difference a problem? 22:40 <@GhostlyDeath> drawing strings mostly 22:40 < fraggle> the exactly 22:41 < fraggle> the point is that you can shift all the unicode support to the drawing stage 22:41 < fraggle> you only need to change your text drawing functions 22:41 < fraggle> that's the only place where it makes any difference 22:41 <@GhostlyDeath> but that would only be for drawing text 22:41 < fraggle> yes 22:41 <@GhostlyDeath> nothing in the input for Unicode 22:42 < fraggle> if you use UTF-8, you just hide non-ASCII characters within strings 22:42 <@GhostlyDeath> but the Unicode isn't just for drawing strings 22:42 < fraggle> it doesn't matter anywhere except the drawing functiosn 22:42 < fraggle> see? 22:43 < fraggle> so overhauling your entire program to replace everything with UTF-16 is completely unnecessary 22:43 <@GhostlyDeath> shit happens 22:43 < fraggle> ? 22:44 < fraggle> using UTF-16 is also a classic mistake because it doesn't allow you to represent every unicode symbol 22:44 <@GhostlyDeath> I know 22:45 < fraggle> thus defeating the entire point of adding unicode support in the first place 22:45 <@GhostlyDeath> switching to UTF-32 would be alot simpler now 22:45 <@GhostlyDeath> since wchar_t is only 32 bits 22:45 <@GhostlyDeath> I would only have to change the drawing functions 22:45 <@GhostlyDeath> wchar_t is 32-bits on Linux though 22:46 <@GhostlyDeath> MSVC might make it 16-bits 22:46 < fraggle> or just switch to UTF-8 and save yourself a lot of unnecessary trouble :) 22:48 <@GhostlyDeath> By using UTF-16, I don't have to worry about multibyte strings 22:48 <@GhostlyDeath> and I can use the wcs* functions with wchar_t 22:49 < fraggle> what do you mean by "don't have to worry" about them? 22:49 <@GhostlyDeath> if those functions require wchar_t 22:49 <@GhostlyDeath> I would have to translate everything in the game to be wchar_ts 22:49 < fraggle> ? 22:50 <@GhostlyDeath> The drawing functions only go up to 16-bits currently 22:50 <@GhostlyDeath> at my choice, I can choose to switch to UTF-32 anytime 22:50 <@GhostlyDeath> since UTF-16 would already be in place 22:50 < fraggle> ok.. 22:50 <@GhostlyDeath> every function will use wchar_ts instead of char 22:51 < fraggle> so you're saying that you're going to change every piece of code in the entire source base that uses a string? 22:51 < fraggle> to use wchar_t strings? 22:51 <@GhostlyDeath> Scripts that are UTF-32 will just be snipped down for now 22:51 < fraggle> is that correct? 22:51 <@GhostlyDeath> but the parsing code will only depend on wchar_t 22:51 <@GhostlyDeath> regardless if it's 8, 16 or 32 bits 22:51 <@GhostlyDeath> it's just the drawing code mostly 22:51 <@GhostlyDeath> fraggle: most of the strings 22:52 < fraggle> this is going to be rather tedious don't you think? 22:53 <@GhostlyDeath> I have tons of Free time 22:54 < fraggle> point i've been trying to make is that the "problem" with multibyte strings only really occurs when you're trying to render them 22:54 < fraggle> it's only something you need to deal with in the rendering stage 22:55 < fraggle> wouldn't it be easier to just make the rendering stage support UTF-8 instead of reworking the entire source code base 22:55 < fraggle> in addition to being less work it also leads to more readable code! 22:56 <@GhostlyDeath> and what if someone decides to use extended ASCII characters reserved by UTF-8? 22:56 <@GhostlyDeath> to maybe draw some special characters they made themselves 22:56 < fraggle> the only time you need to care about the difference between a character and a byte is when you're rendering characters, or doing something like looking up an individual character in a string (usually because you're rendering it) 22:56 <@GhostlyDeath> and added a corresponding STCFNxxx? 22:57 <@GhostlyDeath> many parts of the code depend on an individual character 22:57 < fraggle> has anyone done that? 22:57 <@GhostlyDeath> I believe so 22:57 < fraggle> example? 22:57 < fraggle> are you talking about vanilla wads or something? i'm confused 22:58 < fraggle> because vanilla doesn't do extended ascii 22:59 <@GhostlyDeath> who says you have to follow vanilla rules? 22:59 < fraggle> well i don't understand the point you're making 22:59 < fraggle> are you saying you think someone might come along and mistakenly think it supports extended ascii? 23:00 < fraggle> or that there are old wads that do this that you need to support? 23:01 < fraggle> and if there aren't any examples of such older wads, why bother? 23:02 <@GhostlyDeath> Legacy only allows '!' to '_' to be defined by characters 23:02 < fraggle> yes 23:02 < fraggle> same as vanilla 23:02 <@GhostlyDeath> lowercase is automatically uppercase 23:02 < fraggle> this is for strings rendered in the STCFN** font you mean 23:02 <@GhostlyDeath> yes 23:03 < fraggle> so basically there can't be any existing mods that would pose a problem then 23:03 <@GhostlyDeath> Lowercase STCFNs work in legacy as long as they exist 23:03 < fraggle> interesting 23:04 <@GhostlyDeath> actually, that might have been ZDoom 23:04 <@GhostlyDeath> or a ZDoom based port 23:04 < fraggle> maybe 23:05 <@GhostlyDeath> but ReMooD only maps lowercase letters to capital if the lowercase version does not exist but the uppercase one does 23:05 < fraggle> right 23:05 < fraggle> a useful extension for modders 23:06 < fraggle> so, any sensible reason left to overhaul everything to UTF-16/32? 23:06 <@GhostlyDeath> Entire documents in Japanese will be smaller 23:07 <@GhostlyDeath> if UTF-8 causes japanese characters to be 3 bytes 23:07 <@GhostlyDeath> in UTF-16 they are only 2 bytes 23:07 < fraggle> yeah well there's no reason that you can't support parsing UTF-16/32 *files* 23:07 <@GhostlyDeath> if the Japanese characters outweigh the normal ASCII characters 23:07 <@GhostlyDeath> same goes when representing text in game 23:07 < fraggle> my point was about internal representation 23:08 <@GhostlyDeath> When someone talks over the network 23:08 <@GhostlyDeath> 2 characters will be 6 bytes instead of 4 bytes 23:08 < fraggle> are most doomers japanese? or english? 23:08 <@GhostlyDeath> There are foreign doomers 23:08 < fraggle> yes 23:08 <@GhostlyDeath> I see them all the time on ZDaemon 23:08 < fraggle> but most doomers playing netgames are probably writing in english, let's be honest 23:09 <@GhostlyDeath> every port only allows ASCII to be input 23:09 < fraggle> so for the common case you'd be doubling the size of every string transmitted 23:09 <@GhostlyDeath> yes 23:09 < fraggle> quadrupling if you use UTF-32 23:09 < fraggle> so not really a saving 23:09 <@GhostlyDeath> Network packets will be compressed 23:09 <@GhostlyDeath> (if server allows compression) 23:10 <@GhostlyDeath> the server could accept only uncompressed packets which would force clients to not use compression 23:10 <@GhostlyDeath> or the server could force compression which would force clients to use compression 23:10 < fraggle> i bet a compressed ascii string is still smaller than the same compressed UTF-32 string 23:10 < fraggle> so not really relevant 23:11 < fraggle> if i can be honest it really seems like you're just clutching at straws 23:12 <@GhostlyDeath> Using gz -9 23:12 <@GhostlyDeath> 23404 >> 4485 and 11707 >> 3706 23:13 < fraggle> ? 23:13 <@GhostlyDeath> Only ASCII Characters 23:13 < fraggle> what do those numbers mean 23:13 <@GhostlyDeath> bytes 23:13 < fraggle> you've given me no context 23:15 < fraggle> 34234 >> 463 and 123781 >> 11111 23:15 < fraggle> 43434 << 333532 and 545 + 3 23:15 < fraggle> meaningless, see? 23:15 <@GhostlyDeath> Not bitshifting 23:16 <@GhostlyDeath> The doubled number is obviously UTF-16 23:16 <@GhostlyDeath> each file compressed down to a certain size 23:16 < fraggle> doubled number? what are you talking about? 23:16 <@GhostlyDeath> well, double + 2 23:16 < fraggle> can you just go back and present these values again in some clear form? 23:16 < fraggle> what is 23404 representing? 23:16 <@GhostlyDeath> number bytes of the UTF-16 file 23:16 < fraggle> ok 23:17 < fraggle> are there two files? 23:17 <@GhostlyDeath> yes 23:17 < fraggle> what are the contents of these two files? 23:17 <@GhostlyDeath> lzma for 16 produces 3646 and 8 produces 3506 23:17 <@GhostlyDeath> EnUS translation file 23:17 <@GhostlyDeath> Only ASCII characters 23:19 < fraggle> and the other file? 23:19 <@GhostlyDeath> Only ASCII characters 23:19 < fraggle> ... 23:19 < fraggle> so you have two ascii files.. they contain the same data? 23:19 <@GhostlyDeath> yes 23:19 < fraggle> right 23:19 <@GhostlyDeath> except one is in UTF-16 23:19 <@GhostlyDeath> containing characters found in ASCII 23:20 <@GhostlyDeath> hence ASCII characters 23:20 < fraggle> ok, i understand 23:20 < fraggle> and the other? 23:20 <@GhostlyDeath> Why the hell are we debating what in files anyway 23:20 <@GhostlyDeath> I already told you 23:20 < fraggle> told me what? 23:20 <@GhostlyDeath> What the contents of the file are 23:20 < fraggle> they both contain the same data, which is ascii 23:20 < fraggle> yes? 23:21 < fraggle> are you trying to say that they have different encodings? 23:21 < fraggle> one is encoded in UTF-16, the other? UTF-32? 23:21 <@GhostlyDeath> one is UTF-16 the other is ASCII 23:21 < fraggle> ok 23:22 <@GhostlyDeath> If the entire file were characters that could only be represented as 3 bytes in UTF-8, the UTF-16 one would be smaller 23:22 < fraggle> so if i try to guess what your values mean now 23:22 < fraggle> the ASCII version is 11707 bytes long 23:22 < fraggle> and compresses to 3706 bytes 23:22 < fraggle> is that correct? 23:23 <@GhostlyDeath> using gzip -9 23:23 < fraggle> while the UTF-16 version is 23404 bytes long, compresses to 4485 bytes 23:23 <@GhostlyDeath> lzma -9 produces better compression at 3509 bytes 23:23 < Freedoomer> bzip2 > gzip 23:23 <@GhostlyDeath> with lzma -9 it compresses down to 3646 23:23 < Freedoomer> Try bzip2 23:24 < fraggle> so compressed ASCII is smaller than compressed UTF-16? 23:24 <@GhostlyDeath> bzip2 for UTF-16 is 3662 23:24 <@GhostlyDeath> for ASCII it's 3668 23:24 <@GhostlyDeath> 6 bytes smaller! 23:24 <@GhostlyDeath> bzip2 WINS! 23:24 < Freedoomer> :) 23:24 <@GhostlyDeath> UTF-16 prevails! 23:25 < fraggle> i don't know GhostlyDeath, it seems to me like you're just dogmatically trying to stick to your original decision instead of rationally examining the arguments presented to you 23:25 < fraggle> what can i say, i'm only trying to help 23:25 <@GhostlyDeath> I've rationally examined them 23:25 < fraggle> if you want to waste your time, it's your project 23:25 <@GhostlyDeath> exactly 23:25 <@GhostlyDeath> Why do you care if I'm using wchar_ts? 23:26 <@GhostlyDeath> it would only mean that supporting UTF-32 would be alot easier 23:26 < fraggle> i'm just trying to provide some advice 23:26 <@GhostlyDeath> in the far far future 23:26 <@GhostlyDeath> however 23:26 <@GhostlyDeath> for text over the network 23:26 <@GhostlyDeath> I can have a field to specify encoding 23:26 <@GhostlyDeath> so English speakers saying "Hello" use 6 bytes 23:26 <@GhostlyDeath> and Japanese people can use UTF-16 23:26 < fraggle> because it seems like you've rushed into making the decision without doing the proper research or taking the time to understand the options 23:28 <@GhostlyDeath> I thought about UTF-8 but had to deal with multibyte characters 23:28 <@GhostlyDeath> In complete non-English environments it would not be viable 23:28 <@GhostlyDeath> as it would waste much more space than UTF-16 23:29 <@GhostlyDeath> but wchar_ts are 32-bits anyway in Linux 23:30 < fraggle> it's your time to waste