22:13 < fraggle> are you going to support UTF-32 middle endian?
22:13 < fraggle> what about UTF-9 and UTF-18
22:22 <@GhostlyDeath> screw em
22:29 < fraggle> what about all your PDP-11 users
22:29 < fraggle> they'll be completely without unicode support
22:30 <@GhostlyDeath> I don't think they can run ReMooD
22:31 <@GhostlyDeath> also fraggle
22:31 <@GhostlyDeath> ReMooD is still ASCII
22:31 <@GhostlyDeath> it just uses Unicode on the inside
22:31 <@GhostlyDeath> so it will still run on 98 without unicows
22:31 < fraggle> unicows?
22:32 < fraggle> oh i see
22:32 < fraggle> my approach would just be to use UTF-8
22:32 < fraggle> UTF-8 is great, it's like magic
22:33 <@GhostlyDeath> but then I'd have to handle multibyte
22:33 <@GhostlyDeath> while parsing text files
22:33 <@GhostlyDeath> if (*x == MULTIBYTEINDICATOR)
22:33 < fraggle> you just pretend everything is ASCII and fix things up in a few places
22:33 < fraggle> magic
22:34 <@GhostlyDeath> would be anything reading and writing files, parsing text, drawing text
22:34 < fraggle> that's the point really
22:34 <@GhostlyDeath> would require the same code everywhere really
22:35 < fraggle> you only need to change the text drawing code
22:35 < fraggle> and you're done
22:35 <@GhostlyDeath> what about scripts?
22:35 < fraggle> what about them?
22:35 <@GhostlyDeath> parsing them
22:35 < fraggle> you just parse them like they're ascii
22:35 <@GhostlyDeath> and if they want to internationalize them?
22:36 <@GhostlyDeath> There would be multibyte
22:36 < fraggle> how do you mean
22:36 <@GhostlyDeath> but then I would have to interpret those also in the parser
22:36 <@GhostlyDeath> What if they want to print out japanese characters?
22:36 < fraggle> sorry i don't understand
22:36 < fraggle> the point i'm making is that you don't need to interpret UTF-8 any different from ASCII
22:37 <@GhostlyDeath> except for multibyte
22:37 < fraggle> no
22:37 < fraggle> at all
22:37 < fraggle> that's what i mean
22:37 <@GhostlyDeath> So I just leave them in as invalid characters?
22:37 < fraggle> in what situation would you have to parse them differently?
22:38 <@GhostlyDeath> if someone decides to type in some japanese
22:38 < fraggle> right
22:38 <@GhostlyDeath> or parse some files
22:38 < fraggle> well let's find a specific example
22:38 < fraggle> like a script
22:38 <@GhostlyDeath> Going for UTF-16 will simplify everything for me
22:38 < fraggle> maybe a fragglescript?
22:38 <@GhostlyDeath> as I won't have to worry about multibyte UTF-8
22:38 < fraggle> well that's my point really
22:38 < fraggle> i'm saying i think UTF-8 will make everything simpler for you
22:39 < fraggle> because you'll hardly have to change anything
22:39 <@GhostlyDeath> scripts and lumps will be converted to UTF-16
22:39 < fraggle> if you take the example of a script
22:39 <@GhostlyDeath> strlen won't work on multibyte strings
22:39 < fraggle> like a variable definition for example in a script
22:39 <@GhostlyDeath> I would have to create a function to go through each character
22:39 <@GhostlyDeath> and treat multibyte chars as single chars
22:40 < fraggle> yeah
22:40 <@GhostlyDeath> same goes for everything else
22:40 <@GhostlyDeath> wcs* uses wchar_t
22:40 < fraggle> but you have to think about when you actually use strlen and why
22:40 < fraggle> in what situation is that difference a problem?
22:40 <@GhostlyDeath> drawing strings mostly
22:40 < fraggle> the exactly
22:41 < fraggle> the point is that you can shift all the unicode support to the drawing stage
22:41 < fraggle> you only need to change your text drawing functions
22:41 < fraggle> that's the only place where it makes any difference
22:41 <@GhostlyDeath> but that would only be for drawing text
22:41 < fraggle> yes
22:41 <@GhostlyDeath> nothing in the input for Unicode
22:42 < fraggle> if you use UTF-8, you just hide non-ASCII characters within strings
22:42 <@GhostlyDeath> but the Unicode isn't just for drawing strings
22:42 < fraggle> it doesn't matter anywhere except the drawing functiosn
22:42 < fraggle> see?
22:43 < fraggle> so overhauling your entire program to replace everything with UTF-16 is completely unnecessary
22:43 <@GhostlyDeath> shit happens
22:43 < fraggle> ?
22:44 < fraggle> using UTF-16 is also a classic mistake because it doesn't allow you to represent every unicode symbol
22:44 <@GhostlyDeath> I know
22:45 < fraggle> thus defeating the entire point of adding unicode support in the first place
22:45 <@GhostlyDeath> switching to UTF-32 would be alot simpler now
22:45 <@GhostlyDeath> since wchar_t is only 32 bits
22:45 <@GhostlyDeath> I would only have to change the drawing functions
22:45 <@GhostlyDeath> wchar_t is 32-bits on Linux though
22:46 <@GhostlyDeath> MSVC might make it 16-bits
22:46 < fraggle> or just switch to UTF-8 and save yourself a lot of unnecessary trouble :)
22:48 <@GhostlyDeath> By using UTF-16, I don't have to worry about multibyte strings
22:48 <@GhostlyDeath> and I can use the wcs* functions with wchar_t
22:49 < fraggle> what do you mean by "don't have to worry" about them?
22:49 <@GhostlyDeath> if those functions require wchar_t
22:49 <@GhostlyDeath> I would have to translate everything in the game to be wchar_ts
22:49 < fraggle> ?
22:50 <@GhostlyDeath> The drawing functions only go up to 16-bits currently
22:50 <@GhostlyDeath> at my choice, I can choose to switch to UTF-32 anytime
22:50 <@GhostlyDeath> since UTF-16 would already be in place
22:50 < fraggle> ok..
22:50 <@GhostlyDeath> every function will use wchar_ts instead of char
22:51 < fraggle> so you're saying that you're going to change every piece of code in the entire source base that uses a string?
22:51 < fraggle> to use wchar_t strings?
22:51 <@GhostlyDeath> Scripts that are UTF-32 will just be snipped down for now
22:51 < fraggle> is that correct?
22:51 <@GhostlyDeath> but the parsing code will only depend on wchar_t
22:51 <@GhostlyDeath> regardless if it's 8, 16 or 32 bits
22:51 <@GhostlyDeath> it's just the drawing code mostly
22:51 <@GhostlyDeath> fraggle: most of the strings
22:52 < fraggle> this is going to be rather tedious don't you think?
22:53 <@GhostlyDeath> I have tons of Free time
22:54 < fraggle> point i've been trying to make is that the "problem" with multibyte strings only really occurs when you're trying to render them
22:54 < fraggle> it's only something you need to deal with in the rendering stage
22:55 < fraggle> wouldn't it be easier to just make the rendering stage support UTF-8 instead of reworking the entire source code base
22:55 < fraggle> in addition to being less work it also leads to more readable code!
22:56 <@GhostlyDeath> and what if someone decides to use extended ASCII characters reserved by UTF-8?
22:56 <@GhostlyDeath> to maybe draw some special characters they made themselves
22:56 < fraggle> the only time you need to care about the difference between a character and a byte is when you're rendering characters, or doing something like looking up an individual character in a string (usually because you're rendering it)
22:56 <@GhostlyDeath> and added a corresponding STCFNxxx?
22:57 <@GhostlyDeath> many parts of the code depend on an individual character
22:57 < fraggle> has anyone done that?
22:57 <@GhostlyDeath> I believe so
22:57 < fraggle> example?
22:57 < fraggle> are you talking about vanilla wads or something? i'm confused
22:58 < fraggle> because vanilla doesn't do extended ascii
22:59 <@GhostlyDeath> who says you have to follow vanilla rules?
22:59 < fraggle> well i don't understand the point you're making
22:59 < fraggle> are you saying you think someone might come along and mistakenly think it supports extended ascii?
23:00 < fraggle> or that there are old wads that do this that you need to support?
23:01 < fraggle> and if there aren't any examples of such older wads, why bother?
23:02 <@GhostlyDeath> Legacy only allows '!' to '_' to be defined by characters
23:02 < fraggle> yes
23:02 < fraggle> same as vanilla
23:02 <@GhostlyDeath> lowercase is automatically uppercase
23:02 < fraggle> this is for strings rendered in the STCFN** font you mean
23:02 <@GhostlyDeath> yes
23:03 < fraggle> so basically there can't be any existing mods that would pose a problem then
23:03 <@GhostlyDeath> Lowercase STCFNs work in legacy as long as they exist
23:03 < fraggle> interesting
23:04 <@GhostlyDeath> actually, that might have been ZDoom
23:04 <@GhostlyDeath> or a ZDoom based port
23:04 < fraggle> maybe
23:05 <@GhostlyDeath> but ReMooD only maps lowercase letters to capital if the lowercase version does not exist but the uppercase one does
23:05 < fraggle> right
23:05 < fraggle> a useful extension for modders
23:06 < fraggle> so, any sensible reason left to overhaul everything to UTF-16/32?
23:06 <@GhostlyDeath> Entire documents in Japanese will be smaller
23:07 <@GhostlyDeath> if UTF-8 causes japanese characters to be 3 bytes
23:07 <@GhostlyDeath> in UTF-16 they are only 2 bytes
23:07 < fraggle> yeah well there's no reason that you can't support parsing UTF-16/32 *files*
23:07 <@GhostlyDeath> if the Japanese characters outweigh the normal ASCII characters
23:07 <@GhostlyDeath> same goes when representing text in game
23:07 < fraggle> my point was about internal representation
23:08 <@GhostlyDeath> When someone talks over the network
23:08 <@GhostlyDeath> 2 characters will be 6 bytes instead of 4 bytes
23:08 < fraggle> are most doomers japanese? or english?
23:08 <@GhostlyDeath> There are foreign doomers
23:08 < fraggle> yes
23:08 <@GhostlyDeath> I see them all the time on ZDaemon
23:08 < fraggle> but most doomers playing netgames are probably writing in english, let's be honest
23:09 <@GhostlyDeath> every port only allows ASCII to be input
23:09 < fraggle> so for the common case you'd be doubling the size of every string transmitted
23:09 <@GhostlyDeath> yes
23:09 < fraggle> quadrupling if you use UTF-32
23:09 < fraggle> so not really a saving
23:09 <@GhostlyDeath> Network packets will be compressed
23:09 <@GhostlyDeath> (if server allows compression)
23:10 <@GhostlyDeath> the server could accept only uncompressed packets which would force clients to not use compression
23:10 <@GhostlyDeath> or the server could force compression which would force clients to use compression
23:10 < fraggle> i bet a compressed ascii string is still smaller than the same compressed UTF-32 string
23:10 < fraggle> so not really relevant
23:11 < fraggle> if i can be honest it really seems like you're just clutching at straws
23:12 <@GhostlyDeath> Using gz -9
23:12 <@GhostlyDeath> 23404 >> 4485 and 11707 >> 3706
23:13 < fraggle> ?
23:13 <@GhostlyDeath> Only ASCII Characters
23:13 < fraggle> what do those numbers mean
23:13 <@GhostlyDeath> bytes
23:13 < fraggle> you've given me no context
23:15 < fraggle> 34234 >> 463 and 123781 >> 11111
23:15 < fraggle> 43434 << 333532 and 545 + 3
23:15 < fraggle> meaningless, see?
23:15 <@GhostlyDeath> Not bitshifting
23:16 <@GhostlyDeath> The doubled number is obviously UTF-16
23:16 <@GhostlyDeath> each file compressed down to a certain size
23:16 < fraggle> doubled number? what are you talking about?
23:16 <@GhostlyDeath> well, double + 2
23:16 < fraggle> can you just go back and present these values again in some clear form?
23:16 < fraggle> what is 23404 representing?
23:16 <@GhostlyDeath> number bytes of the UTF-16 file
23:16 < fraggle> ok
23:17 < fraggle> are there two files?
23:17 <@GhostlyDeath> yes
23:17 < fraggle> what are the contents of these two files?
23:17 <@GhostlyDeath> lzma for 16 produces 3646 and 8 produces 3506
23:17 <@GhostlyDeath> EnUS translation file
23:17 <@GhostlyDeath> Only ASCII characters
23:19 < fraggle> and the other file?
23:19 <@GhostlyDeath> Only ASCII characters
23:19 < fraggle> ...
23:19 < fraggle> so you have two ascii files.. they contain the same data?
23:19 <@GhostlyDeath> yes
23:19 < fraggle> right
23:19 <@GhostlyDeath> except one is in UTF-16
23:19 <@GhostlyDeath> containing characters found in ASCII
23:20 <@GhostlyDeath> hence ASCII characters
23:20 < fraggle> ok, i understand
23:20 < fraggle> and the other?
23:20 <@GhostlyDeath> Why the hell are we debating what in files anyway
23:20 <@GhostlyDeath> I already told you
23:20 < fraggle> told me what?
23:20 <@GhostlyDeath> What the contents of the file are
23:20 < fraggle> they both contain the same data, which is ascii
23:20 < fraggle> yes?
23:21 < fraggle> are you trying to say that they have different encodings?
23:21 < fraggle> one is encoded in UTF-16, the other? UTF-32?
23:21 <@GhostlyDeath> one is UTF-16 the other is ASCII
23:21 < fraggle> ok
23:22 <@GhostlyDeath> If the entire file were characters that could only be represented as 3 bytes in UTF-8, the UTF-16 one would be smaller
23:22 < fraggle> so if i try to guess what your values mean now
23:22 < fraggle> the ASCII version is 11707 bytes long
23:22 < fraggle> and compresses to 3706 bytes
23:22 < fraggle> is that correct?
23:23 <@GhostlyDeath> using gzip -9
23:23 < fraggle> while the UTF-16 version is 23404 bytes long, compresses to 4485 bytes
23:23 <@GhostlyDeath> lzma -9 produces better compression at 3509 bytes
23:23 < Freedoomer> bzip2 > gzip
23:23 <@GhostlyDeath> with lzma -9 it compresses down to 3646
23:23 < Freedoomer> Try bzip2
23:24 < fraggle> so compressed ASCII is smaller than compressed UTF-16?
23:24 <@GhostlyDeath> bzip2 for UTF-16 is 3662
23:24 <@GhostlyDeath> for ASCII it's 3668
23:24 <@GhostlyDeath> 6 bytes smaller!
23:24 <@GhostlyDeath> bzip2 WINS!
23:24 < Freedoomer> :)
23:24 <@GhostlyDeath> UTF-16 prevails!
23:25 < fraggle> i don't know GhostlyDeath, it seems to me like you're just dogmatically trying to stick to your original decision instead of rationally examining the arguments presented to you
23:25 < fraggle> what can i say, i'm only trying to help
23:25 <@GhostlyDeath> I've rationally examined them
23:25 < fraggle> if you want to waste your time, it's your project
23:25 <@GhostlyDeath> exactly
23:25 <@GhostlyDeath> Why do you care if I'm using wchar_ts?
23:26 <@GhostlyDeath> it would only mean that supporting UTF-32 would be alot easier
23:26 < fraggle> i'm just trying to provide some advice
23:26 <@GhostlyDeath> in the far far future
23:26 <@GhostlyDeath> however
23:26 <@GhostlyDeath> for text over the network
23:26 <@GhostlyDeath> I can have a field to specify encoding
23:26 <@GhostlyDeath> so English speakers saying "Hello" use 6 bytes
23:26 <@GhostlyDeath> and Japanese people can use UTF-16
23:26 < fraggle> because it seems like you've rushed into making the decision without doing the proper research or taking the time to understand the options
23:28 <@GhostlyDeath> I thought about UTF-8 but had to deal with multibyte characters
23:28 <@GhostlyDeath> In complete non-English environments it would not be viable
23:28 <@GhostlyDeath> as it would waste much more space than UTF-16
23:29 <@GhostlyDeath> but wchar_ts are 32-bits anyway in Linux
23:30 < fraggle> it's your time to waste