c - How to detect the character encoding of command line arguments in mingw -


Is it safe to believe that they are ISO-8859-15 (windows-1252?), Or are there any functions Call it to ask? The ultimate goal is to convert to UTF-8.


Background:

The cause of the problem arises because XMLStarlet assumes that its command line arguments are UTF-8, under Windows it seems that they are actually ISO-8859 -15 (windows-1252?), Or at the beginning of the main , do the following things: Things work:

  four ** utf8argv = Malloc (size * (four *) * (argc + 1)); Utf8argv [argc] = null; {Iconv_t windows2utf8 = iconv_open ("UTF-8", "ISO-8859-15"); Int i; For (i = 0; i & lt; argc; i ++) {const char * arg = argv [i]; Size_t len ​​= strlen (arg); Size_t outlen = len * 2 + 1; Four * utfarg = malloc (outlen); Char * out = utfarg; Size_t ret = iconv (windows2utf8, & arg, & lan; & amp; ou, & amp; outlen;); If (rate at lieutenant; 0) {mirror ("iconv"); Utf8argv [i] = null; to continue; } Outside [0] = '\ 0'; Utf8argv [i] = Usag; } Argv = utf8argv; }  

test encoding

The following program prints the bytes of their first argument in decimal:

  #include & Lt; Strings.h & gt; # Include & lt; Stdio.h & gt; IRTE (ARGLE [1]); I ++) {printf ("% d", (unsigned char) argv [1] [i] i]); } Printf ("\ n"); Return 0; }  

chcp reports the code page, so the characters must be 145 and 146, respectively.

  C: \ Users \ npostavs \ tmp & gt; Chcp active code page: 850  

But we have seen 230 and 198 which match:

  c: \ user \ n postavs \ tmp> Cmd-chars § 230 198  

Passing the characters outside the codepace leads to a lossy change

for cmd-chars.exe Logic with αβγ (These are not present in codepay 1252)

  C: \ Users \ npostavs \ tmp & gt; Shortcut-cmd-chars.lnk gives 97 223 63  

Which aß? is.

You can specify a argv with a call - in the style array of expanded strings You can call the first argument to get the command line argument. This is the only portable Windows way, especially with code page mess; For example, Japanese characters can be passed through the Windows shortcut. After that, you can use it with code page logic of CP_UTF8 to convert each broad-character argv element to UTF-8.

Note that with callout 0 output buffer size (Byte count), WideMaterialBelibyte will allow you to set the number of UTF-8 bytes needed for the number of characters described ( Or if you want to pass -1 as the number of detailed characters to simplify your code, you have a whole string including the Null Terminator). You can then assign the required number of bytes using the malloc et al. And call WideCharToMultiByte with the correct number of bytes instead of 0. If this was a performance-critical one, then a different solution would probably be the best, but since it is a time to get the command line debate, I would say that there will be some reduction in performance.

Of course, do not forget to empty all your memory, which includes local free with callcard command line TOORGVW as argument.

For more information about the function and how you can use them, click on the link to view the MSDN documentation.


Comments

Popular posts from this blog

sqlite3 - UPDATE a table from the SELECT of another one -

c# - Showing a SelectedItem's Property -

javascript - Render HTML after each iteration in loop -