c - How to detect the character encoding of command line arguments in mingw -
Is it safe to believe that they are ISO-8859-15 (windows-1252?), Or are there any functions Call it to ask? The ultimate goal is to convert to UTF-8.
Background:
The cause of the problem arises because XMLStarlet assumes that its command line arguments are UTF-8, under Windows it seems that they are actually ISO-8859 -15 (windows-1252?), Or at the beginning of the main
, do the following things: Things work:
four ** utf8argv = Malloc (size * (four *) * (argc + 1)); Utf8argv [argc] = null; {Iconv_t windows2utf8 = iconv_open ("UTF-8", "ISO-8859-15"); Int i; For (i = 0; i & lt; argc; i ++) {const char * arg = argv [i]; Size_t len = strlen (arg); Size_t outlen = len * 2 + 1; Four * utfarg = malloc (outlen); Char * out = utfarg; Size_t ret = iconv (windows2utf8, & arg, & lan; & amp; ou, & amp; outlen;); If (rate at lieutenant; 0) {mirror ("iconv"); Utf8argv [i] = null; to continue; } Outside [0] = '\ 0'; Utf8argv [i] = Usag; } Argv = utf8argv; }
test encoding
The following program prints the bytes of their first argument in decimal:
#include & Lt; Strings.h & gt; # Include & lt; Stdio.h & gt; IRTE (ARGLE [1]); I ++) {printf ("% d", (unsigned char) argv [1] [i] i]); } Printf ("\ n"); Return 0; }
chcp
reports the code page, so the characters must be 145 and 146, respectively.
C: \ Users \ npostavs \ tmp & gt; Chcp active code page: 850
But we have seen 230 and 198 which match:
c: \ user \ n postavs \ tmp> Cmd-chars § 230 198
Passing the characters outside the codepace leads to a lossy change
for cmd-chars.exe Logic with
αβγ
(These are not present in codepay 1252)
C: \ Users \ npostavs \ tmp & gt; Shortcut-cmd-chars.lnk gives 97 223 63
Which aß?
is.
You can specify a argv
with a call - in the style array of expanded strings You can call the first argument to get the command line argument. This is the only portable Windows way, especially with code page mess; For example, Japanese characters can be passed through the Windows shortcut. After that, you can use it with code page logic of CP_UTF8
to convert each broad-character argv
element to UTF-8.
Note that with callout 0 output buffer size (Byte count), WideMaterialBelibyte
will allow you to set the number of UTF-8 bytes needed for the number of characters described ( Or if you want to pass -1 as the number of detailed characters to simplify your code, you have a whole string including the Null Terminator). You can then assign the required number of bytes using the malloc
et al. And call WideCharToMultiByte
with the correct number of bytes instead of 0. If this was a performance-critical one, then a different solution would probably be the best, but since it is a time to get the command line debate, I would say that there will be some reduction in performance.
Of course, do not forget to empty all your memory, which includes local free
with callcard command line TOORGVW
as argument.
For more information about the function and how you can use them, click on the link to view the MSDN documentation.
Comments
Post a Comment