java - Decode Base 64 and encode as Utf-8 still leaves encoded characters -


I have a program that has to decode the base 64 string and then re-enter it as UTF-8. The program is pulling text from a .doc and then downloading it locally from dropbox (using tembo). There are still weird letters before and after the document. This is what Microsoft reveals a section of the page for Word 2011:

Image Character

I tried to keep the text online in the decoder and did not know what the encoding of the text above was, I like this, I am currently decoding text:

  encoded = encoded. All fixed ("\ r \ n", ""); Encoded = encoded.Rabel all ("\ n", ""); Encoded = encoded.Rabel all ("\ r", ""); // Decoding feedback decoded = StringTilities New StringWfF 8 (base 64.decodeBase64 (encoded));  

In this textEdit.app looks like:

 Do TextEdit

Do anyone know what the encoding is and how can I decode these letters?

The word here is the first part of the .docx file, in hex:

 < Code> 50 4B 03 04 14 00 06 00 08 00 00 00 21 00E1F8E BF 8D 01 00 00 06 06 00 00 13 00 08 02 5B 43 6 F 6 A 74 65 6e 74 5 F 54 79 70 65 73 5 D2 E 78 6 D6 C 20 A 02 04 02 Al 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Note that every 2-digit value above is a character first two Value - 50 and 4B - ASCII A The Shr p. (Google "ASCII table" and you will see what I mean.)

All these character data that you can see: If you see hex values, with the value above 0x7F Nothing is a valid ASCII / UTF 8 character.

When such data is transmitted over the Internet through some protocols, the data are suitable for distortion (the protocol expects ASCII characters) unless encoded in ASCII. This is the purpose of "Base-64"

encodes the above data as base-64.

  UEsDBBQABgAIAAAAIQDhD46 / jQEAACkGAAATAAgCW0NvbnRlbn RfVHlwZXNdLnhtbCCiBAIooAACAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  

It can be transmitted securely, because all values ​​are regular ASCII characters (with their numeric value 0x7f Are below).

When you decode Base-64, with that, if you type that data in a file, then you will be "restructured" the original .docx file.

If, on the other hand, the data you decoded (or never encodes the data) byte string converter (like newStringUtf8 ) then 0x7f characters larger than UTF 8 views Has been interpreted as and the same UTF 16 or UTF has been translated into 32 characters. But "binary" data (eg header data in .doc or .docx file) are numbers only - this is not character data, by converting those binary values ​​into UTF characters, nothing is meaningful, besides some values ​​avoid conversion Can not and will not change back correctly.

The way to deal with this file is to "binary" with base-64 form, write that data as a "binary" file and then use the software that understands how to read its headers And understand it differently. This will be either Word or some APIs, especially to access the inside of Word files.


Comments

Popular posts from this blog

sqlite3 - UPDATE a table from the SELECT of another one -

c# - Showing a SelectedItem's Property -

javascript - Render HTML after each iteration in loop -