Difference between revisions of "MC-Basic Strings"

From SoftMC-Wiki
Jump to: navigation, search
m (8-bit ASCII Strings)
m (8-bit ASCII Strings)
Line 4: Line 4:
 
Only a single 8-bit byte is used to encode the 256 ASCII characters within Unicode range U+0000 to U+00FF. The most significant bit is 0 for the first 128 characters (U+0000 to U+007F range), and 1 for the next 128 characters, ranging between U+0080 and U+00FF.
 
Only a single 8-bit byte is used to encode the 256 ASCII characters within Unicode range U+0000 to U+00FF. The most significant bit is 0 for the first 128 characters (U+0000 to U+007F range), and 1 for the next 128 characters, ranging between U+0080 and U+00FF.
 
<pre>
 
<pre>
Character Decimal   Unicode Binary display
+
Character Decimal   Unicode Binary display
 
A (capital A) 65     U+0041 01000001
 
A (capital A) 65     U+0041 01000001
 
Á (A with acute) 193     U+00C1 11000001
 
Á (A with acute) 193     U+00C1 11000001

Revision as of 13:31, 26 May 2014

Types of Strings

8-bit ASCII Strings

Only a single 8-bit byte is used to encode the 256 ASCII characters within Unicode range U+0000 to U+00FF. The most significant bit is 0 for the first 128 characters (U+0000 to U+007F range), and 1 for the next 128 characters, ranging between U+0080 and U+00FF.

Character	 Decimal    Unicode	Binary display
A (capital A)	 65	    U+0041	01000001	
Á (A with acute) 193	    U+00C1	11000001	

UTF-8 Strings

UTF-8 encodes each character in one to four 8-bit bytes within Unicode range U+0000 to U+00FF. When two to four bytes are used, the most significant bit of these bytes is always 1, to prevent confusion with 7-bit ASCII characters. - One byte is needed to encode the first 128 ASCII characters (Unicode range U+0000 to U+007F). Byte always begins with 0, thus compatible with 7-bit ASCII.

Unicode range		Binary display
U+0000 – U+007F  		0zzzzzzz

- Two bytes are needed for Unicode range U+0080 to U+07FF, which includes Latin letters with diacritics and characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets. First byte always begins with 110, while second byte always begins with 10. The Unicode value of the character is represented by the rest 11 bits.

Unicode range		Binary display
U+0080 – U+07FF 		110yyyyy 10zzzzzz

- Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). First byte always begins with 1110, while second and third bytes always begin with 10. The Unicode value of the character is represented by the rest 16 bits.

Unicode range		Binary display
U+0800 – U+FFFF  		1110xxxx 10yyyyyy 10zzzzzz

- Four bytes are needed for characters in the other planes of Unicode, which are rarely used in practice. First byte always begins with 11110, while the next three bytes always begin with 10. The Unicode value of the character is represented by the rest 21 bits.

Unicode range	      Binary display
U+10000 - U+10FFFF	11110www 10xxxxxx 10yyyyyy 10zzzzzz

First byte always starts with 1, followed by 1 depending on number of following bytes. This signalization is terminated with 0. 110x xxxx  one byte follows 1110 xxxx  two byte follows 1111 0xxx  three byte follows

Character	     Decimal   Unicode	 Binary display	
A (capital A)	  65	    U+0041	 01000001	
Á (A with acute)	 193	    U+00C1	 11000011 10000001