Difference between revisions of "MC-Basic Strings"

From SoftMC-Wiki
Jump to: navigation, search
m (UTF-8 Strings)
m
Line 48: Line 48:
 
</pre>
 
</pre>
 
<br/>
 
<br/>
 +
 +
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
 +
 +
= Using Strings =
 +
== Compatibility ==
 +
 +
UTF-8 is incompatible with 8-bit ASCII in the U+0080 to U+00FF Unicode range, since the characters within this range are now encoded by two bytes instead of one.
 +
<pre>
 +
Different encoding of ÷ (division sign) character, Unicode value U+00F7 (ASCII number 247):
 +
Encoding method  Hexadecimal display   Binary display
 +
8-bit ASCII     0xF7     11110111
 +
UTF-8     0xC3 0xB7 11000011 10110111
 +
</pre>
 +
This incompatibility requires separation of string variables into two distinct types, i.e., 8-bit ASCII strings and UTF-8 strings.  Distinction of string types already takes place at the translation phase, through different declaration statements. 
 +
<pre>
 +
Declaration of ASCII-8 strings:
 +
{Common Shared | Dim Shared | Dim} <name>{[size]…} As String
 +
Declaration of UTF-8 strings:
 +
{Common Shared | Dim Shared | Dim} <name>{[size]…} As String Of UTF8
 +
Declaration of ASCII-8 and UTF-8 fields:
 +
Type <struct_name>
 +
<name>{[size]…} As String
 +
<name>{[size]…} As String Of UTF8
 +
End Type
 +
</pre>
 +
== Conversion through Assignment ==
 +
During assignment, the accepting string will assume the type of the input string, if input string is of “higher” type. However, the content of the assigned string remains untouched, and is not converted to the higher type.
 +
<pre>
 +
Common shared ASCIIStr as String
 +
Dim shared UTFStr as String of UTF8
 +
ASCIIStr = UTFStr  ASCIIStr becomes UTF-8 (content of UTFStr is    copied as is, without implicit conversion to  ASCII-8).
 +
UTFStr = ASCIIStr  UTFStr stays UTF-8 (content of ASCIIStr is copied as is, without implicit conversion to UTF-8).
 +
</pre>
 +
Background: A given string can already be a UTF-8 string. During some manipulations the programmer can use existing function where a string variable is defined “as string”. If type information would be lost, further manipulations can do an implicit conversion to UTF-8 (see 3.2), thus converting the string to UTF-8 even if it is already encoded  as UTF-8.
 +
As demonstrated in the following example, type of variable might already change within declaration, through assignment:
 +
<pre>
 +
Common shared UTFStr as String of UTF8
 +
Dim shared ASCIIStr as String = UTFStr
 +
</pre>
 +
== Conversion to UTF-8 during String Concatenation and Comparison ==
 +
String concatenation and comparison can be performed only between strings of identical type. Therefore, ASCII-8 strings - concatenated or compared to UTF-8 strings - are implicitly converted to the “higher” encoding method, i.e., UTF-8. Implicit conversion is performed during run-time by creating a temporal copy of the ASCII-8 string, encoded according to UTF-8 rules, whereas sources (input strings) will stay untouched (thus avoiding changes in constant input strings)! As a result, in concatenation of a UTF-8 and an ASCII-8 string, the resulting string will also be of UTF-8 type.
 +
<pre>
 +
Common shared ASCIIStr as String
 +
Dim shared UTFStr as String of UTF8
 +
/* Concatenation */
 +
? ASCIIStr + UTFStr  /* ASCIIStr converted to UTF-8. Result is UTF-8.
 +
ASCIIStr source string is still ASCII-8 */
 +
/* Comparison */
 +
? UTFStr >= ASCIIStr /* ASCIIStr converted to UTF-8*. ASCIIStr source is not touched */
 +
</pre>
 +
== String Values ==
 +
String values, delimited by double quotes (“ ” ), are type-less, i.e., they are not of ASCII-8 type, nor of UTF-8 type. Therefore, assignment of a string value into a variable will not affect the variable’s type. Likewise, in concatenation with a “typed” string, the type of the resulting string is determined by the “typed” string. Code of type-less strings will also be implicitly converted to the UTF-8 form during concatenation and comparison with a UTF-8 string.
 +
<pre>
 +
Common shared ASCIIStr as String
 +
Common shared UTFStr as String of UTF8
 +
ASCIIStr = “…”  /* Assignment does not affect ASCIIStr type */
 +
UTFStr = “…”    /* Assignment does not affect UTFStr type */
 +
?“…” + UTFStr  /*String value is converted to UTF-8. Result is UTF-8*/
 +
?ASCIIStr + “…”  /* Result is ASCII-8 */
 +
?UTFStr >= “…”  /* String value is converted to UTF-8 */
 +
?“…” < ASCIIStr  /* String value is handled as ASCII-8 */
 +
</pre>
 +
== String Constant Variables ==
 +
Distinction of string types is also applied for constants, through different declaration statements.
 +
 +
As demonstrated in the following example, type of constant might already change within declaration, through assignment:
 +
<pre>
 +
Common shared ASCIIStr as Const String = “…”
 +
Common shared UTFStrConst as Const String of UTF8 = ASCIIStr
 +
</pre>
 +
Type of constant strings, determined within declaration, is set, since constants cannot be assigned after declaration.
 +
== Parameters and Returned Values ==
 +
Prototypes of subroutines and functions are always written according to the ASCII-8 syntax, but are able to accept both string types.
 +
<pre>
 +
Common shared ASCIIStr as String
 +
Dim shared UTFStr as String of UTF8
 +
Sub StrSub (StrPar1 as String)
 +
End Sub
 +
Call StrSub ( UTFStr )
 +
Call StrSub ( ASCIIStr )
 +
Function StrFunc1 (ByVal StrPar As String) As String StrFunc1 = StrPar
 +
End Function
 +
Print StrFunc1( ASCIIStr )  Returns an ASCII-8 string
 +
Print StrFunc1( UTFStr )  Returns a UTF-8 string
 +
Function StrFunc2 As String
 +
                        /* Returned value not assigned */
 +
End Function
 +
Print StrFunc1  Returns a No-Type string
 +
</pre>
 +
== Multiple Null Characters in String Code ==
 +
In both string types, an unlimited number of null characters (U+0000) can be inserted into the string’s code without cutting it off.

Revision as of 13:38, 26 May 2014

Types of Strings

8-bit ASCII Strings

Only a single 8-bit byte is used to encode the 256 ASCII characters within Unicode range U+0000 to U+00FF. The most significant bit is 0 for the first 128 characters (U+0000 to U+007F range), and 1 for the next 128 characters, ranging between U+0080 and U+00FF.

Character	 Decimal    Unicode	Binary display
A (capital A)	 65	    U+0041	01000001	
Á (A with acute) 193	    U+00C1	11000001	

UTF-8 Strings

UTF-8 encodes each character in one to four 8-bit bytes within Unicode range U+0000 to U+00FF. When two to four bytes are used, the most significant bit of these bytes is always 1, to prevent confusion with 7-bit ASCII characters.
One byte is needed to encode the first 128 ASCII characters (Unicode range U+0000 to U+007F). Byte always begins with 0, thus compatible with 7-bit ASCII.

Unicode range		Binary display
U+0000 – U+007F  	0zzzzzzz


Two bytes are needed for Unicode range U+0080 to U+07FF, which includes Latin letters with diacritics and characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets.
First byte always begins with 110, while second byte always begins with 10. The Unicode value of the character is represented by the rest 11 bits.

Unicode range		Binary display
U+0080 – U+07FF 	110yyyyy 10zzzzzz


Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). First byte always begins with 1110, while second and third bytes always begin with 10. The Unicode value of the character is represented by the rest 16 bits.

Unicode range		Binary display
U+0800 – U+FFFF  	1110xxxx 10yyyyyy 10zzzzzz


Four bytes are needed for characters in the other planes of Unicode, which are rarely used in practice. First byte always begins with 11110, while the next three bytes always begin with 10. The Unicode value of the character is represented by the rest 21 bits.

Unicode range	        Binary display
U+10000 - U+10FFFF      11110www 10xxxxxx 10yyyyyy 10zzzzzz


First byte always starts with 1, followed by 1 depending on number of following bytes. This signalization is terminated with 0.
110x xxxx -> one byte follows
1110 xxxx -> two byte follows
1111 0xxx -> three byte follows

Character	 Decimal    Unicode	 Binary display	
A (capital A)	 65	    U+0041	 01000001	
Á (A with acute) 193	    U+00C1	 11000011 10000001


@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Using Strings

Compatibility

UTF-8 is incompatible with 8-bit ASCII in the U+0080 to U+00FF Unicode range, since the characters within this range are now encoded by two bytes instead of one.

Different encoding of ÷ (division sign) character, Unicode value U+00F7 (ASCII number 247): 
Encoding method  	 Hexadecimal display	  Binary display
8-bit ASCII	  		  0xF7		     11110111
UTF-8		    		0xC3 0xB7		 11000011 10110111

This incompatibility requires separation of string variables into two distinct types, i.e., 8-bit ASCII strings and UTF-8 strings. Distinction of string types already takes place at the translation phase, through different declaration statements.

Declaration of ASCII-8 strings:
{Common Shared | Dim Shared | Dim} <name>{[size]…} As String 
Declaration of UTF-8 strings:
{Common Shared | Dim Shared | Dim} <name>{[size]…} As String Of UTF8
Declaration of ASCII-8 and UTF-8 fields:
Type <struct_name>
<name>{[size]…} As String
<name>{[size]…} As String Of UTF8
End Type

Conversion through Assignment

During assignment, the accepting string will assume the type of the input string, if input string is of “higher” type. However, the content of the assigned string remains untouched, and is not converted to the higher type.

Common shared ASCIIStr as String
Dim shared UTFStr as String of UTF8
ASCIIStr = UTFStr  ASCIIStr becomes UTF-8 (content of UTFStr is    copied as is, without implicit conversion to   ASCII-8).
UTFStr = ASCIIStr  UTFStr stays UTF-8 (content of ASCIIStr is copied as is, without implicit conversion to UTF-8). 

Background: A given string can already be a UTF-8 string. During some manipulations the programmer can use existing function where a string variable is defined “as string”. If type information would be lost, further manipulations can do an implicit conversion to UTF-8 (see 3.2), thus converting the string to UTF-8 even if it is already encoded as UTF-8. As demonstrated in the following example, type of variable might already change within declaration, through assignment:

Common shared UTFStr as String of UTF8
Dim shared ASCIIStr as String = UTFStr

Conversion to UTF-8 during String Concatenation and Comparison

String concatenation and comparison can be performed only between strings of identical type. Therefore, ASCII-8 strings - concatenated or compared to UTF-8 strings - are implicitly converted to the “higher” encoding method, i.e., UTF-8. Implicit conversion is performed during run-time by creating a temporal copy of the ASCII-8 string, encoded according to UTF-8 rules, whereas sources (input strings) will stay untouched (thus avoiding changes in constant input strings)! As a result, in concatenation of a UTF-8 and an ASCII-8 string, the resulting string will also be of UTF-8 type.

Common shared ASCIIStr as String
Dim shared UTFStr as String of UTF8
/* Concatenation */
? ASCIIStr + UTFStr  /* ASCIIStr converted to UTF-8. Result is UTF-8.
ASCIIStr source string is still ASCII-8 */
/* Comparison */	
? UTFStr >= ASCIIStr /* ASCIIStr converted to UTF-8*. ASCIIStr source is not touched */

String Values

String values, delimited by double quotes (“ ” ), are type-less, i.e., they are not of ASCII-8 type, nor of UTF-8 type. Therefore, assignment of a string value into a variable will not affect the variable’s type. Likewise, in concatenation with a “typed” string, the type of the resulting string is determined by the “typed” string. Code of type-less strings will also be implicitly converted to the UTF-8 form during concatenation and comparison with a UTF-8 string.

Common shared ASCIIStr as String
Common shared UTFStr as String of UTF8
ASCIIStr = “…”   /* Assignment does not affect ASCIIStr type */
UTFStr = “…”     /* Assignment does not affect UTFStr type */
?“…” + UTFStr   /*String value is converted to UTF-8. Result is UTF-8*/
?ASCIIStr + “…”  /* Result is ASCII-8 */
?UTFStr >= “…”   /* String value is converted to UTF-8 */
?“…” < ASCIIStr  /* String value is handled as ASCII-8 */

String Constant Variables

Distinction of string types is also applied for constants, through different declaration statements.

As demonstrated in the following example, type of constant might already change within declaration, through assignment:

Common shared ASCIIStr as Const String = “…”
Common shared UTFStrConst as Const String of UTF8 = ASCIIStr

Type of constant strings, determined within declaration, is set, since constants cannot be assigned after declaration.

Parameters and Returned Values

Prototypes of subroutines and functions are always written according to the ASCII-8 syntax, but are able to accept both string types.

Common shared ASCIIStr as String
Dim shared UTFStr as String of UTF8
Sub StrSub (StrPar1 as String)
End Sub
Call StrSub ( UTFStr )
Call StrSub ( ASCIIStr )
Function StrFunc1 (ByVal StrPar As String) As String StrFunc1 = StrPar
End Function
Print StrFunc1( ASCIIStr ) 			Returns an ASCII-8 string
Print StrFunc1( UTFStr )			Returns a UTF-8 string
Function StrFunc2 As String		
                        /* Returned value not assigned */	
End Function
Print StrFunc1					Returns a No-Type string

Multiple Null Characters in String Code

In both string types, an unlimited number of null characters (U+0000) can be inserted into the string’s code without cutting it off.