Difference between revisions of "MC-Basic Strings"
(→UTF-8 Strings) |
(→UTF-8 Strings) |
||
Line 16: | Line 16: | ||
Unicode range Binary display | Unicode range Binary display | ||
------------- -------------- | ------------- -------------- | ||
− | U+0000 – U+007F 0zzzzzzz /* z | + | U+0000 – U+007F 0zzzzzzz /* z stands for 0 or 1 */ |
</pre> | </pre> | ||
<br/> | <br/> | ||
Line 23: | Line 23: | ||
<pre> | <pre> | ||
Unicode range Binary display | Unicode range Binary display | ||
− | U+0080 – U+07FF 110yyyyy 10zzzzzz | + | ------------- -------------- |
+ | U+0080 – U+07FF 110yyyyy 10zzzzzz /* y and z stand for 0 or 1 */ | ||
</pre> | </pre> | ||
<br/> | <br/> | ||
Line 30: | Line 31: | ||
<pre> | <pre> | ||
Unicode range Binary display | Unicode range Binary display | ||
− | U+0800 – U+FFFF 1110xxxx 10yyyyyy 10zzzzzz | + | ------------- -------------- |
+ | U+0800 – U+FFFF 1110xxxx 10yyyyyy 10zzzzzz /* x, y and z stand for 0 or 1 */ | ||
</pre> | </pre> | ||
<br/> | <br/> | ||
Line 37: | Line 39: | ||
<pre> | <pre> | ||
Unicode range Binary display | Unicode range Binary display | ||
− | U+10000 - U+10FFFF 11110www 10xxxxxx 10yyyyyy 10zzzzzz | + | ------------- -------------- |
+ | U+10000 - U+10FFFF 11110www 10xxxxxx 10yyyyyy 10zzzzzz /* w, x, y and z stand for 0 or 1 */ | ||
</pre> | </pre> | ||
<br/> | <br/> | ||
Line 46: | Line 49: | ||
<pre> | <pre> | ||
Character Decimal Unicode Binary display | Character Decimal Unicode Binary display | ||
+ | --------- ------- ------- -------------- | ||
A (capital A) 65 U+0041 01000001 | A (capital A) 65 U+0041 01000001 | ||
Á (A with acute) 193 U+00C1 11000011 10000001 | Á (A with acute) 193 U+00C1 11000011 10000001 |
Revision as of 13:07, 27 May 2014
Contents
Types of Strings
8-bit ASCII Strings
Only a single 8-bit byte is used to encode the 256 ASCII characters within Unicode range U+0000 to U+00FF. The most significant bit is 0 for the first 128 characters (U+0000 to U+007F range), and 1 for the next 128 characters, ranging between U+0080 and U+00FF.
Character Decimal Unicode Binary display --------- ------- ------- -------------- A (capital A) 65 U+0041 01000001 Á (A with acute) 193 U+00C1 11000001
UTF-8 Strings
UTF-8 encodes each character in one to four 8-bit bytes within Unicode range U+0000 to U+00FF. When two to four bytes are used, the most significant bit of these bytes is always 1, to prevent confusion with 7-bit ASCII characters.
One byte is needed to encode the first 128 ASCII characters (Unicode range U+0000 to U+007F). Byte always begins with 0, thus compatible with 7-bit ASCII.
Unicode range Binary display ------------- -------------- U+0000 – U+007F 0zzzzzzz /* z stands for 0 or 1 */
Two bytes are needed for Unicode range U+0080 to U+07FF, which includes Latin letters with diacritics and characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets.
First byte always begins with 110, while second byte always begins with 10. The Unicode value of the character is represented by the rest 11 bits.
Unicode range Binary display ------------- -------------- U+0080 – U+07FF 110yyyyy 10zzzzzz /* y and z stand for 0 or 1 */
Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use).
First byte always begins with 1110, while second and third bytes always begin with 10. The Unicode value of the character is represented by the rest 16 bits.
Unicode range Binary display ------------- -------------- U+0800 – U+FFFF 1110xxxx 10yyyyyy 10zzzzzz /* x, y and z stand for 0 or 1 */
Four bytes are needed for characters in the other planes of Unicode, which are rarely used in practice.
First byte always begins with 11110, while the next three bytes always begin with 10. The Unicode value of the character is represented by the rest 21 bits.
Unicode range Binary display ------------- -------------- U+10000 - U+10FFFF 11110www 10xxxxxx 10yyyyyy 10zzzzzz /* w, x, y and z stand for 0 or 1 */
First byte always starts with 1, followed by 1 depending on number of following bytes. This signalization is terminated with 0.
110x xxxx -> one byte follows
1110 xxxx -> two byte follows
1111 0xxx -> three byte follows
Character Decimal Unicode Binary display --------- ------- ------- -------------- A (capital A) 65 U+0041 01000001 Á (A with acute) 193 U+00C1 11000011 10000001
Using Strings
Compatibility
UTF-8 is incompatible with 8-bit ASCII in the U+0080 to U+00FF Unicode range, since the characters within this range are now encoded by two bytes instead of one.
Different encoding of ÷ (division sign) character, Unicode value U+00F7 (ASCII number 247): Encoding method Hexadecimal display Binary display 8-bit ASCII 0xF7 11110111 UTF-8 0xC3 0xB7 11000011 10110111
This incompatibility requires separation of string variables into two distinct types, i.e., 8-bit ASCII strings and UTF-8 strings. Distinction of string types already takes place at the translation phase, through different declaration statements.
Declaration of ASCII-8 strings: {Common Shared | Dim Shared | Dim} <name>{[size]…} As String Declaration of UTF-8 strings: {Common Shared | Dim Shared | Dim} <name>{[size]…} As String Of UTF8 Declaration of ASCII-8 and UTF-8 fields: Type <struct_name> <name>{[size]…} As String <name>{[size]…} As String Of UTF8 End Type
Conversion through Assignment
During assignment, the accepting string will assume the type of the input string, if input string is of “higher” type. However, the content of the assigned string remains untouched, and is not converted to the higher type.
Common shared ASCIIStr as String Dim shared UTFStr as String of UTF8 ASCIIStr = UTFStr → ASCIIStr becomes UTF-8 (content of UTFStr is copied as is, without implicit conversion to ASCII-8). UTFStr = ASCIIStr → UTFStr stays UTF-8 (content of ASCIIStr is copied as is, without implicit conversion to UTF-8).
As demonstrated in the following example, type of variable might already change within declaration, through assignment:
Common shared UTFStr as String of UTF8 Dim shared ASCIIStr as String = UTFStr
Conversion to UTF-8 during String Concatenation and Comparison
String concatenation and comparison can be performed only between strings of identical type. Therefore, ASCII-8 strings - concatenated or compared to UTF-8 strings - are implicitly converted to the “higher” encoding method, i.e., UTF-8. Implicit conversion is performed during run-time by creating a temporal copy of the ASCII-8 string, encoded according to UTF-8 rules, whereas sources (input strings) will stay untouched (thus avoiding changes in constant input strings)! As a result, in concatenation of a UTF-8 and an ASCII-8 string, the resulting string will also be of UTF-8 type.
Common shared ASCIIStr as String Dim shared UTFStr as String of UTF8 Concatenation: ------------- ? ASCIIStr + UTFStr /* ASCIIStr converted to UTF-8. Result is UTF-8. ASCIIStr source string is still ASCII-8 */ Comparison: ---------- ? UTFStr >= ASCIIStr /* ASCIIStr converted to UTF-8. ASCIIStr source is not touched */
String Values
String values, delimited by double quotes (“ ” ), are type-less, i.e., they are not of ASCII-8 type, nor of UTF-8 type. Therefore, assignment of a string value into a variable will not affect the variable’s type. Likewise, in concatenation with a “typed” string, the type of the resulting string is determined by the “typed” string. Code of type-less strings will also be implicitly converted to the UTF-8 form during concatenation and comparison with a UTF-8 string.
Common shared ASCIIStr as String Common shared UTFStr as String of UTF8 ASCIIStr = “…” /* Assignment does not affect ASCIIStr type */ UTFStr = “…” /* Assignment does not affect UTFStr type */ ?“…” + UTFStr /* String value is converted to UTF-8. Result is UTF-8 */ ?ASCIIStr + “…” /* Result is ASCII-8 */ ?UTFStr >= “…” /* String value is converted to UTF-8 */ ?“…” < ASCIIStr /* String value is handled as ASCII-8 */
String Constant Variables
Distinction of string types is also applied for constants, through different declaration statements.
As demonstrated in the following example, type of constant might already change within declaration, through assignment:
Common shared ASCIIStr as Const String = “…” Common shared UTFStrConst as Const String of UTF8 = ASCIIStr
Type of constant strings, determined within declaration, is set, since constants cannot be assigned after declaration.
Parameters and Returned Values
Prototypes of subroutines and functions are always written according to the ASCII-8 syntax, but are able to accept both string types.
Common shared ASCIIStr as String Dim shared UTFStr as String of UTF8 Sub StrSub (StrPar1 as String) End Sub Call StrSub ( UTFStr ) Call StrSub ( ASCIIStr ) Function StrFunc1 (ByVal StrPar As String) As String StrFunc1 = StrPar /* Returned value is assigned */ End Function Print StrFunc1( ASCIIStr ) → Returns an ASCII-8 string Print StrFunc1( UTFStr ) → Returns a UTF-8 string Function StrFunc2 As String /* Returned value is NOT assigned */ End Function Print StrFunc1 → Returns a No-Type string
Multiple Null Characters in String Code
In both string types, an unlimited number of null characters (U+0000) can be inserted into the string’s code without cutting it off.
String Manipulating Functions
Function |
ASCII-8 |
UTF-8 |
CHR$(< char_value > ) Returns a string corresponding to a given character value. |
CHR$ returns an ASCII-8 string. <char_value> must be within ASCII-8 range. |
UTF$ returns a UTF-8 string. |
STRING$(<long>,<string>) STRING$(<long>,<char_value>) Returns a string with the specified number of characters. |
STRING$ returns an ASCII-8 string. Input <string> must be of ASCII-8 type. <char_value> must be within ASCII-8 range. |
UTFSTRING$ returns a UTF-8 string. Input <string> must be of UTF-8 type. |
STR$(<long>) Returns the string representation of a number in decimal format. |
STR$ returns an ASCII-8 string. |
N/A |
HEX$(<long>) Returns the string representation of a number in hexadecimal format. |
HEX$ returns an ASCII-8 string. |
N/A |
BIN$(<long>) Returns the string representation of a number in binary format. |
BIN$ returns an ASCII-8 string. |
N/A |
SPACE$(<long>) Returns a string consisting of the specified number of blank spaces. |
SPACE$ returns an ASCII-8 string. |
N/A |
MID$(<string>,<long1>,<long2>) Returns a string consisting of <long2> number of characters from the string, starting at the character at position <long1 >. |
MID$ returns an ASCII-8 string if input <string> is of ASCII-8 type. |
MID$ returns a UTF-8 string if input <string> is of UTF-8 type. |
LEFT$(<string>,<long>) Returns the specified number of characters from the left-hand side of the string. |
LEFT$ returns an ASCII-8 string if input <string> is of ASCII-8 type. |
LEFT$ returns a UTF-8 string if input <string> is of UTF-8 type. |
RIGHT$(<string>,<long>) Returns the specified number of characters from the right-hand side of the string. |
RIGHT$ returns an ASCII-8 string if input <string> is of ASCII-8 type. |
RIGHT$ returns a UTF-8 string if input <string> is of UTF-8 type. |
LTRIM$(<string>,<long>) Returns the right-hand part of a string. |
LTRIM$ returns an ASCII-8 string if input <string> is of ASCII-8 type. |
LTRIM$ returns a UTF-8 string if input <string> is of UTF-8 type. |
RTRIM$(<string>,<long>) Returns the left-hand part of a string. |
RTRIM$ returns an ASCII-8 string if input <string> is of ASCII-8 type. |
RTRIM $ returns a UTF-8 string if input <string> is of UTF-8 type. |
UCASE$(<string>) Returns a string with all the lowercase letters of basic Latin converted to uppercase. |
UCASE$ returns an ASCII-8 string if input <string> is of ASCII-8 type. |
UCASE$ returns an UTF-8 string if input <string> is of UTF-8 type. |
LCASE$(<string>) Returns a string with all the uppercase letters of basic Latin converted to lowercase. |
LCASE$ returns an ASCII-8 string if input <string> is of ASCII-8 type. |
LCASE $ returns an UTF-8 string if input <string> is of UTF-8 type. |
ASC(<string>, {<long>}) Returns an ASCII character value from within a string.
|
ASC returns the ASCII-8 value if input <string> is of ASCII-8 type. |
ASC returns the Unicode value if input <string> is of UTF-8 type. |
VAL(<string>) Returns the numeric value of the input string. |
VAL returns the numeric value of an ASCII-8 <string>. |
VAL returns the numeric value of a UTF-8 <string>. |
INSTR( {<long>},<string1>,<string2>) Returns the position <string2> in <string1>. Note : If one of the input strings is of UTF-8 type, a temporal copy of the code of the other input string (ASCII-8 or no-type) is implicitly converted to the UTF-8 coding method. |
INSTR returns the position of an ASCII-8 <string2> in an ASCII-8 <string1>. |
INSTR returns the position of a UTF-8 <string2> in a UTF-8 <string1>. |
LEN (<string>) Returns the number of characters in the string. |
LEN returns the number of characters in an ASCII-8 <string>. |
LEN returns the number of symbols in a UTF-8 <string>. |
SIZE (<string>) Returns the number of allocated bytes in the string. |
SIZE returns the number of bytes in an ASCII-8 <string>. |
SIZE returns the number of bytes in a UTF-8 <string>. |
TYPEOF (<string>) Returns a number representing the type of the string (No-Type, ASCII-8 or UTF-8). |
TYPEOF returns ASCII-8 type. |
TYPEOF returns UTF-8 type. |
TOUTF8$(<string>) Returns a temporal copy of the input string, converted to the UTF-8 coding method. |
TOUTF8$ converts a copy of the ASCII-8 input code into a UTF-8 code, and returns a UTF-8 string. |
TOUTF8$ does not change the UTF-8 input code, and just returns a copy of the input UTF-8 string. |
TOASCII8$(<string>) Returns a temporal copy of the input string, converted to the ASCII-8 coding method. Any symbol located outside the ASCII-8 range (Unicode value higher than 0xFF) is replaced by a question mark (?). |
TOASCII8$ does not change the ASCII-8 input code, and just returns a copy of the input ASCII-8 string. |
TOASCII8$ converts a copy of the UTF-8 input code into an ASCII-8 code, and returns an ASCII-8 string. |
Common shared ASCIIStr as String Dim shared UTFStr as String of UTF8 Dim UTFSubStr as String of UTF8 Dim ASCIISubStr as String /* Number of symbols (characters) is equal to number of bytes */ ? LEN(ASCIIStr) = SIZE(ASCIIStr) → 1 /* Number of symbols might differ from number of bytes if UTFStr contains symbols higher than 0x7F (at least two bytes per symbol*/ ? LEN(UTFStr) = SIZE(UTFStr) → 0 /* String values are handled as ASCII-8 within INSTR function */ ? INSTR(ASCIIStr, “…”) ? INSTR( “…”, ASCIISubStr) /* Code of ASCII-8 strings is implicitly converted to UTF-8 */ ? INSTR(UTFStr, ASCIISubStr) ? INSTR(ASCIIStr, UTFSubStr) /*Code of type-less string values is implicitly converted to UTF-8*/ ? INSTR(UTFStr, “…”) ? INSTR(“…”, UTFSubStr) ASCIIStr = CHR$(0xC4) /* Ä */ UTFStr = UTF$(0xC4) /* Ä */ /*A single byte (0xC4) in ASCII-8. Two bytes (0xC3 0x84) in UTF-8*/ ? UTFStr = ASCIIStr → 0 /* The ”No-type” string value is handled as ASCII-8 */ ? UTFStr = “Ä” → 0 ? ASCIIStr = “Ä” → 1 /* Explicit conversion to UTF-8 */ ? UTFStr = TOUTF8$(ASCIIStr) → 1 ? UTFStr = TOUTF8$(“Ä”) → 1 /* Explicit conversion to ASCII-8 */ ? TOASCII8$(UTFStr) = ASCIIStr → 1 ? TOASCII8$(UTFStr) = “Ä” → 1
ASCII-8 |
UTF-8 |
No Type |
|
Assignment |
Converts target’s type to ASCII-8. |
Converts target’s type to UTF-8. |
Does not change target’s type. |
Binary operators (concatenation, comparison) |
With ASCII-8 à returns ASCII-8 With UTF-8 à returns UTF-8 With No-Type à returns ASCII-8 |
With ASCII-8 à returns UTF-8 With UTF-8 à returns UTFI-8 With No-Type à returns UTF-8 |
With ASCII-8 à returns ASCII-8 With UTF-8 à returns UTF-8 With No-Type à returns No-Type |
String Manipulating functions |
Code of input string is analyzed as ASCII-8. |
Code of input string is analyzed as UTF-8. |
Code of input string is analyzed as default type ASCII-8.
|
Printing Strings
Print, PrintUsing
Prints strings to the user-interface output window. Strings may be of ASCII-8 type or UTF-8 type. Also, printing statement may include strings from both types. The user interface should be able to identify types of printed strings and display them correctly.
Print #<DeviceHandle>, PrintUsing #<DeviceHandle>
Prints strings to a serial port. Strings may be of ASCII-8 type or UTF-8 type. Also, printing statement may include strings from both types. The target device should be able to identify types of printed strings and encode their characters correctly.
Print #1, CHR$(0x80) → sends one byte to device Print #1, UTF$(0x80) → sends two bytes to device
PrintToBuff #<DeviceHandle>, PrintUsingToBuff #<DeviceHandle>
Prints strings to a buffer. Strings may be of ASCII-8 type or UTF-8 type. Also, printing statement may include strings from both types. The target device should be able to identify types of printed strings and encode their characters correctly.
PrintUsing$
Prints strings into a string-type variable. Strings may be of ASCII-8 type, UTF-8 type or No-Type. Also, printing statement may include strings from all types. Types of input strings will not affect the type of the target string.
Reading Strings from Files
As in string values, returned value of Input$ is also type-less (see section 2.4). I.e., Input$ returns a string which is not ASCII-8 type, nor UTF-8 type. Therefore, assignment of Input$ returned-string does not affect the type of the target variable. As a result, it is the target’s type that determines the type of the data which has been read from file, socket, or other source of data. On the other hand, returned-string of Input$ are handled as ASCII-8 by string manipulating functions (see section 3). Therefore, in order to insure handling as a UTF-8 code, returned value of Input$ should be assigned first into a UTF-8 variable:
Common shared UTFStr as String of UTF8 ? UTF$(0x80) + Input$(10, #1) /* Returned string of Input$ is handled as ASCII-8, and its code is implicitly converted to UTF-8 */ In case file data should be handled as UTF-8: -------------------------------------------- /* Input$ should be assigned first into a UTF-8 variable */ UTFStr = Input$(10,#1) /* Assignment does not affect UTFStr type */ ? UTF$(0x80) + UTFStr /* Returned string of Input$ is handled as UTF-8, without implicit convertion to UTF-8 */
User Error Messages
String messages of user errors and notes are able to have both string types. String types of error messages are set during declaration of user error \ note, and cannot be changed afterwards.
Declaration of user errors and notes with ASCII-8 messages: ---------------------------------------------------------- Common Shared | Dim Shared <name> As Error “<message>” {<number>} Common Shared | Dim Shared <name> As Note “<message>” {<number>} Declaration of user errors and notes with UTF-8 messages: -------------------------------------------------------- Common Shared | Dim Shared <name> As Error “<message>” of UTF8 {<number>} Common Shared | Dim Shared <name> As Note “<message>” of UTF8 {<number>}
Therefore, type of string message returned by MSG property corresponds to its declaration type.
Common shared Err1 as Error “Error” 20001 Common shared Err2 as Error “Error” of UTF8 20002 ? Err1.Msg → Returns an ASCII-8 string ? Err2.Msg → Returns a UTF-8 string
ASCII-8 Only Strings
All strings in the system, which are not involved directly in handling string data-types, accept and \ or return only ASCII-8 strings.
String-type element properties:
<axis>. ACTIVECAM <axis>. FIRSTCAM.NAME <axis>. ALTERNATIVECAM.NAME <axis | group>. ATTACHEDTO <group>. MASTERFRAMENAME <axis | group | conveyer>. ELEMENTNAME <conveyer>. MOVINGFRAMENAME <cam>.NEXT.NAME <cam>.PREVIOUS.NAME
String-type system properties:
System.DATE System.TIME System.NAME System.SERIALNUMBER System.USERAUTHORIZATIONCODE System.IPADDRESSMASK System.CPUTYPE System.ERROR VERSION
String-type task properties:
<task>. STATUS {<task>}.MAINFILENAME {<task>}.ERROR PROGRAMNAME SCOPE
System functions accepting and/or returning strings:
<long> = PING (<string>) <long> = TASKSTATE (<string>) <string> = TASKERROR (<string>) <string> = VESEXECUTE (<string>) <string> = VESMESSAGE