How to convert VBA/VB6 Unicode strings to UTF-8
VBA/VB6 stores its strings internally in what Microsoft documentation used to call "Unicode" but should more accurately be called UTF-16. This means that each character is stored in two bytes (well, actually, some obscure characters can use more).
This page explains how to pass the information to a cryptographic operation that requires the string to be encoded in UTF-8.
2019-12-11:
Added VBA code basFileString.bas to read and write binary files.
2018-08-17:
Added changes required to run on 64-bit Office.
2018-08-15:
Added function
Utf8BytesToString()
to do the reverse and convert from UTF-8-encoded byte array to a VB string.
Go to Downloads.
Important
Rule 1. Always convert VB strings to a byte array before trying to do cryptographic operations.
Simple ASCII strings
A simple ASCII string can be converted to a byte array using the internal StrConv()
function
Dim abData() As Byte abData = StrConv(strInput, vbFromUnicode)
This stores the ASCII characters one per byte in the byte array abData
. Because ASCII is a subset of UTF-8
this array is also UTF-8 encoded.
For example the 3-character ASCII string "abc"
is represented by the three bytes
0x61 0x62 0x63
.
Dim abData() As Byte abData = StrConv("abc", vbFromUnicode) Dim i As Integer For i = 0 To UBound(abData) Debug.Print Hex(abData(i)) & " "; Nextshould output
61 62 63
Problems with StrConv
If you pass a string with, say, an accented Latin character like
á
(U+00E1) the StrConv function will
convert it using Latin-1 encoding (ISO-8859-1) to just the one byte 0xE1
.
This result is not UTF-8 encoded (it should be the two bytes 0xC3 0xA1
).
Furthermore, if you pass, say, a Chinese character which requires more than one byte to store in UTF-16, StrConv will silently fail and just output the character as a question mark '?' (U+003F).
Converting to UTF-8
In a VBA/VB6 application, use the following code to convert a "Unicode" string to an array of bytes encoded in UTF-8.
''' WinApi function that maps a UTF-16 (wide character) string to a new character string Private Declare Function WideCharToMultiByte Lib "kernel32" ( _ ByVal CodePage As Long, _ ByVal dwFlags As Long, _ ByVal lpWideCharStr As Long, _ ByVal cchWideChar As Long, _ ByVal lpMultiByteStr As Long, _ ByVal cbMultiByte As Long, _ ByVal lpDefaultChar As Long, _ ByVal lpUsedDefaultChar As Long) As Long ' CodePage constant for UTF-8 Private Const CP_UTF8 = 65001 ''' Return byte array with VBA "Unicode" string encoded in UTF-8 Public Function Utf8BytesFromString(strInput As String) As Byte() Dim nBytes As Long Dim abBuffer() As Byte ' Catch empty or null input string Utf8BytesFromString = vbNullString If Len(strInput) < 1 Then Exit Function ' Get length in bytes *including* terminating null nBytes = WideCharToMultiByte(CP_UTF8, 0&, ByVal StrPtr(strInput), -1, 0&, 0&, 0&, 0&) ' We don't want the terminating null in our byte array, so ask for `nBytes-1` bytes ReDim abBuffer(nBytes - 2) ' NB ReDim with one less byte than you need nBytes = WideCharToMultiByte(CP_UTF8, 0&, ByVal StrPtr(strInput), -1, ByVal VarPtr(abBuffer(0)), nBytes - 1, 0&, 0&) Utf8BytesFromString = abBuffer End Function
2018-07-27: Thanks to Guillermo R. Flook for pointing out that the original code did not accept an empty or null string and for suggesting a correction.
2018-08-15: Added Utf8BytesToString()
function to do the reverse and
convert from UTF-8-encoded byte array to a VB string.
2018-11-06: Replaced "vbNull" in 5th argument 7th line in function above with "0&".
Examples
The Excel spreadsheet utf8-tests.xls
(zipped, 18.1 kB)
has samples of international characters in Spanish, Japanese, Chinese and Hebrew, and some ASCII characters.
It contains two Visual basic code modules UtfTests.bas which carries out tests on the strings using the
Utf8BytesFromString()
function in the module basUtf8FromString.bas
(see Downloads).
The results of running the tests are shown here.
Results
String | # characters | # bytes UTF-8 | UTF-8 bytes | Note |
---|---|---|---|---|
"abc123" | 6 | 6 | 61 62 63 31 32 33 | 1 |
"áéíóñ" | 5 | 10 | C3 A1 C3 A9 C3 AD C3 B3 C3 B1 | 2 |
Japanese | 5 | 15 | E3 81 93 E3 82 93 E3 81 AB E3 81 A1 E3 81 AF | 3 |
Chinese | 15 | 25 | 4F 55 3D E7 B8 BD E5 B1 80 2C 43 3D E4 B8 AD E5 9C 8B 2C 43 4E 3D E6 9C AC | 4 |
Hebrew | 12 | 15 | 61 62 63 20 D7 9B D7 A9 D7 A8 20 66 31 32 33 | 5 |
Notes
- Each ASCII character is encoded in one byte, e.g.
LATIN SMALL LETTER A (U+0061) => 61 DIGIT THREE (U+0033) => 33
- The accented latin characters will print in the VB immediate window and are encoded in two bytes, e.g.
LATIN SMALL LETTER A WITH ACUTE (U+00E1) => C3 A1 LATIN SMALL LETTER N WITH TILDE (U+00F1) => C3 B1
- The Japanese Hiragana characters print as '?' and are encoded in three bytes, e.g.
HIRAGANA LETTER KO (U+3053) => E3 81 93
- The Chinese characters print as '?' and are encoded in three bytes, e.g.
Han character 'ben' (U+672C) => E6 9C AC
- The Hebrew characters print as '?' and are encoded in two bytes.
The characters are displayed left-to-right as
RESH-SHIN-KAF
but are stored in the correct right-to-left order:HEBREW LETTER KAF (U+05DB) => D7 9B HEBREW LETTER SHIN (U+05E9) => D7 A9 HEBREW LETTER RESH (U+05E8) => D7 A8
Converting from UTF-8-encoded byte array to a VB string
Use the Utf8BytesToString()
function to do the reverse.
''' Maps a character string to a UTF-16 (wide character) string Private Declare Function MultiByteToWideChar Lib "kernel32" ( _ ByVal CodePage As Long, _ ByVal dwFlags As Long, _ ByVal lpMultiByteStr As Long, _ ByVal cchMultiByte As Long, _ ByVal lpWideCharStr As Long, _ ByVal cchWideChar As Long _ ) As Long ' CodePage constant for UTF-8 Private Const CP_UTF8 = 65001 ''' Return length of byte array or zero if uninitialized Private Function BytesLength(abBytes() As Byte) As Long ' Trap error if array is uninitialized On Error Resume Next BytesLength = UBound(abBytes) - LBound(abBytes) + 1 End Function ''' Return VBA "Unicode" string from byte array encoded in UTF-8 Public Function Utf8BytesToString(abUtf8Array() As Byte) As String Dim nBytes As Long Dim nChars As Long Dim strOut As String Utf8BytesToString = "" ' Catch uninitialized input array nBytes = BytesLength(abUtf8Array) If nBytes <= 0 Then Exit Function ' Get number of characters in output string nChars = MultiByteToWideChar(CP_UTF8, 0&, VarPtr(abUtf8Array(0)), nBytes, 0&, 0&) ' Dimension output buffer to receive string strOut = String(nChars, 0) nChars = MultiByteToWideChar(CP_UTF8, 0&, VarPtr(abUtf8Array(0)), nBytes, StrPtr(strOut), nChars) Utf8BytesToString = Left$(strOut, nChars) End Function
Here's a quick test for some accented Spanish characters. You can just type these into your code.
Public Sub Test_Utf8String() Dim b() As Byte Dim s As String b = Utf8BytesFromString("áéíóñ") s = Utf8BytesToString(b) Debug.Print "[" & s & "]" End SubThis should result in the output
[áéíóñ]
Changes required to run on 64-bit Office
The only changes required to run on a 64-bit version of Office are in the WinApi declarations and are as follows:
Private Declare PtrSafe Function WideCharToMultiByte Lib "kernel32" ( _ ByVal CodePage As Long, _ ByVal dwFlags As Long, _ ByVal lpWideCharStr As LongPtr, _ ByVal cchWideChar As Long, _ ByVal lpMultiByteStr As LongPtr, _ ByVal cbMultiByte As Long, _ ByVal lpDefaultChar As Long, _ ByVal lpUsedDefaultChar As Long) As Long Private Declare PtrSafe Function MultiByteToWideChar Lib "kernel32" ( _ ByVal CodePage As Long, _ ByVal dwFlags As Long, _ ByVal lpMultiByteStr As LongPtr, _ ByVal cchMultiByte As Long, _ ByVal lpWideCharStr As LongPtr, _ ByVal cchWideChar As Long _ ) As LongYou can use the
VBA7
compiler constant to write code that will work on both 32-bit and 64-bit versions of Office.
See the updated code basUtf8FromString64.bas.
Note that this may display one set of code in red in your IDE:
#If VBA7 Then
Private Declare PtrSafe Function WideCharToMultiByte Lib "kernel32" ( _
ByVal CodePage As Long, _
ByVal dwFlags As Long, _
ByVal lpWideCharStr As LongPtr, _
ByVal cchWideChar As Long, _
ByVal lpMultiByteStr As LongPtr, _
ByVal cbMultiByte As Long, _
ByVal lpDefaultChar As Long, _
ByVal lpUsedDefaultChar As Long _
) As Long
#Else
Private Declare Function WideCharToMultiByte Lib "kernel32" ( _
ByVal CodePage As Long, _
ByVal dwFlags As Long, _
ByVal lpWideCharStr As Long, _
ByVal cchWideChar As Long, _
ByVal lpMultiByteStr As Long, _
ByVal cbMultiByte As Long, _
ByVal lpDefaultChar As Long, _
ByVal lpUsedDefaultChar As Long _
) As Long
#End If
but this will be ignored by the compiler when it runs. It just looks unsightly.
Note also that we are referring to the 64-bit version of Office, not the Windows platform you are running on.
You do not need the Win64
complier constant and you definitely do not use the LongLong
data type in your code.
See also the Microsoft page 64-Bit Visual Basic for Applications Overview.
Downloads
basUtf8FromString64.bas (zipped, last updated 2017-08-17)basFileString.bas (zipped, last updated 2019-12-11)
Related Topics
See also our pages on:- Cryptography with International Character Sets
- Cross-Platform Encryption
- Storing and representing ciphertext
- Using Byte Arrays in Visual Basic VB6
- Binary and byte operations in Visual Basic VB6
Contact
For more information or to comment on this page, please send us a message.
This page first published 30 June 2015. Last updated 6 April 2021