ā†

September 24, 2021

ascii and unicode

Data Types

Character String

  • usually hold text data
  • text data consist of ASCII characters (characters whose code point <= 127)
  • example: "hello world!" belongs to ASCII subset

Binary String

  • usually hold binary data (pictures, voice, mixed mediaa)
  • unit of length is OCTETS
  • 2 types of varying-length binary string

    • VARBINARY value up to 32,672 bytes
    • BLOB (Binary Large Object) value up to 2GB - 1byte

ASCII

What is it lol

Stands for American Standard Code for Information Interchange - an early standardized encoding system for text. Encoding is converting characters in human language to binary sequences.

256 different ways to group eight 1s and 0s -> 256 ways to represent a character in ASCII.

For example

Character ASCII Byte
A 065 01000001
a 097 01100001
0 048 00110000
9 057 00111001
! 033 00100001

Problem

Not enough ways to store every symbol, ever.

Unicode

What is it lol?

  • Unicode Standard is an international encoding standard for use with different languages and scripts where each letter, digit or symbol is assigned a unique numeric value (code point)
  • allow representation of all languages: abc, language symbols, hiragana, katakana, emoji - in a unified way, up to 32 bits wide (~4 billion unique values)

For example

Character Code Point
A U+0041 or \u0041
a U+0061 or \u0061
0 U+0030
9 U+0039
恂 U+3042
šŸ˜ U+1F601

U stand for "Unicode"

UTF is what leh

UTF stands for Unicode Transformation Format - 8 bits. It is an encoding system for Unicode (translate Unicode to binary string, string back to Unicode)

There are multiple encoding systems for Unicode:

  • UTF-8 (8 bits - 1 byte) seen as more efficient as if a character can be represented in 1 byte, that is what it will use. Also, first 128 code points are exactly the same as ASCII
  • UTF-16 (16 bits - 2 bytes)

UTF-8 saves space. In UTF-8, common characters like ā€œCā€ take 8 bits, while rare characters like ā€œšŸ’©ā€ take 32 bits. Other characters take 16 or 24 bits. A blog post like this one takes about four times less space in UTF-8 than it would in UTF-32. So it loads four times faster. -Adam Hooper

Character Code Point Bytes Taken
A U+0041 1
a U+0061 1
0 U+0030 1
! U+0021 1
šŸ˜ U+1F601 4

Javascript Engine uses UTF-16

> console.log("haha".length) 4 > console.log("šŸ˜".length) 2

Each character, 'h','a' uses 16 bits while fancy ones will use 32 bits. UTF-16 code units means using two code points to reference a character. Smallest binary representation of a character is 2 bytes, 16 bits.

CHARACTER UTF-8 BINARY ENCODING UTF-16 BINARY ENCODING
A 01000001 01000001 11011000 00001110 11011111
š œŽ 11110000 10100000 10011100 10001110 01000001 11011000 00001110 11011111
Read More: