Development · Reference

An Introduction to Character Encoding and UTF-8

A look at how text is represented inside computers.

Computers only really deal in numbers, so before a computer can store or process text, every character has to be tied to a numeric value through some agreed-upon standard. That mapping is what we call character encoding. In the sections below we'll trace how it developed, from the early days through to UTF-8, and why it still matters today.

ASCII

One of the first encodings to catch on widely was ASCII, which assigns numbers to a set of 128 characters — the English letters, the digits, common punctuation, and a handful of control characters. For English text it did the job. But it left out the countless characters used in other languages and writing systems, and once computing went global, that gap turned into a real problem.

Unicode

Unicode came along to fix exactly that. The goal is ambitious but straightforward: give a unique number, called a code point, to every character in every writing system in the world, plus a great many symbols besides. One thing to note is that Unicode defines the characters and their numbers, but it doesn't say how those numbers should be stored as bytes. That job falls to an encoding form — and the most common one by far is UTF-8.

UTF-8

UTF-8 is the scheme that turns Unicode code points into actual sequences of bytes. It's clever about it, using a variable number of bytes per character: anything in the original ASCII range fits in a single byte, while other characters stretch to two, three, or four. The part that really earned it its popularity is that it's backward-compatible with ASCII — text made only of ASCII characters looks byte-for-byte identical in both. That smooth overlap is a big reason it spread so far.

Why encoding matters

Here's where it bites you in practice: write text in one encoding, read it back in another, and the characters can come out wrong. That's the garbled gibberish you sometimes see standing in for the characters you expected. The fix is for everyone in the chain to agree on the same encoding. On the web that agreement has largely settled on UTF-8, and documents routinely declare they're using it so browsers render the text correctly.

Summary

Character encoding ties numeric values to text characters so computers can store and work with them. ASCII handled a small set; Unicode reaches across the world's writing systems. UTF-8 is the widely used way of storing Unicode text as bytes — it stays compatible with ASCII and has become the standard encoding of the web.

Try the HTML entities tool · Back to all articles