*by Carl Burch, Hendrix College, September 2012*

Representing data by Carl Burch is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.

Based on a work at

1. Numeral systems

1.1. Binary numbers

1.2. Hexadecimal and octal

2. Integers

2.1. Unsigned integer representation

2.2. Sign-magnitude representation

2.3. Two's-complement representation

2.4. Adding two's-complement numbers

3. Characters

3.1. ASCII

3.2. 8-bit encodings

3.3. Unicode

3.4. Unicode encodings

1.1. Binary numbers

1.2. Hexadecimal and octal

2. Integers

2.1. Unsigned integer representation

2.2. Sign-magnitude representation

2.3. Two's-complement representation

2.4. Adding two's-complement numbers

3. Characters

3.1. ASCII

3.2. 8-bit encodings

3.3. Unicode

3.4. Unicode encodings

Internally, computers represent all data using **bits**:
Each bit is an individual atom of memory that can be
either off or on, which we interpret as 0 (off) or 1 (on).
In this document, we'll study how computers use bits to
represent **integers** — numbers with no fractional
part, like 2, 105, or −38 — and we'll learn how
a program can manipulate this integer
representation directly using *bit operators*.

Before we discuss how computers represent integers, we must first examine our basic numeral systems.

You're already familiar with the **decimal numeral
system**.
You may remember the following sort of diagram from grade
school.

1 | 0 | 2 | 4 |

1000 | 100 | 10 | 1 |

This diagram has a line underneath each digit of the number
1024, and underneath each line is a reminder of how much that
place is worth. In representing the number 1024,
we have a 4 in the ones place, a 2 in the tens place,
a 0 in the hundreds places, and a 1 in the thousands place.
This system is also called **base 10** because it
is based on the number 10: There are 10 possible symbols for each place
(0 through 9), and the place values go up by factors of 10
(1, 10, 100, 1000,…).

We call the 10 symbols 0 through 9 *digits*
based on the Latin word for *finger*,
because counting on fingers is the origin of our counting system.
Of course, computers aren't endowed with such fingers,
and so it shouldn't be surprising that this numeral system is
less convenient for computers.
Instead, computers count with bits, which have two possible states,
and so we use a 2-based system — the **binary numeral
system**.

In the binary numeral system, we have only two symbols 0 and 1,
which we call *bits* based on contracting
the phrase *binary digit*.
Also, the place values go up by factors of 2,
so our places are worth 1, 2, 4, 8, 16, and so on.
The following diagrams a number written in binary notation.

1 | 0 | 1 | 1 |

8 | 4 | 2 | 1 |

This value, 1011_{(2)}, represents a number with
1 eight, 0 fours, 1 two, and 1 one:
We perform the addition
1 ⋅ 8 + 0 ⋅ 4 + 1 ⋅ 2 + 1 ⋅ 1 = 11_{(10)},
and we conclude that this binary number 1011_{(2)}
is an alternative representation for the number we know as
eleven, and which we write as 11 in decimal notation.
(The parenthesized subscripts indicate whether the
number is in binary notation or decimal notation.)

We'll often want to convert numbers between their binary and decimal
representations. With 1011_{(2)}, we already saw one example of
converting in the direction from binary to decimal.
But here's another example: Suppose we want to
identify what 100100_{(2)} represents.
We first determine what places contain the one bits.

1 | 0 | 0 | 1 | 0 | 0 |

32 | 16 | 8 | 4 | 2 | 1 |

We then add up the values of these places to get a base-10
value: The 32's place and the 4's place are filled with one
bits,
so we compute 32 + 4 = 36_{(10)}.

To convert a number from decimal to binary,
we repeatedly determine the largest power of two
that fits into the number
and subtract it,
until we reach zero; the binary representation has a 1 bit in
each place whose value we subtracted, and a 0 bit in the
remaining places. Suppose, as an example,
we want to convert 88_{(10)} to binary.
We observe the largest power of 2 less than 88 is 64, so we decide that
the binary expansion of 88 has a 1 in the 64's place, and we subtract
64 to get 88 − 64 = 24. Then we see than the largest power of 2 less than 24 is
16, so we decide to put a 1 in the 16's place and subtract 16 from 24 to
get 8. Now 8 is the largest power of 2 that fits into 8, so we put
a 1 in the 8's place and subtract to get 0.
Once we reach 0, we write down which places we filled with
1's.

1 | 1 | 1 | ||||

64 | 32 | 16 | 8 | 4 | 2 | 1 |

We put a zero in each empty place and conclude that the binary
representation of 88_{(10)} is 1011000_{(2)}.

For humans, reading and typing binary numbers is a major
hassle: The numbers tend to be long, and we're better with
shorter numbers, even if there are more symbols to choose from.
Computer programmers often find base 16 the most convenient
choice — called the **hexadecimal numeral system**,
often abbreviated as **hex**.

One of the first problems we face in talking about hexadecimal is that we need 16 different symbols for each digit, and we only know the 10 symbols 0 through 9. We fill out the remaining six symbols with letters from the alphabet:

Ais 10 _{(10)}Dis 13 _{(10)}Bis 11 _{(10)}Eis 14 _{(10)}Cis 12 _{(10)}Fis 15 _{(10)}

Thus, we can plausibly talk about the number 1B4 in the hexadecimal system. Let's see what 1B4 translates to in decimal: As before, we first insert the place values, which now go up by factors of 16, starting with 1, 16, 256, and 4096.

1 | B | 4 |

256 | 16 | 1 |

We see that 1B4_{(16)} has one 256, eleven 16's, and
four 1's, so we compute
1 · 256
+ 11 · 16
+ 4 · 1
= 436_{(10)}. We conclude that
1B4_{(16)} = 436_{(10)}.

But surely, you might object, decimal would be more
convenient than hexadecimal!
The special thing about hexadecimal is that it is based on
a power of 2: 2^{4} = 16.
This leads to a simple technique for converting between
hexadecimal and binary, much quicker than anything available for
decimal: We simply convert each hexadecimal digit into four bits.

Let's look at an example of this conversion using 1B4_{(16)}.
We observe that
the first digit 1 has the four-bit binary equivalent of 0001,
the second digit B has the four-bit binary equivalent of 1011,
and the third digit 4 has the four-bit binary equivalent of 0100.
Put these together, and we arrive at the binary number
0001 1011 0100_{(2)}.

Going the other direction is just as easy.
Let's consider the binary number 1011010110_{(2)}.
We break it up into groups of four, starting from the bottom
four bits, arriving at 10 1101 0110;
and then we translate each group into the corresponding
hexadecimal digit: 10_{(2)} = 2,
1101_{(2)} = 13_{(10)} = D_{(16)},
and 0110_{(2)} = 6.
We therefore conclude that
1011010110_{(2)} = 2D6_{(16)}.

Programmers also occasionally use base 8, called
**octal**. Here, our digits are 0 through 7, and
the place values are 1, 8, 64, 512, and so on.
Thus 31_{(8)} translates to three 8's and one 1,
which is 3 ⋅ 8 + 1 ⋅ 1 = 25_{(10)}.
(This leads to a simple joke: Why do programmers have a hard time
distinguishing Halloween from Christmas? Because
OCT 31 = DEC 25.)

Octal has the advantage that you can type it on a numeric keypad, but most programmers prefer hexadecimal — in part because groups of four bits conveniently pair into groups of eight, corresponding to the bytes that a computer uses.

These alternative bases are useful enough that C and its
imitators (including Python and Java) have a special technique
for incorporating octal and hexadecimal numbers into programs.
If you prefix a number with the digit

, then this indicates an
octal number: `0`

is an equivalent way of writing `031`

.
Similarly, prefixing a number with the two-character sequence
`25`

indicates a hexadecimal number: `0x`

is
equivalent to `0x1b4`

. When you see this in the program, it
usually indicates that the programmer views the particular
binary pattern as more important than its numeric value.`436`

Now we can look at how computers actually store integers.

Modern computers represent all integers using the same amount of space.
For example, we might decide that each byte represents a number.
(A **byte** is a group of eight bits.)
A byte, however, is very limiting: The largest number we can
fit is 11111111_{(2)} = 255_{(10)}, and we often want to deal with larger
numbers than that.

Thus, computers tend to use groups of bytes called **words**.
Different computers have different word sizes.
In the past, many machines had 16-bit words;
today, most machines use 32-bit words,
though many use 64-bit words.
(The term *word* comes from the fact that four bytes (32 bits) is
equivalent to four ASCII characters, and four letters is the length of
many useful English words.)
Thirty-two bits is plenty for most numbers, as it allows us to represent
any integer from 0 up to 2^{32} − 1 = 4,294,967,295.
But the limitation is becoming increasingly irritating —
primarily because it leads to problems when you have more than
4 gigabytes of memory (4 gigabytes is 2^{32} bytes) —
and so larger systems frequently use 64-bit words.

At first, that looks simple enough, but there's a minor but
significant detail:
In storing the 32-bit number among four bytes, should you store
the lowest-order 8 bits or the highest-order 8 bits in the firs
byte (i.e., the byte with the lowest address)?
We write numbers with the highest-order bits first, which leads one to
at first think we should start there;
but we typically do arithmetic (like addition and multiplication) starting
from the lowest-order bits, so arguably it's more intuitive to
start with the low end — though we do comparisons and thus
sorting starting from the highest-order bits, which puts us back
where we started.
The overall consensus is that the case for each is basically neutral,
but established systems have been built using each method.
Systems that start with the highest-order bits are called
**big-endian**, while ones that start with the lowest-order
bits are called **little-endian**.
(This is a reference to the historic feud between Lilliput and Blefuscu
over which end boiled eggs should be broken, as documented in
*Gulliver's Travels*.)
Most Intel-based systems (including nearly all modern PCs) use the little-endian system,
while Internet protocols typically specify the big-endian system.

The representation of an integer using binary representation
in a fixed number of bits is called an **unsigned representation**.
The term comes from the fact that the
only numbers representable in the system have no negative sign.

But what about negative integers? After all, there are some perfectly respectable numbers below 0. We'll examine two techniques for representing integers both negative and positive: sign-magnitude representation and two's-complement representation.

**Sign-magnitude representation** is the more intuitive technique.
Here, we let the first
bit indicate whether the number is positive or negative (the number's
*sign*), and
the rest of the bits tell how far the number is from 0 (its *magnitude*).
Suppose we are working with 8-bit sign-magnitude numbers.

10 is represented as 00001010 −10 is represented as 10001010

For −10_{(10)}, we use 1 for the first bit, because the number is
negative, and then we place
10_{(10)} = 1010_{(2)} into the remaining seven
bits.

What's the range of integers we can represent
with an 8-bit sign-magnitude representation?
For the largest number, we'd want 0 for the sign bit and
1 everywhere else, giving us 01111111, or 127_{(10)}.
For the smallest number, we'd want 1 for the sign bit
and 1 everywhere else, giving us −127_{(10)}.
An 8-bit sign-magnitude representation, then, can represent any
integer from −127_{(10)} to 127_{(10)}.

This range of integers includes 255 values. But we've seen that 8 bits can represent up to 256 different values. The discrepancy arises from the fact that the representation includes two representations of the number zero (0 and −0, represented as 00000000 and 10000000).

Arithmetic using sign-magnitude representation is somewhat more complicated than we might hope. To build a circuit comparing two numbers, you would need additional circuitry so that −0 is understood as equal to 0. Moreover, adding two numbers requires separate circuitry to handle the cases of when the numbers' signs match and when they don't match. Because of these complications, today's computers don't use sign-magnitude representation for integers. (It is used regularly in floating-point numbers, which are beyond the scope of this document.)

Computers today use the
**two's-complement representation** for integers.
In the two's-complement system, the topmost bit's value is the negation
of its meaning in an unsigned system.
For example, in an 8-bit *unsigned* system,
the topmost bit is the 128's place.

128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |

In an 8-bit *two's-complement* system, then, we negate the meaning of
the topmost bit to be −128 instead.

−128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |

If we were working with 32-bit numbers — as is typical
— then we'd still negate the meaning of the topmost bit.
In a 32-bit unsigned representation, the topmost bit is worth
2^{31}, so in a 32-bit two's-complement representation, the topmost
bit is worth −2^{31}.

Let's see how a number is represented using two's-complement
representation; we'll return to an 8-bit system, since it's easier to work
examples there.
We'll follow the
same algorithm from
for converting from decimal to binary.
Let's look at the example of −100_{(10)}.
We first choose a 1 for the −128's place, leaving us with
(−100) − (−128) = 28.
(Since the place value is negative, we subtract a negative number.)
Then we'd choose a 1 for the 16's place, the 8's place, and the 4's
place to reach 0.

1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |

−128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |

Thus, the 8-bit two's-complement representation of −100_{(10)}
would be 10011100.

To see the relationship between positive and negative representations, let's look at representing some numbers and their negations.

100 is represented as 01100100 −100 is represented as 10011100 10 is represented as 00001010 −10 is represented as 11110110

You see that the negation of a number involves flipping the first several bits of the number, but the last bits remain the same. For 100, we flip all but the last three bits; for 10, only the last two bits remain the same. The bits that remain the same include the last 1 and all succeeding 0's. That's a general rule for negating any two's-complement number: Flip all bits above the lowest 1.

What's the range of numbers representable in an
8-bit two's-complement representation?
To arrive at the largest number, we'd want 0 for the −128's bit
and 1 everywhere else, giving us 01111111, or 127_{(10)}.
For the smallest number, we'd want 1 for
the −128's bit and 0 everywhere else, giving −128_{(10)}.
In an 8-bit two's-complement representation, we can represent any
integer from −128_{(10)} up to 127_{(10)}.
This range includes 256 integers; there are no duplicates as with
sign-magnitude representation.

It's instructive to map out the bit patterns (in order of their unsigned value) and their corresponding two's-complement values.

bits value 00000000 0 00000001 1 00000010 2 : 01111110 126 01111111 127 10000000 −128 10000001 −128 : 11111110 −2 11111111 −1

Notice that the two's-complement representation wraps around: If you take the largest number, 01111111, and add 1 to it as if it were an unsigned number, you get 10000000, the smallest number. This wrap-around behavior can lead to some interesting behavior. In one game I played as a child (back when 16-bit computers were popular), the score would go up progressively as you guided a monster through a maze. I wasn't very good at the game, but my little brother mastered it enough that the score would hit its maximum value and then wrap around to a very negative value! Trying to get the largest possible score — without wrapping around — was an interesting challenge.

One of the nice things about two's-complement numbers is that you can add them just as you add regular numbers. Suppose we want to add −3 and 5.

1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | |

+ | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |

We can attempt to do this using regular addition, akin to the
technique we traditionally use in adding base-10 numbers, except
that here adding 1 and 1 gives us 0 with a carry of 1, since
1 + 1 = 10_{(2)}.

1 | 1 | 1 | 1 | 1 | 1 | 1 | ||

1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | |

+ | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |

1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |

We get an extra 1 in the ninth bit of the answer, but if we ignore this
ninth bit, we get the correct answer of 2_{(10)}.

We can reason that ignoring this ninth bit is correct as follows: Say one of the two numbers is negative and the other is positive. That is, one of the two numbers has a 1 in the −128's place, and the other has 0 there. If there is no carry into the −128's place, then the answer is OK, because that means we got the correct sum in the lower 7 bits, and then when we add the −128's place, we'll maintain the −128 represented by the 1 in that location in the negative number.

If there is a carry into the −128's place, then this represents a carry of 128 taken from summing the 64's column. This carry of 128 (represented by a carry of 1), added to the −128 in that column for the negative number (represented by a 1 in that column), should give us 0. This is exactly what we get we add the carry of 1 into the leftmost column to the 1 of the negative number in this column and then throw away the carry.

A similar sort of analysis will also work when we are adding two negative numbers or two positive numbers. Of course, the addition only works if you end up with something in the representable range. There's no hope of arriving at a good answer if you add 120 and 120 with 8-bit two's-complement numbers, since the number 240 simply isn't representable in that system. (The answer turns out to be −16!)

The upshot of all this is that when building a circuit for two's-complement numbers, we don't need to worry about different cases needing different circuits: The circuit simply needs to do regular unsigned addition and then ignore the final carry (if any).

Written text is among the most important data processed by a computer, and it too merits some study.

Early computers did not have a standard way of encoding characters
into binary, which soon proved unsatisfactory once data began being transferred between one computer and another. So in the early 1960's, an American standards organization now known as ANSI took up the charge of designing a standard encoding. They named their encoding the American Code for Information Interchange, though this was soon forgotten as people called it by its acronym, **ASCII**, and it turned into the basic standard for foolproof compatibility.

ASCII uses seven bits to encode each character, allowing for
2^{7} = 128 different encodings. This basically includes every symbol that you can find on an English keyboard, plus a few **control characters** reserved for encoding non-printable information like an instruction to ignore the previous character. Most of these control characters are basically obsolete; those still in common use include the following.

0x00: | NUL | — used to terminate strings in some systems |

0x08: | BS | — sent by the backspace key to remove character before cursor |

0x09: | HT | — sent by the tab key |

0x0A: | LF | — used to separate lines in a file |

0x0C: | CR | — often required to precede LF for legacy reasons |

0x1F: | ESC | — sent by the escape key |

Representing line breaks is somewhat interesting: Early systems based on typewriters would have two actions when the user completes a line: It would first do a “carriage return” to move the paper or print head so that the next typed character would go on the left side of the page, followed by a “line feed” to scroll the paper a bit so that the next typed character goes on the following line. ASCII had two separate characters representing these two actions, named CR and LF; files would be stored with both characters for each line break. The carriage return would be sent first because moving all the way across the page horizontally took longer than scrolling the paper to the next line, so you would want to start moving horizontally first.

In order to preserve compatibility with older systems, many systems have copied this same convention of breaking lines using a CR character followed by an LF character. And indeed many systems today still use this convention, including Microsoft Windows systems and most Internet protocols. However, other systems were designed to use just one character to save space; this includes Unix and from there MacOSX and Linux, which use the LF character only to separate lines.

Beyond the control characters are the printable characters, which are represented as given in the table below. Each row of the table represents ASCII characters that start with the same hexadecimal digit; the column headers indicate the second hexadecimal digit.

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |

0x2? | † | ! | " | # | $ | % | & | ' | ( | ) | * | + | , | - | . | / |

0x3? | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? |

0x4? | @ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |

0x5? | P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ |

0x6? | ` | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o |

0x7? | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | ‡ |

† space character (typed using the space bar) | ||||||||||||||||

‡ DEL control character (rarely used) |

As you can see, ASCII places the digits in sequential order, followed by the capital letters in sequential order followed by the lower-case letters. Punctuation marks are interspersed in between so that the digits and letters start toward the beginning of their respective rows.

Since modern computers use eight bits in each byte, a very natural way to use ASCII is to allocate one byte for each character. Of course, that leaves an extra bit, which could be used for a variety of purposes. During transmission of information, some systems have used this additional bit to signify whether an odd number of the other bits are 1, in the hope of identifying the occasional mistransmitted bit. Some systems have used this additional bit to represent whether the text should be highlighted (probably by using inverted text). But the most common technique has been to define it so that the additional 128 bit sequences represent other characters not included in ASCII.

The most common such extension in use is the Latin1 encoding, which adds
the accent marks and a few other non-Latin characters
necessary for supporting a wide variety of
European languages like Spanish and German, as well as additional punctuation including currency symbols beyond the dollar *$*.

But there are many others, supporting different alphabets such as Cyrillic (Russian), Hebrew, and Greek. In fact, two organizations collaborated on a set of 15 such standards, of which Latin1 is the first; collectively, they are called ISO/IEC 8859.

Having many alternative encoding standards is confusing, and in any case it doesn't address East Asian writing systems, such as those for Chinese and Japanese, which can use tens of thousands of symbols. For this reason, a group got together to define a 16-bit encoding, which they called Unicode. Sixteen bits allows for 65,536 different characters, which covers nearly all characters in modern use. They published their standard in 1991–92.

While this was nearly sufficient, they eventually decided that
they needed more room, so they extended the encoding to go up to the
hexadecimal number 10FFFF_{(16)} ≈ 1.1 million. This
whole space is unlikely ever to be exhausted.

Each number in the space is called a **code point** and is typically
referenced using a hexadecimal number of four to six digits prefixed by the
characters “U+” as in the example U+03C0.
The Unicode standard includes an English name for the character
(in the case of U+03C0, it is “Greek small letter pi”)
and an illustration of the relevant character (π).
An example from the extended space is U+1F0AD with the English name
“playing card queen of spades”;
which is represented as 🂭 in the browser's font (if it includes
that character) and pictured as in the standard.

Today's Unicode standard includes all the common alphabets as well as mathematical symbols, many alphabets only of historic interest (e.g., Egyptian hieroglyphs), and other occasionally useful symbols. However, enumerating all Chinese characters is basically impossible, so it inevitably omits several.

While ASCII and its 8-bit extensions were the dominant encoding through the 1990's, most systems today are migrating toward using Unicode. This is a complex process, both because Unicode is itself complex and because so much software has been written on the basis that each character is exactly one byte long.

Unicode becomes more complex when we talk about how the characters
are encoded into binary. In the initial standard, it wasn't so difficult:
All characters fit into 16 bits, so designers suggested simply that
a computer should reserve 16 bits for each character. This is called the **UCS-2** encoding, which some systems still use though the extension of Unicode to beyond 16 bits renders it obsolete.

The natural extension to UCS-2 is UTF-16. In UTF-16, characters whose
code points fit into 16 bits are simply written as is. For the characters
beyond U+FFFF (which includes the rarer Chinese characters,
alphabets of primarily historical interest, and unusual symbols
like playing cards), we subtract 10000_{(16)} from the
code point to arrive at a 20-bit number and divide it into two
groups of 10 bits. We add the upper 10-bit number to D800_{(16)} and encode the sum as a 16-bit number (which will be between
D800_{(16)} and DBFF_{(16)};
then we follow that with the 16-bit result of adding DC00_{(16)} to the
second group of 10 bits (the sum will be between DC00_{(16)} and DFFF_{(16)}.

This system works because the standard has never defined the code points
between U+D800 and U+DFFF, so a number between those
two code points indicates that we are looking at an extended-Unicode character. Moreover, we can tell whether we're looking at the upper
or lower 16 bits by seeing whether the code point is above or
below DC00_{(16)}.

UTF-8 is an alternative encoding scheme that has the advantage of being backward compatible with ASCII-encoded files. It categorizes the code points into four categories.

- U+0000 to U+007F: 0xxxxxxx
- Simply represent these with a single byte. Characters in this range match the ASCII standard, and so a file stored entirely using ASCII would read exactly the same in UTF-8.
- U+0080 to U+07FF: 110xxxxx 10xxxxxx
- These code points fit into 11 bits and are encoded using two bytes. The first byte included should start with the bits 110 followed by the code point's upper five bits, and it should be followed by a byte starting with 10 followed by the code point's lower six bits.
- U+0800 to U+FFFF: 1110xxxx 10xxxxxx 10xxxxxx
- These code points fit into 16 bits and are encoded using three bytes. The first encoded byte should start with the bits 1110 followed by the code point's upper four bits, followed by a byte starting with 10 followed by the code's next six bits, followed by a byte starting with 10 followed by the code's lower six bits.
- U+10000 and up: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
- These code points fit into 21 bits and are encoded using four bytes. The first encoded byte should start with the bits 11110 followed by the code point's upper three bits, followed by a byte starting with 10 followed by the code's next six bits, followed by another byte starting with 10 followed by the code's next six bits, followed by a byte starting with 10 followed by the code's lower six bits.

Because of its basic compatibility with ASCII, and because it generally stores information in compact form, UTF-8 is used heavily for communication between computers and for disk storage. It is very difficult to use for representing text in memory, though, where programs often want to process text as a random-access array of characters. Providing random-access character access is most convenient using a sequence of 32-bit values (UTF-32), though UTF-16 more often offers the preferred tradeoff between performance and space utilization.