byte

The byte (), is a unit of digital information in computing and telecommunications, that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer[1][2] and it is for this reason the basic addressable element in many computer architectures.

The size of the byte has historically been hardware dependent and no definitive standards exist that mandate the size. The de facto standard of eight bits is a convenient power of two permitting the values 0 through 255 for one byte. Many types of applications use variables representable in eight or fewer bits, and processor designers optimize for this common usage. The byte size and byte addressing are often used in place of longer integers for size or speed optimizations in microcontrollers and CPUs. Floating point processors and signal processing applications tend to operate on larger values and some digital signal processors have 16 to 40 bits as the smallest unit of addressable storage. On such processors a byte may be defined to contain this number of bits. The popularity of major commercial computing architectures have aided in the ubiquitous acceptance of the 8-bit size.

The term octet was defined to explicitly denote a sequence of 8 bits because of the ambiguity associated with the term byte.

Contents


History

The term byte was coined by Dr. Werner Buchholz in July 1956, during the early design phase for the IBM Stretch computer.[3][4] It is a respelling of bite to avoid accidental mutation to bit.[1]

The size of a byte was at first selected to be a multiple of existing teletypewriter codes, particularly the 6-bit codes used by the U.S. Army (Fieldata) and Navy. A number of early computers were designed for 6-bit codes, including SAGE, the CDC 1604, IBM 1401, and PDP-8. Early IETF documents cite varying examples of byte sizes: RFC 608 (1974), for example, mentions byte sizes for FTP hosts as the most computationally efficient size of a given hardware platform.[5]

In 1963, to end the use of incompatible teleprinter codes by different branches of the U.S. government, ASCII, a 7-bit code, was adopted as a Federal Information Processing Standard, making 6-bit bytes commercially obsolete. In the early 1960s, AT&T introduced digital telephony first on long-distance trunk lines. These used the 8-bit µ-law encoding. This large investment promised to reduce transmission costs for 8-bit data. IBM at that time extended its 6-bit "BCD" code to an 8-bit character code, "Extended BCD" in the System/360. The use of 8-bit codes for digital telephony also caused 8-bit data "octets" to be adopted as the basic data unit of the early Internet.

Since then, general-purpose computer designs have used eight bits in order to use standard memory parts, and communicate well, even though modern character sets have grown to use as many as 32 bits per character.

In the late 1970s, microprocessors such as the Intel 8008 (the direct predecessor of the 8080, and then the 8086 used in early PCs) could perform a small number of operations on four bits, such as the DAA (decimal adjust) instruction, and the half carry flag, which were used to implement decimal arithmetic routines. These four-bit quantities were called nibbles, in homage to the then-common 8-bit bytes.

Size

Architectures that did not have eight-bit bytes include the CDC 6000 series scientific mainframes that divided their 60-bit floating-point words into 10 six-bit bytes. These bytes conveniently held character data from 12-bit punched Hollerith cards, typically the upper-case alphabet and decimal digits. CDC also often referred to 12-bit quantities as bytes, each holding two 6-bit display code characters, due to the 12-bit I/O architecture of the machine. The PDP-10 used assembly instructions LDB and DPB to load and deposit bytes of any width from 1 to 36-bits. These operations survive today in Common Lisp. Bytes of six, seven, or nine bits were used on some computers, for example within the 36-bit word of the PDP-10. The UNIVAC 1100/2200 series computers (now Unisys) addressed in both 6-bit (Fieldata) and nine-bit (ASCII) modes within its 36-bit word. Telex machines used 5 bits to encode a character.

8-bit bytes

Factors behind the ubiquity of the eight bit byte include the popularity of the IBM System/360 architecture, introduced in the 1960s, and the 8-bit microprocessors, introduced in the 1970s.

The term octet is used to unambiguously specify a size of eight bits, and is used extensively in protocol definitions, for example.

Unit symbol

Prefixes for bit and byte multiples
Decimal
Value SI
1000 k kilo
10002 M mega
10003 G giga
10004 T tera
10005 P peta
10006 E exa
10007 Z zetta
10008 Y yotta
Binary
Value IEC JEDEC
1024 Ki kibi K kilo
10242 Mi mebi M mega
10243 Gi gibi G giga
10244 Ti tebi
10245 Pi pebi
10246 Ei exbi
10247 Zi zebi
10248 Yi yobi

The unit symbol for the byte is specified in IEEE 1541 and the Metric Interchange Format[6] as the upper-case character B, while other standards, such as the International Electrotechnical Commission (IEC) standard IEC 60027, appear silent on the subject.

In the International System of Units (SI), B is the symbol of the bel, a unit of logarithmic power ratios named after Alexander Graham Bell. The usage of B for byte therefore conflicts with this definition. It is also not consistent with the SI convention that only units named after persons should be capitalized. However, there is little danger of confusion because the bel is a rarely used unit. It is used primarily in its decadic fraction, the decibel (dB), for signal strength and sound pressure level measurements, while a unit for one tenth of a byte, i.e. the decibyte, is never used.

The unit symbol kB is commonly used for kilobyte, but may be confused with the common meaning of kb for kilobit. IEEE 1541 specifies the lower case character b as the symbol for bit; however, the IEC 60027 and Metric-Interchange-Format specify bit (e.g., Mbit for megabit) for the symbol, a sufficient disambiguation from byte.

The lowercase letter o for octet is a commonly used symbol in several non-English languages (e.g., French[7] and Romanian), and is also used with metric prefixes (for example, ko and Mo)

Today the harmonized ISO/IEC 80000-13:2008 – Quantities and units — Part 13: Information science and technology standard cancels and replaces subclauses 3.8 and 3.9 of IEC 60027-2:2005, namely those related to Information theory and Prefixes for binary multiples.

Unit multiples

There has been considerable confusion about the meanings of SI (or metric) prefixes used with the unit byte, especially concerning prefixes such as kilo (k or K) and mega (M) as shown in the chart Prefixes for bit and byte. Since computer memory is designed with binary logic, multiples are expressed in powers of 2, rather than 10. The software and computer industries often use binary estimates of the SI-prefixed quantities, while producers of computer storage devices prefer the SI values. This is the reason for specifying computer hard drive capacities of, say, 100 GB, when it contains 93 GiB of storage space.

While the numerical difference between the decimal and binary interpretations is small for the prefixes kilo and mega, it grows to over 20% for prefix yotta, illustrated in the linear-log graph (at right) of difference versus storage size.

Common uses

The byte is also defined as a data type in certain programming languages. The C and C++ programming languages, for example, define byte as an "addressable unit of data large enough to hold any member of the basic character set of the execution environment" (clause 3.6 of the C standard). The C standard requires that the char integral data type is capable of holding at least 255 different values, and is represented by at least 8 bits (clause 5.2.4.2.1). Various implementations of C and C++ define a byte as 8, 9, 16, 32, or 36 bits.[8][9] The actual number of bits in a particular implementation is documented as CHAR_BIT as implemented in the limits.h file. Java's primitive byte data type is always defined as consisting of 8 bits and being a signed data type, holding values from −128 to 127.

In data transmission systems, a contiguous sequence of binary bits in a serial data stream, such as in modem or satellite communications, which is the smallest meaningful unit of data. These bytes might include start bits, stop bits, or parity bits, and thus could vary from 7 to 12 bits to contain a single 7-bit ASCII code.

See also

References

  1. a b R.W. Bemer and W. Buchholz (1962). . In Werner Buchholz. http://archive.computerhistory.org/resources/text/IBM/Stretch/pdfs/Buchholz_102636426.pdf. , Chapter 6, Character Set
  2. R.W. Bemer, A proposal for a generalized card code of 256 characters, Commun. ACM 2 (9), pp19–23, 1959.
  3. Werner Buchholz (July 1956). . computerhistory.org. http://archive.computerhistory.org/resources/text/IBM/Stretch/102636400.txt. 
  4. . http://catb.org/~esr/jargon/html/B/byte.html. 
  5. RFC 608, Host Names On-Line, M.D. Kudlick, SRI-ARC (January 10, 1974), Quote: "The first byte size is the one which leads to the least computational overhead (e.g. 36 for PDP-10's, 32 for 360's)."
  6. Metric-Interchange-Format
  7. . . International Electrotechnical Commission. http://www.iec.ch/zone/si/si_bytes.htm. Retrieved August 30, 2010. )
  8. [26] Built-in / intrinsic / primitive data types, C++ FAQ Lite
  9. Integer Types In C and C++