Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

For each uploaded document

Answer questions

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Information, Characters, Unicode, Summaries of Latin

An 8-bit character set is a convenient size and so US ASCII is for the most part replaced by Latin-1 which supports some European languages.

Typology: Summaries

2022/2023

Uploaded on 02/28/2023

sureesh 🇺🇸

4.6

(10)

243 documents

1 / 86

This page cannot be seen from the preview

Don't miss anything!

bg1

Information, Characters, Unicode

pf3

pf4

pf5

pf8

pf9

pfa

pfd

pfe

pff

pf12

pf13

pf14

pf15

pf16

pf17

pf18

pf19

pf1a

pf1b

pf1c

pf1d

pf1e

pf1f

pf20

pf21

pf22

pf23

pf24

pf25

pf26

pf27

pf28

pf29

pf2a

pf2b

pf2c

pf2d

pf2e

pf2f

pf30

pf31

pf32

pf33

pf34

pf35

pf36

pf37

pf38

pf39

pf3a

pf3b

pf3c

pf3d

pf3e

pf3f

pf40

pf41

pf42

pf43

pf44

pf45

pf46

pf47

pf48

pf49

pf4a

pf4b

pf4c

pf4d

pf4e

pf4f

pf50

pf51

pf52

pf53

pf54

pf55

pf56

Related documents

ASCII and Unicode Character Representation: Control Characters, Parity, and Fonts

Table of ASCII and UNICODE characters

(1)

ASCII AND UNICODE DOCUMENT

Document with Unicode Characters and University Topics

Understanding File Formats: ASCII, Unicode, SGML, HTML, XML and Unicode Formats

Additional Beta Code Characters not in Unicode (WIP)

ASCII, Extended ASCII and Unicode: Character Sets and Their Differences

Character Encodings: ASCII, Windows 1252, and Unicode (UTF-8)

Unicode - Data Structures - Quiz

Greek Unicode Keyboard Input for Mac: A Comprehensive Guide

Symbola Unicode Font: Ancient Scripts & OpenType Features

YuConv.Excel - Excel 2007/2010 Unicode Conversion Tool

Partial preview of the text

Download Information, Characters, Unicode and more Summaries Latin in PDF only on Docsity!

Information, Characters, Unicode

Information – Characters

In modern computing, natural-language text is very important information.

(“Number-crunching” is less important.) Characters of text are represented in several

different ways and a known character encoding is necessary to exchange text

information.

For many years an important encoding standard for characters has been US ASCII–a

7-bit encoding. Since 7 does not divide 32, the ubiquitous word size of computers,

8-bit encodings are more common. Very common is ISO 8859-1 aka “Latin-1,” and

other 8-bit encodings of characters sets for languages other than English.

Currently, a very large multi-lingual character repertoire known as Unicode is gaining

importance.

Control Characters

Notice that the first twos rows are filled with so-called control characters. This

characters have no printable representation. Except for various standards for indicating

lines of text, these characters have no use today. So, nearly one-quarter of the space

available for representing characters is wasted.

Of course, the space character does not have a printable representation (no ink is used

to print a space), but it is extremely useful.

Some US-ASCII Characters

Each character has a unique bit pattern used to represent it (and a Unicode name as

we shall see later).

binary oct dec char Unicode

HT U+0009 horizontal tabulation

0010 0000 0040 32 U+0020 space

0010 1110 0056 46. U+002E full stop

0010 1111 0057 47 / U+002F solidus

0011 0000 0060 48 0 U+0030 digit zero

0011 0001 0061 49 1 U+0031 digit one

Some Characters

Here are some of the characters in Latin-1 not used in writing English.

binary oct dec Unicode

A U+00C3 latin capital letter a with tilde

1101 0111 0327 215 × U+00D7 multiplication sign

1101 1111 0337 223 ß U+00DF latin small letter sharp s

1110 1101 0355 237 ´ı U+00ED latin small letter i with acute

1111 1110 0376 254 þ U+00FE latin small letter thorn

An 8-bit character set is a convenient size and so US ASCII is for the most part

replaced by Latin-1 which supports some European languages. Microsoft’s CP1252 is

somewhat similar.

The new ISO 8859-15 (Latin-9) nicknamed Latin-0 updates Latin-1 by replacing eight

infrequently used characters ¤¦¨´¼½¾. with left-out French letters (¨y, œ) and Finnish

and Lithuanian letters (ˇs, ˇz), and placing the Euro sign e in the cell 0xA4 of the

former (unspecified) currency sign ¤.

¤ U+00A4 currency sign → e U+20AC euro sign

¦ U+00A6 broken bar →

S U+0160 latin capital letter s with caron

¨ U+00A8 diaeresis → ˇs U+0161 latin small letter s with caron

´ U+00B4 acute accent →

Z U+017D latin capital letter z with caron

¸ U+00B8 cedilla → ˇz U+017E latin small letter z with caron

¼ U+00BC vulgar frac 1 quarter → Œ U+0152 latin capital ligature oe

½ U+00BD vulgar fraction 1 half → œ U+0153 latin small ligature oe

¾ U+00BE vulgar frac 3 quarters→

Y U+0178 latin capital letter y with diaeresis

Differences in Character Encodings

binary oct dec hex MacR 1252 Latin1 Latin

0111 0011 0163 115 0x73 s s s s

1000 0000 0200 128 0x

A e

XXX XXX

1000 0101 0205 135 0x

O...

NEL NEL

1000 1010 0212 138 0x8A ¨a

S

VTS VTS

1010 0100 0244 164 0xA4 § ¤ ¤ e

1010 0110 0246 166 0xA6 ¶ ¦ ¦

S

1011 0110 0266 182 0xB6 ∂ ¶ ¶ ¶

1101 1011 0333 219 0xDB e

^

U

^

U

^

U

1110 0100 0344 228 0xE4 ‰ ¨a ¨a ¨a

1111 0011 0363 243 0xF

^

U ´o ´o ´o

Standards help insure that the bit patterns are understood the same way. But the

applicable standard must be clearly known.

Java Makes It Easy

Indicate to the Scanner class which character encoding is to be expected, and Java

will interpret the bytes correctly. This is because Java uses Unicode internally which is

a superset of all commonly used character set encodings.

Scanner s = new Scanner ( System. in , " LATIN -1 " );

Scanner s = new Scanner ( System. in , " Cp1252 " );

Without a specified character encoding, the computer’s default encoding is used.

Scanner s = new Scanner ( System. in );

A program with such a scanner may behave differently on different computers, leading

to confusion.

What Do Characters Mean?

Some characters, like the tab, have no fixed meaning, even though it has an agreed

upon code point. Tabs are interpreted differently by different applications leading to

confusion.

Six invisible or white-space characters are legal in a Java program. No other control

characters are legal in a Java program. A Java program is permitted to end with the

“substitute” character.

binary oct dec Latin1 Unicode

HT U+0009 horizontal tabulation

LF U+000A line feed

FF U+000C form feed

CR U+000D carriage return

SUB U+001A substitute

0010 0000 0040 32 U+0020 space

There is no advantage to using a horizontal tabulation or a substitute character in a

Java program. But there is a risk of breaking some application that uses Java source

code for input (pretty-printers, text beautifiers, metric tools, etc.)

Newlines indications are necessary for formatting programs, and Java permits all three

of the common newline conventions: the line feed character (common in Unix

applications), the carriage return (Mac applications), and the carriage return character

followed by the line feed (Microsoft applications).

MacOS CR "\r"

Unix LF "\n"

Windows CR,LF "\r\n"

Other newline markers are much less common. Next-line (NEL x85) is used to mark

end-of-line on some IBM mainframes. Unicode has its own approach to indicating a

new line:

Unicode

U+2028 line separator

Newline

From Wikipedia:

In computing a newline, also known as a line break or end-of-line (EOL)

marker, is a special character or sequence of characters signifying the end of a

line of text.

There is also some confusion whether newlines terminate or separate lines.

If a newline is considered a separator, there will be no newline after the last

line of a file. The general convention on most systems is to add a newline even

after the last line, i.e. to treat newline as a line terminator. Some programs

have problems processing the last line of a file if it is not newline terminated.

Newline

Please consider the newline mark a line terminator.

The number of newline marks in a file is the number of lines in the file.

You can lose points on tests through this type of miscommunication.

Unicode Versions

1.0 Oct 1991 24 7,161 12. version date scripts characters bits
2.0 Jul 1996 24 38,950 15.
3.0 Sep 1999 38 49,249 15.
4.0 Apr 2003 52 96,447 16.
5.0 Jul 2006 64 99,098 16.
6.0 Oct 2010 93 109,449 16.
7.0 Jun 2014 123 113,021 16.
8.0 Jun 2015 123 120,737 16.
9.0 Jun 2016 135 128,237 16.
10.0 Jun 2017 139 136,690 17.
11.0 Jun 2018 146 137,374 17.

Unicode 8.0 (2015 June 17) adds a total of 7,716 characters for a total of xxx,xxx

(wiki: =120,737) characters. These addtion include six new scripts (Antolian

Hieroglphsy, Old Hungarian) for a total of 127 scripts, and many new symbols, as well

as character additions to several existing scripts.

Unicode 9.0 (2016 June 21) adds 7,500 characters, for a total of 128,172 (wiki

+65=128,237) characters. These additions include six new scripts (Osage, historic

Tangut) for a total of 133 scripts, and 72 new emoji characters.

Unicode 10.0 (2017 June 20) adds 8,518 characters, for a total of 136,690 (wiki:

+65=136,755) characters. These additions include a character for the bitcoint sign, as

well as 56 new emoji characters. Also included are, four new scripts (Nushu,

Soyombo), for a total of 139 scripts.

Unicode 11.0 (2018 June 5) adds 684 characters, for a total of 137,374 (wiki:

+65=137,439) characters. These additions include 7 new scripts, for a total of 146

scripts, as well as 66 new emoji characters.

Unicode 12.0 (2019 March 5) adds 554 characters, for a total of 137,928 (wiki:

+65=137,993) characters. These additions include 4 new scripts, for a total of 150

scripts, as well as 61 new emoji characters. Characters: Marca registrada sign, fairy

chess symbols. Script: historic Elymaic, Wancho, Miao.