Murray Sargent III

blabbingunequaledAI and Robotics

Oct 24, 2013 (3 years and 9 months ago)


Murray Sargent III

Microsoft Corporation

Text Services Group, Word

Tips & Tricks on Editing and
Displaying Unicode Text

What’s RichEdit?

RichEdit 3.0 is set of plain/rich
text, single/multiline
Unicode/ANSI edit controls in single world
wide binary

Multilevel undo, message & com interfaces, Word
compatibility, pretty rich text

Outline view, zoom, font binding, latest in IME support,
and rich complex script support (BiDi, Indic, and Thai)

Next version: pagination, nested tables, tight wrap, 2D
math (maybe!)…

Clients: Office dialogs, WordPad, Outlook RTF editor,
Pocket Word,…


Discuss some problems in manipulating multilingual Unicode text:

Multiple fonts to display Unicode plain text

Neutral characters, deunifying characters that look different in
different scripts

Working with complex scripts, like Arabic

Using keyboards to enter Unicode characters conveniently

Maintaining backward compatibity with previous character sets

Navigating through text that includes “multicharacters”

Implementing glyph variants and surrogate pairs

Font Binding

Most Unicode characters belong to scripts

Associate with each position in a document a “font bundle”

When inserting characters, assign each one to a script

For CJK, check surrounding characters for Kana and Hangul as clues
to use Japanese or Korean fonts instead of Chinese

Assign scripts to neutrals and digits

Keyboard language, especially IMEs, provide strong binding clues

Format inserted characters with fonts assigned to scripts. Check
current font to see if it supports required script

Font Binding Problems

Character not in any script, e.g., mathematical, arrows,
dingbats: use current font or bind to font with font
signature covering appropriate Unicode range. Or invent
new script ID

Font signature may be zero, i.e., unsupported. Call
EnumFontFamiliesEx() to enumerate all charsets for

Font signature may claim support for Unicode ranges, but
miss some characters. cmap reveals support on codewise
basis (slow to access)

Ironically, charset or codepage is a good script ID

Language Detection & Font Binding

Korean and Japanese are often easy to spot because of Hangul and
Kana characters, respectively

For CJK can convert back to codepage and see if errors occur (Ken
Lunde’s suggestion)

For proofing purposes, accurate language identification is needed. For
font binding, script identification is usually sufficient

Typically more than one language corresponds to a script, e.g., Latin
script. Essentially only one uses the Korean script

Natural language processing techniques allow good language
identification if more than a few words are involved, e.g., a sentence

Big Fonts

BitStream Cyberbit has most Unicode characters (“big

Some big fonts have CJK glyph variants for Japanese vs
Simplified Chinese vs Traditional Chinese vs Korean

binding code needs to avoid unnecessary (and
unwanted) font binding with such fonts

Recognize such fonts by using font signature Unicode
ranges and script (codepage) information

Font Sizing

In dialogs, 8
pt Latin characters are commonly used

pt Chinese characters are hard to read, so better to use 9
points in combination with 8
pt Latin characters

Latin characters have bigger descenders than Chinese
characters, since latter only need room for underline

Combining 8
pt Latin characters with 9
point Chinese
characters and keeping same baseline increases line height
to 9 pts plus extra height for Latin descender

Result is more like 10 points: shifts text too high in dialog
box originally designed to handle one language

Complex Scripts

Unicode covers many complex scripts, e.g., Arabic, Thai

scripts require layout engine that translates
character codes to glyph indices (often referencing

General Unicode text engine has to have access to
script layout engine

At the previous Unicode conference David Brown
discussed such an engine, Uniscribe, which runs on all
Windows platforms and is shipped with recent versions of
Internet Explorer

For performance: only use CS engine if needed


Many characters are neutral or “multiscript” and can be rendered with
many different fonts

E.g., blank, ASCII punctuation, ASCII in general, other punctuation,
and decimal digits

Some scripts render neutrals very differently than others and Unicode’s
occasional “over
unification” has complicated what font to use

E.g., Western ellipsis consists of three dots on baseline, while a
Japanese ellipsis has three raised dots

Unicode Standard gives detailed rules for neutrals in BiDi text

Simple rule: neutrals are surrounded by nonneutral characters of same
kind should be rendered with font of nonneutrals

Compatibility characters, such as ASCII fullwidth characters, reveal
which script they belong to

Backward Compatibility

Unicode text engine has to be able to import and export text in other
standards, which are defined by their codepages

Given nonUnicode plain text, which codepage should one use to
convert to/from Unicode?

On localized systems, system code page is a good bet

In multilingual text, you can enter text using keyboards in a variety of
languages that need either Unicode or multiple code pages

For searching text, best choice seems to be to use the current keyboard
code page

If text begins with a UTF
8 BOM, use UTF
8 conversion

If text begins with a rich
text header, e.g., “{
rtf” or “<html>” or
“<!doctype html”, use appropriate conversion routine

Backward Compatibility (cont)

Need a little rich
text functionality (minimal language tagging) to
display Unicode plain text unambiguously in some CJK scenarios

This functionality handles font choices and language
dependent glyph

There can be a disparity between typed text and set text

When a user types in text using a keyboard charset, edit engine knows
charset and therefore can insert accurate Unicode text including which
CJK glyph variant to use

Client gets text as pure ANSI (or Unicode) text without script clues

Would be handy to have script tags. Language tags also work, but are
a case of overkill unless proofing tools are to be supported

Unicode on Win95/98

Win95/98 supports a limited subset of Unicode text functions

ExtTextOutW() works in most cases. Not on Win95J or with metafiles,
so convert back to ANSI whenever possible

Device drivers may not handle Unicode text

With TrueType it’s possible to force downloading of fonts and use
Unicode more reliably

A number of GDI text APIs aren’t implemented, e.g.,

GetStringTypeExW is stubbed out, so all references to character
property tables have to go through a codepage translation

Text boxes, list boxes, comboboxes are all ANSI; use RichEdit for

Unicode Keyboard Input

National keyboards provide ways to input many Unicode characters.
E.g., Greek, Russian, and all ordinary European text.

IMEs (input method editors) let you type phonetic characters to get a
partially composed character sequence. Then type blank to request
composition. If the composition is reasonably unique, you get a fully
composed character; else you get menu of possible resolutions.

To enter Unicode Hex input type a Unicode hexadecimal code into the
text. type a special hot key, e.g., Alt+x, to convert the hex to a Unicode

Type Alt+X to replace a character by its hexadecimal number.

Input Sequence Checking. Vietnamese, Thai, and Indic languages
don’t allow all Unicode sequences to be valid and utilize special input
sequence checking code to disallow illegal sequences. For example,
Vietnamese only allows tone marks on vowels.

Unicode Surrogates

Discuss 3 display models that could enable Win9x/WinNTx based
applications to display higher
plane characters (those in the 16 planes
above the BMP). Ideas are still under development...

First uses a plane index and a 16
bit offset

Second uses a flat 32
bit index

Third uses surrogate
pair ligatures

Models aren’t mutually exclusive, since they involve different cmaps
(compressed tables used to convert codepoints to glyphs)

All assume higher
plane characters are stored as standard Unicode
surrogate pairs

Alternative representations include straight 32
bit characters and UTF
8, but aren’t as practical

Unicode Surrogates (cont)

Using 2 16
bit surrogates to represent a single character complicates
more than measurement and display of characters:

key handlers and other methods that change character position
must avoid ending up in between lead and trail surrogates

Input methods need to map to surrogate pair

Case changes, line
breaking rules, sorting, file formats, and backing
store manipulations in general have to recognize and deal with pairs

Surrogate code ranges make them easy to work with relative to
multibyte encoding systems

All three display models assume that GDI remains unchanged (need to
be able to run on OSs already in field

Also assume that 16
bit glyph indices are sufficient so that TrueType
rasterizer doesn’t need to be revised

Surrogate Planar Model

Characters in font all belong to a particular plane

No changes required to OS. Applications extend font binding logic to
handle font switches to appropriate planes

Character indices remain 16
bit: allows ExtTextOutW family to be
used directly

Model easy for apps to use today in platform
independent way if no
complex scripts are involved

Complex scripts need layout engine. Then applications can ignore
model issue, since layout engine handles OS/font interactions

Truncated 16
bit code indices may map codes in higher planes to
common control or neutral codes

For surrogate
unaware text
processing code, some ranges would have
to be reserved in upper planes

Surrogate Flat and Ligature Models

Flat 32
bit model uses 32
bit code to index into a new 32
bit cmap in
font file to translate the codes to 16
bit glyph indices

Glyph indices are used to access TextOut family

Method is too tricky for most applications to handle directly: need
aware version of Uniscribe

Font binding is done using font signature

Alternatively, application could use 32
bit character strings with a 32
bit TextOut family housed in platform
independent component

Ligature model requires use of complex
script engine to access ligature

Comparison of Surrogate Models

Ease of implementation: for simple scripts, planar model is easiest. In
binary environment, need Uniscribe, which can handle
OS/font interactions

Performance: Code to glyph mapping has to be done at some point.
Uniscribe is slower and more RAM intensive than planar model or 32
bit TextOut component

Flexibility: flat and ligature models can access chars in all 17 planes
even in same font; planar model one plane per font

Backward compatibility: planar model only needs appropriate fonts
and surrogate
aware apps to work on all Windows platforms

Flat and ligature models require a complex
script engine or a 32
TextOut component to run on all Win9x/WinNTx platforms

Nonspacing Combining Marks

Multicode characters (surrogate pairs, CRLFs, combining
mark and
tag sequences) require special display/navigation handling

Render combining
mark sequences by standard systems calls and fonts
that support combining marks. Better display needs layout engine that
talks to OpenType

Simple caret movement across combining
mark sequences prevents
stopping inside a sequence. Backspace key deletes one mark at a time

cursor hit testing leaves selection at beginning/end of
mark sequence (more elegant model allows selection and
editing of individual marks)

Cool thing: if you can navigate past CRLF combinations, you can
modify corresponding code to handle surrogate pairs and combining
mark sequences quite easily

Glyph Variants

Character variant: 1) Different character open to future
coding, 2) Prescribed variant (Mongollian), 3) Systematic
semantic variation (different forms like italic, bold, script,
Fraktur in math expressions)

Glyph variant: 1) Artistic variant: free variation (57 &s in
Poetica font), 2) Context preferred style (CJK language
based variants), 3) Overloaded code points (U+005C:


), 4) Historical variant: glyph changed over time

Identity variant: 2 external characters map to same
Unicode character

Handling Glyph Variants

Character variant is open to separate encoding. But if already used,
complicates search algorithms (Ş vs Romanian S comma)

Two approaches: inline variant marks and out
plane annotations

Inline variant marks need to be ignored in some searches

plane annotation is invisible in plain text and requires more
memory than inline variant mark

Semantically different characters, e.g., math italic b and math script b,
need to be distinguishable in searches, so separate encoding or use of
inline variant marks are desirable

Current proposal for inline variant marks defines 256 standard variant
codes in plane 14 as well as 256 codes for user
defined variant codes


Have addressed issues encountered in creating Unicode
editors. Issues include:

Automatic choice of fonts for Unicode plain text

Handling nonUnicode documents in Unicode text engines

Ways to input Unicode text

mark sequences, surrogate pairs, navigation in
multicode text, and glyph variants

Some ideas have been implemented in RichEdit 3.0 control
and other text engines

Unicode surrogate pairs and glyph variants need