View Shopping Cart Your Famous Chinese Account Shopping Help Famous Chinese Homepage China Chinese Chinese Culture Chinese Restaurant & Chinese Food Travel to China Chinese Economy & Chinese Trade Chinese Medicine & Chinese Herb Chinese Art
logo
Search
March 8, 2014
Table of Contents
1 Introduction
Han unification

Wikipedia

 
Table Unicode
Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. The Chinese characters are common to Chinese language|Chinese (where they are called "hanzi"), Japanese language|Japanese (where they are called kanji), and Korean language|Korean (where they are called hanja). Modern Korean, Chinese and Japanese typefaces may represent a given Han character as somewhat different glyphs. However, in the formulation of Unicode, these differences were folded. This unification is referred to as "Han unification", with the resulting character repertoire sometimes referred to as Unihan.


Rules for Han Unification are given in the East Asian Scripts chapter of the various versions of the Unicode Standard (Chapter 11 in Unicode 4.0). The Ideographic Rapporteur Group (IRG), made up of experts from the Chinese-speaking countries, North and South Korea, Japan, Vietnam, and other countries, is responsible for the process.



http://www.ibm.com/developerworks/unicode/library/u-secret.html The secret life of Unicode article located on IBM DeveloperWorks has an explanation of this issue that illustrates some of the confusion:

The problem stems from the fact that Unicode encodes characters rather than "glyphs," which are the visual representations of the characters. There are four basic traditions for East Asian character shapes: traditional Chinese, simplified Chinese, Japanese, and Korean. While the Han root character may be the same for CJK languages, the glyphs in common use for the same characters may not be, and new characters were invented in each country.


For example, the traditional Chinese glyph for "grass" uses four strokes for the "grass" radical, whereas the simplified Chinese, Japanese, and Korean glyphs use three. But there is only one Unicode point for the grass character (<big>草</big>, U+8349) regardless of writing system. Another example is the ideograph for "one" (壹, 壱, or 一), which is different in Chinese, Japanese, and Korean. Many people think that the three versions should be encoded differently.


In fact, the three ideographs for "one" are encoded separately in Unicode. They are not national variants. The first and second are used on financial instruments to prevent forgery, while the third is the common form in all three countries.

A slight difference in rendering characters might be considered a serious problem if it changes the meaning or reflects the wrong cultural tradition. Besides a simple nuisance like Japanese text looking like Chinese, names might be displayed with a different glyph &mdash; the same character in the sense of encoding but a different character in the view of the users. This rendering problem is often employed to criticize Westerners for not being aware of subtle distinctions, even though Unification is being carried out by Easterners. The display error occurs only when rendering plain text in a single font, and not when rendering language-specific text and names in language-appropriate fonts.

The process of Han Unification was controversial, with most of the opposition coming from Japan. Opponents of Han unification state that it steamrolls over thousands of years of cultural tradition, misses many of the subtleties that are one of the most important features of these languages, and renders serious literature and academic research in these languages impossible. Proponents of Han unification point out that the unification process is in the hands of specialists from China, Korea, and Japan, and that the objections to unification of specific characters are made without regard to their histories. Characters which some Japanese today consider completely distinct were historically the same, and were taught as the same in Japanese schools until the 1950s. As for historical research, Unicode now encodes far more characters than any other standard, and far more than were listed in any dictionary, with many more being processed for inclusion as fast as the scholars can agree on their identities.

Some characters used only in names are not included in Unicode. This is not a form of cultural imperialism, as is sometimes feared. These characters are generally not included in their national character sets either.



Much of the controversy surrounding Han unification is based on confusion between the ideas of characters and glyphs, as defined in Unicode, and the related but distinct idea of graphemes. Unicode defines abstract characters, as opposed to glyphs, which are particular visual representations of a character in a font, or graphemes, basic units of writing in a particular language. One character may be represented by many distinct glyphs, for example a "g" or an "a", both of which may have one loop or two. In Dutch, "ij" is a single letter, and thus a grapheme. For example, the first letter in "IJsselmeer" is capitalized. Similarly for "ch" in some Spanish-speaking countries, and "lj" in Croatian. Graphemes present in national character code standards have been added to Unicode, as required by Unicode's Source Separation rule, even where they can be composed of characters already available.

Unicode publishes charts with pictures for each character, but these are illustrations only and do not mandate the character's shape. References like http://www.debian.or.jp/~kubota/unicode-unihan.html below seem to assume that what the Unicode standard pictures is how each character must be displayed, and protest when it doesn't match the local appearance of the character. The way things are supposed to work is that a Japanese user will have a font with Japanese-style characters, a Chinese user will have a font with Chinese-style characters, etc., and everyone will see the "right" characters for them. Problems are introduced when several languages must be represented in the same text document, and users expect different fonts for the different languages. This can be worked around outside the Unicode standard with higher-level markup defining the language used for each string of characters, although this is cumbersome and may not always work correctly; see the demonstration below.

Note that most of the opposition to Han unification appears to be Japanese, because of increased sensitivity to the distinctions between Chinese and Japanese styles of letters. There has been very little opposition from Chinese speakers. Although the Taiwan Big5 character set does not include Simplified characters, the PRC has character set standards with and without them. Unicode is seen as neutral with regards to the politically charged issue of Simplified versus Traditional characters, encoding Simplified and Traditional Chinese glyphs separately (e.g. the ideograph for "discard" is 丟 U+4E1F for Traditional Chinese big5 #A5E1 and 丢 U+4E22 for Simplified Chinese gb #2210). Traditional and Simplified characters must be encoded separately according to Unicode Unification rules, because they are distinguished in pre-existing PRC character sets, not just because they have different shapes. Mapping between Traditional and Simplified characters is not one-to-one, which also prevents unification.

Specialist character sets developed to address, or regarded by some as not suffering from, these perceived deficiencies include:
  • CNS character set

  • CCCII character set

  • Giga Character Set

  • TRON

  • UTF-2000


However, none of these alternative standards has been as widely adopted as Unicode, which is now the base character set for many new standards and protocols, and is built into the architecture of operating systems (Windows, Macintosh OS X, and many versions of Unix), programming languages (Perl, Python, Java, Common LISP, APL), and libraries (IBM International_Components_for_Unicode |International Components for Unicode (ICU) along with the Pango, Graphite and Scribe rendering engines), font formats (TrueType and OpenType) and so on.




The following table contains identical grapheme in all five rows, but each row is marked (via an HTML attribute) as being in a different language: Chinese language|Chinese (3 varieties: unmarked "Chinese", Simplified Chinese character|simplified characters, and Traditional Chinese character|traditional characters), Japanese language|Japanese, or Korean language|Korean. So, ideally, your Web browser|browser should select Typeface|fonts and glyphs that suit each language better. See how well it works for you.
style="font-size: medium;"|Chinese (generic)
style="font-size: medium;"|Chinese (Simplified)
style="font-size: medium;"|Chinese (Traditional)
style="font-size: medium;"|Japanese
style="font-size: medium;"|Korean



The following table contains identical grapheme with multiple glyphs encoded in unicode:

<table style="font-size: xx-large;">
<tr style="vertical-align: middle; height: 1.5em;" lang="zh" xml:lang="zh">
<td style="font-size: medium;">Chinese (generic)</td>
<td>高</td>
<td>髙</td>
<td>&nbsp;</td>
<td>紅</td>
<td>红</td>
<td>&nbsp;</td>
<td>丟</td>
<td>丢</td>
<td>&nbsp;</td>
<td>乗</td>
<td>乘</td>
<td>&nbsp;</td>
<td>侣</td>
<td>侶</td>
<td>&nbsp;</td>
<td>兌</td>
<td>兑</td>
<td>&nbsp;</td>
<td>內</td>
<td>内</td>
<td>&nbsp;</td>
<td>產</td>
<td>産</td>
<td>&nbsp;</td>
<td>稅</td>
<td>税</td>
<td>&nbsp;</td>
<td>⿔</td>
<td>亀</td>
<td>龜</td>
<td>龟</td>
<td>龜</td>
<td>龜</td>
<td>&nbsp;</td>
<td>別</td>
<td>别</td>
<td>&nbsp;</td>
<td>両</td>
<td>两</td>
<td>兩</td>
<td>兩</td>
</tr>
<tr style="vertical-align: middle; height: 1.5em;" lang="zh-cn" xml:lang="zh-cn">
<td style="font-size: medium;">Chinese (Simplified)</td>
<td>高</td>
<td>髙</td>
<td>&nbsp;</td>
<td>紅</td>
<td>红</td>
<td>&nbsp;</td>
<td>丟</td>
<td>丢</td>
<td>&nbsp;</td>
<td>乗</td>
<td>乘</td>
<td>&nbsp;</td>
<td>侣</td>
<td>侶</td>
<td>&nbsp;</td>
<td>兌</td>
<td>兑</td>
<td>&nbsp;</td>
<td>內</td>
<td>内</td>
<td>&nbsp;</td>
<td>產</td>
<td>産</td>
<td>&nbsp;</td>
<td>稅</td>
<td>税</td>
<td>&nbsp;</td>
<td>⿔</td>
<td>亀</td>
<td>龜</td>
<td>龟</td>
<td>龜</td>
<td>龜</td>
<td>&nbsp;</td>
<td>別</td>
<td>别</td>
<td>&nbsp;</td>
<td>両</td>
<td>两</td>
<td>兩</td>
<td>兩</td>
</tr>
<tr style="vertical-align: middle; height: 1.5em;" lang="zh-tw" xml:lang="zh-tw">
<td style="font-size: medium;">Chinese (Traditional)</td>
<td>高</td>
<td>髙</td>
<td>&nbsp;</td>
<td>紅</td>
<td>红</td>
<td>&nbsp;</td>
<td>丟</td>
<td>丢</td>
<td>&nbsp;</td>
<td>乗</td>
<td>乘</td>
<td>&nbsp;</td>
<td>侣</td>
<td>侶</td>
<td>&nbsp;</td>
<td>兌</td>
<td>兑</td>
<td>&nbsp;</td>
<td>內</td>
<td>内</td>
<td>&nbsp;</td>
<td>產</td>
<td>産</td>
<td>&nbsp;</td>
<td>稅</td>
<td>税</td>
<td>&nbsp;</td>
<td>⿔</td>
<td>亀</td>
<td>龜</td>
<td>龟</td>
<td>龜</td>
<td>龜</td>
<td>&nbsp;</td>
<td>別</td>
<td>别</td>
<td>&nbsp;</td>
<td>両</td>
<td>两</td>
<td>兩</td>
<td>兩</td>
</tr>
<tr style="vertical-align: middle; height: 1.5em;" lang="ja" xml:lang="ja">
<td style="font-size: medium;">Japanese</td>
<td>高</td>
<td>髙</td>
<td>&nbsp;</td>
<td>紅</td>
<td>红</td>
<td>&nbsp;</td>
<td>丟</td>
<td>丢</td>
<td>&nbsp;</td>
<td>乗</td>
<td>乘</td>
<td>&nbsp;</td>
<td>侣</td>
<td>侶</td>
<td>&nbsp;</td>
<td>兌</td>
<td>兑</td>
<td>&nbsp;</td>
<td>內</td>
<td>内</td>
<td>&nbsp;</td>
<td>產</td>
<td>産</td>
<td>&nbsp;</td>
<td>稅</td>
<td>税</td>
<td>&nbsp;</td>
<td>⿔</td>
<td>亀</td>
<td>龜</td>
<td>龟</td>
<td>龜</td>
<td>龜</td>
<td>&nbsp;</td>
<td>別</td>
<td>别</td>
<td>&nbsp;</td>
<td>両</td>
<td>两</td>
<td>兩</td>
<td>兩</td>
</tr>
<tr style="vertical-align: middle; height: 1.5em;" lang="ko" xml:lang="ko">
<td style="font-size: medium;">Korean</td>
<td>高</td>
<td>髙</td>
<td>&nbsp;</td>
<td>紅</td>
<td>红</td>
<td>&nbsp;</td>
<td>丟</td>
<td>丢</td>
<td>&nbsp;</td>
<td>乗</td>
<td>乘</td>
<td>&nbsp;</td>
<td>侣</td>
<td>侶</td>
<td>&nbsp;</td>
<td>兌</td>
<td>兑</td>
<td>&nbsp;</td>
<td>內</td>
<td>内</td>
<td>&nbsp;</td>
<td>產</td>
<td>産</td>
<td>&nbsp;</td>
<td>稅</td>
<td>税</td>
<td>&nbsp;</td>
<td>⿔</td>
<td>亀</td>
<td>龜</td>
<td>龟</td>
<td>龜</td>
<td>龜</td>
<td>&nbsp;</td>
<td>別</td>
<td>别</td>
<td>&nbsp;</td>
<td>両</td>
<td>两</td>
<td>兩</td>
<td>兩</td>
</tr>
<tr style="vertical-align: middle; height: 1.5em; font-size: small;" >
<td style="font-size: medium;">code</td>
<td>U+9ad8</td>
<td>U+9ad9</td>
<td>&nbsp;</td>
<td>U+7d05</td>
<td>U+7ea2</td>
<td>&nbsp;</td>
<td>U+4e1f</td>
<td>U+4e22</td>
<td>&nbsp</td>
<td>U+4e57</td>
<td>U+4e58</td>
<td>&nbsp;</td>
<td>U+4fa3</td>
<td>U+4fb6</td>
<td>&nbsp;</td>
<td>U+514c</td>
<td>U+5151</td>
<td>&nbsp;</td>
<td>U+5167</td>
<td>U+5185</td>
<td>&nbsp;</td>
<td>U+7522</td>
<td>U+7523</td>
<td>&nbsp;</td>
<td>U+7a05</td>
<td>U+7a0e</td>
<td>&nbsp;</td>
<td>U+2fd4</td>
<td>U+4e80</td>
<td>U+9f9c</td>
<td>U+9f9f</td>
<td>U+f907</td>
<td>U+f908</td>
<td>&nbsp;</td>
<td>U+5225</td>
<td>U+522b</td>
<td>&nbsp;</td>
<td>U+4e21</td>
<td>U+4e24</td>
<td>U+5169</td>
<td>U+f978</td>
</tr>
</table>



  • Chinese character encoding

  • Sinicization

  • Unihan Database




  • http://www.unicode.org/standard/standard.html Unicode standard

  • http://tclab.kaist.ac.kr/~otfried/Mule/unihan.html Han Unification in Unicode by Otfried Cheong

  • http://www.hastingsresearch.com/net/04-unicode-limitations.shtml Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations

  • http://slashdot.org/features/01/06/06/0132203.shtml Why Unicode Will Work On The Internet

  • http://www.unicode.org/charts/unihan.html Unihan Database

  • http://www.debian.or.jp/~kubota/unicode-unihan.html Per-character summary of differences in characters

  • http://www.ibm.com/developerworks/unicode/library/u-secret.html The secret life of Unicode

  • http://www.microsoft.com/downloads/details.aspx?FamilyID=fc02e2e3-14bb-46c1-afee-3732d6249647&DisplayLang=en GB18030 Support Package for Windows 2000/XP, including Chinese, Tibetan, Yi, Mongolian and Thai font by Microsoft

  • http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2326.pdf Proposal to encode additional grass radicals in the UCS - A humorous proposal to encode all possible variants of the grass radical, made as an April Fool's Day joke


Category:Unicode
Category:Chinese language
Category:Japanese language
Category:Korean language

ja:CJK統合漢字
ko:CJK통합한자
zh:中日韩汉字统一

This article is licensed under the GNU Free Documentation License. It uses material from the Wikipedia article "Han unification".


Last Modified:   2005-04-13


Search
All informatin on the site is © FamousChinese.com 2002-2005. Last revised: January 2, 2004
Are you interested in our site or/and want to use our information? please read how to contact us and our copyrights.
To post your business in our web site? please click here. To send any comments to us, please use the Feedback.
To let us provide you with high quality information, you can help us by making a more or less donation: