=encoding euc-jp =head1 NAME =begin original perluniintro - Perl Unicode introduction =end original perluniintro - Perl Unicode の手引き =head1 DESCRIPTION =begin original This document gives a general idea of Unicode and how to use Unicode in Perl. See L for references to more in-depth treatments of Unicode. =end original このドキュメントは、Unicode の一般的な考えと、 Perl で Unicode をどのように使うかを書いています。 Unicode のより深い扱いへのリファレンスについては L を参照してください。 =head2 Unicode =begin original Unicode is a character set standard which plans to codify all of the writing systems of the world, plus many other symbols. =end original Unicode は、世界の全ての書記体系と、それに加えて、他の多くのシンボルを体系化することを計画している文字集合標準です。 =begin original Unicode and ISO/IEC 10646 are coordinated standards that unify almost all other modern character set standards, covering more than 80 writing systems and hundreds of languages, including all commercially-important modern languages. All characters in the largest Chinese, Japanese, and Korean dictionaries are also encoded. The standards will eventually cover almost all characters in more than 250 writing systems and thousands of languages. Unicode 1.0 was released in October 1991, and 6.0 in October 2010. =end original Unicode と ISO/IEC 10646 は、ほとんど全ての現代の文字集合標準を統合し、全ての商業的に重要な現代の言語を含む 80 以上の書記体系と数百以上の言語に対応する組織的標準です。もっとも大きい中国語、日本語、韓国語、それぞれの辞書の全ての文字もまた、符号化されています。この標準は、最終的には、250 の書記体系と、1000 以上の言語のほとんどすべての文字を網羅する予定です。 Unicode 1.0 は 1991 年 10 月にリリースされ、Unicode 6.0 は 2010 年 10 月にリリースされました。 =begin original A Unicode I is an abstract entity. It is not bound to any particular integer width, especially not to the C language C. Unicode is language-neutral and display-neutral: it does not encode the language of the text, and it does not generally define fonts or other graphical layout details. Unicode operates on characters and on text built from those characters. =end original Unicode の I<文字> は、抽象的な存在です。 Unicode の文字は、どんな特定の整数幅にも、特に、C 言語の C にも束縛されません。 Unicode は、言語中立で、表示中立です: Unicode は、テキストの言語をエンコードしませんし、一般的にはフォントや他のグラフィカルなレイアウトの詳細を定義しません。 Unicode は、文字と、それらの文字からなるテキストを操作します。 =begin original Unicode defines characters like C or C and unique numbers for the characters, in this case 0x0041 and 0x03B1, respectively. These unique numbers are called I

.  A code point is essentially the position of the
character within the set of all possible Unicode characters, and thus in
Perl, the term I is often used interchangeably with it.

=end original

Unicode は、C や C のような
文字と、その文字について固有の番号を定義します; この場合はそれぞれ、
0x0041 と 0x03B1 になります。
このような固有の番号は、I<符号位置> (code point) と呼ばれます。
符号位置は基本的には全ての Unicode 文字の集合の中の文字の位置なので、
Perl では、I<序数> (ordinal) がしばしば同じ意味として使われます。

=begin original

The Unicode standard prefers using hexadecimal notation for the code
points.  If numbers like C<0x0041> are unfamiliar to you, take a peek
at a later section, L.  The Unicode standard
uses the notation C, to give the
hexadecimal code point and the normative name of the character.

=end original

Unicode 標準は、符号位置に 16 進記法を使うのを好みます。
C<0x0041> のような番号に馴染みがなければ、後のセクション、
L を覗いて見て下さい。
Unicode 標準は、C という表記を使って、
16 進法の符号位置と標準的な文字の名前を書きます。

=begin original

Unicode also defines various I for the characters, like
"uppercase" or "lowercase", "decimal digit", or "punctuation";
these properties are independent of the names of the characters.
Furthermore, various operations on the characters like uppercasing,
lowercasing, and collating (sorting) are defined.

=end original

Unicode はまた、「大文字」、「小文字」、「10 進数字」、「句読点： のような、
様々な文字の I<特性> (property) を定義します; これらの特性は、文字の名前と
独立です。
更に、様々な文字に対する、大文字化や小文字化や並び替えといった操作が
定義されています。

=begin original

A Unicode I "character" can actually consist of more than one internal
I "character" or code point.  For Western languages, this is adequately
modelled by a I (like C) followed
by one or more I (like C).  This sequence of
base character and modifiers is called a I.  Some non-western languages require more complicated
models, so Unicode created the I concept, which was
later further refined into the I.  For
example, a Korean Hangul syllable is considered a single logical
character, but most often consists of three actual
Unicode characters: a leading consonant followed by an interior vowel followed
by a trailing consonant.

=end original

Unicode I<論理> 「文字」は、実際には一つ以上の I<実際> の「文字」または
符号位置から構成されます。
西洋の言語では、これは (C のような)、I<基底文字>
(base character) に続いて、一つ以上の(C のような)
I<修飾字> (modifiers) によってモデル化されています。
この基底文字と修飾字の並びは、I<結合文字の並び>
(combining character sequence) と呼ばれます。
一部の非西洋言語ではより複雑なモデルが必要なので、Unicode は
I<書記素クラスタ> (grapheme cluster) という概念を作成し、後に
I<拡張書記素クラスタ> (extended grapheme cluster) という形に洗練させました。
例えば、ハングル音節文字は一つの論理文字として考えられますが、とても
しばしば三つの実際の Unocde 文字から構成されています: 
先頭子音に引き続いて内部母音、それに引き続いて末尾子音です。

=begin original

Whether to call these extended grapheme clusters "characters" depends on your
point of view. If you are a programmer, you probably would tend towards seeing
each element in the sequences as one unit, or "character".  However from
the user's point of view, the whole sequence could be seen as one
"character" since that's probably what it looks like in the context of the
user's language.  In this document, we take the programmer's point of
view: one "character" is one Unicode code point.

=end original

これらの拡張書記素クラスタを「複数の文字」と呼ぶかどうかは、どのような
視点を取るかによります。
プログラマならば、おそらく、この順番のそれぞれの要素を、
1 つの単位、あるいは「文字」として、見ようとするでしょう。
しかし、ユーザの視点では、おそらくユーザの言語の文脈でみえるような
ものなので、並び全体を一つの「文字」として見るでしょう。
この文書では、プログラマの視点を取ります: 一つの「文字」は一つの Unicode
符号位置です。

=begin original

For some combinations of base character and modifiers, there are
I characters.  There is a single character equivalent, for
example, for the sequence C followed by
C.  It is called  C.  These precomposed characters are, however, only available for
some combinations, and are mainly meant to support round-trip
conversions between Unicode and legacy standards (like ISO 8859).  Using
sequences, as Unicode does, allows for needing fewer basic building blocks
(code points) to express many more potential grapheme clusters.  To
support conversion between equivalent forms, various I are also defined.  Thus, C is
in I, (abbreviated NFC), and the sequence
C followed by C
represents the same character in I (NFD).

=end original

一部の基底文字と修飾字の組み合わせは、I<合成済> (precomposed) 文字です。
例えば、C に引き続いて
C の並びのように、等価な単一の文字があります。
これは C と呼ばれます。
しかし、これらの合成済文字は一部の組み合わせでのみ利用可能で、主に
Unicode と(ISO 8859 のような)伝統的な標準との間の往復変換に対応するために
あります。
Unicode がするように並びを使うと、より多くの潜在的な書記素クラスタを
表現するためにより少ない基本構築ブロック(符号位置)で済むようになります。
等価な形式の変換に対応するために、様々な I<正規化形式>
(normalization form) も定義されています。
従って、C は
I<正規化形式 C> (Normalization Form Composed) (短縮形 NFC)にあり、
C に引き続いて C の並びは
I<正規化形式 D> (Normalization Form Decomposed) (NFD) にある同じ文字を
表現します。

=begin original

Because of backward compatibility with legacy encodings, the "a unique
number for every character" idea breaks down a bit: instead, there is
"at least one number for every character".  The same character could
be represented differently in several legacy encodings.  The
converse is not also true: some code points do not have an assigned
character.  Firstly, there are unallocated code points within
otherwise used blocks.  Secondly, there are special Unicode control
characters that do not represent true characters.

=end original

レガシーエンコーディングとの後方互換性のために、
"全ての文字に固有の番号" という考えは、少々壊れています:
その代わりに、"少なくとも全ての文字に 1 つの番号" があります。
同じ文字が、いくつかのレガシーエンコーディングの中で、違うように
表現されていました。
逆は真でもなく: 符号位置には、文字が割り当てられていないものも
あります。
1 番目に、使われているブロック内にもかかわらず、割り当てられていない
符号位置があります。
2 番目に、特別な Unicode のコントロール文字があり、それらは、本物の文字を
表現しません。

=begin original

When Unicode was first conceived, it was thought that all the world's
characters could be represented using a 16-bit word; that is a maximum of
C<0x10000> (or 65536) characters from C<0x0000> to C<0xFFFF> would be
needed.  This soon proved to be false, and since Unicode 2.0 (July
1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
and Unicode 3.1 (March 2001) defined the first characters above C<0xFFFF>.
The first C<0x10000> characters are called the I, or the
I (BMP).  With Unicode 3.1, 17 (yes,
seventeen) planes in all were defined--but they are nowhere near full of
defined characters, yet.

=end original

Unicode が最初に着想されたとき、世界中の文字は 16 ビットで表現できると
考えられていました; C<0x0000> から C<0xFFFF> までの最大 C<0x10000> (あるいは
65536) 文字が必要であるということです。
これは間違っているとすぐに証明され、Unicode 2.0(1996 年 7 月)から、Unicode は
21 ビット(C<0x10FFFF>)まで、
様々に定義されています; Unicode 3.1(2001 年 3 月) では、C<0xFFFF> を超えた
最初の文字が定義されました。
最初の C<0x10000> 文字は、I、または、I<基本多言語面>
(Basic Multilingual Plane)(BMP) と呼ばれます。
Unicode 3.1 で、全部で 17(そう、17)の面が定義されました -- ですが、
まだ、定義された全文字のどこにも、まだ近くにありません。

=begin original

When a new language is being encoded, Unicode generally will choose a
C of consecutive unallocated code points for its characters.  So
far, the number of code points in these blocks has always been evenly
divisible by 16.  Extras in a block, not currently needed, are left
unallocated, for future growth.  But there have been occasions when
a later release needed more code points than the available extras, and a
new block had to allocated somewhere else, not contiguous to the initial
one, to handle the overflow.  Thus, it became apparent early on that
"block" wasn't an adequate organizing principal, and so the C