=encoding euc-jp =head1 NAME =begin original perlunicode - Unicode support in Perl =end original perlunicode - Perl における Unicode サポート =head1 DESCRIPTION =begin original If you haven't already, before reading this document, you should become familiar with both L and L. =end original If you haven't already, before reading this document, you should become familiar with both L and L. (TBT) =begin original Unicode aims to B-fy the en-B-ings of all the world's character sets into a single Standard. For quite a few of the various coding standards that existed when Unicode was first created, converting from each to Unicode essentially meant adding a constant to each code point in the original standard, and converting back meant just subtracting that same constant. For ASCII and ISO-8859-1, the constant is 0. For ISO-8859-5, (Cyrillic) the constant is 864; for Hebrew (ISO-8859-8), it's 1488; Thai (ISO-8859-11), 3424; and so forth. This made it easy to do the conversions, and facilitated the adoption of Unicode. =end original Unicode aims to B-fy the en-B-ings of all the world's character sets into a single Standard. For quite a few of the various coding standards that existed when Unicode was first created, converting from each to Unicode essentially meant adding a constant to each code point in the original standard, and converting back meant just subtracting that same constant. For ASCII and ISO-8859-1, the constant is 0. For ISO-8859-5, (Cyrillic) the constant is 864; for Hebrew (ISO-8859-8), it's 1488; Thai (ISO-8859-11), 3424; and so forth. This made it easy to do the conversions, and facilitated the adoption of Unicode. (TBT) =begin original And it worked; nowadays, those legacy standards are rarely used. Most everyone uses Unicode. =end original And it worked; nowadays, those legacy standards are rarely used. Most everyone uses Unicode. (TBT) =begin original Unicode is a comprehensive standard. It specifies many things outside the scope of Perl, such as how to display sequences of characters. For a full discussion of all aspects of Unicode, see L. =end original Unicode is a comprehensive standard. It specifies many things outside the scope of Perl, such as how to display sequences of characters. For a full discussion of all aspects of Unicode, see L. (TBT) =head2 Important Caveats (重要な警告) =begin original Even though some of this section may not be understandable to you on first reading, we think it's important enough to highlight some of the gotchas before delving further, so here goes: =end original Even though some of this section may not be understandable to you on first reading, we think it's important enough to highlight some of the gotchas before delving further, so here goes: (TBT) =begin original Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. =end original Unicode サポートは大規模な要求です。 Perl は標準 Unicode や付随する技術的なレポートを一つ残らず 実装しているわけではありませんが、多くの Unicode 機能を サポートしています。 =begin original Also, the use of Unicode may present security issues that aren't obvious. Read L. =end original また、Unicode を使うと、明らかではないセキュリティ問題が姿を現すかも 知れません。 L を 読んでください。 =over 4 =item Safest if you C (C とすれば一番安全) =begin original In order to preserve backward compatibility, Perl does not turn on full internal Unicode support unless the pragma L>|feature/The 'unicode_strings' feature> is specified. (This is automatically selected if you S> or higher.) Failure to do this can trigger unexpected surprises. See L below. =end original 後方互換性を維持するために、Perl は L>|feature/The 'unicode_strings' feature> プラグマが指定されない限り 完全な内部 Unicode 対応をオンにしません。 (これは S> 以上を使うと自動的に選択されます。) こうするのに失敗すると予測できない驚きを引き起こすかも知れません。 後述する L を参照してください。 =begin original This pragma doesn't affect I/O. Nor does it change the internal representation of strings, only their interpretation. There are still several places where Unicode isn't fully supported, such as in filenames. =end original このプラグマは I/O には影響しません。 また、文字列の内部表現も変更しません; その解釈だけです。 ファイル名のように Unicode に完全に対応していない場所がいくつかあります。 =item Input and Output Layers (入出力層) =begin original Use the C<:encoding(...)> layer to read from and write to filehandles using the specified encoding. (See L.) =end original Use the C<:encoding(...)> layer to read from and write to filehandles using the specified encoding. (See L.) (TBT) =item You should convert your non-ASCII, non-UTF-8 Perl scripts to be UTF-8. =begin original See L. =end original L を参照してください。 =item C still needed to enable L in scripts =begin original If your Perl script is itself encoded in L, the S> pragma must be explicitly included to enable recognition of that (in string or regular expression literals, or in identifier names). B> is needed.> (See L). =end original Perl スクリプト自身が L でエンコードされている場合、 Perl スクリプトそれ自身の 中を(文字列や正規表現リテラル、あるいは変数名で) 認識可能に するために、 C プラグマを明示的に含めなければなりません。 B<これは明示的に C が必要な唯一の場合です。> (L を参照してください。) =item C-marked scripts and L scripts autodetected =begin original However, if a Perl script begins with the Unicode C (UTF-16LE, UTF16-BE, or UTF-8), or if the script looks like non-C-marked UTF-16 of either endianness, Perl will correctly read in the script as the appropriate Unicode encoding. (C-less UTF-8 cannot be effectively recognized or differentiated from ISO 8859-1 or other eight-bit encodings.) =end original しかし、Unicode C (UTF-16LE, UTF16-BE, またはUTF-8)で Perl スクリプトが 始まっていたり、スクリプトが C がついていない UTF-16(BE か LE のいずれか) であった場合、Perl はそのスクリプトを 適切な Unicode エンコーディングとして正しく読み込みます。 (C がない UTF-8 は、効率的に ISO 8859-1 などの 8 ビットエンコーディングと 区別したり認識することができません。) =back =head2 Byte and Character Semantics (バイトと文字のセマンティクス) =begin original Before Unicode, most encodings used 8 bits (a single byte) to encode each character. Thus a character was a byte, and a byte was a character, and there could be only 256 or fewer possible characters. "Byte Semantics" in the title of this section refers to this behavior. There was no need to distinguish between "Byte" and "Character". =end original Before Unicode, most encodings used 8 bits (a single byte) to encode each character. Thus a character was a byte, and a byte was a character, and there could be only 256 or fewer possible characters. "Byte Semantics" in the title of this section refers to this behavior. There was no need to distinguish between "Byte" and "Character". (TBT) =begin original Then along comes Unicode which has room for over a million characters (and Perl allows for even more). This means that a character may require more than a single byte to represent it, and so the two terms are no longer equivalent. What matter are the characters as whole entities, and not usually the bytes that comprise them. That's what the term "Character Semantics" in the title of this section refers to. =end original Then along comes Unicode which has room for over a million characters (and Perl allows for even more). This means that a character may require more than a single byte to represent it, and so the two terms are no longer equivalent. What matter are the characters as whole entities, and not usually the bytes that comprise them. That's what the term "Character Semantics" in the title of this section refers to. (TBT) =begin original Perl had to change internally to decouple "bytes" from "characters". It is important that you too change your ideas, if you haven't already, so that "byte" and "character" no longer mean the same thing in your mind. =end original Perl had to change internally to decouple "bytes" from "characters". It is important that you too change your ideas, if you haven't already, so that "byte" and "character" no longer mean the same thing in your mind. (TBT) =begin original The basic building block of Perl strings has always been a "character". The changes basically come down to that the implementation no longer thinks that a character is always just a single byte. =end original The basic building block of Perl strings has always been a "character". The changes basically come down to that the implementation no longer thinks that a character is always just a single byte. (TBT) =begin original There are various things to note: =end original 記しておくべき様々なことがあります: =over 4 =item * =begin original String handling functions, for the most part, continue to operate in terms of characters. C, for example, returns the number of characters in a string, just as before. But that number no longer is necessarily the same as the number of bytes in the string (there may be more bytes than characters). The other such functions include C, C, C, C, C, C, C, C, and C. =end original String handling functions, for the most part, continue to operate in terms of characters. C, for example, returns the number of characters in a string, just as before. But that number no longer is necessarily the same as the number of bytes in the string (there may be more bytes than characters). The other such functions include C, C, C, C, C, C, C, C, and C. (TBT) =begin original The exceptions are: =end original 例外は: =over 4 =item * =begin original the bit-oriented C =end original ビット単位の C E =item * =begin original the byte-oriented C/C C<"C"> format =end original バイト単位の C/C C<"C"> フォーマット =begin original However, the C specifier does operate on whole characters, as does the C specifier. =end original However, the C specifier does operate on whole characters, as does the C specifier. (TBT) =item * =begin original some operators that interact with the platform's operating system =end original プラットフォームのオペレーティングシステムと相互作用する一部の演算子 =begin original Operators dealing with filenames are examples. =end original 例としてはファイル名を扱う演算子です。 =item * =begin original when the functions are called from within the scope of the S>> pragma =end original 関数が S>> プラグマのスコープ内から呼び出された場合 =begin original Likely, you should use this only for debugging anyway. =end original おそらく、これはデバッグのためだけに行うべきです。 =back =item * =begin original Strings--including hash keys--and regular expression patterns may contain characters that have ordinal values larger than 255. =end original 文字列 -- ハッシュのキーを含め -- と正規表現パターンは序数値として 255 を 超える値を持つ文字を含めることができます。 =begin original If you use a Unicode editor to edit your program, Unicode characters may occur directly within the literal strings in UTF-8 encoding, or UTF-16. (The former requires a C or C, the latter requires a C.) =end original プログラムを編集するのに Unicode エディタを使っているのであれば、Unicode の 文字 UTF-8 か UTF-16 のエンコーディングコーディングでリテラル文字列に 含めることができます。 (前者は C か C を必要とし、後者は C を必要とします。) =begin original L gives other ways to place non-ASCII characters in your strings. =end original L gives other ways to place non-ASCII characters in your strings. (TBT) =item * =begin original The C and C functions work on whole characters. =end original C 関数と C 関数は文字全体に対して働きます。 =item * =begin original Regular expressions match whole characters. For example, C<"."> matches a whole character instead of only a single byte. =end original 正規表現は文字全体にマッチします。 例えば、C<"."> は 1 バイトだけではなく、ひとつの文字全体にマッチします。 =item * =begin original The C operator translates whole characters. (Note that the C functionality has been removed. For similar functionality to that, see C and C). =end original C 演算子は文字全体を変換します。 C は削除されたことに注意してください。 (これと同様のことを行うには C と C を 参照してください。) =item * =begin original C reverses by character rather than by byte. =end original C はバイト単位ではなく文字単位で 反転を行います。 =item * =begin original The bit string operators, C<& | ^ ~> and (starting in v5.22) C<&. |. ^. ~.> can operate on characters that don't fit into a byte. However, the current behavior is likely to change. You should not use these operators on strings that are encoded in UTF-8. If you're not sure about the encoding of a string, downgrade it before using any of these operators; you can use L|utf8/Utility functions>. =end original ビット文字列演算子 C<& | ^ ~> および (v5.22 からの) C<&. |. ^. ~.> は 1 バイトに収まらない 文字を操作できます。 しかし、現在の振る舞いは変更される予定です。 UTF-8 でエンコードされた文字列に対してこれらの演算子を 使うべきではありません。 文字列のエンコーディンがはっきりしない場合、 これらの演算子を使う前に降格してください; L|utf8/Utility functions> が使えます。 =back =begin original The bottom line is that Perl has always practiced "Character Semantics", but with the advent of Unicode, that is now different than "Byte Semantics". =end original The bottom line is that Perl has always practiced "Character Semantics", but with the advent of Unicode, that is now different than "Byte Semantics". (TBT) =head2 ASCII Rules versus Unicode Rules (ASCII 規則対 Unicode 規則) =begin original Before Unicode, when a character was a byte was a character, Perl knew only about the 128 characters defined by ASCII, code points 0 through 127 (except for under S>). That left the code points 128 to 255 as unassigned, and available for whatever use a program might want. The only semantics they have is their ordinal numbers, and that they are members of none of the non-negative character classes. None are considered to match C<\w> for example, but all match C<\W>. =end original Before Unicode, when a character was a byte was a character, Perl knew only about the 128 characters defined by ASCII, code points 0 through 127 (except for under S>). That left the code points 128 to 255 as unassigned, and available for whatever use a program might want. The only semantics they have is their ordinal numbers, and that they are members of none of the non-negative character classes. None are considered to match C<\w> for example, but all match C<\W>. (TBT) =begin original Unicode, of course, assigns each of those code points a particular meaning (along with ones above 255). To preserve backward compatibility, Perl only uses the Unicode meanings when there is some indication that Unicode is what is intended; otherwise the non-ASCII code points remain treated as if they are unassigned. =end original Unicode, of course, assigns each of those code points a particular meaning (along with ones above 255). To preserve backward compatibility, Perl only uses the Unicode meanings when there is some indication that Unicode is what is intended; otherwise the non-ASCII code points remain treated as if they are unassigned. (TBT) =begin original Here are the ways that Perl knows that a string should be treated as Unicode: =end original Here are the ways that Perl knows that a string should be treated as Unicode: (TBT) =over =item * =begin original Within the scope of S> =end original Within the scope of S> (TBT) =begin original If the whole program is Unicode (signified by using 8-bit Bnicode Bransformation Bormat), then all strings within it must be Unicode. =end original If the whole program is Unicode (signified by using 8-bit Bnicode Bransformation Bormat), then all strings within it must be Unicode. (TBT) =item * =begin original Within the scope of L>|feature/The 'unicode_strings' feature> =end original Within the scope of L>|feature/The 'unicode_strings' feature> (TBT) =begin original This pragma was created so you can explicitly tell Perl that operations executed within its scope are to use Unicode rules. More operations are affected with newer perls. See L. =end original This pragma was created so you can explicitly tell Perl that operations executed within its scope are to use Unicode rules. More operations are affected with newer perls. See L. (TBT) =item * =begin original Within the scope of S> or higher =end original Within the scope of S> or higher (TBT) =begin original This implicitly turns on S>. =end original This implicitly turns on S>. (TBT) =item * =begin original Within the scope of L>|perllocale/Unicode and UTF-8>, or L>|perllocale> and the current locale is a UTF-8 locale. =end original Within the scope of L>|perllocale/Unicode and UTF-8>, or L>|perllocale> and the current locale is a UTF-8 locale. (TBT) =begin original The former is defined to imply Unicode handling; and the latter indicates a Unicode locale, hence a Unicode interpretation of all strings within it. =end original The former is defined to imply Unicode handling; and the latter indicates a Unicode locale, hence a Unicode interpretation of all strings within it. (TBT) =item * =begin original When the string contains a Unicode-only code point =end original When the string contains a Unicode-only code point (TBT) =begin original Perl has never accepted code points above 255 without them being Unicode, so their use implies Unicode for the whole string. =end original Perl has never accepted code points above 255 without them being Unicode, so their use implies Unicode for the whole string. (TBT) =item * =begin original When the string contains a Unicode named code point C<\N{...}> =end original When the string contains a Unicode named code point C<\N{...}> (TBT) =begin original The C<\N{...}> construct explicitly refers to a Unicode code point, even if it is one that is also in ASCII. Therefore the string containing it must be Unicode. =end original The C<\N{...}> construct explicitly refers to a Unicode code point, even if it is one that is also in ASCII. Therefore the string containing it must be Unicode. (TBT) =item * =begin original When the string has come from an external source marked as Unicode =end original When the string has come from an external source marked as Unicode (TBT) =begin original The L|perlrun/-C [numberElist]> command line option can specify that certain inputs to the program are Unicode, and the values of this can be read by your Perl code, see L. =end original The L|perlrun/-C [numberElist]> command line option can specify that certain inputs to the program are Unicode, and the values of this can be read by your Perl code, see L. (TBT) =item * When the string has been upgraded to UTF-8 =begin original The function L|utf8/Utility functions> can be explicitly used to permanently (unless a subsequent C is called) cause a string to be treated as Unicode. =end original The function L|utf8/Utility functions> can be explicitly used to permanently (unless a subsequent C is called) cause a string to be treated as Unicode. (TBT) =item * There are additional methods for regular expression patterns =begin original A pattern that is compiled with the C<< /u >> or C<< /a >> modifiers is treated as Unicode (though there are some restrictions with C<< /a >>). Under the C<< /d >> and C<< /l >> modifiers, there are several other indications for Unicode; see L. =end original A pattern that is compiled with the C<< /u >> or C<< /a >> modifiers is treated as Unicode (though there are some restrictions with C<< /a >>). Under the C<< /d >> and C<< /l >> modifiers, there are several other indications for Unicode; see L. (TBT) =back =begin original Note that all of the above are overridden within the scope of C>; but you should be using this pragma only for debugging. =end original Note that all of the above are overridden within the scope of C>; but you should be using this pragma only for debugging. (TBT) =begin original Note also that some interactions with the platform's operating system never use Unicode rules. =end original Note also that some interactions with the platform's operating system never use Unicode rules. (TBT) =begin original When Unicode rules are in effect: =end original Unicode の規則が有効の場合: =over 4 =item * =begin original Case translation operators use the Unicode case translation tables. =end original 大小文字の変換演算子は Unicode の大小文字変換テーブルを使用します。 =begin original Note that C, or C<\U> in interpolated strings, translates to uppercase, while C, or C<\u> in interpolated strings, translates to titlecase in languages that make the distinction (which is equivalent to uppercase in languages without the distinction). =end original C や展開文字列中の C<\U> は大文字に変換し、C や 展開文字列中の C<\u> はその言語で区別されているときに タイトルケースに変換します (これは、区別がない言語では大文字と等価です)。 =begin original There is a CPAN module, C>, which allows you to define your own mappings to be used in C, C, C, C, and C (or their double-quoted string inlined versions such as C<\U>). (Prior to Perl 5.16, this functionality was partially provided in the Perl core, but suffered from a number of insurmountable drawbacks, so the CPAN module was written instead.) =end original C, C, C, C, C (および C<\U> のような ダブルクォート文字列インライン版) で使える独自のマッピングを定義できる CPAN モジュール C> があります。 (Perl 5.16 以前では、この機能は Perl コアで部分的に提供されていましたが、 多くの克服できない欠点があったため、代わりに CPAN モジュールが書かれました。) =item * =begin original Character classes in regular expressions match based on the character properties specified in the Unicode properties database. =end original Character classes in regular expressions match based on the character properties specified in the Unicode properties database. (TBT) =begin original C<\w> can be used to match a Japanese ideograph, for instance; and C<[[:digit:]]> a Bengali number. =end original C<\w> can be used to match a Japanese ideograph, for instance; and C<[[:digit:]]> a Bengali number. (TBT) =item * =begin original Named Unicode properties, scripts, and block ranges may be used (like bracketed character classes) by using the C<\p{}> "matches property" construct and the C<\P{}> negation, "doesn't match property". =end original 名前付き Unicode 特性、用字、ブロック範囲は、 C<\p{}> 「特性にマッチング」構文および否定である C<\P{}> 「特性にマッチングしない」を使って(大かっこ文字クラスのように)使えます。 =begin original See L for more details. =end original さらなる詳細については L を参照してください。 =begin original You can define your own character properties and use them in the regular expression with the C<\p{}> or C<\P{}> construct. See L for more details. =end original 独自の文字特性を定義して、C<\p{}> と C<\P{}> 構文によって 正規表現でそれらを使うことができます。 さらなる詳細については L を 参照してください。 =back =head2 Extended Grapheme Clusters (Logical characters) (拡張書記素クラスタ (論理文字)) =begin original Consider a character, say C. It could appear with various marks around it, such as an acute accent, or a circumflex, or various hooks, circles, arrows, I, above, below, to one side or the other, I. There are many possibilities among the world's languages. The number of combinations is astronomical, and if there were a character for each combination, it would soon exhaust Unicode's more than a million possible characters. So Unicode took a different approach: there is a character for the base C, and a character for each of the possible marks, and these can be variously combined to get a final logical character. So a logical character--what appears to be a single character--can be a sequence of more than one individual characters. The Unicode standard calls these "extended grapheme clusters" (which is an improved version of the no-longer much used "grapheme cluster"); Perl furnishes the C<\X> regular expression construct to match such sequences in their entirety. =end original 一つの文字、例えば C について考えてみます。 これは文字の回りの様々なマークとして現れることがあって、 鋭アクセント、曲折アクセント、フック、円、矢など、上、下、左、右、などです。 世界中の言語の中では多くの可能性があります。 組み合わせの数は天文学的で、 それぞれの組み合わせを一つの文字にすると、Unicode の数百万の可能な文字を すぐに使い切ってしまいます。 それで Unicode は異なる手法を取りました: 基本となる C を一つの文字として、 それぞれの可能なマークのそれぞれを一つの文字として、 最後に論理的な文字でこれらを様々に結合できるようにしました。 それで一つの論理文字--単一の文字として現れるもの--は 複数の独立した文字の並びになることがあります。 Unicode 標準はこれを「拡張書記素クラスタ」("extended grapheme cluster") (もはやあまり使われない「書記素クラスタ」"grapheme cluster" の改良版) と 呼びます; Perl はこのような並び丸ごとにマッチングする C<\X> 正規表現構文を 用意しています。 =begin original But Unicode's intent is to unify the existing character set standards and practices, and several pre-existing standards have single characters that mean the same thing as some of these combinations, like ISO-8859-1, which has quite a few of them. For example, C<"LATIN CAPITAL LETTER E WITH ACUTE"> was already in this standard when Unicode came along. Unicode therefore added it to its repertoire as that single character. But this character is considered by Unicode to be equivalent to the sequence consisting of the character C<"LATIN CAPITAL LETTER E"> followed by the character C<"COMBINING ACUTE ACCENT">. =end original But Unicode's intent is to unify the existing character set standards and practices, and several pre-existing standards have single characters that mean the same thing as some of these combinations, like ISO-8859-1, which has quite a few of them. For example, C<"LATIN CAPITAL LETTER E WITH ACUTE"> was already in this standard when Unicode came along. Unicode therefore added it to its repertoire as that single character. But this character is considered by Unicode to be equivalent to the sequence consisting of the character C<"LATIN CAPITAL LETTER E"> followed by the character C<"COMBINING ACUTE ACCENT">. (TBT) =begin original C<"LATIN CAPITAL LETTER E WITH ACUTE"> is called a "pre-composed" character, and its equivalence with the "E" and the "COMBINING ACCENT" sequence is called canonical equivalence. All pre-composed characters are said to have a decomposition (into the equivalent sequence), and the decomposition type is also called canonical. A string may be comprised as much as possible of precomposed characters, or it may be comprised of entirely decomposed characters. Unicode calls these respectively, "Normalization Form Composed" (NFC) and "Normalization Form Decomposed". The C> module contains functions that convert between the two. A string may also have both composed characters and decomposed characters; this module can be used to make it all one or the other. =end original C<"LATIN CAPITAL LETTER E WITH ACUTE"> は「合成済」(pre-composed) 文字と 呼ばれ、"E" および "COMBINING ACCENT" と等価な並びは正準等価 (canonical equivalence) と呼ばれます。 全ての合成済文字は(等価な並びに)分解でき、分解の種類もまた正準と呼ばれます。 文字列は、可能な限り合成済文字で構成される場合もあれば、 完全に分解された文字で構成される場合もあります。 Unicode では、これらをそれぞれ 「正規化形式 C」("Normalization Form Composed": NFC) と "Normalization Form Decomposed" と呼んでいます。 C> モジュールには、 二つの文字を変換する関数が含まれています。 文字列は、合成された文字と分解された文字の両方を持つこともできます。 このモジュールを使用して、すべてを片方にすることも、 もう片方にすることもできます。 =begin original You may be presented with strings in any of these equivalent forms. There is currently nothing in Perl 5 that ignores the differences. So you'll have to specially hanlde it. The usual advice is to convert your inputs to C before processing further. =end original You may be presented with strings in any of these equivalent forms. There is currently nothing in Perl 5 that ignores the differences. So you'll have to specially hanlde it. The usual advice is to convert your inputs to C before processing further. (TBT) =begin original For more detailed information, see L. =end original さらに詳しい情報については、L を 参照してください。 =head2 Unicode Character Properties (Unicode 文字特性) =begin original (The only time that Perl considers a sequence of individual code points as a single logical character is in the C<\X> construct, already mentioned above. Therefore "character" in this discussion means a single Unicode code point.) =end original (Perl が個々の符号位置の並びを単一の論理文字として扱う 唯一のタイミングは、既に前述した C<\X> 構文です。 従って、この議論での「文字」は単一の Unicode 符号位置を意味します。) =begin original Very nearly all Unicode character properties are accessible through regular expressions by using the C<\p{}> "matches property" construct and the C<\P{}> "doesn't match property" for its negation. =end original ほぼ全ての Unicode 文字特性は、 C<\p{}> "matches property" 構文とその否定形の C<\P{}> "doesn't match property" を使った正規表現を通してアクセス可能です。 =begin original For instance, C<\p{Uppercase}> matches any single character with the Unicode C<"Uppercase"> property, while C<\p{L}> matches any character with a C of C<"L"> (letter) property (see L below). Brackets are not required for single letter property names, so C<\p{L}> is equivalent to C<\pL>. =end original たとえば、C<\p{Uppercase}> は Unicode の C<"Uppercase"> 特性を持つ任意の 単一の文字にマッチングし、C<\p{L}> は C C<"L"> (letter) 特性を持つ任意の文字にマッチングします (後述する L 参照)。 中かっこは一文字の特性名では省略することができるので、C<\p{L}> は C<\pL> と等価です。 =begin original More formally, C<\p{Uppercase}> matches any single character whose Unicode C property value is C, and C<\P{Uppercase}> matches any character whose C property value is C, and they could have been written as C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively. =end original より正式には、C<\p{Uppercase}> は Unicode の C 特性値 が C である任意の単一の文字とマッチングし、C<\P{UpperCase}> は C 特性値 が C である任意の文字とマッチングします; そしてこれらはそれぞれ C<\p{Uppercase=True}>, C<\p{Uppercase=False}> と書けます。 =begin original This formality is needed when properties are not binary; that is, if they can take on more values than just C and C. For example, the C property (see L below), can take on several different values, such as C, C, C, and others. To match these, one needs to specify both the property name (C), AND the value being matched against (C, C, I). This is done, as in the examples above, by having the two components separated by an equal sign (or interchangeably, a colon), like C<\p{Bidi_Class: Left}>. =end original この形式は、特性が 2 値でない場合、つまり、単に C と C より多くの 値を取ることができる場合に必要です。 たとえば、C 特性(L を参照)は、 C, C, C などのさまざまな値を取ることができます。 これらにマッチングするには、特性名(C)と、 マッチングする値 (C, C など) の両方を指定する必要があります。 これは、前述の例のように、二つの要素を等号 (または、C<\p{Biddi_Class:Left}> のように交換可能なコロン)で 区切ることによって、実行されます。 =begin original All Unicode-defined character properties may be written in these compound forms of C<\p{I=I}> or C<\p{I:I}>, but Perl provides some additional properties that are written only in the single form, as well as single-form short-cuts for all binary properties and certain others described below, in which you may omit the property name and the equals or colon separator. =end original すべての Unicode が定義した文字特性は、C<\p{I=I}> や C<\p{I:I}> のような複合形式で書けますが、 Perl は特性名および等号やコロンの区切り文字を省略できるように、 単一形式でのみ書ける追加の特性や、全ての 2 値特性と一部の後述する ものに対する単一形式のショートカットを提供します。 =begin original Most Unicode character properties have at least two synonyms (or aliases if you prefer): a short one that is easier to type and a longer one that is more descriptive and hence easier to understand. Thus the C<"L"> and C<"Letter"> properties above are equivalent and can be used interchangeably. Likewise, C<"Upper"> is a synonym for C<"Uppercase">, and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically various synonyms for the values the property can be. For binary properties, C<"True"> has 3 synonyms: C<"T">, C<"Yes">, and C<"Y">; and C<"False"> has correspondingly C<"F">, C<"No">, and C<"N">. But be careful. A short form of a value for one property may not mean the same thing as the same short form for another. Thus, for the C> property, C<"L"> means C<"Letter">, but for the L|/Bidirectional Character Types> property, C<"L"> means C<"Left">. A complete list of properties and synonyms is in L. =end original ほとんどの Unicode 文字特性には、少なくとも二つの同義語 (またはあなたが好むなら別名)があります; 簡単に入力できる短いものと、 より長いけれども説明的で理解しやすいものです。 したがって、前述の C<"L"> および C<"Letter"> 特性は等価であり、 交換可能です。 同様に、C<"Upper"> は C<"Uppercase"> の同義語であり、C<\p{Uppercase}> は 等価に C<\p{Upper}> と書けます。 また、典型的には特性の値に対してさまざまな同義語があります。 2 値特性の場合、C<"True"> には三つの同義語があります: C<"T">, C<"Yes">, C<"Y">; C<"False"> には C<"F">, C<"No">, C<"N"> が あります。 しかし注意してください。 ある特性に対する値の短い形式は、他の特性の同じ短い形式と同じものを 意味するとは限りません。 従って、C> 特性では C<"L"> は C<"Letter"> を 意味しますが、L|/Bidirectional Character Types> 特性では、 C<"L"> は C<"Left"> を意味します。 特性および同義語の完全な一覧は L にあります。 =begin original Upper/lower case differences in property names and values are irrelevant; thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. Similarly, you can add or subtract underscores anywhere in the middle of a word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space is irrelevant adjacent to non-word characters, such as the braces and the equals or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are equivalent to these as well. In fact, white space and even hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is equivalent. All this is called "loose-matching" by Unicode. The few places where stricter matching is used is in the middle of numbers, and in the Perl extension properties that begin or end with an underscore. Stricter matching cares about white space (except adjacent to non-word characters), hyphens, and non-interior underscores. =end original 特性名と値の大文字と小文字の違いは無関係です; したがって C<\p{Upper}> は C<\p{upper}>, さらには C<\p{UpPeR}> とも同じことを 意味します。 同様に、単語の中のどこにでも下線を追加または削除できるので、 これらは C<\p{U_p_p_e_r}> とも等価です。 また、中かっこや等号、コロンなどの非単語文字に隣接した空白は無視されるので、 C<\p{ Upper }> and C<\p{ Upper_case : Y }> も等価です。 実際には、通常、空白とハイフンさえどこにでも追加または削除できます。 したがって、C<\p{Upper case=Yes}> ですらも等価です。 これはすべて Unicode で「緩いマッチング」と呼ばれます。 数少ない厳密なマッチングが採用されている場所は数値の中と、下線で始まったり 終わったりする Perl 拡張特性です。 より厳密なマッチングは空白(非単語文字に隣接するものを除く)、ハイフン、 非内部下線を考慮します。 =begin original You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret (C<^>) between the first brace and the property name: C<\p{^Tamil}> is equal to C<\P{Tamil}>. =end original C<\p{}> と C<\P{}> の両方で、キャレット(C<^>) を最初のブレースと 特性名の間に置くことによって意味を反転することができます: C<\p{^Tamil}> は C<\P{Tamil}> と等価です。 =begin original Almost all properties are immune to case-insensitive matching. That is, adding a C regular expression modifier does not change what they match. There are two sets that are affected. The first set is C, C, and C, all of which match C under C matching. And the second set is C, C, and C, all of which match C under C matching. This set also includes its subsets C and C both of which under C match C. (The difference between these sets is that some things, such as Roman numerals, come in both upper and lower case so they are C, but aren't considered letters, so they aren't C's.) =end original ほとんど全ての特性は大文字小文字を考慮したマッチングの影響を受けません。 つまり、C 正規表現修飾子を追加することは、 それらがマッチングするものを変えません。 影響を受ける二つの集合があります。 最初の集合は、 C, C, C, C の下で C にマッチングする全てです。 二番目の集合は、 C, C, C, C マッチングの基で C にマッチングする全てです。 この集合はまた、C マッチングの基で C にマッチングする そのサブセット C と C を含みます。 (これらの集合の違いは、ローマ数字のような一部のもので、 大文字と小文字の両方に含まれるので C であるけれども、 しかし字と考えられないので、C ではありません。) =begin original See L for special considerations when matching Unicode properties against non-Unicode code points. =end original 非 Unicode 符号位置に対して Unicode 特性をマッチングしたときの 特殊処理については L を参照してください。 =head3 B =begin original Every Unicode character is assigned a general category, which is the "most usual categorization of a character" (from L). =end original 全ての Unicode 文字は一つの一般カテゴリに割り当てられています; これは「その文字の最も普通のカテゴライズ」 (L より)です。 =begin original The compound way of writing these is like C<\p{General_Category=Number}> (short: C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up through the equal or colon separator is omitted. So you can instead just write C<\pN>. =end original これらを書く複合的な方法は C<\p{General_Category=Number}> (短縮形: C<\p{gc:n}>) のようなものです。 Perl は等号またはコロンの区切り文字までの全てを省略できる機能を 提供しています。 従って、代わりに単に C<\pN> と書けます。 =begin original Here are the short and long forms of the values the C property can have: =end original 以下は、Unicode の C<一般カテゴリ> 特性が持つことができる値の 短形式と長形式です: Short Long L Letter LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}]) Lu Uppercase_Letter Ll Lowercase_Letter Lt Titlecase_Letter Lm Modifier_Letter Lo Other_Letter M Mark Mn Nonspacing_Mark Mc Spacing_Mark Me Enclosing_Mark N Number Nd Decimal_Number (also Digit) Nl Letter_Number No Other_Number P Punctuation (also Punct) Pc Connector_Punctuation Pd Dash_Punctuation Ps Open_Punctuation Pe Close_Punctuation Pi Initial_Punctuation (may behave like Ps or Pe depending on usage) Pf Final_Punctuation (may behave like Ps or Pe depending on usage) Po Other_Punctuation S Symbol Sm Math_Symbol Sc Currency_Symbol Sk Modifier_Symbol So Other_Symbol Z Separator Zs Space_Separator Zl Line_Separator Zp Paragraph_Separator C Other Cc Control (also Cntrl) Cf Format Cs Surrogate Co Private_Use Cn Unassigned =begin original Single-letter properties match all characters in any of the two-letter sub-properties starting with the same letter. C and C are special: both are aliases for the set consisting of everything matched by C, C, and C. =end original 単一文字の特性は同じ文字で始まる二文字の任意のサブ特性に含まれる すべての文字にマッチします。 C と C は特別です: 両方とも C, C, C に マッチングする全てからなる集合への別名です。 =head3 B (B<双方向文字型>) =begin original Because scripts differ in their directionality (Hebrew and Arabic are written right to left, for example) Unicode supplies a C property. Some of the values this property can have are: =end original 用字はその方向性で異なるので (例えばヘブライ語とアラビア語は右から左に 書きます) Unicode は以下の特性を C 特性で提供しています。 この特性が持つことができる値の一部は: Value Meaning L Left-to-Right LRE Left-to-Right Embedding LRO Left-to-Right Override R Right-to-Left AL Arabic Letter RLE Right-to-Left Embedding RLO Right-to-Left Override PDF Pop Directional Format EN European Number ES European Separator ET European Terminator AN Arabic Number CS Common Separator NSM Non-Spacing Mark BN Boundary Neutral B Paragraph Separator S Segment Separator WS Whitespace ON Other Neutrals =begin original This property is always written in the compound form. For example, C<\p{Bidi_Class:R}> matches characters that are normally written right to left. Unlike the C> property, this property can have more values added in a future Unicode release. Those listed above comprised the complete set for many Unicode releases, but others were added in Unicode 6.3; you can always find what the current ones are in in L. And L describes how to use them. =end original この特性は常に複合形式で書かれます。 たとえば、C<\p{Bidi_Class:R}> は通常右から左に書く文字にマッチします。 C> 特性とは異なり、 この特性は将来リリースされる Unicode でさらに値が追加されるかもしれません。 これらの上述したものは何回もの Unicode のリリースの間完全な一覧でしたが、 その他の物は Unicode 6.3 で追加されたものです; 現在の内容についてはいつでも L で確認できます。 これらの使い方については L に記述されています。 =head3 B (B<用字>) =begin original The world's languages are written in many different scripts. This sentence (unless you're reading it in translation) is written in Latin, while Russian is written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in Hiragana or Katakana. There are many more. =end original 世界の言語は多くの異なった用字で書かれています。 この文は(訳文を読んでいない限り)ラテン文字で書かれていますが、ロシア語は キリル文字で書かれています; そしてギリシャ語は、ええと、ギリシャ文字です; 日本語は主にひらがなやカタカナで書かれています。 もっとたくさんあります。 =begin original The Unicode C