=encoding euc-jp =head1 NAME =begin original perlunicode - Unicode support in Perl =end original perlunicode - Perl における Unicode サポート =head1 DESCRIPTION =head2 Important Caveats (重要な警告) =begin original Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. =end original Unicode サポートは大規模な要求です。 Perl は標準 Unicode や付随する技術的なレポートを一つ残らず実装しているわけではありませんが、多くの Unicode 機能をサポートしています。 =begin original People who want to learn to use Unicode in Perl, should probably read L, before reading this reference document. =end original Perl で Unicode を使うことを学びたい人は、多分このリファレンスを読む前に L を読んだ方がよいでしょう。 =over 4 =item Input and Output Layers (入出力層) =begin original Perl knows when a filehandle uses Perl's internal Unicode encodings (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with the ":utf8" layer. Other encodings can be converted to Perl's encoding on input or from Perl's encoding on output by use of the ":encoding(...)" layer. See L. =end original Perl は、ファイルハンドルが ":utf8" 層を指定してオープンされると、ファイルハンドルが Perl の内部 Unicode エンコーディング (UTF-8, または EBCDIC の時は UTF-EBCDIC) を使うことが分かります。その他のエンコーディングは、":encoding(...)" 層を使うことで、入力時の Perl のエンコーディングへの変換や出力時の Perl のエンコーディングからの変換を行えます。 L を参照してください。 =begin original To indicate that Perl source itself is in UTF-8, use C. =end original Perl のソース自身が UTF-8 であることを示すには、C を使ってください。 =item Regular Expressions (正規表現) =begin original The regular expression compiler produces polymorphic opcodes. That is, the pattern adapts to the data and automatically switches to the Unicode character scheme when presented with data that is internally encoded in UTF-8, or instead uses a traditional byte scheme when presented with byte data. =end original 正規表現コンパイラは多態的なオペコードを生成します。つまり、パターンはデータに対して適用され、データが内部で UTF-8 でエンコードされている場合には Unicode 文字スキームに自動的に切り替わります; さもなければ、バイトデータで表されている場合には伝統的なバイトスキームが使われます。 =item C still needed to enable UTF-8/UTF-EBCDIC in scripts =begin original As a compatibility measure, the C pragma must be explicitly included to enable recognition of UTF-8 in the Perl scripts themselves (in string or regular expression literals, or in identifier names) on ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based machines. B is needed.> See L. =end original 互換性のために、ASCII ベースのマシンにおいて Perl スクリプトそれ自身の中の UTF-8 を(文字列や正規表現リテラル、あるいは変数名で) 認識可能にするためや、EBCDIC ベースのマシンで UTF-EBCDIC を認識させるために C プラグマを明示的に含めなければなりません。 B<これらは明示的に C が必要な唯一の場合です。> L を参照してください。 =item BOM-marked scripts and UTF-16 scripts autodetected =begin original If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either endianness, Perl will correctly read in the script as Unicode. (BOMless UTF-8 cannot be effectively recognized or differentiated from ISO 8859-1 or other eight-bit encodings.) =end original Unicode BOM (UTF-16LE, UTF16-BE, またはUTF-8)で Perl スクリプトが始まっていたり、スクリプトが BOM がついていない UTF-16(BE か LE のいずれか) であった場合、Perl はそのスクリプトを Unicode であるとして正しく読み込みます。 (BOM がない UTF-8 は、効率的に ISO 8859-1 などの 8 ビットエンコーディングと区別したり認識することができません。) =item C needed to upgrade non-Latin-1 byte strings =begin original By default, there is a fundamental asymmetry in Perl's Unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in I, but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1. =end original デフォルトでは、Perl の Unicode モデルにおける基本的な非対称があります: バイト文字列から Unicode 文字列への暗黙の昇格はその文字列が I でエンコードされているものと仮定しますが、 Unicode 文字列からのダウングレードは UTF-8 エンコーディングへと行われます。これは Unicode の最初の 256 文字が Latin-1 と共通であるからです。 =begin original See L for more details. =end original 詳細は L を参照してください。 =back =head2 Byte and Character Semantics (バイトと文字のセマンティクス) =begin original Beginning with version 5.6, Perl uses logically-wide characters to represent strings internally. =end original バージョン 5.6 から、Perl は論理的なワイド文字を内部的な文字列の表現のために使っています。 =begin original In future, Perl-level operations will be expected to work with characters rather than bytes. =end original 将来は、Perl レベルの操作はバイトではなく文字に対して働くことになるでしょう。 =begin original However, as an interim compatibility measure, Perl aims to provide a safe migration path from byte semantics to character semantics for programs. For operations where Perl can unambiguously decide that the input data are characters, Perl switches to character semantics. For operations where this determination cannot be made without additional information from the user, Perl decides in favor of compatibility and chooses to use byte semantics. =end original しかしながら、一時的な互換性の措置として、Perl はプログラムに対するバイトセマンティクスから文字セマンティクスへの安全な移行パスを提供することを目指します。入力データが文字であると Perl が曖昧さなく決定できる操作については、 Perl は文字セマンティクスに切り替えます。ユーザーからの付加的な情報抜きに決定することができない操作については Perl は互換性の観点からバイトセマンティクスを選択します。 =begin original Under byte semantics, when C is in effect, Perl uses the semantics associated with the current locale. Absent a C, and absent a C pragma, Perl currently uses US-ASCII (or Basic Latin in Unicode terminology) byte semantics, meaning that characters whose ordinal numbers are in the range 128 - 255 are undefined except for their ordinal numbers. This means that none have case (upper and lower), nor are any a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.) =end original バイトセマンティクスでは、C が有効の場合、Perl は現在のロケールに関連づけられたセマンティクスを使います。 C がなく、C もない場合、 Perl は現在のところ US-ASCII (または Unicode の用語では Basic Latin) バイトセマンティクスを使います; つまり番号 128 - 255 の範囲の文字は、その番号以外では未定義です。つまり、大文字小文字はなく、C<[:alpha:]> や C<\w> のような、どの文字クラスにも含まれません。 (しかし C<\W> クラスや Perl の正規表現拡張 C<[:^alpha:]> には属します。) =begin original This behavior preserves compatibility with earlier versions of Perl, which allowed byte semantics in Perl operations only if none of the program's inputs were marked as being a source of Unicode character data. Such data may come from filehandles, from calls to external programs, from information provided by the system (such as %ENV), or from literals and constants in the source text. =end original この動作は Perl の以前のバージョンとの互換性を維持し、プログラムの入力が Unicode の文字データのソースであるとマークされていない場合にのみ Perl の操作でバイトセマンティクスを許可します。そのようなデータは、ファイルハンドル、外部プログラムの呼び出し、システムから提供される情報(%ENV のような)、ソーステキスト中のリテラルや定数といったものからくるものです。 =begin original The C pragma will always, regardless of platform, force byte semantics in a particular lexical scope. See L. =end original C プラグマは常に、プラットフォームとは無関係に、特定のレキシカルスコープにおいてバイトセマンティクスを強制します。 L を参照してください。 =begin original The C pragma is intended to always, regardless of platform, force Unicode semantics in a particular lexical scope. In release 5.12, it is partially implemented, applying only to case changes. See L below. =end original C プラグマは、プラットフォームに関わらず常に特定のレキシカルスコープで Unicode セマンティクスを強制することを意図しています。リリース 5.12 では、これは部分的に実装されていて、大文字小文字変更にのみ適用されます。後述する L を参照してください。 =begin original The C pragma is primarily a compatibility device that enables recognition of UTF-(8|EBCDIC) in literals encountered by the parser. Note that this pragma is only required while Perl defaults to byte semantics; when character semantics become the default, this pragma may become a no-op. See L. =end original C プラグマは主としてパーサが遭遇するリテラル中の UTF-(8|EBCDIC) の認識を有効にする互換デバイス(compatibility device)です。このプラグマは Perl のデフォルトがバイトセマンティクスであるときにのみ必要であることに注意してください; 文字セマンティクスがデフォルトである場合には、このプラグマは何もしません。 L を参照してください。 =begin original Unless explicitly stated, Perl operators use character semantics for Unicode data and byte semantics for non-Unicode data. The decision to use character semantics is made transparently. If input data comes from a Unicode source--for example, if a character encoding layer is added to a filehandle or a literal Unicode string constant appears in a program--character semantics apply. Otherwise, byte semantics are in effect. The C pragma should be used to force byte semantics on Unicode data, and the C pragma to force Unicode semantics on byte data (though in 5.12 it isn't fully implemented). =end original 明示的に指定されない限り、Perl の演算子は Unicode データに対しては文字セマンティクスを用い、非 Unicode データに対してはバイトセマンティクスを用います。文字セマンティクスの使用の決定はトランスペアレントに行われます。もし入力データが Unicode ソースから来たもの -- たとえば、文字エンコーディング層がファイルハンドルに附加されているかリテラルの Unicode 文字列定数がプログラムの中にある -- のであれば文字セマンティクスが適用されます。そうでなければ、バイトセマンティクスが有効になります。 C プラグマは Unicode データに対してバイトセマンティクスを強制するときに使って、C プラグマをバイトデータで Unicode セマンティクスを強制するために使えます (しかし 5.12 ではこれは完全には実装されていません)。 =begin original If strings operating under byte semantics and strings with Unicode character data are concatenated, the new string will have character semantics. This can cause surprises: See L, below. You can choose to be warned when this happens. See L. =end original バイトセマンティクスの元での文字列の操作で、Unicode 文字データが連結された文字列であった場合、新たな文字列は文字セマンティックスを保ちます。これは驚きを引き起こすかもしれません: 後述する L を参照してください。これが起きたときに警告されるようにすることを選択できます。 L を参照してください。 =begin original Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is logically just a number ranging from 0 to 2**31 or so. Larger characters may encode into longer sequences of bytes internally, but this internal detail is mostly hidden for Perl code. See L for more. =end original 文字セマンティクスの元では、伝統的にバイトに対して働いていた操作の多くが文字に対して働きます。 Perl における文字は論理的には 0 から 2**31 までの範囲の数値です。大きな文字は内部的にはより長いシーケンスにエンコードされる可能性がありますが、この内部の詳細は Perl プログラムからほとんど隠されています。詳細は L を参照してください。 =head2 Effects of Character Semantics (文字セマンティクスの効果) =begin original Character semantics have the following effects: =end original 文字セマンティクスは以下の効果を持っています: =over 4 =item * =begin original Strings--including hash keys--and regular expression patterns may contain characters that have an ordinal value larger than 255. =end original 文字列 -- ハッシュのキーを含め -- と正規表現パターンは序数値として 255 を超える値を持つ文字を含めることができます。 =begin original If you use a Unicode editor to edit your program, Unicode characters may occur directly within the literal strings in UTF-8 encoding, or UTF-16. (The former requires a BOM or C, the latter requires a BOM.) =end original プログラムを編集するのに Unicode エディタを使っているのであれば、Unicode の文字 UTF-8 か UTF-16 のエンコーディングコーディングでリテラル文字列に含めることができます。 (前者は BOM か C を必要とし、後者は BOM を必要とします。) =begin original Unicode characters can also be added to a string by using the C<\N{U+...}> notation. The Unicode code for the desired character, in hexadecimal, should be placed in the braces, after the C. For instance, a smiley face is C<\N{U+263A}>. =end original Unicode の文字は C<\x{...}> 表記を使うことにより文字列に追加することもできます。その表現される Unicode コードは、16 進でブレースに囲みます。たとえば、smiley face は C<\N{U+263A}> です。 =begin original Alternatively, you can use the C<\x{...}> notation for characters 0x100 and above. For characters below 0x100 you may get byte semantics instead of character semantics; see L. On EBCDIC machines there is the additional problem that the value for such characters gives the EBCDIC character rather than the Unicode one. =end original あるいは、0x100 以上の文字については C<\x{...}> 記法が使えます。 0x100 より小さい文字については文字セマンティクスではなくバイトセマンティクスを使います; L を参照してください。 EBCDIC マシンでは、このような文字の値が Unicode のものではなく EBCDIC のものになるという追加の問題があります。 =begin original Additionally, if you =end original これに加えて、 use charnames ':full'; =begin original you can use the C<\N{...}> notation and put the official Unicode character name within the braces, such as C<\N{WHITE SMILING FACE}>. See L. =end original とすると C<\N{...}> 表記を使うことができ、公式な Unicode 文字名を C<\N{WHITE SMILING FACE}> のようにブレースの中に置くことができます。 L を参照してください。 =item * =begin original If an appropriate L is specified, identifiers within the Perl script may contain Unicode alphanumeric characters, including ideographs. Perl does not currently attempt to canonicalize variable names. =end original 適切な L が指定されていれば、Perl スクリプトの中の識別子で表意文字を含めた Unicode の英数字を含めることができます。 Perl は現在、変数名を正規化しようとはしません。 =item * =begin original Regular expressions match characters instead of bytes. "." matches a character instead of a byte. =end original 正規表現はバイトではなく文字にマッチします。 "." は一バイトではなく、ひとつの文字にマッチします。 =item * =begin original Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. C<\w> can be used to match a Japanese ideograph, for instance. =end original 正規表現中の文字クラスはバイトではなく文字にマッチし、Unicode の特性データベースで定義されている文字特性に対してマッチを行います。たとえば、C<\w> は日本語の表意文字にマッチさせるために使うことができます。 =item * =begin original Named Unicode properties, scripts, and block ranges may be used like character classes via the C<\p{}> "matches property" construct and the C<\P{}> negation, "doesn't match property". See L for more details. =end original 名前付き Unicode 特性、用字、ブロック範囲は、 C<\p{}> 「特性にマッチング」構文および否定である C<\P{}> 「特性にマッチングしない」を使って文字クラスのように使えます。さらなる詳細については L を参照してください。 =begin original You can define your own character properties and use them in the regular expression with the C<\p{}> or C<\P{}> construct. See L for more details. =end original 独自の文字特性を定義して、C<\p{}> と C<\P{}> 構文によって正規表現でそれらを使うことができます。さらなる詳細については L を参照してください。 =item * =begin original The special pattern C<\X> matches a logical character, an "extended grapheme cluster" in Standardese. In Unicode what appears to the user to be a single character, for example an accented C, may in fact be composed of a sequence of characters, in this case a C followed by an accent character. C<\X> will match the entire sequence. =end original 特殊なパターン C<\X> は論理文字、標準で言うところの「拡張書記素クラスタ」にマッチングします。 Unicode では、ユーザーには単一の文字、例えばアクセント付きの C に見えるものが、実際には文字の並び、この場合では C に引き続いてアクセント文字から構成されるかもしれません。 C<\X> は並び全体にマッチングします。 =item * =begin original The C operator translates characters instead of bytes. Note that the C functionality has been removed. For similar functionality see pack('U0', ...) and pack('C0', ...). =end original C 演算子はバイトではなく文字で変換します。 C は削除されたことに注意してください。同様のことを行うには pack('U0', ...) と pack('C0', ...) を参照してください。 =item * =begin original Case translation operators use the Unicode case translation tables when character input is provided. Note that C, or C<\U> in interpolated strings, translates to uppercase, while C, or C<\u> in interpolated strings, translates to titlecase in languages that make the distinction (which is equivalent to uppercase in languages without the distinction). =end original 大小文字の変換演算子は Unicode の大小文字変換テーブルを、文字の入力があったときに使用します。 C や展開文字列中の C<\U> は大文字に変換し、C や展開文字列中の C<\u> はその言語で区別されているときにタイトルケースに変換します (これは、区別がない言語では大文字と等価です)。 =item * =begin original Most operators that deal with positions or lengths in a string will automatically switch to using character positions, including C, C, C, C, C, C, C, C, and C. An operator that specifically does not switch is C. Operators that really don't care include operators that treat strings as a bucket of bits such as C, and operators dealing with filenames. =end original 文字列の位置や長さを取り扱う演算子の大部分は自動的に文字の位置を使うように変更されました; これには C, C, C, C, C, C, C, C, C が含まれます。 C は変更されていません。文字列をビットのバケツのように扱う C、ファイル名を取り扱う演算子は文字かどうかを気にしません。 =item * =begin original The C/C letter C does I change, since it is often used for byte-oriented formats. Again, think C in the C language. =end original C/C の文字 C は I<変更されていません>; なぜなら、これらはしばしばバイト指向の書式のために使われるからです。繰り返しますが、C 言語の C を考えてください。 =begin original There is a new C specifier that converts between Unicode characters and code points. There is also a C specifier that is the equivalent of C/C and properly handles character values even if they are above 255. =end original Unicode の文字と符号位置の間の変換を行う新たな C 指定子があります。 C/C と等価で、文字の値が 255 を超えていても適切に扱える C 指定子もあります。 =item * =begin original The C and C functions work on characters, similar to C and C, I C and C. C and C are methods for emulating byte-oriented C and C on Unicode strings. While these methods reveal the internal encoding of Unicode strings, that is not something one normally needs to care about at all. =end original C 関数と C 関数は C や C のように文字に対して働き、C や C のようには I<働きません>。 C と C は Unicode 文字列においてバイト指向の C や C をエミュレートするためのメソッドです。これらのメソッドが Unicode 文字列の内部エンコーディングを明らかにするので、通常はケアする必要はありません。 =item * =begin original The bit string operators, C<& | ^ ~>, can operate on character data. However, for backward compatibility, such as when using bit string operations when characters are all less than 256 in ordinal value, one should not use C<~> (the bit complement) with characters of both values less than 256 and values greater than 256. Most importantly, DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) will not hold. The reason for this mathematical I is that the complement cannot return B the 8-bit (byte-wide) bit complement B the full character-wide bit complement. =end original ビット文字列演算子 C<& | ^ ~> は文字データを操作できます。しかし、例えば全ての文字の値が 255 以下のときにビット文字列演算を使った場合の後方互換性のために、 256 以上の値の文字と 255 以下の値の文字の両方が含まれている文字列に C<~> (ビット補数) を使うべきではありません。最も重要なことは、ド・モルガンの法則 (C<~($x|$y) eq ~$x&~$y> と C<~($x&$y) eq ~$x|~$y>) が成り立たないということです。この数学的な I<過失> の理由は補数(complement)が 8 ビットのビット補数 B<および> 文字幅のビット補数の B<両方> を返すことができないためです。 =item * =begin original You can define your own mappings to be used in lc(), lcfirst(), uc(), and ucfirst() (or their string-inlined versions). See L for more details. =end original lc(), lcfirst(), uc(), ucfirst() (およびこれらの文字列インライン版) で使える独自のマッピングを定義できます。更なる詳細については L を参照してください。 =back =over 4 =item * =begin original And finally, C reverses by character rather than by byte. =end original そして最後に、C はバイト単位ではなく文字単位で反転を行います。 =back =head2 Unicode Character Properties (Unicode 文字特性) =begin original Most Unicode character properties are accessible by using regular expressions. They are used like character classes via the C<\p{}> "matches property" construct and the C<\P{}> negation, "doesn't match property". =end original ほとんどの Unicode 文字特性は正規表現を使ってアクセス可能です。それらは C<\p{}> "matches property" 構造やその否定形の C<\P{}> "doesn't match property" を使った文字クラスで使うことができます。 =begin original For instance, C<\p{Uppercase}> matches any character with the Unicode "Uppercase" property, while C<\p{L}> matches any character with a General_Category of "L" (letter) property. Brackets are not required for single letter properties, so C<\p{L}> is equivalent to C<\pL>. =end original たとえば、C<\p{Uppercase}> は Unicode の "Uppercase" 特性を持つ任意の文字にマッチし、C<\p{L}> は一般カテゴリ "L" (letter) 特性を持つ任意の文字にマッチします。ブラケットは一文字の特性では省略することができるので、C<\p{L}> は C<\pL> と等価です。 =begin original More formally, C<\p{Uppercase}> matches any character whose Unicode Uppercase property value is True, and C<\P{Uppercase}> matches any character whose Uppercase property value is False, and they could have been written as C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively =end original より正式には、C<\p{Uppercase}> は Unicode の Uppercase 特性値が True である任意の文字とマッチングし、C<\P{UpperCase}>は UpperCase 特性値が False である任意の文字とマッチングします; そしてこれらはそれぞれ C<\p{Uppercase=True}>, C<\p{Uppercase=False}> と書けます。 =begin original This formality is needed when properties are not binary, that is if they can take on more values than just True and False. For example, the Bidi_Class (see L below), can take on a number of different values, such as Left, Right, Whitespace, and others. To match these, one needs to specify the property name (Bidi_Class), and the value being matched against (Left, Right, I). This is done, as in the examples above, by having the two components separated by an equal sign (or interchangeably, a colon), like C<\p{Bidi_Class: Left}>. =end original この形式は、特性が 2 値でない場合、つまり、単に True と False より多くの値を取ることができる場合に必要です。たとえば、Bidi_Class (L を参照)は、 Left、Right、Whitespace などのさまざまな値を取ることができます。これらにマッチングするには、特性名(Bidi_Class)と、マッチングする値 (Left、Right など) を指定する必要があります。これは、前述の例のように、二つの要素トを等号 (または、C<\p{Biddi_Class:Left}> のように交換可能なコロン)で区切ることによって、実行されます。 =begin original All Unicode-defined character properties may be written in these compound forms of C<\p{property=value}> or C<\p{property:value}>, but Perl provides some additional properties that are written only in the single form, as well as single-form short-cuts for all binary properties and certain others described below, in which you may omit the property name and the equals or colon separator. =end original すべての Unicode が定義した文字特性は、C<\p{property=value}> や C<\p{property:value}> のような複合形式で書けますが、 Perl は特性名および等号やコロンの区切り文字を省略できるように、単一形式でのみ書ける追加の特性や、全ての 2 値特性と一部の後述するものに対する単一形式のショートカットを提供します。 =begin original Most Unicode character properties have at least two synonyms (or aliases if you prefer), a short one that is easier to type, and a longer one which is more descriptive and hence it is easier to understand what it means. Thus the "L" and "Letter" above are equivalent and can be used interchangeably. Likewise, "Upper" is a synonym for "Uppercase", and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically various synonyms for the values the property can be. For binary properties, "True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F", "No", and "N". But be careful. A short form of a value for one property may not mean the same thing as the same short form for another. Thus, for the General_Category property, "L" means "Letter", but for the Bidi_Class property, "L" means "Left". A complete list of properties and synonyms is in L. =end original ほとんどの Unicode 文字特性には、少なくとも二つの同義語 (またはあなたが好むなら別名)があります; 簡単に入力できる短いものと、より長いけれども説明的で意味が理解しやすいものです。したがって、前述の "L"および "Letter" は同等であり、交換可能です。同様に、"Upper" は "Uppercase" の同義語であり、C<\p{Uppercase}> は等価に C<\p{Upper}> と書けます。また、典型的には特性の値に対してさまざまな同義語があります。 2 値特性の場合、"True" には三つの同義語があります: "T", "Yes", "Y"; "False" には "F", "No", "N" があります。しかし注意してください。ある特性に対する値の短い形式は、他の特性の同じ短い形式と同じものを意味するとは限りません。従って、General_Category 特性では "L" は "Letter" を意味しますが、 Bidi_Class 特性では、"L" は "Left" を意味します。特性および同義語の完全な一覧は L にあります。 =begin original Upper/lower case differences in the property names and values are irrelevant, thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. Similarly, you can add or subtract underscores anywhere in the middle of a word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space is irrelevant adjacent to non-word characters, such as the braces and the equals or colon separators so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are equivalent to these as well. In fact, in most cases, white space and even hyphens can be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is equivalent. All this is called "loose-matching" by Unicode. The few places where stricter matching is employed is in the middle of numbers, and the Perl extension properties that begin or end with an underscore. Stricter matching cares about white space (except adjacent to the non-word characters) and hyphens, and non-interior underscores. =end original 特性名と値の大文字と小文字の違いは無関係です; したがって C<\p{Upper}> は C<\p{upper}>, さらには C<\p{UpPeR}> とも同じことを意味します。同様に、単語の中のどこにでも下線を追加または削除できるので、これらは C<\p{U_p_p_e_r}> とも等価です。また、中かっこや等号、コロンなどの非単語文字に隣接した空白は無視されるので、 C<\p{ Upper }> and C<\p{ Upper_case : Y }> も等価です。実際には、ほとんどの場合、空白とハイフンさえどこにでも追加または削除できます。したがって、C<\p{Upper case=Yes}> ですらも等価です。これはすべて Unicode で「緩いマッチング」と呼ばれます。数少ない厳密なマッチングが採用されている場所は数値の中と、下線で始まったり終わったりする Perl 拡張特性です。より厳密なマッチングは空白ス(非単語文字に隣接するものを除く)とハイフン、および非内部下線を考慮します。 =begin original You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret (^) between the first brace and the property name: C<\p{^Tamil}> is equal to C<\P{Tamil}>. =end original C<\p{}> と C<\P{}> の両方で、キャレット(^) を最初のブレースと特性名の間に置くことによって意味を反転することができます: C<\p{^Tamil}> は C<\P{Tamil}> と等価です。 =head3 B =begin original Every Unicode character is assigned a general category, which is the "most usual categorization of a character" (from L). =end original 全ての Unicode 文字は一つの一般カテゴリに割り当てられています; これは「その文字の最も普通のカテゴライズ」 (L より)です。 =begin original The compound way of writing these is like C<\p{General_Category=Number}> (short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up through the equal or colon separator is omitted. So you can instead just write C<\pN>. =end original これらを書く複合的な方法は C<\p{General_Category=Number}> (短縮形は C<\p{gc:n}>) のようなものです。 Perl は等号またはコロンの区切り文字までの全てを省略できる機能を提供しています。従って、代わりに単に C<\pN> と書けます。 =begin original Here are the short and long forms of the General Category properties: =end original 以下は、Unicode の一般カテゴリ特性(General Category properties) の短形式と長形式です: Short Long L Letter LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}]) Lu Uppercase_Letter Ll Lowercase_Letter Lt Titlecase_Letter Lm Modifier_Letter Lo Other_Letter M Mark Mn Nonspacing_Mark Mc Spacing_Mark Me Enclosing_Mark N Number Nd Decimal_Number (also Digit) Nl Letter_Number No Other_Number P Punctuation (also Punct) Pc Connector_Punctuation Pd Dash_Punctuation Ps Open_Punctuation Pe Close_Punctuation Pi Initial_Punctuation (may behave like Ps or Pe depending on usage) Pf Final_Punctuation (may behave like Ps or Pe depending on usage) Po Other_Punctuation S Symbol Sm Math_Symbol Sc Currency_Symbol Sk Modifier_Symbol So Other_Symbol Z Separator Zs Space_Separator Zl Line_Separator Zp Paragraph_Separator C Other Cc Control (also Cntrl) Cf Format Cs Surrogate (not usable) Co Private_Use Cn Unassigned =begin original Single-letter properties match all characters in any of the two-letter sub-properties starting with the same letter. C and C are special cases, which are aliases for the set of C, C, and C. =end original 単一文字の特性は同じ文字で始まる二文字の任意のサブ特性に含まれるすべての文字にマッチします。 C と C は特別なケースで、これは C, C, C の別名です。 =begin original Because Perl hides the need for the user to understand the internal representation of Unicode characters, there is no need to implement the somewhat messy concept of surrogates. C is therefore not supported. =end original Perl はユーザーが Unicode 文字の内部表現について理解する必要がないようにしているので、サロゲートの面倒なコンセプトについて実装する必要はありません。従って、C はサポートされていません。 =head3 B (B<双方向文字型>) =begin original Because scripts differ in their directionality--Hebrew is written right to left, for example--Unicode supplies these properties in the Bidi_Class class: =end original 用字はその方向性で異なるので--たとえばヘブライ語は右から左に書きます -- Unicode は以下の特性を Bidi_Class クラスで提供しています: Property Meaning L Left-to-Right LRE Left-to-Right Embedding LRO Left-to-Right Override R Right-to-Left AL Arabic Letter RLE Right-to-Left Embedding RLO Right-to-Left Override PDF Pop Directional Format EN European Number ES European Separator ET European Terminator AN Arabic Number CS Common Separator NSM Non-Spacing Mark BN Boundary Neutral B Paragraph Separator S Segment Separator WS Whitespace ON Other Neutrals =begin original This property is always written in the compound form. For example, C<\p{Bidi_Class:R}> matches characters that are normally written right to left. =end original この特性は常に複合形式で書かれます。たとえば、C<\p{Bidi_Class:R}> は通常右から左に書く文字にマッチします。 =head3 B (B<用字>) =begin original The world's languages are written in a number of scripts. This sentence (unless you're reading it in translation) is written in Latin, while Russian is written in Cyrllic, and Greek is written in, well, Greek; Japanese mainly in Hiragana or Katakana. There are many more. =end original 世界の言語は様々な用字で書かれています。この文は(訳文を読んでいない限り)ラテン文字で書かれていますが、ロシア語はキリル文字で書かれています; そしてギリシャ語は、ええと、ギリシャ文字です; 日本語は主にひらがなやカタカナで書かれています。もっとたくさんあります。 =begin original The Unicode Script property gives what script a given character is in, and can be matched with the compound form like C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>). Perl furnishes shortcuts for all script names. You can omit everything up through the equals (or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>. =end original Unicode Script特性は、指定された文字の中にある用字を示し、 C<\p{Script=Hebrew}> (短縮: C<\p{sc=hebr}>) のような複合形式でマッチングさせることができます。 Perlは、すべての用字名のショートカットを提供します。等号(またはコロン)までのすべてを省略できます; そして単に C<\p{Latin}> や C<\P{Cyrillic}> と書けます。 =begin original A complete list of scripts and their shortcuts is in L. =end original 用字とその省略形の完全な一覧は L にあります。 =head3 B (B<"Is" 接頭辞の使用>) =begin original For backward compatibility (with Perl 5.6), all properties mentioned so far may have C or C prepended to their name, so C<\P{Is_Lu}>, for example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to C<\p{Arabic}>. =end original (Perl 5.6 との)後方互換性のため、すべての特性はその名前の前に C または C を置くことができます; したがって、C<\P{Is_Lu}> は C<\P{Lu}> と等価で、C<\p{IsScript:Arabic}> は C<\p{Arabic}> と等価です。 =head3 B (B<ブロック>) =begin original In addition to B, Unicode also defines B of characters. The difference between scripts and blocks is that the concept of scripts is closer to natural languages, while the concept of blocks is more of an artificial grouping based on groups of Unicode characters with consecutive ordinal values. For example, the "Basic Latin" block is all characters whose ordinals are between 0 and 127, inclusive, in other words, the ASCII characters. The "Latin" script contains some letters from this block as well as several more, like "Latin-1 Supplement", "Latin Extended-A", I, but it does not contain all the characters from those blocks. It does not, for example, contain digits, because digits are shared across many scripts. Digits and similar groups, like punctuation, are in the script called C. There is also a script called C for characters that modify other characters, and inherit the script value of the controlling character. =end original B<用字> に加え、Unicode では文字の B<ブロック> を定義しています。用字とブロックの違いは、用字のコンセプトが自然言語に密着したものであるのに対して、ブロックのコンセプトは連続した番号を持つ Unicode 文字のグループに基づいたより人工的なグループ分けであることです。たとえば、"Basic Latin" ブロックは番号 0 から 127 までの全ての文字、言い換えると ASCII 文字です。 "Latin" 用字は、このブロックの文字と、"Latin-1 Supplement", "Latin Extended-A" I<など> のいくつかのブロックの文字を含んでいますが、それらのブロックのすべての文字を含んではいません。例を挙げると、数字は多くの用字を越えて共有されているので、 (Latin 用字は)数字を含みません。数字と、句読点のような同様のグループは C と呼ばれる用字にあります。他の文字を修正して、制御文字の用字の値を継承する文字のための C と呼ばれる用字もあります。 =begin original For more about scripts versus blocks, see UAX#24 "Unicode Script Property": L =end original 用字とブロックに違いに関する詳細については、 UAX#24 "Unicode Script Property" L を参照してください。 =begin original The Script property is likely to be the one you want to use when processing natural language; the Block property may be useful in working with the nuts and bolts of Unicode. =end original 用字特性は自然言語を処理するときにおそらく使いたいと思うようなものです; ブロック特性は Unicode の基本的な部分で動作させるのに有用です。 =begin original Block names are matched in the compound form, like C<\p{Block: Arrows}> or C<\p{Blk=Hebrew}>. Unlike most other properties only a few block names have a Unicode-defined short name. But Perl does provide a (slight) shortcut: You can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards compatibility, the C prefix may be omitted if there is no naming conflict with a script or any other property, and you can even use an C prefix instead in those cases. But it is not a good idea to do this, for a couple reasons: =end original ブロック名は C<\p{Block: Arrows}> や C<\p{Blk=Hebrew}> のような復号形式でマッチングします。その他のほとんどの特性と違って、いくつかのブロック名だけが Unicode が定義した短い名前を持ちます。しかし Perl は(多少の)ショートカットを提供します: 例えば C<\p{In_Arrows}> や C<\p{In_Hebrew}> のように書けます。後方互換性のために、C 接頭辞は用字や他の特性と衝突しなければ省略することも可能ですし、このような場合で C 接頭辞を使うこともできます。しかしそうするのはいい考えではありません; いくつかの理由があります: =over 4 =item 1 =begin original It is confusing. There are many naming conflicts, and you may forget some. For example, C<\p{Hebrew}> means the I