package with names like C (for lc() and lcfirst()), C (for the first character in ucfirst()), and C (for uc(), and the rest of the characters in ucfirst()). =end original Æ±ÍÍ¤Ë¡¢lc()¡¢lcfirst()¡¢uc()¡¢ucfirst() (¤¢¤ë¤¤¤Ï¤½¤ÎÊ¸»úÎóÁÈ¤ß¹þ¤ßÈÇ)¤Ç ¤¢¤Ê¤¿¼«¿È¤ÎÂÐ±þ´Ø·¸¤òÄêµÁ¤¹¤ë¤³¤È¤â¤Ç¤¤Þ¤¹¡£ ¸¶Â§¤Ï ¥æ¡¼¥¶¡¼ÄêµÁÊ¸»úÆÃÀ¤Î¾ì¹ç¤È»÷¤Æ¤¤¤Þ¤¹: C (lc() ¤È lcfirst()ÍÑ), C (ucfirst() ¤ÎºÇ½é¤ÎÊ¸»úÍÑ), C (uc() ÍÑ¤È ucfirst() ¤Î »Ä¤ê¤ÎÊ¸»úÍÑ) ¤Î¤è¤¦¤ÊÌ¾Á°¤Î¥µ¥Ö¥ë¡¼¥Á¥ó¤ò C

¥Ñ¥Ã¥±¡¼¥¸¤ÇÄêµÁ¤·¤Þ¤¹¡£ =begin original The string returned by the subroutines needs now to be three hexadecimal numbers separated by tabulators: start of the source range, end of the source range, and start of the destination range. For example: =end original ¥µ¥Ö¥ë¡¼¥Á¥ó¤«¤éÊÖ¤µ¤ì¤ëÊ¸»úÎó¤Ï¥¿¥Ö¤Ç¶èÀÚ¤é¤ì¤¿ 3 ¤Ä¤Î 16 ¿Ê¿ô¤ò É¬Í×¤È¤·¤Þ¤¹: ¥½¡¼¥¹¤ÎÈÏ°Ï¤Î»Ï¤Þ¤ê¡¢¥½¡¼¥¹¤ÎÈÏ°Ï¤Î½ª¤ï¤ê¡¢¤½¤·¤Æ ¥Ç¥¹¥Æ¥£¥Í¡¼¥·¥ç¥óÈÏ°Ï¤Î»Ï¤Þ¤ê¤Ç¤¹¡£ Îã¤òµó¤²¤Þ¤·¤ç¤¦: sub ToUpper { return </F. The mapping data is returned as the here-document, and the C are special exception mappings derived from <$Config{privlib}>/F. The C and C mappings that one can see in the directory are not directly user-accessible, one can use either the C module, or just match case-insensitively (that's when the C mapping is used). =end original (¿¿·õ¤Ê¥Ï¥Ã¥«¡¼ÀìÍÑ) ¥Ç¥Õ¥©¥ë¥È¤Î¥Þ¥Ã¥Ô¥ó¥°¤òÆâ¾Ê¤·¤¿¤¤¤Î¤Ê¤é¡¢ C<$Config{privlib}>/F ¤È¤¤¤¦¥Ç¥£¥ì¥¯¥È¥ê¤Ë¥Ç¡¼¥¿¤ò ¸«¤Ä¤±½Ð¤¹¤³¤È¤¬¤Ç¤¤Þ¤¹¡£ ¥Þ¥Ã¥Ô¥ó¥°¥Ç¡¼¥¿¤Ï¥Ò¥¢¥É¥¥å¥á¥ó¥È¤È¤·¤ÆÊÖ¤µ¤ì¡¢C ¤Ï C<$Config{privlib}>/F ¤«¤éÇÉÀ¸¤·¤¿ÆÃ¼ì¤Ê Îã³°¥Þ¥Ã¥Ô¥ó¥°¤Ç¤¹¡£ ¤½¤Î¥Ç¥£¥ì¥¯¥È¥ê¤Ç¸«¤Ä¤±¤ë¤³¤È¤Î¤Ç¤¤ë C ¤È C ¤Î¥Þ¥Ã¥Ô¥ó¥°¤Ï ¥æ¡¼¥¶¡¼¤¬¥À¥¤¥ì¥¯¥È¤Ë¥¢¥¯¥»¥¹¤Ç¤¤º¡¢C ¥â¥¸¥å¡¼¥ë¤ò»È¤¦¤« Âç¾®Ê¸»ú¤òÌµ»ë¤·¤Æ¥Þ¥Ã¥Ô¥ó¥°¤·¤Þ¤¹(C ¥Þ¥Ã¥Ô¥ó¥°¤¬»È¤ï¤ì¤Æ¤¤¤ë¤È¤)¡£ =begin original A final note on the user-defined case mappings: they will be used only if the scalar has been marked as having Unicode characters. Old byte-style strings will not be affected. =end original ¥æ¡¼¥¶¡¼ÄêµÁ¤ÎÂçÊ¸»ú¡¦¾®Ê¸»ú¤ÎÂÐ±þ´Ø·¸¤Ë´Ø¤¹¤ëºÇ¸å¤ÎÃí°Õ: ¤³¤ì¤é¤Ï¥¹¥«¥é¤¬ Unicode Ê¸»ú¤È¤·¤Æ¥Þ¡¼¥¯¤µ¤ì¤Æ¤¤¤ë¤È¤¤Ë¤Î¤ß»È¤ï¤ì¤Þ¤¹¡£ ¸Å¤¤¥Ð¥¤¥È·Á¼°¤ÎÊ¸»úÎó¤Ë¤Ï±Æ¶Á¤òµÚ¤Ü¤·¤Þ¤»¤ó¡£ =head2 Character Encodings for Input and Output (Æþ½ÐÎÏ¤Î¤¿¤á¤ÎÊ¸»ú¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°) =begin original See L. =end original L ¤ò»²¾È¤·¤Æ¤¯¤À¤µ¤¤¡£ =head2 Unicode Regular Expression Support Level (Unicode Àµµ¬É½¸½ÂÐ±þ¥ì¥Ù¥ë) =begin original The following list of Unicode support for regular expressions describes all the features currently supported. The references to "Level N" and the section numbers refer to the Unicode Technical Standard #18, "Unicode Regular Expressions", version 11, in May 2005. =end original °Ê²¼¤Ëµó¤²¤ë¥ê¥¹¥È¤Ï¡¢¸½ºßÂÐ±þ¤·¤Æ¤¤¤ëÁ´¤Æ¤Îµ¡Ç½¤òµ½Ò¤¹¤ë¡¢ Àµµ¬É½¸½¤Î¤¿¤á¤Î Unicode ÂÐ±þ¤Î¥ê¥¹¥È¤Ç¤¹¡£ "Level N" ¤ËÂÐ¤¹¤ë»²¾È¤È¥»¥¯¥·¥ç¥óÈÖ¹æ¤Ï Unicode Technical Standard #18, "Unicode Regular Expressions", version 11, in May 2005 ¤ò»²¾È¤·¤Æ¤¤¤Þ¤¹¡£ =over 4 =item * Level 1 - Basic Unicode Support RL1.1 Hex Notation - done [1] RL1.2 Properties - done [2][3] RL1.2a Compatibility Properties - done [4] RL1.3 Subtraction and Intersection - MISSING [5] RL1.4 Simple Word Boundaries - done [6] RL1.5 Simple Loose Matches - done [7] RL1.6 Line Boundaries - MISSING [8] RL1.7 Supplementary Code Points - done [9] [1] \x{...} [2] \p{...} \P{...} [3] supports not only minimal list (general category, scripts, Alphabetic, Lowercase, Uppercase, WhiteSpace, NoncharacterCodePoint, DefaultIgnorableCodePoint, Any, ASCII, Assigned), but also bidirectional types, blocks, etc. (see "Unicode Character Properties") [4] \d \D \s \S \w \W \X [:prop:] [:^prop:] [5] can use regular expression look-ahead [a] or user-defined character properties [b] to emulate set operations [6] \b \B [7] note that Perl does Full case-folding in matching, not Simple: for example U+1F88 is equivalent to U+1F00 U+03B9, not with 1F80. This difference matters mainly for certain Greek capital letters with certain modifiers: the Full case-folding decomposes the letter, while the Simple case-folding would map it to a single character. [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r), CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029); should also affect <>, $., and script line numbers; should not split lines within CRLF [c] (i.e. there is no empty line between \r and \n) [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF but also beyond U+10FFFF [d] =begin original [a] You can mimic class subtraction using lookahead. For example, what UTS#18 might write as =end original [a] class subtraction ¤òÀèÆÉ¤ß¤ò»È¤Ã¤ÆÌÏÊï¤¹¤ë¤³¤È¤¬¤Ç¤¤Þ¤¹¡£ ¤¿¤È¤¨¤Ð¡¢°Ê²¼¤Î UTR #18 ¤Ï [{Greek}-[{UNASSIGNED}]] =begin original in Perl can be written as: =end original °Ê²¼¤Î¤è¤¦¤Ë Perl ¤Çµ½Ò¤Ç¤¤Þ¤¹: (?!\p{Unassigned})\p{InGreekAndCoptic} (?=\p{Assigned})\p{InGreekAndCoptic} =begin original But in this particular example, you probably really want =end original ¤·¤«¤·¡¢¤³¤ÎÆÃÄê¤ÎÎã¤Ç¤Ï¡¢¤¢¤Ê¤¿¤¬¼ÂºÝ¤ËË¾¤ó¤Ç¤¤¤¿¤Î¤Ï¼¡¤Î¤â¤Î¤Ç¤·¤ç¤¦ \p{GreekAndCoptic} =begin original which will match assigned characters known to be part of the Greek script. =end original ¤³¤ì¤Ï Greek ÍÑ»ú¤Î°ìÉô¤È¤·¤ÆÃÎ¤é¤ì¤Æ¤¤¤ë assigned character ¤Ë¥Þ¥Ã¥Á¤·¤Þ¤¹¡£ =begin original Also see the Unicode::Regex::Set module, it does implement the full UTS#18 grouping, intersection, union, and removal (subtraction) syntax. =end original Æ±ÍÍ¤Ë Unicode::Regex::Set ¥â¥¸¥å¡¼¥ë¤ò»²¾È¤·¤Æ¤¯¤À¤µ¤¤¡£ ¤³¤ì¤Ï UTR #18¤Î¥°¥ë¡¼¥Ô¥ó¥°¡¢intersection¡¢union, removal(substraction)¹½Ê¸¤ò ¥Õ¥ë¤Ë¼ÂÁõ¤·¤Æ¤¤¤Þ¤¹¡£ =begin original [b] '+' for union, '-' for removal (set-difference), '&' for intersection (see L) =end original [b] ·ë¹ç¤Î¤¿¤á¤Ë¤Ï '+'¡¢½üµî(º¹½¸¹ç)¤Î¤¿¤á¤Ë¤Ï '-'¡¢ ¶¦ÄÌ½¸¹ç¤Î¤¿¤á¤Ë¤Ï '&' ¤Ç¤¹ (L ¤ò»²¾È¤·¤Æ¤¯¤À¤µ¤¤) =begin original [c] Try the C<:crlf> layer (see L). =end original [c] C<:crlf> ÁØ¤ò»î¤·¤Æ¤¯¤À¤µ¤¤ (L ¤ò»²¾È¤·¤Æ¤¯¤À¤µ¤¤)¡£ =begin original [d] Avoid C (or say C) to allow U+FFFF (C<\x{FFFF}>). =end original [d] U+FFFF (C<\x{FFFF}>) ¤òµö²Ä¤¹¤ë¤¿¤á¤Ë¡¢C ¤ò ¤·¤Ê¤¤¤Ç¤¯¤À¤µ¤¤ (¤Þ¤¿¤Ï C ¤È¤·¤Æ¤¯¤À¤µ¤¤)¡£ =item * Level 2 - Extended Unicode Support RL2.1 Canonical Equivalents - MISSING [10][11] RL2.2 Default Grapheme Clusters - MISSING [12][13] RL2.3 Default Word Boundaries - MISSING [14] RL2.4 Default Loose Matches - MISSING [15] RL2.5 Name Properties - MISSING [16] RL2.6 Wildcard Properties - MISSING [10] see UAX#15 "Unicode Normalization Forms" [11] have Unicode::Normalize but not integrated to regexes [12] have \X but at this level . should equal that [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable clusters as a single grapheme cluster. [14] see UAX#29, Word Boundaries [15] see UAX#21 "Case Mappings" [16] have \N{...} but neither compute names of CJK Ideographs and Hangul Syllables nor use a loose match [e] =begin original [e] C<\N{...}> allows namespaces (see L). =end original [e] C<\N{...}> ¤ÏÌ¾Á°¶õ´Ö¤òµö²Ä¤·¤Þ¤¹ (L ¤ò»²¾È¤·¤Æ¤¯¤À¤µ¤¤)¡£ =item * Level 3 - Tailored Support RL3.1 Tailored Punctuation - MISSING RL3.2 Tailored Grapheme Clusters - MISSING [17][18] RL3.3 Tailored Word Boundaries - MISSING RL3.4 Tailored Loose Matches - MISSING RL3.5 Tailored Ranges - MISSING RL3.6 Context Matching - MISSING [19] RL3.7 Incremental Matches - MISSING ( RL3.8 Unicode Set Sharing ) RL3.9 Possible Match Sets - MISSING RL3.10 Folded Matching - MISSING [20] RL3.11 Submatchers - MISSING [17] see UAX#10 "Unicode Collation Algorithms" [18] have Unicode::Collate but not integrated to regexes [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see outside of the target substring [20] need insensitive matching for linguistic features other than case; for example, hiragana to katakana, wide and narrow, simplified Han to traditional Han (see UTR#30 "Character Foldings") =back =head2 Unicode Encodings (Unicode ¤Î¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°) =begin original Unicode characters are assigned to I

, which are abstract
numbers.  To use these numbers, various encodings are needed.

=end original

Unicode Ê¸»ú¤ÏÃê¾ÝÅª¤Ê¿ôÃÍ¤Ç¤¢¤ë I<Éä¹æ°ÌÃÖ> ¤Ë¥¢¥µ¥¤¥ó¤µ¤ì¤Æ¤¤¤Þ¤¹¡£
¤³¤ì¤é¤Î¿ôÃÍ¤ò»È¤¦¤¿¤á¤Ë¡¢¤µ¤Þ¤¶¤Þ¤Ê¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤¬É¬Í×¤È¤Ê¤ê¤Þ¤¹¡£

=over 4

=item *

UTF-8

=begin original

UTF-8 is a variable-length (1 to 6 bytes, current character allocations
require 4 bytes), byte-order independent encoding. For ASCII (and we
really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is
transparent.

=end original

UTF-8 ¤Ï²ÄÊÑÄ¹(1 ¤«¤é 6 ¥Ð¥¤¥È; ¸½ºß¤ÎÊ¸»úÇÛÃÖ¤Ç¤Ï 4 ¥Ð¥¤¥È¤òÍ×µá¤·¤Þ¤¹)¤Ç¡¢
¥Ð¥¤¥È¤ÎÊÂ¤Ó½ç¤Ë°ÍÂ¸¤·¤Ê¤¤¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤Ç¤¹¡£
ASCII(¤³¤³¤Ç¤Ï 7-bit ASCII ¤Î¤³¤È¤Ç¡¢Â¾¤Î 8-bit ¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤Î¤³¤È¤Ç¤Ï
¤¢¤ê¤Þ¤»¤ó)¤È UTF-8 ¤ÏÆ©²á¤Ç¤¹¡£

=begin original

The following table is from Unicode 3.2.

=end original

°Ê²¼¤Î¥Æ¡¼¥Ö¥ë¤Ï Unicode 3.2 ¤Î¤â¤Î¤Ç¤¹¡£

 Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte

   U+0000..U+007F       00..7F
   U+0080..U+07FF       C2..DF    80..BF
   U+0800..U+0FFF       E0        A0..BF    80..BF
   U+1000..U+CFFF       E1..EC    80..BF    80..BF
   U+D000..U+D7FF       ED        80..9F    80..BF
   U+D800..U+DFFF       ******* ill-formed *******
   U+E000..U+FFFF       EE..EF    80..BF    80..BF
  U+10000..U+3FFFF      F0        90..BF    80..BF    80..BF
  U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
 U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF

=begin original

Note the C in C, the C<80..9F> in
C, the C<90..B>F in C, and the
C<80...8F> in C.  The "gaps" are caused by legal
UTF-8 avoiding non-shortest encodings: it is technically possible to
UTF-8-encode a single code point in different ways, but that is
explicitly forbidden, and the shortest possible encoding should always
be used.  So that's what Perl does.

=end original

C ¤ÎÃæ¤Î C¡¢C ¤ÎÃæ¤Î C<80..9F>¡¢
C ¤ÎÃæ¤Î C<90..BF>¡¢C ¤ÎÃæ¤Î
C<80...8F> ¤ËÃí°Õ¤·¤Æ¤¯¤À¤µ¤¤¡£
¤³¤Î¡Ö·ä´Ö¡×¤Ï¡¢ÀµÅö¤Ê UTF-8 ¤¬ºÇÃ»¤Ç¤Ê¤¤¥¨¥ó¥³¡¼¥É¤òÈò¤±¤ë¤¿¤á¤Ë
¤¢¤ê¤Þ¤¹: µ»½ÑÅª¤Ë¤Ï UTF-8 ¥¨¥ó¥³¡¼¥É¤Ï°ì¤Ä¤ÎÉä¹æ°ÌÃÖ¤òÊ£¿ô¤ÎÊýË¡¤Ç
É½¤¹¤³¤È¤¬¤Ç¤¤Þ¤¹¤¬¡¢¤³¤ì¤ÏÌÀ¼¨Åª¤Ë¶Ø»ß¤µ¤ì¤Æ¤¤¤Æ¡¢²ÄÇ½¤Ê¸Â¤êºÇÃ»¤Î
¥¨¥ó¥³¡¼¥É¤¬¾ï¤Ë»È¤ï¤ì¤Þ¤¹¡£
½¾¤Ã¤Æ¡¢Perl ¤â¤½¤¦¤·¤Þ¤¹¡£

=begin original

Another way to look at it is via bits:

=end original

¤³¤ì¤ò¸«¤ë¤â¤¦°ì¤Ä¤ÎÊýË¡¤Ï¥Ó¥Ã¥ÈÃ±°Ì¤Ç¸«¤ë¤³¤È¤Ç¤¹:

 Code Points                    1st Byte   2nd Byte  3rd Byte  4th Byte

                    0aaaaaaa     0aaaaaaa
            00000bbbbbaaaaaa     110bbbbb  10aaaaaa
            ccccbbbbbbaaaaaa     1110cccc  10bbbbbb  10aaaaaa
  00000dddccccccbbbbbbaaaaaa     11110ddd  10cccccc  10bbbbbb  10aaaaaa

=begin original

As you can see, the continuation bytes all begin with C<10>, and the
leading bits of the start byte tell how many bytes the are in the
encoded character.

=end original

¸«¤Æ¤ÎÄÌ¤ê¡¢¸åÂ³¥Ð¥¤¥È¤Ï¤¹¤Ù¤Æ C<10> ¤«¤é»Ï¤Þ¤Ã¤Æ¤¤¤Æ¡¢³«»Ï¥Ð¥¤¥È¤Î
Àè¹Ô¥Ó¥Ã¥È¤Ï¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿Ê¸»ú¤¬¤É¤Î¤¯¤é¤¤¤ÎÄ¹¤µ¤Ç¤¢¤ë¤«¤ò¼¨¤·¤Æ¤¤¤Þ¤¹¡£

=item *

UTF-EBCDIC

=begin original

Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.

=end original

UTF-8 ¤È»÷¤Æ¤¤¤Þ¤¹¤¬¡¢UTF-8 ¤¬ ASCII-safe ¤Ç¤¢¤ë¤è¤¦¤Ë EBCDIC-safe ¤Ç¤¹¡£

=item *

=begin original

UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)

=end original

UTF-16, UTF-16BE, UTF-16LE, ¥µ¥í¥²¡¼¥È, BOM (Byte Order Marks)

=begin original

The followings items are mostly for reference and general Unicode
knowledge, Perl doesn't use these constructs internally.

=end original

°Ê²¼¤Î¹àÌÜ¤Ï¤Û¤È¤ó¤É»²¾È¤ª¤è¤Ó°ìÈÌÅª¤Ê Unicode ÃÎ¼±¤Î¤¿¤á¤Î¤â¤Î¤Ç¡¢
Perl ¤Ï¤³¤ì¤é¤Î¹½Â¤¤òÆâÉô¤Ç»È¤Ã¤Æ¤¤¤Þ¤»¤ó¡£

=begin original

UTF-16 is a 2 or 4 byte encoding.  The Unicode code points
C are stored in a single 16-bit unit, and the code
points C in two 16-bit units.  The latter case is
using I, the first 16-bit unit being the I, and the second being the I.

=end original

UTF-16 ¤Ï 2 ¥Ð¥¤¥È¤â¤·¤¯¤Ï 4 ¥Ð¥¤¥È¤Î¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤Ç¤¹¡£
C ¤ÎÈÏ°Ï¤Î Unicode ¤ÎÉä¹æ°ÌÃÖ¤Ï¤Ò¤È¤Ä¤Î 16 ¥Ó¥Ã¥È
¥æ¥Ë¥Ã¥È¤Ë¼ý¤á¤é¤ì¡¢C ¤ÎÈÏ°Ï¤ÎÉä¹æ°ÌÃÖ¤Ï 2 ¤Ä¤Î
16 ¥Ó¥Ã¥È¥æ¥Ë¥Ã¥È¤Ë¼ý¤á¤é¤ì¤Þ¤¹¡£
¸å¼Ô¤ò¥µ¥í¥²¡¼¥È(surrogates) ¤È¸Æ¤Ó¤Þ¤¹¡£
ºÇ½é¤Î 16 ¥Ó¥Ã¥È¥æ¥Ë¥Ã¥È¤Ï I ¤Ç¡¢ÆóÈÖÌÜ¤Ï
I ¤È¤Ê¤ê¤Þ¤¹¡£

=begin original

Surrogates are code points set aside to encode the C
range of Unicode code points in pairs of 16-bit units.  The I are the range C, and the I
are the range C.  The surrogate encoding is

=end original

¥µ¥í¥²¡¼¥È¤Ï Unicode ¤ÎÉä¹æ°ÌÃÖ¤Î C ¤ÎÈÏ°Ï¤ò
16 ¥Ó¥Ã¥È¥æ¥Ë¥Ã¥È¤Î¥Ú¥¢¤ÇÉ½¸½¤¹¤ë½¸¹ç¤Ç¤¹¡£
I ¤Ï C ¤ÎÈÏ°Ï¤Ç¡¢I ¤Ï
C ¤ÎÈÏ°Ï¤Ç¤¹¡£
¥µ¥í¥²¡¼¥È¤Î¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤Ï

	$hi = ($uni - 0x10000) / 0x400 + 0xD800;
	$lo = ($uni - 0x10000) % 0x400 + 0xDC00;

=begin original

and the decoding is

=end original

¤Ç¤¢¤ê¡¢¥Ç¥³¡¼¥É¤Ï°Ê²¼¤Î¤è¤¦¤Ê¤â¤Î¤Ç¤¹

	$uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);

=begin original

If you try to generate surrogates (for example by using chr()), you
will get a warning if warnings are turned on, because those code
points are not valid for a Unicode character.

=end original

(¤¿¤È¤¨¤Ð chr() ¤ò»È¤Ã¤Æ)¥µ¥í¥²¡¼¥È¤òÀ¸À®¤·¤è¤¦¤È¤·¤¿¤Ê¤é¤Ð¡¢
·Ù¹ð¤¬Í¸ú¤Ç¤¢¤ì¤Ð·Ù¹ð¤¬È¯À¸¤¹¤ë¤Ç¤·¤ç¤¦¡£
¤Ê¤¼¤Ê¤é¡¢¤½¤¦¤¤¤Ã¤¿Éä¹æ°ÌÃÖ¤Ï Unicode Ê¸»ú¤È¤·¤Æ¤ÏÀµ¤·¤¤¤â¤Î¤Ç¤Ï¤Ê¤¤¤«¤é¤Ç¤¹¡£

=begin original

Because of the 16-bitness, UTF-16 is byte-order dependent.  UTF-16
itself can be used for in-memory computations, but if storage or
transfer is required either UTF-16BE (big-endian) or UTF-16LE
(little-endian) encodings must be chosen.

=end original

16-bitness ¤Î¤¿¤á¡¢UTF-16 ¤Ï¥Ð¥¤¥È¤ÎÊÂ¤Ó½ç¤Ë°ÍÂ¸¤·¤Þ¤¹¡£
UTF-16 ¤½¤ì¼«¿È¤Ï¥á¥â¥êÆâ¤Î·×»»¤Ë»È¤¦¤³¤È¤¬¤Ç¤¤Þ¤¹¤¬¡¢³ÊÇ¼¤äÅ¾Á÷¤ÎºÝ¤Ë¤Ï
UTF-16BE (¥Ó¥Ã¥°¥¨¥ó¥Ç¥£¥¢¥ó)¤« UTF-16LE (¥ê¥È¥ë¥¨¥ó¥Ç¥£¥¢¥ó)¤Î
¤¤¤º¤ì¤«¤Î¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤òÁªÂò¤·¤Ê¤±¤ì¤Ð¤Ê¤ê¤Þ¤»¤ó¡£

=begin original

This introduces another problem: what if you just know that your data
is UTF-16, but you don't know which endianness?  Byte Order Marks, or
BOMs, are a solution to this.  A special character has been reserved
in Unicode to function as a byte order marker: the character with the
code point C is the BOM.

=end original

¤³¤Î¤³¤È¤ÏÊÌ¤ÎÌäÂê¤ò°ú¤µ¯¤³¤·¤Þ¤¹: ¤¢¤Ê¤¿¤Î¥Ç¡¼¥¿¤¬ UTF-16 ¤Ç¤¢¤ë¤³¤È¤À¤±¤ò
ÃÎ¤Ã¤Æ¤¤¤Æ¡¢¤½¤Î¥Ð¥¤¥ÈÊÂ¤Ó½ç¤òÃÎ¤é¤Ê¤«¤Ã¤¿¤È¤·¤¿¤é?
¥Ð¥¤¥È½ç¥Þ¡¼¥¯ (Byte Order Marks)¡¢Î¬¤·¤Æ BOM ¤Ï¤³¤ì¤ò²ò·è¤·¤Þ¤¹¡£
¥Ð¥¤¥ÈÊÂ¤Ó¤Î¥Þ¡¼¥«¡¼¤È¤·¤Æ¤Îµ¡Ç½¤Î¤¿¤á¤Ë Unicode ¤Ç¤ÏÆÃ¼ì¤ÊÊ¸»ú¤¬
Í½Ìó¤µ¤ì¤Æ¤¤¤Þ¤¹: ¤½¤ÎÊ¸»ú¤ÏÉä¹æ°ÌÃÖ¤Î C ¤Ç¤¹¡£

=begin original

The trick is that if you read a BOM, you will know the byte order,
since if it was written on a big-endian platform, you will read the
bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
you will read the bytes C<0xFF 0xFE>.  (And if the originating platform
was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)

=end original

¤³¤Î¥È¥ê¥Ã¥¯¤Ï¡¢BOM ¤òÆÉ¤ß¹þ¤ó¤À¤È¤¤Ë¥Ð¥¤¥È½ç¤¬¤ï¤«¤ë¤È¤¤¤¦¤³¤È¤Ç¤¹¡£
¥Ó¥Ã¥°¥¨¥ó¥Ç¥£¥¢¥ó¤Î¥×¥é¥Ã¥È¥Õ¥©¡¼¥à¤Ç½ñ¤«¤ì¤¿¤â¤Î¤Ê¤é¤Ê¤é
C<0xFE 0xFF> ¤òÆÉ¤ß½Ð¤·¡¢¥ê¥È¥ë¥¨¥ó¥Ç¥£¥¬¥ó¤Î¥×¥é¥Ã¥È¥Õ¥©¡¼¥à¤Ç
½ñ¤«¤ì¤¿¤â¤Î¤Ê¤é C<0xFF 0xFE> ¤òÆÉ¤ß½Ð¤·¤Þ¤¹¡£
(¤½¤·¤Æ¤â¤·¸µ¤Î¥×¥é¥Ã¥È¥Õ¥©¡¼¥à¤Ç UTF-8 ¤Ç½ñ¤«¤ì¤¿¤â¤Î¤Ê¤é¤Ð
C<0xEF 0xBB 0xBF> ¤È¤¤¤¦¥Ð¥¤¥ÈÎó¤òÆÉ¤à¤³¤È¤Ë¤Ê¤ë¤Ç¤·¤ç¤¦¡£)

=begin original

The way this trick works is that the character with the code point
C is guaranteed not to be a valid Unicode character, so the
sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
little-endian format" and cannot be C, represented in big-endian
format".

=end original

¤³¤Î¥È¥ê¥Ã¥¯¤¬¤¦¤Þ¤¯¤¤¤¯¤Î¤ÏÉä¹æ°ÌÃÖ C ¤ÎÊ¸»ú¤ÏÀµÅö¤Ê
Unicode Ê¸»ú¤Ç¤Ê¤¤¤È¤¤¤¦¤³¤È¤Ë¤è¤Ã¤Æ¡¢C<0xFF 0xFE> ¤È¤¤¤¦ÊÂ¤Ó¤ÏÊ¶¤ì¤Ê¤¯
"¥ê¥È¥ë¥¨¥ó¥Ç¥£¥¢¥ó¥Õ¥©¡¼¥Þ¥Ã¥È¤Î BOM" ¤Ç¤¢¤Ã¤Æ
"¥Ó¥Ã¥°¥¨¥ó¥Ç¥£¥¢¥ó¤Î C" ¤È¤Ï¤Ê¤é¤Ê¤¤¤Î¤Ç¤¹¡£

=item *

UTF-32, UTF-32BE, UTF-32LE

=begin original

The UTF-32 family is pretty much like the UTF-16 family, expect that
the units are 32-bit, and therefore the surrogate scheme is not
needed.  The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and
C<0xFF 0xFE 0x00 0x00> for LE.

=end original

UTF-32 ¥Õ¥¡¥ß¥ê¡¼¤Ï UTF-16 ¥Õ¥¡¥ß¥ê¡¼¤ÈÎÉ¤¯»÷¤Æ¤¤¤Þ¤¹¤¬¡¢¥æ¥Ë¥Ã¥È¤¬
32 ¥Ó¥Ã¥È¤Ç¡¢¤½¤Î¤¿¤á¥µ¥í¥²¡¼¥ÈÊý¼°¤ÎÉ¬Í×¤¬¤Ê¤¤¤È¤¤¤¦ÅÀ¤¬°Û¤Ê¤ê¤Þ¤¹¡£
BOM ¥·¥°¥Í¥Á¥ã¤Ï BE ¤Ç¤Ï C<0x00 0x00 0xFE 0xFF> ¤Ë¡¢
LE ¤Ç¤Ï C<0xFF 0xFE 0x00 0x00> ¤Ë¤Ê¤ê¤Þ¤¹¡£

=item *

UCS-2, UCS-4

=begin original

Encodings defined by the ISO 10646 standard.  UCS-2 is a 16-bit
encoding.  Unlike UTF-16, UCS-2 is not extensible beyond C,
because it does not use surrogates.  UCS-4 is a 32-bit encoding,
functionally identical to UTF-32.

=end original

ISO 10646 É¸½à¤ÇÄêµÁ¤µ¤ì¤Æ¤¤¤ë¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤Ç¤¹¡£
UCS-2 ¤Ï 16 ¥Ó¥Ã¥È¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤Ç¤¹¡£
UTF-16 ¤È¤Ï°Û¤Ê¤ê¡¢UCS-2 ¤Ï C ¤òÄ¶¤¨¤¿ÈÏ°Ï¤Ë³ÈÄ¥¤Ç¤¤Þ¤»¤ó¡£
¤³¤ì¤Ï¥µ¥í¥²¡¼¥È¤ò»È¤ï¤Ê¤¤¤¿¤á¤Ç¤¹¡£
UCS-4 ¤Ï 32 ¥Ó¥Ã¥È¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤Ç¡¢µ¡Ç½Åª¤Ë¤Ï UTF-32 ¤ÈÆ±¤¸¤Ç¤¹¡£

=item *

UTF-7

=begin original

A seven-bit safe (non-eight-bit) encoding, which is useful if the
transport or storage is not eight-bit safe.  Defined by RFC 2152.

=end original

7 ¥Ó¥Ã¥È¥»¡¼¥Õ(Èó 8 ¥Ó¥Ã¥È)¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤Ç¡¢8 ¥Ó¥Ã¥È¥»¡¼¥Õ¤Ç¤Ê¤¤
Å¾Á÷¤ä³ÊÇ¼¤ËÊØÍø¤Ç¤¹¡£
RFC 2152 ¤Ë¤è¤Ã¤ÆÄêµÁ¤µ¤ì¤Æ¤¤¤Þ¤¹¡£

=back

=head2 Security Implications of Unicode

(Unicode ¤Î¥»¥¥å¥ê¥Æ¥£¤Ø¤Î±Æ¶Á)

=over 4

=item *

=begin original

Malformed UTF-8

=end original

ÉÔÀµ¤Ê UTF-8

=begin original

Unfortunately, the specification of UTF-8 leaves some room for
interpretation of how many bytes of encoded output one should generate
from one input Unicode character.  Strictly speaking, the shortest
possible sequence of UTF-8 bytes should be generated,
because otherwise there is potential for an input buffer overflow at
the receiving end of a UTF-8 connection.  Perl always generates the
shortest length UTF-8, and with warnings on Perl will warn about
non-shortest length UTF-8 along with other malformations, such as the
surrogates, which are not real Unicode code points.

=end original

»ÄÇ°¤Ê¤¬¤é¡¢UTF-8 ¤Î»ÅÍÍ¤Ç¤Ï¤Ò¤È¤Ä¤Î Unicode Ê¸»ú¤ÎÆþÎÏ¤«¤é
²¿¥Ð¥¤¥È¤Î¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿½ÐÎÏ¤È¤·¤Æ²ò¼á¤¹¤ë¤Î¤«¤Ë¤Ä¤¤¤Æ¤¤¤¯¤é¤«¤Î
Í¾ÃÏ¤¬¤¢¤ê¤Þ¤¹¡£
¸·Ì©¤Ë¤¤¤¨¤Ð¡¢²ÄÇ½¤Ê¸Â¤êºÇ¤âÃ»¤¤ UTF-8 ¥Ð¥¤¥ÈÎó¤¬À¸À®¤µ¤ì¤ë¤Ù¤¤Ç¤¹¡£
¤Ê¤¼¤Ê¤é¡¢¤½¤¦¤·¤Ê¤¤¤È UTF-8 ¥³¥Í¥¯¥·¥ç¥ó¤Î½ª¤ï¤ê¤Ë¤ª¤¤¤Æ¡¢ÆþÎÏ¥Ð¥Ã¥Õ¥¡¤¬
¥ª¡¼¥Ð¡¼¥Õ¥í¡¼¤¹¤ë²ÄÇ½À¤¬¤¢¤ë¤«¤é¤Ç¤¹¡£
Perl ¤Ï¾ï¤ËºÇ¤âÃ»¤¤Ä¹¤µ¤Î UTF-8 ¤òÀ¸À®¤·¡¢ËÜÅö¤Î Unicode ¤ÎÉä¹æ°ÌÃÖ¤Ç¤Ê¤¤
¥µ¥í¥²¡¼¥È¤Î¤è¤¦¤ÊÉÔÀµ¤Ê·Á¼°¤ÎºÇÃ»¤Ç¤Ê¤¤ UTF-8 ¤Ë´Ø¤·¤Æ·Ù¹ð¤òÈ¯¤·¤Þ¤¹¡£

=item *

=begin original

Regular expressions behave slightly differently between byte data and
character (Unicode) data.  For example, the "word character" character
class C<\w> will work differently depending on if data is eight-bit bytes
or Unicode.

=end original

Àµµ¬É½¸½¤Ï¥Ð¥¤¥È¥Ç¡¼¥¿¤ÈÊ¸»ú(Unicode)¥Ç¡¼¥¿¤È¤Ç¤Þ¤Ã¤¿¤¯°Û¤Ê¤ë
¿¶¤ëÉñ¤¤¤ò¤·¤Þ¤¹¡£
¤¿¤È¤¨¤Ð¡¢Ã±¸ìÊ¸»ú("word character")¥¯¥é¥¹ C<\w> ¤Ï¤½¤Î¥Ç¡¼¥¿¤¬
8 ¥Ó¥Ã¥È¥Ð¥¤¥È¤« Unicode ¤«¤Ë°ÍÂ¸¤·¤Æ°Û¤Ê¤ëÆ¯¤¤ò¤·¤Þ¤¹¡£

=begin original

In the first case, the set of C<\w> characters is either small--the
default set of alphabetic characters, digits, and the "_"--or, if you
are using a locale (see L), the C<\w> might contain a few
more letters according to your language and country.

=end original

Âè°ì¤Î¾ì¹ç¡¢C<\w> Ê¸»ú¤Î½¸¹ç¤ÏÁêÂÐÅª¤Ë¾®¤µ¤¤¤â¤Î¤Ç¤¹ -- ¥¢¥ë¥Õ¥¡¥Ù¥Ã¥È¡¢
¿ô»ú¡¢¤½¤·¤Æ "_" ¤Î¥Ç¥Õ¥©¥ë¥È½¸¹ç -- ¤â¤·¤¯¤Ï¥í¥±¡¼¥ë(L ¤ò»²¾È)¤ò
»È¤Ã¤Æ¤¤¤ë¤Î¤Ç¤¢¤ì¤Ð¡¢C<\w> ¤Ï¤¢¤Ê¤¿¤Î»È¤Ã¤Æ¤¤¤ë¸À¸ì¤ä¹ñ¤Ë±þ¤¸¤Æ¤¤¤¯¤Ä¤«¤Î
Ê¸»ú¤¬Áý¤¨¤Æ¤¤¤ë¤«¤â¤·¤ì¤Þ¤»¤ó¡£

=begin original

In the second case, the C<\w> set of characters is much, much larger.
Most importantly, even in the set of the first 256 characters, it will
probably match different characters: unlike most locales, which are
specific to a language and country pair, Unicode classifies all the
characters that are letters I as C<\w>.  For example, your
locale might not think that LATIN SMALL LETTER ETH is a letter (unless
you happen to speak Icelandic), but Unicode does.

=end original

ÂèÆó¤Î¾ì¹ç¡¢C<\w> ¤ÎÊ¸»ú½¸¹ç¤ÏÁêÂÐÅª¤ËÂç¤¤Ê¤â¤Î¤Ë¤Ê¤ê¤Þ¤¹¡£
ºÇ¤â½ÅÍ×¤Ê¤³¤È¤Ï¡¢ºÇ½é¤Î 256 Ê¸»ú¤Î½¸¹ç¤Ë¤¢¤Ã¤Æ¤µ¤¨°Û¤Ê¤ëÊ¸»ú¤È
¥Þ¥Ã¥Á¤¹¤ë²ÄÇ½À¤¬¤¢¤ë¤È¤¤¤¦¤³¤È¤Ç¤¹: ¸À¸ì¤È¹ñ¤Î¥Ú¥¢¤Ç»ØÄê¤µ¤ì¤ë
ÂçÉôÊ¬¤Î¥í¥±¡¼¥ë¤È°Û¤Ê¤ê¡¢Unicode ¤Î¥¯¥é¥¹Ê¬¤±¤Ï I<¤É¤³¤«¤Ë¤¢¤ë>
¤¹¤Ù¤Æ¤ÎÊ¸»ú¤ò C<\w> ¤ËÂ°¤¹¤ë¤â¤Î¤È¤·¤Þ¤¹¡£
¤¿¤È¤¨¤Ð¡¢¤¢¤Ê¤¿¤Î»È¤Ã¤Æ¤¤¤ë¥í¥±¡¼¥ë¤Ï LATIN SMALL LETTER ETH ¤¬
(¥¢¥¤¥¹¥é¥ó¥É¸ì¤ò»È¤Ã¤Æ¤¤¤Ê¤¤¸Â¤ê)Â°¤·¤Æ¤¤¤Ê¤¤¤È¤ß¤Ê¤·¤Æ¤¤¤ë¤Ç¤·¤ç¤¦¤¬¡¢
Unicode ¤ÏÂ°¤¹¤ë¤â¤Î¤È¤·¤Æ¤ß¤Ê¤¹¤Î¤Ç¤¹¡£

=begin original

As discussed elsewhere, Perl has one foot (two hooves?) planted in
each of two worlds: the old world of bytes and the new world of
characters, upgrading from bytes to characters when necessary.
If your legacy code does not explicitly use Unicode, no automatic
switch-over to characters should happen.  Characters shouldn't get
downgraded to bytes, either.  It is possible to accidentally mix bytes
and characters, however (see L), in which case C<\w> in
regular expressions might start behaving differently.  Review your
code.  Use warnings and the C pragma.

=end original

¤¹¤Ç¤Ë½Ò¤Ù¤Æ¤¤¤ëÄÌ¤ê¡¢Perl ¤ÏÆó¤Ä¤ÎÀ¤³¦¤Î¤½¤ì¤¾¤ì¤ËÊÒÊý¤ÎÂ
(Æó¤Ä¤Î¤Ò¤Å¤á?) ¤òÆÍ¤Ã¹þ¤ó¤Ç¤¤¤Þ¤¹: ¸Å¤¤¥Ð¥¤¥È¤ÎÀ¤³¦¤È¿·¤·¤¤Ê¸»ú¤ÎÀ¤³¦¤Ç¡¢
É¬Í×¤Ë±þ¤¸¤Æ¥Ð¥¤¥È¤«¤éÊ¸»ú¤Ë¾º³Ê¤·¤Þ¤¹¡£
¤â¤·¤¢¤Ê¤¿¤Î¸Å¤¤¥³¡¼¥É¤¬ÌÀ¼¨Åª¤Ë Unicode ¤ò»È¤Ã¤Æ¤¤¤Ê¤¤¤Î¤Ê¤é¡¢Ê¸»ú¤Ø¤Î
ÀÚ¤êÂØ¤¨¤¬¼«Æ°Åª¤Ë¤Ê¤µ¤ì¤ë¤³¤È¤Ï¤¢¤ê¤Þ¤»¤ó¡£
Ê¸»ú¤Ï¥Ð¥¤¥È¤Ë¥À¥¦¥ó¥°¥ì¡¼¥É¤µ¤ì¤ë¤Ù¤¤Ç¤Ï¤¢¤ê¤Þ¤»¤ó¡£
¶öÈ¯Åª¤Ë¥Ð¥¤¥È¤ÈÊ¸»ú¤¬º®¤¸¤ë²ÄÇ½À¤¬¤¢¤ê¤Þ¤¹¤¬(L ¤ò»²¾È)¡¢
¤½¤Î¤è¤¦¤Ê¾ì¹çÀµµ¬É½¸½Ãæ¤Î C<\w> ¤Ï°Û¤Ê¤ë¤Õ¤ë¤Þ¤¤¤ò¤¹¤ë¤«¤â¤·¤ì¤Þ¤»¤ó¡£
¤¢¤Ê¤¿¤Î¥³¡¼¥É¤ò¥ì¥Ó¥å¡¼¤·¤Æ¤¯¤À¤µ¤¤¡£
warnings ¤È C ¥×¥é¥°¥Þ¤ò»È¤Ã¤Æ¤¯¤À¤µ¤¤¡£

=back

=head2 Unicode in Perl on EBCDIC

(EBCDIC ¾å¤Î Perl ¤Ç¤Î Unicode)

=begin original

The way Unicode is handled on EBCDIC platforms is still
experimental.  On such platforms, references to UTF-8 encoding in this
document and elsewhere should be read as meaning the UTF-EBCDIC
specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
are specifically discussed. There is no C pragma or
":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
the platform's "natural" 8-bit encoding of Unicode. See L
for more discussion of the issues.

=end original

EBCDIC ¥×¥é¥Ã¥È¥Õ¥©¡¼¥à¤Ç¤Î Unicode ¤Î°·¤¤Êý¤ÏÌ¤¤À¼Â¸³Åª¤Ç¤¹¡£
¤³¤Î¤è¤¦¤Ê¥×¥é¥Ã¥È¥Õ¥©¡¼¥à¤Ç¤Ï¡¢¤³¤ÎÊ¸½ñ¤ä¤½¤ÎÂ¾¤Ç¤Î
UTF-8 ¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤Ø¤Î¸ÀµÚ¤Ï¡¢ÆÃ¤Ë ASCII ÂÐ EBCDIC ÌäÂê¤Ë¤Ä¤¤¤Æ
µÄÏÀ¤µ¤ì¤Æ¤¤¤ë¾ì¹ç¤Ç¤Ê¤¤¸Â¤ê¤Ï¡¢Unicode Technical Report 16 ¤Ç
ÄêµÁ¤µ¤ì¤Æ¤¤¤ë UTF-EBCDIC ¤ò°ÕÌ£¤¹¤ë¤â¤Î¤È¤·¤ÆÆÉ¤à¤Ù¤¤Ç¤¹¡£
C ¥×¥é¥°¥Þ¤ä ":utfebcdic" ÁØ¤Ï¤¢¤ê¤Þ¤»¤ó;
Âå¤ï¤ê¤Ë¡¢"utf8" ¤È ":utf8" ¤¬¡¢¤½¤Î¥×¥é¥Ã¥È¥Õ¥©¡¼¥à¤Î¡Ö¼«Á³¤Ê¡×
Unicode ¤Î 8 ¥Ó¥Ã¥È¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤ò°ÕÌ£¤¹¤ë¤è¤¦¤ËºÆÍøÍÑ¤µ¤ì¤Æ¤¤¤Þ¤¹¡£
¤³¤ÎÌäÂê¤Ë´Ø¤¹¤ë¹¹¤Ê¤ëµÄÏÀ¤Ë¤Ä¤¤¤Æ¤Ï L ¤ò»²¾È¤·¤Æ¤¯¤À¤µ¤¤¡£

=head2 Locales

(¥í¥±¡¼¥ë)

=begin original

Usually locale settings and Unicode do not affect each other, but
there are a couple of exceptions:

=end original

ÄÌ¾ï¥í¥±¡¼¥ë¤ÎÀßÄê¤È Unicode ¤Ï¸ß¤¤¤Ë±Æ¶Á¤òµÚ¤Ü¤¹¤³¤È¤Ï¤¢¤ê¤Þ¤»¤ó¤¬¡¢
¤¤¤¯¤Ä¤«¤ÎÎã³°¤¬¤¢¤ê¤Þ¤¹:

=over 4

=item *

=begin original

You can enable automatic UTF-8-ification of your standard file
handles, default C layer, and C<@ARGV> by using either
the C<-C> command line switch or the C environment
variable, see L for the documentation of the C<-C> switch.

=end original

¥Ç¥Õ¥©¥ë¥È¤Î C ÁØ¤ä C<@ARGV> ¤ÎÉ¸½à¥Õ¥¡¥¤¥ë¥Ï¥ó¥É¥ë¤Î
¼«Æ°Åª¤Ê UTF-8 ²½¤ò¡¢C<-C> ¥³¥Þ¥ó¥É¥é¥¤¥ó¥¹¥¤¥Ã¥Á¤«
´Ä¶ÊÑ¿ô C ¤Ë¤è¤Ã¤ÆÍ¸ú¤Ë¤Ç¤¤Þ¤¹¡£
C<-C> ¥¹¥¤¥Ã¥Á¤Ë¤Ä¤¤¤Æ¤ÎÀâÌÀ¤Ï L ¤ò»²¾È¤·¤Æ¤¯¤À¤µ¤¤¡£

=item *

=begin original

Perl tries really hard to work both with Unicode and the old
byte-oriented world. Most often this is nice, but sometimes Perl's
straddling of the proverbial fence causes problems.

=end original

Perl ¤Ï Unicode ¤È¸Å¤¤¥Ð¥¤¥È»Ø¸þ¤ÎÀ¤³¦¤ÎÎ¾Êý¤ÇÆ¯¤¯¤¿¤á¤Ë¶ìÏ«¤·¤Æ¤¤¤Þ¤¹¡£
¤Û¤È¤ó¤É¤Î¾ì¹ç¤Ï¤¦¤Þ¤¯¤¤¤¤Þ¤¹¤¬¡¢¤È¤¤Ë¤Ï Perl ¤¬Æó¸Ô¤ò¤«¤±¤Æ¤¤¤ë¤³¤È¤¬
ÌäÂê¤ò°ú¤µ¯¤³¤¹¤³¤È¤â¤¢¤ê¤Þ¤¹¡£

=back

=head2 When Unicode Does Not Happen

(Unicode ¤Ç¤Ï¤Ê¤¤¾ì¹ç)

=begin original

While Perl does have extensive ways to input and output in Unicode,
and few other 'entry points' like the @ARGV which can be interpreted
as Unicode (UTF-8), there still are many places where Unicode (in some
encoding or another) could be given as arguments or received as
results, or both, but it is not.

=end original

Perl ¤Ë¤ÏÆþ½ÐÎÏ¤ò Unicode ¤Ç¹Ô¤¦¤¿¤á¤ÎÂ¿¿ô¤ÎÊýË¡¤¬¤¢¤ê¡¢
@ARGV ¤Î¤è¤¦¤Ë Unicode (UTF-8) ¤È¤·¤Æ²ò¼á¤Ç¤¤ë¤è¤¦¤Ê¤½¤ÎÂ¾¤Î
¡Ö¥¨¥ó¥È¥ê¥Ý¥¤¥ó¥È¡×¤Ï¤Û¤È¤ó¤É¤Ê¤¤°ìÊý¡¢(²¿¤é¤«¤Î¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤Ç)
Unicode ¤¬°ú¿ô¤È¤·¤ÆÍ¿¤¨¤é¤ì¤¿¤ê·ë²Ì¤È¤·¤ÆÊÖ¤µ¤ì¤ë¤Ù¤¤Ë¤â´Ø¤ï¤é¤º¡¢
¤½¤¦¤Ê¤Ã¤Æ¤¤¤Ê¤¤¾ì½ê¤âÌ¤¤ÀÂ¿¤¯¤¢¤ê¤Þ¤¹¡£

=begin original

The following are such interfaces.  For all of these interfaces Perl
currently (as of 5.8.3) simply assumes byte strings both as arguments
and results, or UTF-8 strings if the C pragma has been used.

=end original

°Ê²¼¤Ëµó¤²¤ë¤Î¤Ï¤½¤Î¤è¤¦¤Ê¥¤¥ó¥¿¡¼¥Õ¥§¡¼¥¹¤Ç¤¹¡£
¤³¤ì¤é¤¹¤Ù¤Æ¤¬¸½ºß¤Î Perl(5.8.3) ¤Ç¤ÏÃ±½ã¤Ë°ú¿ô¤ÈÌá¤êÃÍ¤ÎÎ¾Êý¤¬
¥Ð¥¤¥ÈÊ¸»úÎó¤«¡¢C ¥×¥é¥°¥Þ¤¬»È¤ï¤ì¤Æ¤¤¤ì¤Ð UTF-8 Ê¸»úÎó¤Ç
¤¢¤ë¤È²¾Äê¤·¤Æ¤¤¤Þ¤¹¡£

=begin original

One reason why Perl does not attempt to resolve the role of Unicode in
this cases is that the answers are highly dependent on the operating
system and the file system(s).  For example, whether filenames can be
in Unicode, and in exactly what kind of encoding, is not exactly a
portable concept.  Similarly for the qx and system: how well will the
'command line interface' (and which of them?) handle Unicode?

=end original

¤³¤Î¤è¤¦¤Ê¥±¡¼¥¹¤Ë¤ª¤¤¤Æ¡¢Perl ¤¬¤Ê¤¼ Unicode ¤Ë¤è¤ë²ò·è¤ò
¤·¤Ê¤¤¤Î¤«¤ÎÍýÍ³¤Î°ì¤Ä¤Ï¡¢Åú¤¨¤¬¥ª¥Ú¥ì¡¼¥Æ¥£¥ó¥°¥·¥¹¥Æ¥à¤ä
¥Õ¥¡¥¤¥ë¥·¥¹¥Æ¥à¤Ë¶¯¤¯°ÍÂ¸¤·¤Æ¤¤¤ë¤«¤é¤Ç¤¹¡£
¤¿¤È¤¨¤Ð¡¢¥Õ¥¡¥¤¥ëÌ¾¤¬ Unicode ¤Çµ½Ò¤Ç¤¤Æ¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤¬
¹ç¤Ã¤Æ¤¤¤¿¤È¤·¤Æ¤â¤½¤ì¤Ï°Ü¿¢À¤Î¤¢¤ë¥³¥ó¥»¥×¥È¤Ç¤Ï¤Ê¤¤¤Î¤Ç¤¹¡£
Æ±ÍÍ¤Ê¤³¤È¤¬ qx ¤ä system ¤Ë¤â¸À¤¨¤Þ¤¹:
¡Ö¥³¥Þ¥ó¥É¥é¥¤¥ó¥¤¥ó¥¿¡¼¥Õ¥§¡¼¥¹¡×¤Ï Unicode ¤ò¤É¤Î¤è¤¦¤Ë
°·¤¦¤Î¤Ç¤·¤ç¤¦¤«?

=over 4

=item *

chdir, chmod, chown, chroot, exec, link, lstat, mkdir, 
rename, rmdir, stat, symlink, truncate, unlink, utime, -X

=item *

%ENV

=item *

=begin original

glob (aka the <*>)

=end original

glob (¤Þ¤¿¤Ï <*>)

=item *

open, opendir, sysopen

=item *

=begin original

qx (aka the backtick operator), system

=end original

qx (¤Þ¤¿¤ÏµÕ¥¯¥©¡¼¥È±é»»»Ò), system

=item *

readdir, readlink

=back

=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)

(Unicode ¤ò Perl ¤Ë¶¯À©¤¹¤ë (¤¢¤ë¤¤¤Ï Unicode ¤Ç¤Ê¤¤¤³¤È¤ò Perl ¤Ë¶¯À©¤¹¤ë))

=begin original

Sometimes (see L) there are
situations where you simply need to force a byte
string into UTF-8, or vice versa.  The low-level calls
utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
the answers.

=end original

¤È¤¤È¤·¤Æ(L ¤ò»²¾È)¡¢¥Ð¥¤¥ÈÎó¤ò
UTF-8 ¤Ç¤¢¤ë¤è¤¦¤Ë¶¯À©¤·¤¿¤ê¤½¤ÎµÕ¤ò¹Ô¤¦¾ì¹ç¤¬¤¢¤ë¤«¤â¤·¤ì¤Þ¤»¤ó¡£
Äã¥ì¥Ù¥ë¤Î¸Æ¤Ó½Ð¤· utf8::upgrade($bytestring) ¤È
utf8::downgrade($utf8string[, FAIL_OK]) ¤¬¤½¤Î²óÅú¤Ç¤¹¡£

=begin original

Note that utf8::downgrade() can fail if the string contains characters
that don't fit into a byte.

=end original

utf8::downgrade() ¤Ï¡¢¥Ð¥¤¥È¤Ë¼ý¤Þ¤é¤Ê¤¤Ê¸»ú¤ò´Þ¤àÊ¸»úÎó¤Î¾ì¹ç¤Ï
¼ºÇÔ¤¹¤ë¤³¤È¤¬¤¢¤ë¤³¤È¤ËÃí°Õ¤·¤Æ¤¯¤À¤µ¤¤¡£

=head2 Using Unicode in XS

(XS ¤Ç Unicode ¤ò»È¤¦)

=begin original

If you want to handle Perl Unicode in XS extensions, you may find the
following C APIs useful.  See also L for an
explanation about Unicode at the XS level, and L for the API
details.

=end original

Perl ¤Î Unicode ¤ò XS ³ÈÄ¥¤Ç¼è¤ê°·¤¤¤¿¤¤¤È»×¤¦¤Î¤Ê¤é¡¢°Ê²¼¤Ëµó¤²¤ë
API ·²¤¬ÊØÍø¤«¤âÃÎ¤ì¤Þ¤»¤ó¡£
XS ¥ì¥Ù¥ë¤Ç¤Î Unicode ¤Ë´Ø¤·¤Æ¤ÎÀâÌÀ¤Ï L ¤ò¡¢
API ¤Î¾ÜºÙ¤Ë¤Ä¤¤¤Æ¤Ï L ¤ò»²¾È¤·¤Æ¤¯¤À¤µ¤¤¡£

=over 4

=item *

=begin original

C returns true if the C flag is on and the bytes
pragma is not in effect.  C returns true if the C
flag is on; the bytes pragma is ignored.  The C flag being on
does B mean that there are any characters of code points greater
than 255 (or 127) in the scalar or that there are even any characters
in the scalar.  What the C flag means is that the sequence of
octets in the representation of the scalar is the sequence of UTF-8
encoded code points of the characters of a string.  The C flag
being off means that each octet in this representation encodes a
single character with code point 0..255 within the string.  Perl's
Unicode model is not to use UTF-8 until it is absolutely necessary.

=end original

C ¤Ï C ¥Õ¥é¥°¤¬¥ª¥ó¤Ç¥Ð¥¤¥È¥×¥é¥°¥Þ¤¬¸ú²Ì¤ò
¤â¤Ã¤Æ¤¤¤Ê¤¤¤È¤¤Ë¿¿¤òÊÖ¤·¤Þ¤¹¡£
C ¤Ï C ¤¬¥ª¥ó¤Î¤È¤¡¢¥Ð¥¤¥È¥×¥é¥°¥Þ¤Î¾õÂÖ¤Ë¤Ï
´Ø·¸¤Ê¤¯¿¿¤òÊÖ¤·¤Þ¤¹¡£
C ¥Õ¥é¥°¤Ï¥¹¥«¥é¤ÎÃæ¤Ç 255(¤â¤·¤¯¤Ï127)¤òÄ¶¤¨¤ëÉä¹æ°ÌÃÖ¤ÎÊ¸»ú¤¬
¤¢¤ë¤È¤¤¤¦¤³¤È¤ò I<°ÕÌ£¤·¤Þ¤»¤ó>¡£
C ¥Õ¥é¥°¤Î°ÕÌ£¤¹¤ë¤È¤³¤í¤Ï¡¢¥¹¥«¥éÃæ¤Î¤½¤Î¥ª¥¯¥Æ¥Ã¥È¤ÎÊÂ¤Ó¤¬
Ê¸»úÎó¤È¤·¤ÆUTF-8¤Ç¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿Éä¹æ°ÌÃÖ¤ÎÊÂ¤Ó¤À¤È¤¤¤¦¤³¤È¤Ç¤¹¡£
C ¥Õ¥é¥°¤¬¥ª¥Õ¤Ç¤¢¤ë¤È¤¤¤¦¤³¤È¤ÏÊ¸»úÎó¤ÎÃæ¤Î¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿
Ê¸»ú¤¬ 0..255 ¤ÎÈÏ°Ï¤Ç¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿¥ª¥¯¥Æ¥Ã¥È¤Ç¤¢¤ë¤³¤È¤ò°ÕÌ£¤·¤Þ¤¹¡£
Perl ¤Î Unicode ¥â¥Ç¥ë¤ÏËÜÅö¤ËÉ¬Í×¤È¤Ê¤ë¤Þ¤Ç UTF-8 ¤ò»ÈÍÑ¤·¤Þ¤»¤ó¡£

=item *

=begin original

C writes a Unicode character code point into
a buffer encoding the code point as UTF-8, and returns a pointer
pointing after the UTF-8 bytes.  It works appropriately on EBCDIC machines.

=end original

C ¤Ï Unicode ¤ÎÊ¸»úÉä¹æ°ÌÃÖ¤ò UTF-8 ¤Ç
¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿¤ÎÉä¹æ°ÌÃÖ¤È¤·¤Æ¥Ð¥Ã¥Õ¥¡¤Ë½ñ¤¹þ¤ß¤Þ¤¹¡£
¤½¤·¤Æ¡¢¤½¤Î UTF-8 ¥Ð¥¤¥È¤Î¸å¤ò»Ø¤·¼¨¤¹¥Ý¥¤¥ó¥¿¤òÊÖ¤·¤Þ¤¹¡£
¤³¤ì¤Ï EBCDIC ¤Î¥Þ¥·¥ó¤Ç¤âÅ¬ÀÚ¤ËÆ°ºî¤·¤Þ¤¹¡£

=item *

=begin original

C reads UTF-8 encoded bytes from a buffer and
returns the Unicode character code point and, optionally, the length of
the UTF-8 byte sequence.  It works appropriately on EBCDIC machines.

=end original

C ¤Ï¥Ð¥Ã¥Õ¥¡¤«¤é UTF-8 ¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿¥Ð¥¤¥È¤ò
ÆÉ¤ß½Ð¤·¡¢Unicode ¤ÎÊ¸»úÉä¹æ°ÌÃÖ¤È¡¢¥ª¥×¥·¥ç¥ó¤Ç¤½¤Î
UTF-8 ¥Ð¥¤¥È¥·¡¼¥±¥ó¥¹¤ÎÄ¹¤µ¤òÊÖ¤·¤Þ¤¹¡£
¤³¤ì¤Ï EBCDIC ¤Î¥Þ¥·¥ó¤Ç¤âÅ¬ÀÚ¤ËÆ°ºî¤·¤Þ¤¹¡£

=item *

=begin original

C returns the length of the UTF-8 encoded buffer
in characters.  C returns the length of the UTF-8 encoded
scalar.

=end original

C ¤Ï UTF-8 ¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿¥Ð¥Ã¥Õ¥¡¤ÎÄ¹¤µ¤ò
Ê¸»ú¤ÇÊÖ¤·¤Þ¤¹¡£
C ¤Ï UTF-8 ¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿¥¹¥«¥é¤ÎÄ¹¤µ¤òÊÖ¤·¤Þ¤¹¡£

=item *

=begin original

C converts the string of the scalar to its UTF-8
encoded form.  C does the opposite, if
possible.  C is like sv_utf8_upgrade except that
it does not set the C flag.  C does the
opposite of C.  Note that none of these are to be
used as general-purpose encoding or decoding interfaces: C
for that.  C is affected by the encoding pragma
but C is not (since the encoding pragma is
designed to be a one-way street).

=end original

C ¤Ï¥¹¥«¥é¤ÎÊ¸»úÎó¤ò¤½¤Î UTF-8 ¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿
·Á¼°¤ËÊÑ´¹¤·¤Þ¤¹¡£
C ¤Ï(²ÄÇ½¤Ç¤¢¤ì¤Ð)¤½¤ÎÈ¿ÂÐ¤ÎÆ°ºî¤ò¤·¤Þ¤¹¡£
C ¤Ï C ¤Ë»÷¤Æ¤¤¤Þ¤¹¤¬¡¢
C ¥Õ¥é¥°¤ò¥»¥Ã¥È¤·¤Ê¤¤ÅÀ¤¬°Û¤Ê¤ê¤Þ¤¹¡£
¤³¤ì¤é¤Î·çÇ¡¤¬°ìÈÌÅª¤ÊÌÜÅª¤Î¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤ä¥Ç¥³¡¼¥Ç¥£¥ó¥°¤Î
¥¤¥ó¥¿¡¼¥Õ¥§¡¼¥¹¤È¤·¤Æ»È¤ï¤ì¤Æ¤¤¤ë¤³¤È¤ËÃí°Õ¤·¤Æ¤¯¤À¤µ¤¤:
C ¤¬¤½¤Î¤¿¤á¤Ë¤¢¤ê¤Þ¤¹¡£
C ¤Ï¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¥×¥é¥°¥Þ¤Ë±Æ¶Á¤ò¼õ¤±¤Þ¤¹¤¬¡¢
C ¤Ï¤½¤¦¤Ç¤Ï¤¢¤ê¤Þ¤»¤ó(¤Ê¤¼¤Ê¤é¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°
¥×¥é¥°¥Þ¤Ï°ìÊýÄÌ¹Ô¤Ë¥Ç¥¶¥¤¥ó¤µ¤ì¤Æ¤¤¤ë¤«¤é¤Ç¤¹)¡£

=item *

=begin original

C returns true if the pointer points to a valid UTF-8
character.

=end original

C ¤Ï¥Ý¥¤¥ó¥¿¤¬Àµ¤·¤¤ UTF-8 Ê¸»ú¤ò»Ø¤·¼¨¤·¤Æ¤¤¤ë¤È¤¤Ë
¿¿¤òÊÖ¤·¤Þ¤¹¡£

=item *

=begin original

C returns true if C bytes of the buffer
are valid UTF-8.

=end original

C ¤Ï¥Ð¥Ã¥Õ¥¡¤Î C ¥Ð¥¤¥È¤¬Àµ¤·¤¤
UTF-8 Ê¸»ú¤Ç¤¢¤ë¤È¤¤Ë¿¿¤òÊÖ¤·¤Þ¤¹¡£

=item *

=begin original

C will return the number of bytes in the UTF-8 encoded
character in the buffer.  C will return the number of bytes
required to UTF-8-encode the Unicode character code point.  C
is useful for example for iterating over the characters of a UTF-8
encoded buffer; C is useful, for example, in computing
the size required for a UTF-8 encoded buffer.

=end original

C ¤Ï¥Ð¥Ã¥Õ¥¡¤ÎÃæ¤Ë¤¢¤ë UTF-8 ¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿Ê¸»ú¤Î
¥Ð¥¤¥È¿ô¤òÊÖ¤·¤Þ¤¹¡£
C ¤Ï UTF-8 ¥¨¥ó¥³¡¼¥É¤¹¤ë Unicode Ê¸»ú¤ÎÉä¹æ°ÌÃÖ¤¬Í×µá¤¹¤ë
¥Ð¥¤¥È¿ô¤òÊÖ¤·¤Þ¤¹¡£
C ¤Ï UTF-8 ¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿¥Ð¥Ã¥Õ¥¡¤ÎÊ¸»ú¤ËÂÐ¤·¤Æ·«¤êÊÖ¤·¤ò
¹Ô¤¦¤è¤¦¤ÊÎã¤ËÊØÍø¤Ç¤¹¡£
C ¤Ï¤¿¤È¤¨¤Ð¡¢UTF-8 ¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿¥Ð¥Ã¥Õ¥¡¤ÎÍ×µá¤¹¤ëÂç¤¤µ¤ò
·×»»¤¹¤ë¤Î¤ËÊØÍø¤Ç¤¹¡£

=item *

=begin original

C will tell the distance in characters between the
two pointers pointing to the same UTF-8 encoded buffer.

=end original

C ¤ÏÆ±¤¸ UTF-8 ¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿¥Ð¥Ã¥Õ¥¡¤ò¤µ¤¹
Æó¤Ä¤Î¥Ý¥¤¥ó¥¿¤Î´Ö¤ÎÊ¸»úÃ±°Ì¤Îµ÷Î¥¤òÊÖ¤·¤Þ¤¹¡£

=item *

=begin original

C will return a pointer to a UTF-8 encoded buffer
that is C (positive or negative) Unicode characters displaced
from the UTF-8 buffer C.  Be careful not to overstep the buffer:
C will merrily run off the end or the beginning of the
buffer if told to do so.

=end original

C ¤Ï¡¢UTF-8 ¥Ð¥Ã¥Õ¥¡ C ¤«¤é Unicode ¤Ç C Ê¸»úÊ¬
(Àµ¿ô¤Ç¤âÉé¿ô¤Ç¤â) °ÜÆ°¤·¤¿ UTF-8 ¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¥Ð¥Ã¥Õ¥¡¤Ø¤Î
¥Ý¥¤¥ó¥¿¤òÊÖ¤·¤Þ¤¹¡£
¥Ð¥Ã¥Õ¥¡¤òÄ¶¤¨¤Ê¤¤¤è¤¦¤ËÃí°Õ¤·¤Æ¤¯¤À¤µ¤¤: C ¤Ï¡¢¤½¤¦
»Ø¼¨¤µ¤ì¤ì¤Ð²¿¤âµ¤¤Ë¤»¤º¤Ë¥Ð¥Ã¥Õ¥¡¤ÎÀèÆ¬¤äËöÈø¤òÆ§¤ß±Û¤¨¤Þ¤¹¡£

=item *

=begin original

C and
C are useful for debugging the
output of Unicode strings and scalars.  By default they are useful
only for debugging--they display B characters as hexadecimal code
points--but with the flags C,
C, and C you can make the
output more readable.

=end original

C ¤È
C ¤Ï Unicode ¤ÎÊ¸»úÎó¤ä¥¹¥«¥é¤Î
½ÐÎÏ¤ò¥Ç¥Ð¥Ã¥°¤¹¤ë¤Î¤ËÊØÍø¤Ç¤¹¡£
¥Ç¥Õ¥©¥ë¥È¤Ç¤Ï¥Ç¥Ð¥Ã¥°¤Î¤ß¤ËÊØÍø¤Ç¤¹ -- B<¤¹¤Ù¤Æ¤Î> Ê¸»ú¤ò
16 ¿Ê¤ÎÉä¹æ°ÌÃÖ¤È¤·¤ÆÉ½¼¨¤·¤Þ¤¹ -- ¤·¤«¤· C,
C, C ¤È¤¤¤¦¥Õ¥é¥°¤ò
Í¿¤¨¤ë¤³¤È¤Ë¤è¤Ã¤Æ¡¢½ÐÎÏ¤òÆÉ¤ß¤ä¤¹¤¯¤Ç¤¤Þ¤¹¡£

=item *

=begin original

C can be used to
compare two strings case-insensitively in Unicode.  For case-sensitive
comparisons you can just use C and C as usual.

=end original

C ¤Ï Unicode ¤Ë
¤ª¤¤¤ÆÂç¾®Ê¸»ú¤òÌµ»ë¤·¤¿Ê¸»úÎóÈæ³Ó¤Ë»È¤¦¤³¤È¤¬¤Ç¤¤Þ¤¹¡£
Âç¾®Ê¸»ú¤ò°Õ¼±¤·¤¿Èæ³Ó¤Ë¤ÏÄÌ¾ï¤É¤ª¤ê C ¤ä C ¤ò
»È¤¦¤³¤È¤¬¤Ç¤¤Þ¤¹¡£

=back

=begin original

For more information, see L, and F and F
in the Perl source code distribution.

=end original

¤â¤Ã¤È¾Ü¤·¤¤¾ðÊó¤Ï¡¢L ¤È¡¢Perl ¤Î¥½¡¼¥¹¥³¡¼¥ÉÇÛÉÛ¤Î
F ¤È F ¤ò»²¾È¤·¤Æ¤¯¤À¤µ¤¤¡£

=head1 BUGS

=head2 Interaction with Locales

(¥í¥±¡¼¥ë¤È¤ÎÁê¸ßºîÍÑ)

=begin original

Use of locales with Unicode data may lead to odd results.  Currently,
Perl attempts to attach 8-bit locale info to characters in the range
0..255, but this technique is demonstrably incorrect for locales that
use characters above that range when mapped into Unicode.  Perl's
Unicode support will also tend to run slower.  Use of locales with
Unicode is discouraged.

=end original

Unicode ¥Ç¡¼¥¿¤È¶¦¤Ë¥í¥±¡¼¥ë¤ò»È¤¦¤³¤È¤Ï¤ª¤«¤·¤Ê·ë²Ì¤ò
¤â¤¿¤é¤¹¤³¤È¤Ë¤Ê¤ê¤ä¤¹¤¤¤Ç¤¹¡£
¸½ºß¤Î¤È¤³¤í¡¢Perl ¤ÏÊ¸»ú¤Ë 0..255 ¤ÎÈÏ°Ï¤Î 8 ¥Ó¥Ã¥È¥í¥±¡¼¥ë¤ò
³ä¤êÅö¤Æ¤è¤¦¤È¤·¤Æ¤¤¤Þ¤¹¤¬¡¢¤³¤Î¥Æ¥¯¥Ë¥Ã¥¯¤Ï Unicode ¤Ë
¥Þ¥Ã¥×¤·¤è¤¦¤È¤·¤¿¤È¤¤ËÀè¤ÎÈÏ°Ï¤ÎÊ¸»ú¤ò»ÈÍÑ¤¹¤ë¥í¥±¡¼¥ë¤ËÂÐ¤·¤Æ
ÌÀ¤é¤«¤ËÀµ¤·¤¯¤¢¤ê¤Þ¤»¤ó¡£
Perl ¤Î Unicode ¥µ¥Ý¡¼¥È¤Ï¤Þ¤¿¡¢ÃÙ¤¯¤Ê¤ê¤¬¤Á¤Ç¤¹¡£
Unicode ¤È¤¤¤Ã¤·¤ç¤Ë¥í¥±¡¼¥ë¤ò»È¤¦¤³¤È¤Ï¤ª´«¤á¤Ç¤¤Þ¤»¤ó¡£

=head2 Problems with characters whose ordinal numbers are in the range 128 - 255 with no Locale specified

(Locale ¤¬»ØÄê¤µ¤ì¤Ê¤¤¤È¤¤ÎÈÖ¹æ 128 - 255 ¤ÎÈÏ°Ï¤ÎÊ¸»ú¤ÎÌäÂê)

=begin original

Without a locale specified, unlike all other characters or code points,
these characters have very different semantics in byte semantics versus
character semantics.
In character semantics they are interpreted as Unicode code points, which means
they are viewed as Latin-1 (ISO-8859-1).
In byte semantics, they are considered to be unassigned characters,
meaning that the only semantics they have is their
ordinal numbers, and that they are not members of various character classes.
None are considered to match C<\w> for example, but all match C<\W>.
Besides these class matches,
the known operations that this affects are those that change the case,
regular expression matching while ignoring case,
and B.
This can lead to unexpected results in which a string's semantics suddenly
change if a code point above 255 is appended to or removed from it,
which changes the string's semantics from byte to character or vice versa.
This behavior is scheduled to change in version 5.12, but in the meantime,
a workaround is to always call utf8::upgrade($string), or to use the
standard modules L or L.

=end original

¥í¥±¡¼¥ë»ØÄê¤¬¤Ê¤¤¾ì¹ç¡¢¤½¤ÎÂ¾¤ÎÊ¸»ú¤äÉä¹æ°ÌÃÖ¤È¤Ï°Û¤Ê¤ê¡¢¤³¤ì¤é¤ÎÊ¸»ú¤Ï
¥Ð¥¤¥È¥»¥Þ¥ó¥Æ¥£¥¯¥¹¤ÈÊ¸»ú¥»¥Þ¥ó¥Æ¥£¥¯¥¹¤Ç¤È¤Æ¤â°Û¤Ê¤Ã¤¿¥»¥Þ¥ó¥Æ¥£¥¯¥¹¤Ç¤¹¡£
Ê¸»ú¥»¥Þ¥ó¥Æ¥£¥¯¥¹¤Ç¤Ï Unicode Éä¹æ°ÌÃÖ¤È¤·¤Æ²ò¼á¤µ¤ì¡¢Latin-1
(ISO-8859-1) ¤È¤·¤Æ»²¾È¤µ¤ì¤Þ¤¹¡£
¥Ð¥¤¥È¥»¥Þ¥ó¥Æ¥£¥¯¥¹¤Ç¤Ï¡¢Ì¤ÄêµÁÊ¸»ú¤È¤·¤Æ°·¤ï¤ì¡¢ÊÝ»ý¤·¤Æ¤¤¤ë
¥»¥Þ¥ó¥Æ¥£¥¯¥¹¤Ï¤½¤ÎÈÖ¹æ¤À¤±¤Ç¡¢ÍÍ¡¹¤ÊÊ¸»ú¥¯¥é¥¹¤Î¥á¥ó¥Ð¤Ë¤Ï¤Ê¤é¤Ê¤¤¤³¤È¤ò
°ÕÌ£¤·¤Þ¤¹¡£
Îã¤¨¤Ð¤É¤ì¤â C<\w> ¤Ë¤Ï¥Þ¥Ã¥Á¥ó¥°¤·¤Þ¤»¤ó¤¬¡¢Á´¤Æ C<\W> ¤Ë¥Þ¥Ã¥Á¥ó¥°¤·¤Þ¤¹¡£
¤³¤ì¤é¤Î¥¯¥é¥¹¤Î¥Þ¥Ã¥Á¥ó¥°¤ÎÂ¾¤Ë¡¢¤³¤ì¤¬±Æ¶Á¤òÍ¿¤¨¤ë¤³¤È¤¬ÃÎ¤é¤ì¤Æ¤¤¤ë
Áàºî¤Ï¡¢ÂçÊ¸»ú¾®Ê¸»ú¤ÎÊÑ¹¹¡¢ÂçÊ¸»ú¾®Ê¸»ú¤òÌµ»ë¤·¤¿Àµµ¬É½¸½¥Þ¥Ã¥Á¥ó¥°¡¢
B ¤Ç¤¹¡£
¤³¤ì¤Ë¤è¤ê¡¢Éä¹æ°ÌÃÖ 255 ¤òÄ¶¤¨¤ëÊ¸»ú¤¬ÄÉ²Ã¤µ¤ì¤¿¤êºï½ü¤µ¤ì¤¿¤ê¤¹¤ë¤È¡¢
Ê¸»úÎó¤Î¥»¥Þ¥ó¥Æ¥£¥Ã¥¯¥¹¤¬¥Ð¥¤¥È¤«¤éÊ¸»ú¤Ø(¤Þ¤¿¤Ï¤½¤ÎµÕ¤Ø)ÆÍÁ³
ÊÑ¹¹¤µ¤ì¤ë¤È¤¤¤¦Í½ÁÛ³°¤Î·ë²Ì¤ò°ú¤µ¯¤³¤¹¤³¤È¤¬¤¢¤ê¤Þ¤¹¡£
¤³¤Î¿¶¤ëÉñ¤¤¤Ï 5.12 ¤ÇÊÑ¹¹¤µ¤ì¤ëÍ½Äê¤Ç¤¹¤¬¡¢º£¤Î¤È¤³¤í¤Î²óÈòÊýË¡¤Ï
¾ï¤Ë utf8::upgrade($string) ¤ò¸Æ¤Ó½Ð¤¹¤«É¸½à¥â¥¸¥å¡¼¥ë L ¤ä
L ¤ò»È¤¦¤³¤È¤Ç¤¹¡£

=head2 Interaction with Extensions

(¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤È¤ÎÁê¸ßºîÍÑ)

=begin original

When Perl exchanges data with an extension, the extension should be
able to understand the UTF8 flag and act accordingly. If the
extension doesn't know about the flag, it's likely that the extension
will return incorrectly-flagged data.

=end original

Perl ¤¬¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤È¥Ç¡¼¥¿¤ò¤ä¤ê¼è¤ê¤¹¤ë¤È¤¡¢¤½¤Î¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤Ï
UTF8 ¥Õ¥é¥°¤òÍý²ò¤·¡¢¤Þ¤¿¡¢¤½¤ì¤Ë½¾¤Ã¤¿¿¶¤ëÉñ¤¤¤ò¤¹¤Ù¤¤Ç¤¹¡£
¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤¬¤³¤Î¥Õ¥é¥°¤Ë¤Ä¤¤¤Æ²¿¤âÃÎ¤é¤Ê¤±¤ì¤Ð¡¢¤½¤Î¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤Ï
Àµ¤·¤¯¤Ê¤¤¥Õ¥é¥°¤¬¤Ä¤¤¤¿¥Ç¡¼¥¿¤òÊÖ¤¹²ÄÇ½À¤¬¤¢¤ê¤Þ¤¹¡£

=begin original

So if you're working with Unicode data, consult the documentation of
every module you're using if there are any issues with Unicode data
exchange. If the documentation does not talk about Unicode at all,
suspect the worst and probably look at the source to learn how the
module is implemented. Modules written completely in Perl shouldn't
cause problems. Modules that directly or indirectly access code written
in other programming languages are at risk.

=end original

¤½¤Î¤¿¤á¡¢¤â¤· Unicode ¥Ç¡¼¥¿¤ò°·¤ª¤¦¤È¤¤¤¦¤Î¤Ç¤¢¤ì¤Ð¡¢ Unicode ¥Ç¡¼¥¿¤Î
¸ò´¹¤Ë´Ø¤·¤Æ²¿¤é¤«¤Îµ½Ò¤¬¤¢¤ë¤Î¤Ê¤é»È¤¦¥â¥¸¥å¡¼¥ë¤¹¤Ù¤Æ¤Î¥É¥¥å¥á¥ó¥È¤ò
Ä´¤Ù¤Æ¤¯¤À¤µ¤¤¡£
¥É¥¥å¥á¥ó¥È¤¬ Unicode ¤Ë´Ø¤·¤Æ²¿¤Î¸ÀµÚ¤â¤·¤Æ¤¤¤Ê¤¤¤Î¤Ê¤é¡¢ºÇ°¤Î¥±¡¼¥¹¤ò
¹ÍÎ¸¤·¡¢¤½¤·¤Æ¤½¤Î¥â¥¸¥å¡¼¥ë¤¬¤É¤Î¤è¤¦¤Ë¼ÂÁõ¤µ¤ì¤Æ¤¤¤ë¤«¤òÃÎ¤ë¤¿¤á¤Ë
¥½¡¼¥¹¤ò¸«¤ë¤³¤È¤Ë¤Ê¤ë¤«¤â¤·¤ì¤Þ¤»¤ó¡£
´°Á´¤Ë Perl ¤Ç½ñ¤«¤ì¤¿¥â¥¸¥å¡¼¥ë¤ÏÌäÂê¤ò°ú¤µ¯¤³¤·¤Þ¤»¤ó¡£
Â¾¤Î¥×¥í¥°¥é¥ß¥ó¥°¸À¸ì¤Ç½ñ¤«¤ì¤Æ¤¤¤ëÄ¾ÀÜ¤Þ¤¿¤Ï´ÖÀÜ¤Ë¥¢¥¯¥»¥¹¤¹¤ë¥³¡¼¥É¤Ë
¥ê¥¹¥¯¤¬¤¢¤ë¤Î¤Ç¤¹¡£

=begin original

For affected functions, the simple strategy to avoid data corruption is
to always make the encoding of the exchanged data explicit. Choose an
encoding that you know the extension can handle. Convert arguments passed
to the extensions to that encoding and convert results back from that
encoding. Write wrapper functions that do the conversions for you, so
you can later change the functions when the extension catches up.

=end original

±Æ¶Á¤ò¼õ¤±¤¿´Ø¿ô¤Î¤¿¤á¤Î¡¢¥Ç¡¼¥¿¤ÎÎô²½(data corruption)¤òËÉ¤°Ã±½ã¤Ê
ÀïÎ¬¤È¤Ï¡¢¸ò´¹¤¹¤ë¥Ç¡¼¥¿¤Î¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤ò¾ï¤ËÌÀ³Î¤Ë¤¹¤ë¤È¤¤¤¦¤³¤È¤Ç¤¹¡£
¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤¬¼è¤ê°·¤¦¤³¤È¤¬¤Ç¤¤ë¤ÈÃÎ¤Ã¤Æ¤¤¤ë¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤ò
ÁªÂò¤·¤Æ¤¯¤À¤µ¤¤¡£
¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤ËÅÏ¤¹°ú¿ô¤òÁªÂò¤·¤¿¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤ËÊÑ´¹¤·¡¢
¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤«¤éÊÖ¤Ã¤Æ¤¤¿·ë²Ì¤ò¤½¤Î¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤«¤é
µÕÊý¸þ¤ËÊÑ´¹¤·¤Þ¤¹¡£
ÊÑ´¹¤ò¹Ô¤Ã¤Æ¤¯¤ì¤ë¥é¥Ã¥Ñ´Ø¿ô¤ò½ñ¤¤¤Æ¤ª¤¤¤Æ¡¢
¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤¬ÄÉ¤¤¤Ä¤¤¤¿»þ¤Ë´Ø¿ô¤òÊÑ¹¹¤Ç¤¤ë¤è¤¦¤Ë¤·¤Æ¤ª¤¤Þ¤¹¡£

=begin original

To provide an example, let's say the popular Foo::Bar::escape_html
function doesn't deal with Unicode data yet. The wrapper function
would convert the argument to raw UTF-8 and convert the result back to
Perl's internal representation like so:

=end original

Îã¤È¤·¤Æ¡¢¤Þ¤À Unicode ¥Ç¡¼¥¿¤ò¼è¤ê°·¤¦¤è¤¦¤Ë¤Ï¤Ç¤¤Æ¤¤¤Ê¤¤¡¢
ÍÌ¾¤Ê Foo::Bar::escape_html ¤Ë¤Ä¤¤¤Æ½Ò¤Ù¤Þ¤·¤ç¤¦¡£
¥é¥Ã¥Ñ´Ø¿ô¤Ï°ú¿ô¤òÀ¸¤Î UTF-8 ¤ËÊÑ´¹¤·¡¢·ë²Ì¤ò Perl ¤ÎÆâÉôÉ½¸½¤Ë
µÕÊÑ´¹¤·¤Þ¤¹:

    sub my_escape_html ($) {
      my($what) = shift;
      return unless defined $what;
      Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what)));
    }

=begin original

Sometimes, when the extension does not convert data but just stores
and retrieves them, you will be in a position to use the otherwise
dangerous Encode::_utf8_on() function. Let's say the popular
C extension, written in C, provides a C method that
lets you store and retrieve data according to these prototypes:

=end original

¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤¬¥Ç¡¼¥¿¤òÊÑ´¹¤·¤Ê¤¤¤±¤ì¤É¤â³ÊÇ¼¤·¤¿¤ê¼è¤ê½Ð¤·¤¿¤ê¤¹¤ë¤È¤¤Ë¡¢
¤È¤¤È¤·¤Æ´í¸±¤Ê Encode::_utf8_on() ´Ø¿ô°Ê³°¤Î¤â¤Î¤ò
»È¤¦¤³¤È¤¬¤¢¤ë¤«¤â¤·¤ì¤Þ¤»¤ó¡£
C ¤Ç½ñ¤«¤ì¤Æ¤¤¤Æ¡¢¥Ç¡¼¥¿¤ò°Ê²¼¤Î¥×¥í¥È¥¿¥¤¥×¤Ë½¾¤Ã¤Æ³ÊÇ¼¤·¤¿¤ê
¼è¤ê½Ð¤·¤¿¤ê¤¹¤ë C ¥á¥½¥Ã¥É¤ò»ý¤Ã¤Æ¤¤¤ë
ÍÌ¾¤Ê C ¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤Ë¤Ä¤¤¤Æ½Ò¤Ù¤Æ¤ß¤Þ¤·¤ç¤¦:

    $self->param($name, $value);            # set a scalar
    $value = $self->param($name);           # retrieve a scalar

=begin original

If it does not yet provide support for any encoding, one could write a
derived class with such a C method:

=end original

¤É¤Î¥¨¥ó¥³¡¼¥Ç¥£¥ó¥°¤â¤Þ¤À¥µ¥Ý¡¼¥È¤·¤Æ¤¤¤Ê¤¤¤Î¤Ê¤é¡¢
°Ê²¼¤Î¤è¤¦¤Ê C ¥á¥½¥Ã¥É¤ò»ý¤Ã¤¿ÇÉÀ¸¥¯¥é¥¹¤ò
µ½Ò¤¹¤ë¤³¤È¤¬¤Ç¤¤ë¤Ç¤·¤ç¤¦:

    sub param {
      my($self,$name,$value) = @_;
      utf8::upgrade($name);     # make sure it is UTF-8 encoded
      if (defined $value) {
        utf8::upgrade($value);  # make sure it is UTF-8 encoded
        return $self->SUPER::param($name,$value);
      } else {
        my $ret = $self->SUPER::param($name);
        Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
        return $ret;
      }
    }

=begin original

Some extensions provide filters on data entry/exit points, such as
DB_File::filter_store_key and family. Look out for such filters in
the documentation of your extensions, they can make the transition to
Unicode data much easier.

=end original

°ìÉô¤Î¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤Ï¥Ç¡¼¥¿¤Î¥¨¥ó¥È¥ê/Ã¦½Ð¥Ý¥¤¥ó¥È¤Ç¥Õ¥£¥ë¥¿¡¼¤ò
Äó¶¡¤·¤Æ¤¤¤Þ¤¹¡£
¤¿¤È¤¨¤Ð DB_File::filter_store_key¤È¤½¤ÎÃç´Ö¤Ç¤¹¡£
¤¢¤Ê¤¿»È¤¦¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤Î¥É¥¥å¥á¥ó¥È¤Ë¤¢¤ë¤½¤Î¤è¤¦¤Ê¥Õ¥£¥ë¥¿¡¼¤Ë
Ãí°Õ¤·¤Æ¤¯¤À¤µ¤¤¡£
¤½¤ì¤é¤Ï Unicode ¥Ç¡¼¥¿¤ÎÊÑ²½¤ò¤è¤êÍÆ°×¤Ë¤·¤Þ¤¹¡£

=head2 Speed

(Â®ÅÙ)

=begin original

Some functions are slower when working on UTF-8 encoded strings than
on byte encoded strings.  All functions that need to hop over
characters such as length(), substr() or index(), or matching regular
expressions can work B faster when the underlying data are
byte-encoded.

=end original

°ìÉô¤Î´Ø¿ô¤Ï UTF-8 ¤Ç¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿Ê¸»úÎó¤ËÂÐ¤·¤ÆÅ¬ÍÑ¤·¤¿¤È¤¤Ë¥Ð¥¤¥È
¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿Ê¸»úÎó¤ËÂÐ¤¹¤ë¤È¤¤è¤ê¤âÃÙ¤¯¤Ê¤ê¤Þ¤¹¡£
Ê¸»ú¤ËÂÐ¤·¤ÆÆ¯¤¯É¬Í×¤Î¤¢¤ë length()¡¢substr()¡¢index()¤Î¤è¤¦¤Ê´Ø¿ô¤Î¤¹¤Ù¤Æ¤È
Àµµ¬É½¸½¥Þ¥Ã¥Á¥ó¥°¤Ï¡¢¥Ç¡¼¥¿¤¬
¥Ð¥¤¥È¥¨¥ó¥³¡¼¥É¤µ¤ì¤Æ¤¤¤ë¤È¤¤Ë¤Ï B<¤«¤Ê¤ê> Áá¤¯Æ°ºî¤Ç¤¤Þ¤¹¡£

=begin original

In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
a caching scheme was introduced which will hopefully make the slowness
somewhat less spectacular, at least for some operations.  In general,
operations with UTF-8 encoded strings are still slower. As an example,
the Unicode properties (character classes) like C<\p{Nd}> are known to
be quite a bit slower (5-20 times) than their simpler counterparts
like C<\d> (then again, there 268 Unicode characters matching C
compared with the 10 ASCII characters matching C).

=end original

Perl 5.8.0 ¤Ç¤Ï¤³¤ÎÃÙ¤µ¤Ï¤·¤Ð¤·¤ÐÌÜÎ©¤Ä¤â¤Î¤Ç¤·¤¿¡£
Perl 5.8.1 ¤Ç¤Ï¾¯¤Ê¤¯¤È¤â°ìÉô¤ÎÁàºî¤Ë¤Ä¤¤¤Æ¤Ï¡¢ÃÙ¤µ¤ò²þÁ±¤¹¤ë¤³¤È¤ò
´üÂÔ¤¹¤ë¥¥ã¥Ã¥·¥ó¥°¥¹¥¡¼¥à(caching scheme)¤¬Æ³Æþ¤µ¤ì¤Þ¤·¤¿¡£
°ìÈÌÅª¤Ë¤Ï¡¢UTF-8 ¥¨¥ó¥³¡¼¥É¤µ¤ì¤¿Ê¸»úÎó¤ËÂÐ¤¹¤ëÁàºî¤Ï¤Þ¤ÀÃÙ¤¤¤â¤Î¤Ç¤¹¡£
¤¿¤È¤¨¤Ð¡¢C<\p{Nd}> ¤Î¤è¤¦¤Ê Unicode ¤ÎÆÃÀ(Ê¸»ú¥¯¥é¥¹)¤ÏÂÐ±þ¤¹¤ë
C<\d> ¤Î¤è¤¦¤ÊÃ±½ã¤Ê¤â¤Î¤è¤ê¤âÌÜÎ©¤Ã¤ÆÃÙ¤¤(5 ÇÜ¤«¤é10 ÇÜ)¤³¤È¤¬
ÃÎ¤é¤ì¤Æ¤¤¤Þ¤¹(·«¤êÊÖ¤·¤Þ¤¹¤¬¡¢C ¤Ï 10 ¤Î ASCII Ê¸»ú¤ËÂÐ¤·¤Æ
¥Þ¥Ã¥Á¤¹¤ë¤Î¤ËÂÐ¤·¤Æ C ¤Ï 268 ¤Î Unicode Ê¸»ú¤Ë¥Þ¥Ã¥Á¤·¤Þ¤¹)¡£

=head2 Possible problems on EBCDIC platforms

(EBCDIC ¥×¥é¥Ã¥È¥Õ¥©¡¼¥à¤Ç¤¢¤êÆÀ¤ëÌäÂê)

=begin original

In earlier versions, when byte and character data were concatenated,
the new string was sometimes created by
decoding the byte strings as I, even if the
old Unicode string used EBCDIC.

=end original

°ÊÁ°¤Î¥Ð¡¼¥¸¥ç¥ó¤Ç¤Ï¡¢¥Ð¥¤¥È¥Ç¡¼¥¿¤ÈÊ¸»ú¥Ç¡¼¥¿¤òÏ¢·ë¤¹¤ë¤È¡¢
¸Å¤¤ Unicode Ê¸»úÎó¤¬ EBCDIC ¤ò»È¤Ã¤Æ¤¤¤¿¤È¤·¤Æ¤â¡¢¿·¤·¤¤Ê¸»úÎó¤Ï
¥Ð¥¤¥ÈÊ¸»úÎó¤ò I ¤È¤·¤Æ¥Ç¥³¡¼¥É¤·¤Æ
ºîÀ®¤µ¤ì¤ë¤³¤È¤¬¤¢¤ê¤Þ¤·¤¿¡£

=begin original

If you find any of these, please report them as bugs.

=end original

¤³¤ì¤é¤Î¤É¤ì¤«¤òÈ¯¸«¤·¤¿¤é¡¢¤É¤¦¤«¥Ð¥°¤È¤·¤ÆÊó¹ð¤·¤Æ¤¯¤À¤µ¤¤¡£

=head2 Porting code from perl-5.6.X

(perl 5.6.X ¤«¤é¥³¡¼¥É¤ò°Ü¿¢¤¹¤ë)

=begin original

Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
was required to use the C pragma to declare that a given scope
expected to deal with Unicode data and had to make sure that only
Unicode data were reaching that scope. If you have code that is
working with 5.6, you will need some of the following adjustments to
your code. The examples are written such that the code will continue
to work under 5.6, so you should be safe to try them out.

=end original

Perl 5.8 ¤Ï 5.6 ¤È¤Ï°Û¤Ê¤ë Unicode ¥â¥Ç¥ë¤ò»ý¤Ã¤Æ¤¤¤Þ¤¹¡£
5.6 ¤Ç¤Ï¥×¥í¥°¥é¥Þ¤Ï¡¢¤¢¤ëÍ¿¤¨¤é¤ì¤¿¥¹¥³¡¼¥×¤¬ Unicode ¥Ç¡¼¥¿¤ò
¼è¤ê°·¤¦¤Î¤È Unicode ¥Ç¡¼¥¿¤À¤±¤¬¤½¤Î¥¹¥³¡¼¥×¤Ë¤¢¤ë¤³¤È¤òÀë¸À¤¹¤ë¤Î¤Ë
C ¥×¥é¥°¥Þ¤Î»ÈÍÑ¤òÍ×µá¤µ¤ì¤Æ¤¤¤Þ¤·¤¿¡£
5.6 ¤ÇÆ°¤¤¤Æ¤¤¤¿¥×¥í¥°¥é¥à¤ò»ý¤Ã¤Æ¤¤¤ë¤Î¤Ê¤é¡¢°Ê²¼¤Ëµó¤²¤ëÈùÄ´À°¤ò»Ü¤¹
É¬Í×¤¬¤¢¤ë¤Ç¤·¤ç¤¦¡£
Îã¤Ï 5.6 ¤Ç¤âÆ°¤¯¤è¤¦¤Ë½ñ¤«¤ì¤Æ¤¤¤ë¤Î¤Ç¡¢°Â¿´¤·¤Æ»î¤¹¤³¤È¤¬¤Ç¤¤Þ¤¹¡£

=over 4

=item *

=begin original

A filehandle that should read or write UTF-8

=end original

UTF-8 ¤ÇÆÉ¤ß½ñ¤¤¹¤Ù¤¥Õ¥¡¥¤¥ë¥Ï¥ó¥É¥ë

  if ($] > 5.007) {
    binmode $fh, ":encoding(utf8)";
  }

=item *

=begin original

A scalar that is going to be passed to some extension

=end original

²¿¤é¤«¤Î¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤ËÅÏ¤½¤¦¤È¤¹¤ë¥¹¥«¥é

=begin original

Be it Compress::Zlib, Apache::Request or any extension that has no
mention of Unicode in the manpage, you need to make sure that the
UTF8 flag is stripped off. Note that at the time of this writing
(October 2002) the mentioned modules are not UTF-8-aware. Please
check the documentation to verify if this is still true.

=end original

Compress::Zlib¡¢Apache::Request ¤Ê¤É¤Î¡¢¥Þ¥Ë¥å¥¢¥ë¥Ú¡¼¥¸¤Ë Unicode ¤Ë
´Ø¤¹¤ëµºÜ¤¬¤Ê¤¤²¿¤é¤«¤Î¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤Ç¡¢³Î¼Â¤Ë UTF8 ¥Õ¥é¥°¤¬
¥ª¥Õ¤Ë¤¹¤ëÉ¬Í×¤¬¤¢¤ê¤Þ¤¹¡£
¤³¤ì¤ò½ñ¤¤¤Æ¤¤¤ë»þÅÀ(2002 Ç¯ 10 ·î)¤Ç¤Ï¡¢¾åµ¤Î¥â¥¸¥å¡¼¥ë¤Ï
UTF-8 ÂÐ±þ¤Ç¤Ê¤¤¤³¤È¤ËÃí°Õ¤·¤Æ¤¯¤À¤µ¤¤¡£
¤³¤ì¤¬¤Þ¤À¿¿¤Ç¤¢¤ë¤Î¤Ê¤é¡¢¥É¥¥å¥á¥ó¥È¤ò¥Á¥§¥Ã¥¯¤·¤Æ³Î¤«¤á¤Æ¤¯¤À¤µ¤¤¡£

  if ($] > 5.007) {
    require Encode;
    $val = Encode::encode_utf8($val); # make octets
  }

=item *

=begin original

A scalar we got back from an extension

=end original

¥¨¥¯¥¹¥Æ¥ó¥·¥ç¥ó¤«¤éÊÖ¤Ã¤Æ¤¤¿¥¹¥«¥é

=begin original

If you believe the scalar comes back as UTF-8, you will most likely
want the UTF8 flag restored:

=end original

¤½¤Î¥¹¥«¥é¤¬ UTF-8 ¤È¤·¤ÆÊÖ¤Ã¤Æ¤¤¿¤â¤Î¤À¤È¿®¤¸¤Æ¤¤¤ë¤Î¤Ê¤é¡¢
UTF-8 ¥Õ¥é¥°¤ò¥ê¥¹¥È¥¢¤·¤¿¤¤¤È¹Í¤¨¤ë¤Ç¤·¤ç¤¦:

  if ($] > 5.007) {
    require Encode;
    $val = Encode::decode_utf8($val);
  }

=item *

=begin original

Same thing, if you are really sure it is UTF-8

=end original

Æ±ÍÍ¤Ë¡¢UTF-8 ¤À¤È³Î¿®¤·¤Æ¤¤¤ë¤Î¤Ê¤é

  if ($] > 5.007) {
    require Encode;
    Encode::_utf8_on($val);
  }

=item *

=begin original

A wrapper for fetchrow_array and fetchrow_hashref

=end original

fetchrow_array ¤È fetchrow_hashref ¤Ø¤Î¥é¥Ã¥Ñ

=begin original

When the database contains only UTF-8, a wrapper function or method is
a convenient way to replace all your fetchrow_array and
fetchrow_hashref calls. A wrapper function will also make it easier to
adapt to future enhancements in your database driver. Note that at the
time of this writing (October 2002), the DBI has no standardized way
to deal with UTF-8 data. Please check the documentation to verify if
that is still true.

=end original

¥Ç¡¼¥¿¥Ù¡¼¥¹¤¬ UTF-8 ¤Î¤ß¤«¤é¹½À®¤µ¤ì¤Æ¤¤¤ë¤È¤¡¢¥é¥Ã¥Ñ´Ø¿ô¤ä
¥é¥Ã¥Ñ¥á¥½¥Ã¥É¤Ï¤¢¤Ê¤¿¤Î fetchrow_array ¤ä fetchrow_hashref ¤Î¸Æ¤Ó½Ð¤·¤ò
ÃÖ¤´¹¤¨¤ë¤Î¤ËÊØÍø¤ÊÊýË¡¤Ç¤·¤ç¤¦¡£
¥é¥Ã¥Ñ´Ø¿ô¤Ï¤Þ¤¿¡¢¤¢¤Ê¤¿¤Î»È¤Ã¤Æ¤¤¤ë¥Ç¡¼¥¿¥Ù¡¼¥¹¥É¥é¥¤¥Ð¤¬
¾Íè³ÈÄ¥¤µ¤ì¤¿¤È¤¤ËÅ¬ÍÑ¤·¤ä¤¹¤¯¤¹¤ë¤Ç¤·¤ç¤¦¡£
¤³¤Î¥É¥¥å¥á¥ó¥È¤ò½ñ¤¤¤Æ¤¤¤ë»þÅÀ(2002 Ç¯ 10 ·î)¤Ç¤Ï¡¢DBI ¤Ï UTF-8 ¤Î¥Ç¡¼¥¿¤ò
°·¤¦É¸½àÅª¤ÊÊýË¡¤ò»ý¤Ã¤Æ¤¤¤Þ¤»¤ó¡£
¤³¤ì¤¬¤Þ¤À¿¿¤Ê¤é¥É¥¥å¥á¥ó¥È¤ò¥Á¥§¥Ã¥¯¤·¤Æ³Î¤«¤á¤Æ¤¯¤À¤µ¤¤¡£

  sub fetchrow {
    my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref}
    if ($] < 5.007) {
      return $sth->$what;
    } else {
      require Encode;
      if (wantarray) {
        my @arr = $sth->$what;
        for (@arr) {
          defined && /[^\000-\177]/ && Encode::_utf8_on($_);
        }
        return @arr;
      } else {
        my $ret = $sth->$what;
        if (ref $ret) {
          for my $k (keys %$ret) {
            defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k};
          }
          return $ret;
        } else {
          defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
          return $ret;
        }
      }
    }
  }

=item *

=begin original

A large scalar that you know can only contain ASCII

=end original

ASCII ¤À¤±¤¬´Þ¤Þ¤ì¤Æ¤¤¤ë¤ÈÊ¬¤«¤Ã¤Æ¤¤¤ëÂç¤¤Ê¥¹¥«¥é

=begin original

Scalars that contain only ASCII and are marked as UTF-8 are sometimes
a drag to your program. If you recognize such a situation, just remove
the UTF8 flag:

=end original

ASCII ¤À¤±¤«¤é¹½À®¤µ¤ì¤Æ¤¤¤ë¤Î¤Ë UTF8 ¤È¤·¤Æ°õÉÕ¤±¤µ¤ì¤Æ¤¤¤ë¥¹¥«¥é¤¬
¤¢¤Ê¤¿¤Î¥×¥í¥°¥é¥à¤Ø°ú¤¤º¤ê¤³¤Þ¤ì¤ë¤³¤È¤¬¤¢¤ê¤Þ¤¹¡£
¤½¤Î¤è¤¦¤Ê¾ì¹ç¤òÇ§¼±¤·¤¿¤Ê¤é¤Ð¡¢Ã±¤Ë UTF-8 ¥Õ¥é¥°¤ò¼è¤ê½ü¤¤¤Æ¤¯¤À¤µ¤¤:

  utf8::downgrade($val) if $] > 5.007;

=back

=head1 SEE ALSO

L, L, L, L, L, L,
L, L

=begin meta

Translate: KIMURA Koichi (-5.8.5)
Update: SHIRAKATA Kentaro  (5.10.0-)
Status: completed

=end meta

=cut