The OpenNET Project / Index page

[ новости /+++ | форум | wiki | теги | ]

Интерактивная система просмотра системных руководств (man-ов)

 [Cписок руководств | Печать]

perlunicode ()
  • >> perlunicode (1) ( Solaris man: Команды и прикладные программы пользовательского уровня )
  • perlunicode (1) ( Разные man: Команды и прикладные программы пользовательского уровня )


         perlunicode - Unicode support in Perl


         Important Caveat
         WARNING: The implementation of Unicode support in Perl is
         The following areas need further work.
         Input and Output Disciplines
             There is currently no easy way to mark data read from a
             file or other external source as being utf8.  This will
             be one of the major areas of focus in the near future.
         Regular Expressions
             The existing regular expression compiler does not
             produce polymorphic opcodes.  This means that the
             determination on whether to match Unicode characters is
             made when the pattern is compiled, based on whether the
             pattern contains Unicode characters, and not when the
             matching happens at run time.  This needs to be changed
             to adaptively match Unicode if the string to be matched
             is Unicode.
         `use utf8' still needed to enable a few features
             The `utf8' pragma implements the tables used for Unicode
             support.  These tables are automatically loaded on
             demand, so the `utf8' pragma need not normally be used.
             However, as a compatibility measure, this pragma must be
             explicitly used to enable recognition of UTF-8 encoded
             literals and identifiers in the source text.
         Byte and Character semantics
         Beginning with version 5.6, Perl uses logically wide
         characters to represent strings internally.  This internal
         representation of strings uses the UTF-8 encoding.
         In future, Perl-level operations can be expected to work
         with characters rather than bytes, in general.
         However, as strictly an interim compatibility measure, Perl
         v5.6 aims to provide a safe migration path from byte
         semantics to character semantics for programs.  For
         operations where Perl can unambiguously decide that the
         input data is characters, Perl now switches to character
         semantics.  For operations where this determination cannot
         be made without additional information from the user, Perl
         decides in favor of compatibility, and chooses to use byte
         This behavior preserves compatibility with earlier versions
         of Perl, which allowed byte semantics in Perl operations,
         but only as long as none of the program's inputs are marked
         as being as source of Unicode character data.  Such data may
         come from filehandles, from calls to external programs, from
         information provided by the system (such as %ENV), or from
         literals and constants in the source text.
         If the `-C' command line switch is used, (or the
         ${^WIDE_SYSTEM_CALLS} global flag is set to `1'), all system
         calls will use the corresponding wide character APIs.  This
         is currently only implemented on Windows.
         Regardless of the above, the `bytes' pragma can always be
         used to force byte semantics in a particular lexical scope.
         See the bytes manpage.
         The `utf8' pragma is primarily a compatibility device that
         enables recognition of UTF-8 in literals encountered by the
         parser.  It may also be used for enabling some of the more
         experimental Unicode support features.  Note that this
         pragma is only required until a future version of Perl in
         which character semantics will become the default.  This
         pragma may then become a no-op.  See the utf8 manpage.
         Unless mentioned otherwise, Perl operators will use
         character semantics when they are dealing with Unicode data,
         and byte semantics otherwise.  Thus, character semantics for
         these operations apply transparently; if the input data came
         from a Unicode source (for example, by adding a character
         encoding discipline to the filehandle whence it came, or a
         literal UTF-8 string constant in the program), character
         semantics apply; otherwise, byte semantics are in effect.
         To force byte semantics on Unicode data, the `bytes' pragma
         should be used.
         Under character semantics, many operations that formerly
         operated on bytes change to operating on characters.  For
         ASCII data this makes no difference, because UTF-8 stores
         ASCII in single bytes, but for any character greater than
         `chr(127)', the character may be stored in a sequence of two
         or more bytes, all of which have the high bit set.  But by
         and large, the user need not worry about this, because Perl
         hides it from the user.  A character in Perl is logically
         just a number ranging from 0 to 2**32 or so.  Larger
         characters encode to longer sequences of bytes internally,
         but again, this is just an internal detail which is hidden
         at the Perl level.
         Effects of character semantics
         Character semantics have the following effects:
         o   Strings and patterns may contain characters that have an
             ordinal value larger than 255.
             Presuming you use a Unicode editor to edit your program,
             such characters will typically occur directly within the
             literal strings as UTF-8 characters, but you can also
             specify a particular character with an extension of the
             `\x' notation.  UTF-8 characters are specified by
             putting the hexadecimal code within curlies after the
             `\x'.  For instance, a Unicode smiley face is
             `\x{263A}'.  A character in the Latin-1 range (128..255)
             should be written `\x{ab}' rather than `\xab', since the
             former will turn into a two-byte UTF-8 code, while the
             latter will continue to be interpreted as generating a
             8-bit byte rather than a character.  In fact, if the
             `use warnings' pragma of the `-w' switch is turned on,
             it will produce a warning that you might be generating
             invalid UTF-8.
         o   Identifiers within the Perl script may contain Unicode
             alphanumeric characters, including ideographs.  (You are
             currently on your own when it comes to using the
             canonical forms of characters--Perl doesn't (yet)
             attempt to canonicalize variable names for you.)
         o   Regular expressions match characters instead of bytes.
             For instance, "." matches a character instead of a byte.
             (However, the `\C' pattern is provided to force a match
             a single byte ("`char'" in C, hence `\C').)
         o   Character classes in regular expressions match
             characters instead of bytes, and match against the
             character properties specified in the Unicode properties
             database.  So `\w' can be used to match an ideograph,
             for instance.
         o   Named Unicode properties and block ranges make be used
             as character classes via the new `\p{}' (matches
             property) and `\P{}' (doesn't match property)
             constructs.  For instance, `\p{Lu}' matches any
             character with the Unicode uppercase property, while
             `\p{M}' matches any mark character.  Single letter
             properties may omit the brackets, so that can be written
             `\pM' also.  Many predefined character classes are
             available, such as `\p{IsMirrored}' and
         o   The special pattern `\X' match matches any extended
             Unicode sequence (a "combining character sequence" in
             Standardese), where the first character is a base
             character and subsequent characters are mark characters
             that apply to the base character.  It is equivalent to
         o   The `tr///' operator translates characters instead of
             bytes.  It can also be forced to translate between 8-bit
             codes and UTF-8.  For instance, if you know your input
             in Latin-1, you can say:
                 while (<>) {
                     tr/\0-\xff//CU;         # latin1 char to utf8
             Similarly you could translate your output with
                 tr/\0-\x{ff}//UC;           # utf8 to latin1 char
             No, `s///' doesn't take /U or /C (yet?).
         o   Case translation operators use the Unicode case
             translation tables when provided character input.  Note
             that `uc()' translates to uppercase, while `ucfirst'
             translates to titlecase (for languages that make the
             distinction).  Naturally the corresponding backslash
             sequences have the same semantics.
         o   Most operators that deal with positions or lengths in
             the string will automatically switch to using character
             positions, including `chop()', `substr()', `pos()',
             `index()', `rindex()', `sprintf()', `write()', and
             `length()'.  Operators that specifically don't switch
             include `vec()', `pack()', and `unpack()'.  Operators
             that really don't care include `chomp()', as well as any
             other operator that treats a string as a bucket of bits,
             such as `sort()', and the operators dealing with
         o   The `pack()'/`unpack()' letters "`c'" and "`C'" do not
             change, since they're often used for byte-oriented
             formats.  (Again, think "`char'" in the C language.)
             However, there is a new "`U'" specifier that will
             convert between UTF-8 characters and integers.  (It
             works outside of the utf8 pragma too.)
         o   The `chr()' and `ord()' functions work on characters.
             This is like `pack("U")' and `unpack("U")', not like
             `pack("C")' and `unpack("C")'.  In fact, the latter are
             how you now emulate byte-oriented `chr()' and `ord()'
             under utf8.
         o   And finally, `scalar reverse()' reverses by character
             rather than by byte.
         Character encodings for input and output
         [XXX: This feature is not yet implemented.]


         As of yet, there is no method for automatically coercing
         input and output to some encoding other than UTF-8.  This is
         planned in the near future, however.
         Whether an arbitrary piece of data will be treated as
         "characters" or "bytes" by internal operations cannot be
         divined at the current time.
         Use of locales with utf8 may lead to odd results.  Currently
         there is some attempt to apply 8-bit locale info to
         characters in the range 0..255, but this is demonstrably
         incorrect for locales that use characters above that range
         (when mapped into Unicode).  It will also tend to run
         slower.  Avoidance of locales is strongly encouraged.


         the bytes manpage, the utf8 manpage, the section on
         "${^WIDE_SYSTEM_CALLS}" in the perlvar manpage

    Поиск по тексту MAN-ов: 

    Inferno Solutions
    Hosting by

    Закладки на сайте
    Проследить за страницей
    Created 1996-2023 by Maxim Chirkov
    Добавить, Поддержать, Вебмастеру