名前¶

perlreguts - Description of the Perl regular expression engine.

perlreguts - Perl 正規表現エンジンの説明

(訳注: (TBR)がついている段落は「みんなの自動翻訳＠TexTra」による機械翻訳です。)

説明¶

This document is an attempt to shine some light on the guts of the regex engine and how it works. The regex engine represents a significant chunk of the perl codebase, but is relatively poorly understood. This document is a meagre attempt at addressing this situation. It is derived from the author's experience, comments in the source code, other papers on the regex engine, feedback on the perl5-porters mail list, and no doubt other places as well.

この文書は、regexエンジンの本質とその仕組みに光を当てる試みです。 regexエンジンはperlのコードベースのかなりの部分を表していますが、比較的よく理解されていません。この文書は、この状況に対処するためのわずかな試みです。著者の経験、ソースコードのコメント、regexエンジンに関する他の論文、perl 5-porterメーリングリストへのフィードバック、そして間違いなく他の場所からも導き出されています。 (TBR)

NOTICE! It should be clearly understood that the behavior and structures discussed in this represents the state of the engine as the author understood it at the time of writing. It is NOT an API definition, it is purely an internals guide for those who want to hack the regex engine, or understand how the regex engine works. Readers of this document are expected to understand perl's regex syntax and its usage in detail. If you want to learn about the basics of Perl's regular expressions, see perlre. And if you want to replace the regex engine with your own see see perlreapi.

注意!ここで説明した動作と構造は、著者が執筆時点で理解していたエンジンの状態を表していることを明確に理解しておく必要があります。これはNOTAPI定義であり、regexエンジンをハックしたり、regexエンジンがどのように動作するかを理解したい人のための、純粋に内部ガイドです。この文書の読者は、perlのregex構文とその使い方を詳細に理解していることが期待されています。 Perlの正規表現の基礎について知りたい場合は、perlreを参照してください。また、regexエンジンを独自のものに置き換えたい場合は、perlreapiを参照してください。 (TBR)

OVERVIEW¶

A quick note on terms¶

There is some debate as to whether to say "regexp" or "regex". In this document we will use the term "regex" unless there is a special reason not to, in which case we will explain why.

"regexp"と"regex"のどちらと呼ぶかについては議論がありますが、この文書では特別な理由がない限り"regex"という用語を使います。特別な理由がある場合はその理由を説明します。 (TBR)

When speaking about regexes we need to distinguish between their source code form and their internal form. In this document we will use the term "pattern" when we speak of their textual, source code form, and the term "program" when we speak of their internal representation. These correspond to the terms S-regex and B-regex that Mark Jason Dominus employs in his paper on "Rx" ([1] in "REFERENCES").

regexについて話すときには、ソースコード形式と内部形式を区別する必要があります。このドキュメントでは、"パターン"という用語をテキストで、ソースコード形式で、"プログラム"という用語を内部表現で使用します。これらは、Mark Jason Dominusが"Rx"([1]in "REFERENCES")に関する論文で採用しているS-regexおよびB-regexという用語に対応しています。 (TBR)

What is a regular expression engine?¶

A regular expression engine is a program that takes a set of constraints specified in a mini-language, and then applies those constraints to a target string, and determines whether or not the string satisfies the constraints. See perlre for a full definition of the language.

正規表現エンジンは、ミニ言語で指定された一連の制約を受け取り、それらの制約をターゲット文字列に適用し、文字列が制約を満たすかどうかを判断するプログラムです。この言語の完全な定義については、perlreを参照してください。 (TBR)

In less grandiose terms, the first part of the job is to turn a pattern into something the computer can efficiently use to find the matching point in the string, and the second part is performing the search itself.

それほど大げさではありませんが、ジョブの最初の部分は、パターンをコンピューターがストリング内の一致点を見つけるために効率的に使用できるものに変換することであり、2番目の部分は検索そのものを実行することです。 (TBR)

To do this we need to produce a program by parsing the text. We then need to execute the program to find the point in the string that matches. And we need to do the whole thing efficiently.

そのためには、テキストを構文解析してプログラムを作成する必要があります。次に、プログラムを実行して、マッチする文字列内の点を見つける必要があります。そして、すべてを効率的に行う必要があります。 (TBR)

Structure of a Regexp Program¶

High Level¶

Although it is a bit confusing and some people object to the terminology, it is worth taking a look at a comment that has been in regexp.h for years:

これは少し混乱しており、この用語に反対する人もいるが、regexp.hに何年もあるコメントを見る価値がある。 (TBR)

This is essentially a linear encoding of a nondeterministic finite-state machine (aka syntax charts or "railroad normal form" in parsing technology).

これは本質的に非決定性有限状態マシン(構文チャート、あるいは構文解析技術における「鉄道正規形」)の線形符号化です。 (TBR)

The term "railroad normal form" is a bit esoteric, with "syntax diagram/charts", or "railroad diagram/charts" being more common terms. Nevertheless it provides a useful mental image of a regex program: each node can be thought of as a unit of track, with a single entry and in most cases a single exit point (there are pieces of track that fork, but statistically not many), and the whole forms a layout with a single entry and single exit point. The matching process can be thought of as a car that moves along the track, with the particular route through the system being determined by the character read at each possible connector point. A car can fall off the track at any point but it may only proceed as long as it matches the track.

「鉄道標準形式」という用語はやや難解であり、「構文図/チャート」または「鉄道図/チャート」がより一般的な用語です。しかし、これはregexプログラムの有用なメンタルイメージを提供します:各ノードは1つのエントリとほとんどの場合1つの終了ポイントを持つトラックの単位と考えることができ(分岐するトラックの断片がありますが、統計的には多くありません)、全体は1つのエントリと1つの終了ポイントを持つレイアウトを形成します。マッチングプロセスは、トラックに沿って移動する車と考えることができます。システム内の特定のルートは、各コネクタポイントで読み取られた文字によって決定されます。車はどのポイントでもトラックから落ちることができますが、トラックと一致する場合にのみ進むことができます。 (TBR)

Thus the pattern /foo(?:\w+|\d+|\s+)bar/ can be thought of as the following chart:

したがって、パターン/foo(?:\w+\d+\s+)bar/は次のチャートと考えることができます。 (TBR)

                      [start]
                         |
                       <foo>
                         |
                   +-----+-----+
                   |     |     |
                 <\w+> <\d+> <\s+>
                   |     |     |
                   +-----+-----+
                         |
                       <bar>
                         |
                       [end]

The truth of the matter is that perl's regular expressions these days are much more complex than this kind of structure, but visualising it this way can help when trying to get your bearings, and it matches the current implementation pretty closely.

実際のところ、最近のperlの正規表現はこの種の構造よりもはるかに複雑ですが、この方法で表示することは、状況を把握しようとする際に役立ち、現在の実装と非常によく一致します。 (TBR)

To be more precise, we will say that a regex program is an encoding of a graph. Each node in the graph corresponds to part of the original regex pattern, such as a literal string or a branch, and has a pointer to the nodes representing the next component to be matched. Since "node" and "opcode" already have other meanings in the perl source, we will call the nodes in a regex program "regops".

より正確には、regexプログラムはグラフのエンコードであると言います。グラフ内の各ノードは、元のregexパターンの一部(リテラル文字列や分岐など)に対応しており、次に一致するコンポーネントを表すノードへのポインタを持っています。 "node"と"opcode"はperlソース内ですでに他の意味を持っているため、regexプログラム内のノードを"regops"と呼びます。 (TBR)

The program is represented by an array of regnode structures, one or more of which represent a single regop of the program. Struct regnode is the smallest struct needed, and has a field structure which is shared with all the other larger structures.

プログラムはregnode構造体の配列で表され、その1つ以上がプログラムの1つのregopを表します。 regnode構造体は必要な最小構造体であり、他のすべてのより大きな構造体と共有されるフィールド構造体を持っています。 (TBR)

The "next" pointers of all regops except BRANCH implement concatenation; a "next" pointer with a BRANCH on both ends of it is connecting two alternatives. [Here we have one of the subtle syntax dependencies: an individual BRANCH (as opposed to a collection of them) is never concatenated with anything because of operator precedence.]

BRANCH以外のすべてのregopsの"next"ポインタは連結を実装します。両端にBRANCHがある"next"ポインタは、2つの選択肢を接続します。 [ここでは、微妙な構文依存関係の1つを紹介します。個々のBRANCH(それらの集合ではなく)は、演算子の優先順位のために何とも連結されません。 ] (TBR)

The operand of some types of regop is a literal string; for others, it is a regop leading into a sub-program. In particular, the operand of a BRANCH node is the first regop of the branch.

regopのオペランドには、リテラル文字列が含まれているものと、サブプログラムに通じるregopが含まれているものがあります。特に、BRANCHノードのオペランドは、分岐の最初のregopです。 (TBR)

NOTE: As the railroad metaphor suggests, this is not a tree structure: the tail of the branch connects to the thing following the set of BRANCHes. It is a like a single line of railway track that splits as it goes into a station or railway yard and rejoins as it comes out the other side.

注意:鉄道メタファが示唆するように、これはnot木構造である:枝の尾はBRANCHesの集合に続くものに接続している。これは鉄道の単線のようなもので、駅や鉄道ヤードに入るときに分かれ、反対側に出るときに再結合する。 (TBR)

Regops¶

The base structure of a regop is defined in regexp.h as follows:

regopの基本構造は、regexp.hで次のように定義されます。 (TBR)

    struct regnode {
        U8  flags;    /* Various purposes, sometimes overridden */
        U8  type;     /* Opcode value as specified by regnodes.h */
        U16 next_off; /* Offset in size regnode */
    };

Other larger regnode-like structures are defined in regcomp.h. They are almost like subclasses in that they have the same fields as regnode, with possibly additional fields following in the structure, and in some cases the specific meaning (and name) of some of base fields are overridden. The following is a more complete description.

他のより大きなregnode様構造体は、regcomp.hで定義されます。これらは、regnodeと同じフィールドを持ち、構造体に追加フィールドが続く可能性があり、いくつかの基本フィールドの特定の意味(および名前)がオーバーライドされる場合があるという点で、サブクラスとほとんど同じです。以下に、より詳細な説明を示します。 (TBR)

regnode_1

regnode_2

regnode_1 structures have the same header, followed by a single four-byte argument; regnode_2 structures contain two two-byte arguments instead:

regnode_1構造体は同じヘッダを持ち、その後に単一の4バイト引数が続きます。 regnode_2構造体は代わりに2バイト引数を2つ含みます。 (TBR)

    regnode_1                U32 arg1;
    regnode_2                U16 arg1;  U16 arg2;

regnode_string

regnode_string structures, used for literal strings, follow the header with a one-byte length and then the string data. Strings are padded on the end with zero bytes so that the total length of the node is a multiple of four bytes:

リテラル文字列に使用されるregnode_string構造体は、1バイト長のヘッダと文字列データの後に続きます。文字列の末尾には0バイトが埋め込まれ、ノード全体の長さは4バイトの倍数になります。 (TBR)

    regnode_string           char string[1];
                             U8 str_len; /* overrides flags */

regnode_charclass

Character classes are represented by regnode_charclass structures, which have a four-byte argument and then a 32-byte (256-bit) bitmap indicating which characters are included in the class.

文字クラスはregnode_charclass構造体によって表されます。この構造体には、4バイトの引数と、クラスに含まれる文字を示す32バイト(256ビット)のビットマップがあります。 (TBR)

    regnode_charclass        U32 arg1;
                             char bitmap[ANYOF_BITMAP_SIZE];

regnode_charclass_class

There is also a larger form of a char class structure used to represent POSIX char classes called regnode_charclass_class which has an additional 4-byte (32-bit) bitmap indicating which POSIX char classes have been included.

regnode_charclass_classと呼ばれるPOSIX charクラスを表すために使用されるcharクラス構造体のより大きな形式もあります。この構造体には、どのPOSIX charクラスが含まれているかを示す4バイト(32ビット)のビットマップが追加されています。 (TBR)

    regnode_charclass_class  U32 arg1;
                             char bitmap[ANYOF_BITMAP_SIZE];
                             char classflags[ANYOF_CLASSBITMAP_SIZE];

regnodes.h defines an array called regarglen[] which gives the size of each opcode in units of size regnode (4-byte). A macro is used to calculate the size of an EXACT node based on its str_len field.

regnodes.hは、regarglen[]と呼ばれる配列を定義します。この配列は、size regnode(4バイト)単位で各オペコードのサイズを表します。マクロは、str_lenフィールドに基づいてEXACTノードのサイズを計算するために使用されます。 (TBR)

The regops are defined in regnodes.h which is generated from regcomp.sym by regcomp.pl. Currently the maximum possible number of distinct regops is restricted to 256, with about a quarter already used.

regopsは、regcomp.plによってregcomp.symから生成されるregnodes.hで定義されます。現在、個別のregopsの最大数は256に制限されており、約1/4がすでに使用されています。 (TBR)

A set of macros makes accessing the fields easier and more consistent. These include OP(), which is used to determine the type of a regnode-like structure; NEXT_OFF(), which is the offset to the next node (more on this later); ARG(), ARG1(), ARG2(), ARG_SET(), and equivalents for reading and setting the arguments; and STR_LEN(), STRING() and OPERAND() for manipulating strings and regop bearing types.

マクロのセットを使用すると、フィールドへのアクセスがより簡単で一貫性のあるものになります。マクロには、regnode様構造の型を決定するために使用されるOP()、次のノードへのオフセットであるNEXT_OFF()(これについては後で詳しく説明します)、引数を読み取ったり設定したりするためのARG()、ARG1()、ARG2()、ARG_SET()などがあります。文字列やregopを含む型を操作するためのSTR_LEN()、STRING()、OPERAND()などがあります。 (TBR)

What regop is next?¶

There are three distinct concepts of "next" in the regex engine, and it is important to keep them clear.

regexエンジンの「next」には3つの異なる概念がありますが、それらを明確にしておくことが重要です。 (TBR)

There is the "next regnode" from a given regnode, a value which is rarely useful except that sometimes it matches up in terms of value with one of the others, and that sometimes the code assumes this to always be so.

特定のregnodeからの「次のregnode」があります。この値は、値の点で他のいずれかと一致する場合や、コードが常に一致すると仮定する場合を除き、ほとんど役に立たない値です。 (TBR)
There is the "next regop" from a given regop/regnode. This is the regop physically located after the the current one, as determined by the size of the current regop. This is often useful, such as when dumping the structure we use this order to traverse. Sometimes the code assumes that the "next regnode" is the same as the "next regop", or in other words assumes that the sizeof a given regop type is always going to be one regnode large.

指定されたregop/regnodeから「次のregop」があります。これは、現在のregopの後に物理的に位置するregopであり、現在のregopのサイズによって決定されます。これは、構造体をダンプするときなど、この順序を使用してトラバースする場合に便利です。コードでは、「次のregnode」が「次のregop」と同じであると仮定する場合、つまり、指定されたregopタイプのサイズが常に1 regnode大きいと仮定する場合があります。 (TBR)
There is the "regnext" from a given regop. This is the regop which is reached by jumping forward by the value of NEXT_OFF(), or in a few cases for longer jumps by the arg1 field of the regnode_1 structure. The subroutine regnext() handles this transparently. This is the logical successor of the node, which in some cases, like that of the BRANCH regop, has special meaning.

与えられたregopから"regnext"があります。これは、NEXT_OFF()の値だけ前方にジャンプすることによって到達するregopです。または、少数の場合には、regnode_1構造体のarg1場によってより長いジャンプをすることによって到達するregopです。サブルーチンregnext()はこれを透過的に処理します。これはノードの論理的な後継ノードであり、場合によってはBRANCHregopのように特別な意味を持ちます。 (TBR)

Process Overview¶

Broadly speaking, performing a match of a string against a pattern involves the following steps:

大まかに言えば、パターンに対するストリングのマッチングを実行するには、次のステップが含まれます。 (TBR)

A. Compilation

1. Parsing for size
2. Parsing for construction
3. Peep-hole optimisation and analysis

B. Execution

4. Start position and no-match optimisations
5. Program execution

Where these steps occur in the actual execution of a perl program is determined by whether the pattern involves interpolating any string variables. If interpolation occurs, then compilation happens at run time. If it does not, then compilation is performed at compile time. (The /o modifier changes this, as does qr// to a certain extent.) The engine doesn't really care that much.

これらのステップがPerlプログラムの実際の実行のどこで発生するかは、パターンに文字列変数の補間が含まれているかどうかによって決まります。補間が発生すると、コンパイルは実行時に行われます。補間が発生しない場合は、コンパイルはコンパイル時に実行されます。 (/o修飾子はこれを変更します。 qr//もある程度変更します。 )エンジンはあまり気にしません。 (TBR)

Compilation¶

This code resides primarily in regcomp.c, along with the header files regcomp.h, regexp.h and regnodes.h.

このコードは、ヘッダファイルregcomp.h、regexp.h、regnodes.hと共に、主にregcomp.cに存在します。 (TBR)

Compilation starts with pregcomp(), which is mostly an initialisation wrapper which farms work out to two other routines for the heavy lifting: the first is reg(), which is the start point for parsing; the second, study_chunk(), is responsible for optimisation.

コンパイルはpregcomp()から始まります。これは主に初期化ラッパーであり、ファームは重い処理のための他の2つのルーチンに対して作業を行います:1つ目は構文解析の開始ポイントであるreg()、2つ目は最適化を担当するstudy_chunk()です。 (TBR)

Initialisation in pregcomp() mostly involves the creation and data-filling of a special structure, RExC_state_t (defined in regcomp.c). Almost all internally-used routines in regcomp.h take a pointer to one of these structures as their first argument, with the name pRExC_state. This structure is used to store the compilation state and contains many fields. Likewise there are many macros which operate on this variable: anything that looks like RExC_xxxx is a macro that operates on this pointer/structure.

pregcomp()での初期化には、主に特殊な構造体RExC_state_t(regcomp.cで定義されています)の作成とデータ入力が含まれます。 regcomp.hで内部的に使用されるほとんどすべてのルーチンは、pRExC_stateという名前のこれらの構造体の1つへのポインタを最初の引数として取ります。この構造体はコンパイル状態を保存するために使用され、多くのフィールドを含みます。同様に、この変数を操作するマクロもたくさんあります:RExC_xxxxのようなものはすべて、このポインタ/構造体を操作するマクロです。 (TBR)

Parsing for size¶

In this pass the input pattern is parsed in order to calculate how much space is needed for each regop we would need to emit. The size is also used to determine whether long jumps will be required in the program.

このパスでは、入力パターンを解析して、生成する必要がある各regopに必要なスペースを計算します。このサイズは、プログラムで長いジャンプが必要かどうかを判断するためにも使用されます。 (TBR)

This stage is controlled by the macro SIZE_ONLY being set.

この段階は、設定されているマクロSIZE_ONLYによって制御されます。 (TBR)

The parse proceeds pretty much exactly as it does during the construction phase, except that most routines are short-circuited to change the size field RExC_size and not do anything else.

構文解析は、ほとんどのルーチンがサイズフィールドRExC_sizeを変更するために短絡され、それ以外は何も行われないことを除いて、構築フェーズ中とまったく同じように進行します。 (TBR)

Parsing for construction¶

Once the size of the program has been determined, the pattern is parsed again, but this time for real. Now SIZE_ONLY will be false, and the actual construction can occur.

プログラムのサイズが決定されると、パターンは再び解析されますが、今回は実際に解析されます。ここでSIZE_ONLYは偽となり、実際の構築が行われます。 (TBR)

reg() is the start of the parse process. It is responsible for parsing an arbitrary chunk of pattern up to either the end of the string, or the first closing parenthesis it encounters in the pattern. This means it can be used to parse the top-level regex, or any section inside of a grouping parenthesis. It also handles the "special parens" that perl's regexes have. For instance when parsing /x(?:foo)y/ reg() will at one point be called to parse from the "?" symbol up to and including the ")".

reg()は解析プロセスの開始です。これは、文字列の末尾まで、またはパターン内で遭遇する最初の閉じ括弧まで、任意のパターンの塊を解析します。これは、トップレベルの正規表現、またはグループ化括弧内の任意のセクションを解析するために使用できることを意味します。また、perlの正規表現が持つ"特殊括弧"も処理します。例えば、/x(?:foo)y/を解析する場合、reg()は"?"記号から")"記号までを解析するために一度呼び出されます。 (TBR)

Additionally, reg() is responsible for parsing the one or more branches from the pattern, and for "finishing them off" by correctly setting their next pointers. In order to do the parsing, it repeatedly calls out to regbranch(), which is responsible for handling up to the first | symbol it sees.

さらに、reg()は、パターンからの1つまたは複数の分岐を解析し、次のポインタを正しく設定することによって「終了」させる役割を果たします。解析を実行するために、regbranch()を繰り返し呼び出します。 reg()は、最初に検出したシンボルまでを処理します。 (TBR)

regbranch() in turn calls regpiece() which handles "things" followed by a quantifier. In order to parse the "things", regatom() is called. This is the lowest level routine, which parses out constant strings, character classes, and the various special symbols like $. If regatom() encounters a "(" character it in turn calls reg().

regbranch()は次にregpiece()を呼び出し、"things"とそれに続く数量詞を処理します。 "things"を解析するために、regatom()が呼び出されます。これは最下位レベルのルーチンで、定数文字列、文字クラス、$のような特殊記号を解析します。 regatom()が"("文字に遭遇すると、reg()が呼び出されます。 (TBR)

The routine regtail() is called by both reg() and regbranch() in order to "set the tail pointer" correctly. When executing and we get to the end of a branch, we need to go to the node following the grouping parens. When parsing, however, we don't know where the end will be until we get there, so when we do we must go back and update the offsets as appropriate. regtail is used to make this easier.

ルーチンregtail()は、reg()とregbranch()の両方から、"tailポインタを正しく設定"するために呼び出されます。実行中に分岐の終端に到達した場合、グループ化括弧に続くノードに到達する必要があります。しかし、解析時には、そこに到達するまで終端がどこになるか分からないため、到達したときには、オフセットを適切に更新する必要があります。これを容易にするためにregtailが使用されています。 (TBR)

A subtlety of the parsing process means that a regex like /foo/ is originally parsed into an alternation with a single branch. It is only afterwards that the optimiser converts single branch alternations into the simpler form.

構文解析プロセスの微妙さは、/foo/のような正規表現が、最初は単一の分岐を持つ変形に構文解析されることを意味します。オプティマイザが単一の分岐の変形をより単純な形式に変換するのは、後になってからです。 (TBR)

Parse Call Graph and a Grammar¶

The call graph looks like this:

コールグラフは次のようになります。 (TBR)

    reg()                        # parse a top level regex, or inside of parens
        regbranch()              # parse a single branch of an alternation
            regpiece()           # parse a pattern followed by a quantifier
                regatom()        # parse a simple pattern
                    regclass()   #   used to handle a class
                    reg()        #   used to handle a parenthesised subpattern
                    ....
            ...
            regtail()            # finish off the branch
        ...
        regtail()                # finish off the branch sequence. Tie each
                                 # branch's tail to the tail of the sequence
                                 # (NEW) In Debug mode this is
                                 # regtail_study().

A grammar form might be something like this:

文法形式は次のようになります。 (TBR)

    atom  : constant | class
    quant : '*' | '+' | '?' | '{min,max}'
    _branch: piece
           | piece _branch
           | nothing
    branch: _branch
          | _branch '|' branch
    group : '(' branch ')'
    _piece: atom | group
    piece : _piece
          | _piece quant

Debug Output¶

In the 5.9.x development version of perl you can use re Debug => 'PARSE' to see some trace information about the parse process. We will start with some simple patterns and build up to more complex patterns.

perlの5.9.x開発バージョンでは、use re Debug =>'PARSE'>>を使用して、解析プロセスに関するトレース情報を確認することができます。ここでは、いくつかの単純なパターンから始めて、より複雑なパターンを作成します。 (TBR)

So when we parse /foo/ we see something like the following table. The left shows what is being parsed, and the number indicates where the next regop would go. The stuff on the right is the trace output of the graph. The names are chosen to be short to make it less dense on the screen. 'tsdy' is a special form of regtail() which does some extra analysis.

/foo/を解析すると、次の表のようなものが表示されます。左側は解析対象を示し、数字は次のregopがどこに行くかを示します。右側はグラフのトレース出力です。名前は画面上で密度を低くするために短く選ばれています。 'tsdy'は特別な解析を行うregtail()の特別な形式です。 (TBR)

 >foo<             1    reg
                          brnc
                            piec
                              atom
 ><                4      tsdy~ EXACT <foo> (EXACT) (1)
                              ~ attach to END (3) offset to 2

The resulting program then looks like:

結果のプログラムは次のようになります。 (TBR)

   1: EXACT <foo>(3)
   3: END(0)

As you can see, even though we parsed out a branch and a piece, it was ultimately only an atom. The final program shows us how things work. We have an EXACT regop, followed by an END regop. The number in parens indicates where the regnext of the node goes. The regnext of an END regop is unused, as END regops mean we have successfully matched. The number on the left indicates the position of the regop in the regnode array.

ご覧のように、私たちは枝とピースを解析しましたが、それは最終的には原子だけでした。最後のプログラムは、物事がどのように機能するかを示しています。 EXACTregopがあり、その後にENDregopが続きます。括弧内の数字は、ノードのregnextの位置を示しています。 ENDregopのregnextは使用されていません。 ENDregopsは正常に一致したことを意味します。左側の数字はregnode配列内のregopの位置を示しています。 (TBR)

Now let's try a harder pattern. We will add a quantifier, so now we have the pattern /foo+/. We will see that regbranch() calls regpiece() twice.

では、もっと難しいパターンを試してみましょう。数量詞を追加するので、パターン/foo+/ができました。 regbranch()がregpiece()を2回呼び出すことがわかります。 (TBR)

 >foo+<            1    reg
                          brnc
                            piec
                              atom
 >o+<              3        piec
                              atom
 ><                6        tail~ EXACT <fo> (1)
                   7      tsdy~ EXACT <fo> (EXACT) (1)
                              ~ PLUS (END) (3)
                              ~ attach to END (6) offset to 3

And we end up with the program:

プログラムが完成しました (TBR)

   1: EXACT <fo>(3)
   3: PLUS(6)
   4:   EXACT <o>(0)
   6: END(0)

Now we have a special case. The EXACT regop has a regnext of 0. This is because if it matches it should try to match itself again. The PLUS regop handles the actual failure of the EXACT regop and acts appropriately (going to regnode 6 if the EXACT matched at least once, or failing if it didn't).

ここで特別なケースがあります。 EXACTregopのregnextは0です。これは、EXACTregopが一致した場合には、再度自身との一致を試みる必要があるためです。 PLUSregopはEXACTregopの実際の失敗を処理し、適切に動作します(EXACTが少なくとも1回一致した場合はregnode 6に進み、一致しなかった場合は失敗します)。 (TBR)

Now for something much more complex: /x(?:foo*|b[a][rR])(foo|bar)$/

もっと複雑な/x(?:foo*b[a][rR])(foo bar)$/ (TBR)

 >x(?:foo*|b...    1    reg
                          brnc
                            piec
                              atom
 >(?:foo*|b[...    3        piec
                              atom
 >?:foo*|b[a...                 reg
 >foo*|b[a][...                   brnc
                                    piec
                                      atom
 >o*|b[a][rR...    5                piec
                                      atom
 >|b[a][rR])...    8                tail~ EXACT <fo> (3)
 >b[a][rR])(...    9              brnc
                  10                piec
                                      atom
 >[a][rR])(f...   12                piec
                                      atom
 >a][rR])(fo...                         clas
 >[rR])(foo|...   14                tail~ EXACT <b> (10)
                                    piec
                                      atom
 >rR])(foo|b...                         clas
 >)(foo|bar)...   25                tail~ EXACT <a> (12)
                                  tail~ BRANCH (3)
                  26              tsdy~ BRANCH (END) (9)
                                      ~ attach to TAIL (25) offset to 16
                                  tsdy~ EXACT <fo> (EXACT) (4)
                                      ~ STAR (END) (6)
                                      ~ attach to TAIL (25) offset to 19
                                  tsdy~ EXACT <b> (EXACT) (10)
                                      ~ EXACT <a> (EXACT) (12)
                                      ~ ANYOF[Rr] (END) (14)
                                      ~ attach to TAIL (25) offset to 11
 >(foo|bar)$<               tail~ EXACT <x> (1)
                            piec
                              atom
 >foo|bar)$<                    reg
                  28              brnc
                                    piec
                                      atom
 >|bar)$<         31              tail~ OPEN1 (26)
 >bar)$<                          brnc
                  32                piec
                                      atom
 >)$<             34              tail~ BRANCH (28)
                  36              tsdy~ BRANCH (END) (31)
                                      ~ attach to CLOSE1 (34) offset to 3
                                  tsdy~ EXACT <foo> (EXACT) (29)
                                      ~ attach to CLOSE1 (34) offset to 5
                                  tsdy~ EXACT <bar> (EXACT) (32)
                                      ~ attach to CLOSE1 (34) offset to 2
 >$<                        tail~ BRANCH (3)
                                ~ BRANCH (9)
                                ~ TAIL (25)
                            piec
                              atom
 ><               37        tail~ OPEN1 (26)
                                ~ BRANCH (28)
                                ~ BRANCH (31)
                                ~ CLOSE1 (34)
                  38      tsdy~ EXACT <x> (EXACT) (1)
                              ~ BRANCH (END) (3)
                              ~ BRANCH (END) (9)
                              ~ TAIL (END) (25)
                              ~ OPEN1 (END) (26)
                              ~ BRANCH (END) (28)
                              ~ BRANCH (END) (31)
                              ~ CLOSE1 (END) (34)
                              ~ EOL (END) (36)
                              ~ attach to END (37) offset to 1

Resulting in the program

プログラムの結果 (TBR)

   1: EXACT <x>(3)
   3: BRANCH(9)
   4:   EXACT <fo>(6)
   6:   STAR(26)
   7:     EXACT <o>(0)
   9: BRANCH(25)
  10:   EXACT <ba>(14)
  12:   OPTIMIZED (2 nodes)
  14:   ANYOF[Rr](26)
  25: TAIL(26)
  26: OPEN1(28)
  28:   TRIE-EXACT(34)
        [StS:1 Wds:2 Cs:6 Uq:5 #Sts:7 Mn:3 Mx:3 Stcls:bf]
          <foo>
          <bar>
  30:   OPTIMIZED (4 nodes)
  34: CLOSE1(36)
  36: EOL(37)
  37: END(0)

Here we can see a much more complex program, with various optimisations in play. At regnode 10 we see an example where a character class with only one character in it was turned into an EXACT node. We can also see where an entire alternation was turned into a TRIE-EXACT node. As a consequence, some of the regnodes have been marked as optimised away. We can see that the $ symbol has been converted into an EOL regop, a special piece of code that looks for \n or the end of the string.

ここでは、さまざまな最適化が実行されている、はるかに複雑なプログラムを見ることができます。 regnode 10では、1文字しか含まれていない文字クラスがEXACTノードに変換された例が見られます。また、変更全体がTRIE-EXACTノードに変換された場所も見られます。その結果、一部のregnodesは最適化されていないとマークされています。 $シンボルがEOLregop(\nまたは文字列の末尾を検索する特殊なコード)に変換されていることがわかります。 (TBR)

The next pointer for BRANCHes is interesting in that it points at where execution should go if the branch fails. When executing, if the engine tries to traverse from a branch to a regnext that isn't a branch then the engine will know that the entire set of branches has failed.

BRANCHへのnextポインタは、分岐が失敗した場合に実行すべき場所を指すという点で興味深いものです。実行時に、エンジンが分岐から分岐ではないregnextへトラバースしようとすると、エンジンは分岐のセット全体が失敗したことを認識します。 (TBR)

Peep-hole Optimisation and Analysis¶

The regular expression engine can be a weighty tool to wield. On long strings and complex patterns it can end up having to do a lot of work to find a match, and even more to decide that no match is possible. Consider a situation like the following pattern.

正規表現エンジンは、使用するための重要なツールになる可能性があります。長い文字列や複雑なパターンでは、一致を見つけるために多くの作業を行う必要があり、一致が不可能であると判断するためにさらに多くの作業を行う必要があります。次のパターンのような状況を考えてみてください。 (TBR)

   'ababababababababababab' =~ /(a|b)*z/

The (a|b)* part can match at every char in the string, and then fail every time because there is no z in the string. So obviously we can avoid using the regex engine unless there is a z in the string. Likewise in a pattern like:

(a b)*部分は、文字列内のすべての文字で一致し、文字列内にzが存在しないため、毎回失敗する可能性があります。したがって、文字列内にzが存在しない限り、regexエンジンの使用を回避できることは明らかです。同様に、次のようなパターンで実行します。 (TBR)

   /foo(\w+)bar/

In this case we know that the string must contain a foo which must be followed by bar. We can use Fast Boyer-Moore matching as implemented in fbm_instr() to find the location of these strings. If they don't exist then we don't need to resort to the much more expensive regex engine. Even better, if they do exist then we can use their positions to reduce the search space that the regex engine needs to cover to determine if the entire pattern matches.

この場合、文字列にはfooが含まれ、その後にbarが続く必要があることがわかります。 fbm_instr()で実装されているFast Boyer-Mooreマッチングを使用して、これらの文字列の場所を見つけることができます。それらが存在しない場合は、はるかに高価なregexエンジンに頼る必要はありません。さらに、それらが存在する場合は、それらの位置を使用して、regexエンジンがカバーする必要があるサーチスペースを減らし、パターン全体が一致するかどうかを判断できます。 (TBR)

There are various aspects of the pattern that can be used to facilitate optimisations along these lines:

最適化を促進するために使用できるパターンには、次のようなさまざまな側面があります。 (TBR)

anchored fixed strings
floating fixed strings
minimum and maximum length requirements
start class
Beginning/End of line positions

Another form of optimisation that can occur is the post-parse "peep-hole" optimisation, where inefficient constructs are replaced by more efficient constructs. The TAIL regops which are used during parsing to mark the end of branches and the end of groups are examples of this. These regops are used as place-holders during construction and "always match" so they can be "optimised away" by making the things that point to the TAIL point to the thing that TAIL points to, thus "skipping" the node.

発生する可能性のある別の形態の最適化は、構文解析後の"のぞき穴"最適化であり、非効率的な構造体がより効率的な構造体に置き換えられる。構文解析中にブランチの終わりとグループの終わりをマークするために使用されるTAILregopsは、この例である。これらのregopsは構築中にプレースホルダとして使用され、"常に一致"するため、TAILを指すものをTAILが指すものを指すようにしてノードを"スキップ"することで、"最適化された"ことができる。 (TBR)

Another optimisation that can occur is that of "EXACT merging" which is where two consecutive EXACT nodes are merged into a single regop. An even more aggressive form of this is that a branch sequence of the form EXACT BRANCH ... EXACT can be converted into a TRIE-EXACT regop.

もう1つの最適化は、2つの連続するEXACTノードが1つのregopにマージされる「EXACTマージ」の最適化です。これのさらに積極的な形式は、EXACT BRANCH.EXACT形式の分岐シーケンスをTRIE-EXACTregopに変換できることです。 (TBR)

All of this occurs in the routine study_chunk() which uses a special structure scan_data_t to store the analysis that it has performed, and does the "peep-hole" optimisations as it goes.

これらはすべてルーチンstudy_chunk()で行われます。ルーチンCは特殊な構造体scan_data_tを使用して実行した分析を格納し、"のぞき見穴"最適化を行います。 (TBR)

The code involved in study_chunk() is extremely cryptic. Be careful. :-)

study_chunk()に含まれるコードは非常に暗号化しています。注意してください:-) (TBR)

Execution¶

Execution of a regex generally involves two phases, the first being finding the start point in the string where we should match from, and the second being running the regop interpreter.

通常、regexの実行には2つのフェーズがあります。 1つ目はストリング内でマッチすべき開始点を見つけるフェーズで、2つ目はregopインタープリターを実行するフェーズです。 (TBR)

If we can tell that there is no valid start point then we don't bother running interpreter at all. Likewise, if we know from the analysis phase that we cannot detect a short-cut to the start position, we go straight to the interpreter.

有効な開始点がないことがわかれば、インタプリタを実行する必要はありません。同様に、解析フェーズから開始位置へのショートカットを検出できないことがわかっている場合は、そのままインタプリタに進みます。 (TBR)

The two entry points are re_intuit_start() and pregexec(). These routines have a somewhat incestuous relationship with overlap between their functions, and pregexec() may even call re_intuit_start() on its own. Nevertheless other parts of the the perl source code may call into either, or both.

re_intuit_start()とpregexec()の2つのエントリポイントは、re_intuit_start()とpregexec()です。これらのルーチンは、それぞれの関数が重複するという多少近親相姦的な関係を持っています。また、ソルト()は単独でre_intuit_start()を呼び出すこともできます。しかし、perlソースコードの他の部分は、どちらか、または両方を呼び出すことができます。 (TBR)

Execution of the interpreter itself used to be recursive, but thanks to the efforts of Dave Mitchell in the 5.9.x development track, that has changed: now an internal stack is maintained on the heap and the routine is fully iterative. This can make it tricky as the code is quite conservative about what state it stores, with the result that that two consecutive lines in the code can actually be running in totally different contexts due to the simulated recursion.

インタープリター自体の実行は再帰的でしたが、5.9.x開発トラックでのDave Mitchell氏の努力のおかげで、それは変わりました:内部スタックがヒープ上で維持され、ルーチンは完全に反復的になりました。これは、コードが格納する状態に関して非常に保守的であり、コード内の2つの連続した行が、シミュレートされた再帰のために、実際にはまったく異なるコンテキストで実行される可能性があるため、注意が必要になる可能性があります。 (TBR)

Start position and no-match optimisations¶

re_intuit_start() is responsible for handling start points and no-match optimisations as determined by the results of the analysis done by study_chunk() (and described in "Peep-hole Optimisation and Analysis").

re_intuit_start()は、study_chunk()によって行われた(そして"Peep-hole Optimization and Analysis"に記述されている)分析の結果によって決定された、開始点およびマッチしない最適化を処理する責任がある。 (TBR)

The basic structure of this routine is to try to find the start- and/or end-points of where the pattern could match, and to ensure that the string is long enough to match the pattern. It tries to use more efficient methods over less efficient methods and may involve considerable cross-checking of constraints to find the place in the string that matches. For instance it may try to determine that a given fixed string must be not only present but a certain number of chars before the end of the string, or whatever.

このルーチンの基本的な構造は、パターンがマッチする場所の開始点および/または終了点を見つけようとし、文字列がパターンにマッチするのに十分な長さであることを確認しようとすることです。このルーチンは、効率の低いメソッドよりも効率の高いメソッドを使用しようとし、制約条件のかなりのクロスチェックを行って、マッチする文字列の場所を見つけることがあります。例えば、特定の固定文字列が存在するだけでなく、文字列の終わりの前に特定の数の文字が存在する必要があることなどを判断しようとするかもしれません。 (TBR)

It calls several other routines, such as fbm_instr() which does Fast Boyer Moore matching and find_byclass() which is responsible for finding the start using the first mandatory regop in the program.

Fast Boyer Mooreマッチングを行うfbm_instr()や、プログラム内の最初の必須regopを使ってbyclassを見つけるfind_start()など、他のいくつかのルーチンを呼び出します。 (TBR)

When the optimisation criteria have been satisfied, reg_try() is called to perform the match.

最適化基準が満たされると、reg_try()が呼び出されてマッチングが実行される。 (TBR)

Program execution¶

pregexec() is the main entry point for running a regex. It contains support for initialising the regex interpreter's state, running re_intuit_start() if needed, and running the interpreter on the string from various start positions as needed. When it is necessary to use the regex interpreter pregexec() calls regtry().

pregexec()は、regexを実行するための主要なエントリポイントです。これには、regexインタプリタの状態を初期化するサポート、必要に応じてre_intuit_start()を実行するサポート、および必要に応じて文字列のさまざまな開始位置からインタプリタを実行するサポートが含まれています。 regexインタプリタを使用する必要がある場合、pregexec()はregtry()を呼び出します。 (TBR)

regtry() is the entry point into the regex interpreter. It expects as arguments a pointer to a regmatch_info structure and a pointer to a string. It returns an integer 1 for success and a 0 for failure. It is basically a set-up wrapper around regmatch().

regtry()は、regexインタプリタへのエントリポイントです。引数として、regmatch_info構造体へのポインタと文字列へのポインタを受け取ります。成功した場合は整数1を返し、失敗した場合は0を返します。基本的にはregmatch()の設定ラッパーです。 (TBR)

regmatch is the main "recursive loop" of the interpreter. It is basically a giant switch statement that implements a state machine, where the possible states are the regops themselves, plus a number of additional intermediate and failure states. A few of the states are implemented as subroutines but the bulk are inline code.

regmatchはインタプリタの主要な"再帰ループ"です。これは基本的にステートマシンを実装する巨大なswitch文であり、可能な状態はregops自体といくつかの中間状態と障害状態です。いくつかの状態はサブルーチンとして実装されていますが、バルクはインラインコードです。 (TBR)

MISCELLANEOUS¶

Unicode and Localisation Support¶

When dealing with strings containing characters that cannot be represented using an eight-bit character set, perl uses an internal representation that is a permissive version of Unicode's UTF-8 encoding[2]. This uses single bytes to represent characters from the ASCII character set, and sequences of two or more bytes for all other characters. (See perlunitut for more information about the relationship between UTF-8 and perl's encoding, utf8 -- the difference isn't important for this discussion.)

8ビット文字セットで表現できない文字を含む文字列を扱う場合、perlはUnicodeのUTF-8エンコーディング[2]の許容バージョンである内部表現を使用します。これはASCII文字セットの文字を表現するのに1バイトを使用し、他のすべての文字は2バイト以上のシーケンスを使用します(UTF-8とperlのエンコーディングutf8との関係についての詳細はperlunitutを参照してください。この説明では違いは重要ではありません)。 (TBR)

No matter how you look at it, Unicode support is going to be a pain in a regex engine. Tricks that might be fine when you have 256 possible characters often won't scale to handle the size of the UTF-8 character set. Things you can take for granted with ASCII may not be true with Unicode. For instance, in ASCII, it is safe to assume that sizeof(char1) == sizeof(char2), but in UTF-8 it isn't. Unicode case folding is vastly more complex than the simple rules of ASCII, and even when not using Unicode but only localised single byte encodings, things can get tricky (for example, LATIN SMALL LETTER SHARP S (U+00DF, ß) should match 'SS' in localised case-insensitive matching).

どのような見方をしても、regexエンジンではUnicodeのサポートは厄介なものになります。可能な文字が256文字ある場合にはうまくいくかもしれないトリックは、UTF-8文字セットのサイズに対応できないことがよくあります。 ASCIIで当然と考えられることは、Unicodeには当てはまらない可能性があります。例えば、ASCIIではsizeof(char1) == sizeof(char2)と仮定しても安全ですが、UTF-8ではそうではありません。 Unicodeの大文字小文字の折り畳みはASCIIの単純な規則よりもはるかに複雑であり、Unicodeを使用せずローカライズされた単一バイトエンコーディングのみを使用している場合でも、扱いが難しい場合があります(例えば、LATIN SMALL LETTER SHARP S(U+00 szlig,E<SF>)はローカライズされた大文字小文字を区別しないマッチングでは'SS'と一致する必要があります)。 (TBR)

Making things worse is that UTF-8 support was a later addition to the regex engine (as it was to perl) and this necessarily made things a lot more complicated. Obviously it is easier to design a regex engine with Unicode support in mind from the beginning than it is to retrofit it to one that wasn't.

事態をさらに悪化させているのは、UTF-8サポートがregexエンジンに後から追加された(perlの場合と同じように)ため、事態は必然的に複雑になりました。明らかに、最初からUnicodeサポートを念頭に置いてregexエンジンを設計する方が、そうでないものに後から組み込むよりも簡単です。 (TBR)

Nearly all regops that involve looking at the input string have two cases, one for UTF-8, and one not. In fact, it's often more complex than that, as the pattern may be UTF-8 as well.

入力文字列を参照するほとんどすべてのregopsには、UTF-8とUTF-8以外の2つのケースがあります。実際、パターンもUTF-8である可能性があるため、これはより複雑なことがよくあります。 (TBR)

Care must be taken when making changes to make sure that you handle UTF-8 properly, both at compile time and at execution time, including when the string and pattern are mismatched.

変更を行う際には、コンパイル時と実行時(文字列とパターンが一致しない場合を含む)の両方で、UTF-8が正しく処理されるように注意する必要があります。 (TBR)

The following comment in regcomp.h gives an example of exactly how tricky this can be:

regcomp.hの次のコメントは、これがいかに厄介かを正確に示す例を示しています。 (TBR)

    Two problematic code points in Unicode casefolding of EXACT nodes:

    U+0390 - GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
    U+03B0 - GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS

    which casefold to

    Unicode                      UTF-8

    U+03B9 U+0308 U+0301         0xCE 0xB9 0xCC 0x88 0xCC 0x81
    U+03C5 U+0308 U+0301         0xCF 0x85 0xCC 0x88 0xCC 0x81

    This means that in case-insensitive matching (or "loose matching",
    as Unicode calls it), an EXACTF of length six (the UTF-8 encoded
    byte length of the above casefolded versions) can match a target
    string of length two (the byte length of UTF-8 encoded U+0390 or
    U+03B0). This would rather mess up the minimum length computation.

    What we'll do is to look for the tail four bytes, and then peek
    at the preceding two bytes to see whether we need to decrease
    the minimum length by four (six minus two).

    Thanks to the design of UTF-8, there cannot be false matches:
    A sequence of valid UTF-8 bytes cannot be a subsequence of
    another valid sequence of UTF-8 bytes.

Base Structures¶

The regexp structure described in perlreapi is common to all regex engines. Two of its fields that are intended for the private use of the regex engine that compiled the pattern. These are the intflags and pprivate members. The pprivate is a void pointer to an arbitrary structure whose use and management is the responsibility of the compiling engine. perl will never modify either of these values. In the case of the stock engine the structure pointed to by pprivate is called regexp_internal.

perlreapiで説明されているregexp構造体は、すべてのregexエンジンに共通です。この構造体の2つのフィールドは、パターンをコンパイルしたregexエンジンのプライベートな使用を目的としています。これらはintflagsメンバとpprivateメンバです。 pprivateは、任意の構造体へのvoidポインタであり、その使用と管理はコンパイルエンジンの責任で行われます。 perlはこれらの値を変更することはありません。ストックエンジンの場合、pprivateが指す構造体はregexp_internalと呼ばれます。 (TBR)

Its pprivate and intflags fields contain data specific to each engine.

pprivateとintflagsフィールドには、各エンジンに固有のデータが含まれています。 (TBR)

There are two structures used to store a compiled regular expression. One, the regexp structure described in perlreapi is populated by the engine currently being. used and some of its fields read by perl to implement things such as the stringification of qr//.

コンパイルされた正規表現を格納するために使用される構造体は2つあります。 1つは、perlreapiに記述されているregexp構造体が現在使用されているエンジンによって取り込まれ、そのフィールドのいくつかがperlによって読み取られてqr//のstringificationのようなものを実装します。 (TBR)

The other structure is pointed to be the regexp struct's pprivate and is in addition to intflags in the same struct considered to be the property of the regex engine which compiled the regular expression;

もう1つの構造体は、regexp構造体のpprivateであり、正規表現をコンパイルしたregexエンジンのプロパティと考えられる同じ構造体のintflagsに追加されます。 (TBR)

The regexp structure contains all the data that perl needs to be aware of to properly work with the regular expression. It includes data about optimisations that perl can use to determine if the regex engine should really be used, and various other control info that is needed to properly execute patterns in various contexts such as is the pattern anchored in some way, or what flags were used during the compile, or whether the program contains special constructs that perl needs to be aware of.

regexp構造体には、正規表現で適切に動作するためにperlが認識する必要があるすべてのデータが含まれています。この構造体には、perlがregexエンジンを実際に使用すべきかどうかを判断するために使用できる最適化に関するデータや、さまざまなコンテキストでパターンを適切に実行するために必要なその他のさまざまな制御情報(パターンが何らかの方法で固定されているか、コンパイル時にどのフラグが使用されたか、プログラムがperlが認識する必要がある特殊な構造体を含んでいるかどうかなど)が含まれています。 (TBR)

In addition it contains two fields that are intended for the private use of the regex engine that compiled the pattern. These are the intflags and pprivate members. The pprivate is a void pointer to an arbitrary structure whose use and management is the responsibility of the compiling engine. perl will never modify either of these values.

さらに、パターンをコンパイルしたregexエンジンのプライベートな使用を目的とした2つのフィールドが含まれています。これらはintflagsメンバとpprivateメンバです。 pprivateは任意の構造体へのvoidポインタで、その使用と管理はコンパイルエンジンの責任となります。 perlはこれらの値を変更することはありません。 (TBR)

As mentioned earlier, in the case of the default engines, the pprivate will be a pointer to a regexp_internal structure which holds the compiled program and any additional data that is private to the regex engine implementation.

先に述べたように、デフォルトエンジンの場合、pprivateはregexp_internal構造体へのポインタになります。この構造体は、コンパイルされたプログラムと、regexエンジンの実装にとってプライベートな追加データを保持します。 (TBR)

Perl's `pprivate` structure¶

The following structure is used as the pprivate struct by perl's regex engine. Since it is specific to perl it is only of curiosity value to other engine implementations.

以下の構造体は、perlのregexエンジンでpprivate構造体として使用されています。これはperlに特有のものなので、他のエンジンの実装では興味深い価値しかありません。 (TBR)

    typedef struct regexp_internal {
            regexp_paren_ofs *swap; /* Swap copy of *startp / *endp */
            U32 *offsets;           /* offset annotations 20001228 MJD 
                                       data about mapping the program to the 
                                       string*/
            regnode *regstclass;    /* Optional startclass as identified or constructed
                                       by the optimiser */
            struct reg_data *data;  /* Additional miscellaneous data used by the program.
                                       Used to make it easier to clone and free arbitrary
                                       data that the regops need. Often the ARG field of
                                       a regop is an index into this structure */
            regnode program[1];     /* Unwarranted chumminess with compiler. */
    } regexp_internal;

swap

swap is an extra set of startp/endp stored in a regexp_paren_ofs struct. This is used when the last successful match was from the same pattern as the current pattern, so that a partial match doesn't overwrite the previous match's results. When this field is data filled the matching engine will swap buffers before every match attempt. If the match fails, then it swaps them back. If it's successful it leaves them. This field is populated on demand and is by default null.

swapは、regexp_paren_ofs構造体に保存されたstartp/endpの特別なセットです。これは、最後に成功したマッチが現在のパターンと同じパターンからのものであった場合に使用され、部分的なマッチが前のマッチの結果を上書きしないようにします。このフィールドにデータが入力されると、マッチングエンジンはすべてのマッチが試行される前にバッファをスワップします。マッチが失敗した場合は、バッファをスワップして戻します。マッチが成功した場合は、バッファを残します。このフィールドは要求時に設定され、デフォルトではnullです。 (TBR)

offsets

Offsets holds a mapping of offset in the program to offset in the precomp string. This is only used by ActiveState's visual regex debugger.

Offsetsは、program内のoffsetからprecomp文字列内のoffsetへのマッピングを保持します。これは、ActiveStateのvisual regexデバッガでのみ使用されます。 (TBR)

regstclass

Special regop that is used by re_intuit_start() to check if a pattern can match at a certain position. For instance if the regex engine knows that the pattern must start with a 'Z' then it can scan the string until it finds one and then launch the regex engine from there. The routine that handles this is called find_by_class(). Sometimes this field points at a regop embedded in the program, and sometimes it points at an independent synthetic regop that has been constructed by the optimiser.

re_intuit_start()が特定の位置でパターンが一致するかどうかをチェックするために使用する特殊なregop。たとえば、regexエンジンがパターンが'Z'で始まる必要があることを認識している場合、regexエンジンは文字列をスキャンして検出し、regexエンジンを起動します。これを処理するルーチンはfind_by_class()と呼ばれます。このフィールドがプログラムに組み込まれているregopを指す場合もあれば、オプティマイザによって構築された独立した合成regopを指す場合もあります。 (TBR)

data

This field points at a reg_data structure, which is defined as follows

このフィールドは、次のように定義されるreg_data構造体を指します。 (TBR)

    struct reg_data {
        U32 count;
        U8 *what;
        void* data[1];
    };

This structure is used for handling data structures that the regex engine needs to handle specially during a clone or free operation on the compiled product. Each element in the data array has a corresponding element in the what array. During compilation regops that need special structures stored will add an element to each array using the add_data() routine and then store the index in the regop.

この構造体は、コンパイルされた製品でのクローン操作またはフリー操作中にregexエンジンが特別に処理する必要があるデータ構造体を処理するために使用されます。データ配列内の各要素には、what配列内の対応する要素があります。コンパイル時に、特別な構造体を格納する必要があるregopsは、add_data()ルーチンを使用して各配列に要素を追加し、regopにインデックスを格納します。 (TBR)

program

Compiled program. Inlined into the structure so the entire struct can be treated as a single blob.

コンパイルされたプログラム。構造体にインライン化されているため、構造体全体を1つのblobとして扱うことができます。 (TBR)

作者¶

by Yves Orton, 2006.

by Yves Orton, 2006. (TBT)

With excerpts from Perl, and contributions and suggestions from Ronald J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus, Stephen McCamant, and David Landgren.

With excerpts from Perl, and contributions and suggestions from Ronald J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus, Stephen McCamant, and David Landgren. (TBT)

LICENCE¶

Same terms as Perl.

Same terms as Perl. (TBT)

REFERENCES¶

[1] http://perl.plover.com/Rx/paper/

[2] http://www.unicode.org

POD ERRORS¶

Hey! The above document had some coding errors, which are explained below:

Around line 799:: Unterminated C< ... > sequence
Around line 1375:: Unknown E content in E<SF>

名前¶

説明¶