Java 基础 - 正则表达式 Regex

Text Symbol

1. The special text symbol - Meta Characters

a. [ ] \ ^ $ . | ? * + ( ) those characters are remaining for Regex engine of some special usage.

b. If you want to use those special text symbol as the normal text symbol, you need to added the escape/convert character '\'.

Example , if you want to match the text '1+1=2', you need to use the regex "1\+1=2".

2. None display text symbol

"\t" represents Tab(0x09) symbol

"\r" represents Carriage Return(0x0D) symbol

"\n" represents Line Break(0x0A) symbol

That Windows adopt "\r\n" to end with one line but Linux using "\n"

Regex Charset

The regex charset is a group of chars around in regex "[]"

1. [abc] and [a-z]

[abc] means matching with one character that is a or b or c

[a-z] means matching with any one of the characters from a to z.

2. Negative charset '^'

a. syntax - right after '[', follows a '^' means to fetch the other character string against with the regex charset specified.

Example ,

1. '[^a]+' matches on "abacad", return "bcd", to retrieve the other character string against with 'a'.

2. "[^abc]+" matches on "abcde", the results should be "de", against character string 'abc'

b. First focus that, '^' can match with the '\r' or '\n'

c. Second focus on that, '^' must to match with one charater.

Example , "q[^u]" means it will match a sub-charset start with q and ending with a non-"u" character must, so it will not match "Iraq", but it will match "Iraq is a country".

d. Last focus on that,

Only '^' is right after '[' of one charset, it means the Negative charset

Else if it not, like this '[a^b]+', here '^' means a normal regex charset. it will match "a^b"

3. The Abbreviation

"\d" represents with "[0-9]"

"\w" represents represents with [a-zA-z0-9]

"\s" represents with "Empty Char", it always includes Tab and '\r\n' or '\n'

Revert ,

"\D" represents with "[^\d]"

"\W" represents with "[^\w]"

"\S" represents with "[^\s]"

Examples,

a. "\s\d" matches a sub-charset heading with Empty chars right after with a Number

b. "[\s\d]" matches one character if it was a Empty char or a Number.

Regex Engine

1. text-directed

Abbreviation as DFA, I do not want to deeply metioned it due to I only focus on Java Regex Engine.

2. regex-directed

Abbreviation as NFA, most popular engine currently.

Features

1). Greeding quantifiers Mechanism- Always return leftmost matched one

Example:

a. Uses regex "regex|regex not" to match with literal "regex not"

it will return "regex", cause it will return the leftmost matched one not the most perfect matched one

b. Uses regex "cat" to match with "He captured a catfish for his cat" ( Not Global Search )

it will return "catfish" not the "cat", cause it will return the leftmost matched one

2). Lazy quantifiers Mechanism

- At least to reuse the predefined one regex formula to match

3). How to distinguish Greeding quantifiers with Lazy quantifiers mechanism

Example belows,

Using regex "<.+>" to match with "This is a first test" to find out all HTML Nodes. So the expected results should be "" and .

But we got "first" back due to Greeding matching mechansim

1. The process of Greeding matching Mechanism

a. Uses the first regex formula "<" matched with "<"

b. Uses the next regex formula "<.+" to match all the remain characeter string "first test", then it failed at the End boundary. (Now, the matched stream is "first test".

c. Uses the next "<.+>" to match with the whole stream "first test" but not matched with the regex, then do backtrace (Backtrace - one index of the mathced stream back as the result of "first tes" , regex expression remain as it is "<.+>") --> "first tes" still not matched with the regex --> ..... --> until the "first" matched with "<.+>".

d. Then return the matched result first

Conclusion: it always to reuse the predefined one regex formula to do process match

the same with ".+", the "?*" is also Greeding Mechansim

2. The process of Lazy matching Mechanism

If want to using Lazy matching, need to change the formula from "<.+>" to <.+?> , "?*" to <?*? > to enable we have use Lazy Matching

a. Uses the start regex formula "<" matched with "<"

b. Uses the next regex formula "<.+" matched with "<E", now matched "<E"

c. Uses the next regex formula "<.+>" to match next char "<EM", but failed, then backtrace

( Backtrace - remain the current index, regex expression backtrace one as to be "<.+ " - compare with the Greeding Mechanism )

d. Uses the former regex formula ".+" to match current char "M", now matched <EM

e. Uses the next regex formula ">" matched with the next char ">"

f. return " "

g. continue process a --> f, then will matched " "

Conclusion: it alaways at least to reuse the former one regex formula to do match

But the performance with Lazy Mechansim is not well

3. The alternative to replace using Lazy Matching Mechansim

We can use "<[^>]+>" to replace the formula "<.+>", the reason for using this is we do not need traceback

3. How to distinguish text-directed Engine and regex-directed Engine

Uses regex "regex|regex not" to match with literal "regex not"

If it returned "regex", it is regex-directed Engine. If not, it is text-directed Engine

Repeatly Matching Character

1. "?" means matches zero time or one time

2. "+" means matches one time or more time

3. "*" means matches zero time or more time

4. {min,max} means matches repeatly from min times to max times

{0, } == *, {1, } == +

Mostly Common Used Regex Character "."

1. "." can matches any character, but one Exception belows can not matched,

2. Exception character - Line Break and Line seed Character

Here is caused by the history, in the earlier time, the old computer only support line by line to read the file stream. so "." represents as the regex charset "[^\r\n]" or "^[\n]"

3. How to enable "." to match with the '\r\n' or '\n'

- In java, the default Pattern.complie(regex) is not support '.' to match with "[^\r\n]" or "^[\n]"

- Then we can use Pattern.complie(regex, Pattern.DOTALL) to enable this ability.

The Anchors

1. Notion

Anchors have some different with the regex formula, it can not matches with any character, it only represent start point and the ending point of one charset.

2. Simle Examples

"^\d+$" -- to check the user input should be all numbers.

"^\s*" -- to check the pre-defined blank-spaces char.

3. Multiple-Line Support/unSupport

Suppose we have the stream "first line \n\rsecond line "

1) simple one line

"^" represents the start point before "f "

"$" represents the end point after "e "

2) Multiple lines

"^" represents 2 start points , first one before "f ", 2nd one between "\n\r" and "r"

"$" represents 2 end points , first one between "e " and "\n\r"

3) Practice in Java

The Matching Mode

1. "/i" -- ignore cases

2. "/s" -- activate the single line mode

3. "/" -- activate the multiple line mode

4. how to activate the modes in the regex expression

insert the mode expression into your regex expression, then it start to activate from where you specified. "(?i)" -- ignore cases; "(?-i)" -- not ignore cases

Example

regex "(?i)te(?-i)st " matched with "TEst" but not matched with "TEST" or "teST".

Word Boundary Charset

1. Notion

"\b " represents the Word Boundary anchor

Three places specified as the Word Boundaries,

a. The beginning point of one character string

b. The ending point of one character string

c. The point between un-word character and word character, the un-word character just right after the word character or just before it

2. Example

"\b4\b" can matches " 4 " but can not matches " 44 ".

Selector Charset

1. Notion

"|", Uses the selector to match one of the possiblities.

The Selector has the lowest priority, it told regex machine that to matched the left one or right one.

2. Example

If you want to match a character string with "cat" or "dog", you can use "cat|dog"

If you want to match a word only with "cat" or "dot", "\b(cat|dog)\b"

3. The Greeding quantifiers and how to manipulation

(also, "()" is specified as the group will be mentioned later)

Suppose we have String Character "get getValue set setValue",

a. regex "\b(set|setValue)\b" , regex machine will return "set " first cause the Greeding feature.

The problem is that we want to find the best one "setValue "

b. so we can easily exchange the order like this "\b(setValue|set)\b"

But the writting style not so good

4. "()" to indentify one regex - ( "()" indicated as Group will be mentioned later )

"()" - tell the regex machine to treat it as a individual regex expression

so the example shows in #3 can be simply changed as "\bset(Value){1}\b"

a. (Value) will be treated as one individual regex expression

b. {1} indicates that must matches it one time.

----> So we have the best match.
Java 基础 - 正则表达式 Regex

Group and its Backward Reference

1. Group

A charsets surrounding by "()" formula indicates this is a Group.

- Regex Machine will caches the matched results according with the pattern expression in "()"

- In the first one "() " we call it as first group, so in the second one "() " we call it as 2nd group..

- If want to reference the results cached, using "\1 " refer to the result of 1st group, so as "\2 ".

- Focus that, "\0 " reference itself - the entire orignial regex expression

- "?:" indicate that not to cache the Group value for backward reference. such as "set(?:Value)"

2. Example

Example-1:

Assume that we have character string "This is a test"

what we needed is to find out the contents of each Note

Clues:

1) regex pattern expression "<([A-Z]+?)>.*</\\1>"

2) "([A-z]+?)" as the first group --- ( The group value is "B " after matched )

3) "\\1" specified as a group reference index to reference the result of Group 1 "B "

4) Finally, "<([A-Z]+?)>.*</\\1>" converted as ".*".
Java 基础 - 正则表达式 Regex

Example-2

To remove the repeative words for the careless typing.

Suppose that we have character string "the the world is beautiful"

clues:

1) Regex string "\b(\w+)\b\s+\b\1\b "

2) then we can use some replacement strategy to replace the matched contents
Java 基础 - 正则表达式 Regex

3. Cache Strategy

Cache Strategy - The temporal memory of regex machine only store the latest matched one for one Group

1) "([abc]+)=\1 " will match "cab=cab"

2) "([abc])+=\1 " not match "cab=cab" but matched with "cab=b", cause the latest matched for Group 1 ([abc])+ is "b ".

Meta-Group and Avoding the Backtrace

In some special situation, the Backtrace might leads to the performance disasters of the regex mechine, furture more, it may crash the regex matchine -- So we need to avoiding backtrace by Meta-Group

... in progress

Looking forward/backward

Focus - Javascript only support Looking forward

1. Affirmably/Negatively forward looking

Syntax

Affirmably forward looking - " (?=)"

Negatively forward looking - "(?!)"

Example

Negatively forward looking

we want to check that there is no character "u" right follows up character "q"

So the regex expression like "q(?!=u)"

---- So the same with Affirmably forward looking

Furture understood

Forward looking will not cache the matched results for backward references, but if you want the backward references, using "(?=(regex) )"

2. Affirmably/Negatively backward looking

Syntax

Affirmably backward looking - "(?=<)"

Negatively backward looking - "(?!<)"

( Additonly <> arrouding the value than forward looking )

Example

3. Deeply into forward/backward looking

..... in progress

Reference Links

****: http://dragon.cnblogs.com/archive/2006/05/08/394078.html

Java 基础 - 正则表达式 Regex

相关推荐