python regular expression

Started by ggnfs000, January 05, 2017, 06:03:10 PM

Previous topic - Next topic

ggnfs000

python regular expression

I decided this time to experiment through most if not all of the python regular expressions to enhance my understanding. Whenever I have to deal with the more advanced regular expression I either manages with using simpler expression as substitute instead of exploring further (which results in a ridiculous long expression that can be shortened to almost nothing had I explored more advanced expressions) or just take a nap. The explanation of regexp on various internet sites can be harder to understand without examples so I attempted to provide examples along with description for each expression syntax.

For each of the the examples given below, I literally tried and made sure how it works from python console as I went through, therefore accuracy is guaranteed. In other words, you launch the python console and repeat these examples below and will see the same result.

Some of the examples, specially the advanced regexp features can be harder to understand without covering how python generates the string containing special characters, so I decided to cover it in details in this blog: how python declares normal string vs. raw string

. (DOT)  - ANY CHARACTER(S)

1. single char search with . dot.
re.search(".", a)
when a = "abcd" results in "a"

starting with the simplest, the . dot signifies any char except new line.

2. single char greedy search with . dot.
re.search(".*", a) 
when a = "abcd" results in "abcd" 
when a = "abcd\nabcd" results in "abcd" 

dot used in conjunction with greedy operator * makes it to search matching characters as much as possible until match ends. If new line is encountered it stop as can be seen on second example above.

3. single char greedy search with . dot + re.DOTALL which includes new line
re.search(".*", a, re.DOTALL)
when a = "abcd\nabcd" results in "abcd\nabcd"

Here is the example where it searches all chars including new line. On 2. search stopped at new line but in this example, search includes the new line (\n) since this is a match.

MULTILINE option / ^ and $ (beginning or end of line match)
Multiline option was new to me but it was very useful. Here are few example to make it clearer. To use an example, I also used ^ (match start of line) and also $ (match end of line):

^ (AMPERSAND) -  SEARCH MATCHING PATTERN AT THE BEGINNING OF LINE)  - 
$ (DOLLAR) -          SEARCH MATCHING PATTERN AT THE END OF LINE
re.MULTILINE -      SEARCH ACROSS MULTIPLE LINE

4. search matching pattern at the beginning of line with no MULTILINE
re.search("^asdf", a) 
when a = "qwer\nasdf\nzxcv" results in None

This will generate None type since new MULTILINE option is NOT used and search continued until newline. Before newline matching string is not found at the beginning of line.

5. search matching pattern at the beginning of line with MULTILINE
re.search("^asdf", a, re.MULTILINE)
a = "qwer\nasdf\nzxcv" results in "asdf"

Notice the different of syntax which added re.MULTILINE which causes search to continue onto subsequent lines regardless of new line and findind the result "asdf".

Following two examples 6. and 7. are similar to 4. and 5. except it searches for matching pattern at the end of line:

6. search matching pattern at the end of line without MULTILINE
re.search("df$", a)
a = "qwer\nasdf\nzxcv" results in Nonetype

7. search matching pattern at the end of line with MULTILINE
re.search("df$", a, re.MULTILINE)
a = "qwer\nasdf\nzxcv" results in df

* (STAR) - (ZERO OR MORE REPETITIONS OF MATCHING PATTERN)
+ (PLUS) - (ONE OR MORE REPETITIONS OF MATCHING PATTERN)
? (QUESTION) - (ZERO OR ONE REPETITIONS OF MATCHING PATTERN)

8. greedy search for pattern looking zero or more repetitions of matching pattern
re.search("ab*", a)
a = "aaaabbbbccccdddd" results in abbbb
a = "aaaaccccdddd" results in a

Since this searches for pattern a followed by b but since b is followed by *, look for any number of repetitions of b as match including zero, therefore it resulted in abbbb. Also, a, ab, abb, abbb abbbb are all matching patterns.

9. greedy search for pattern looking one or more repetitions of matching pattern
re.search("ab+", a)
a = "aaaabbbbccccdddd" results in abbbb
re.search("ab+", a) 
a = "aaaaccccdddd" results in None.

This is almost same as 8. except, since looking for 1 or more repetitions rather than 0 or more, "aaaaccccdddd" resulted in None, since b must be encountered after a at least once.

10. non greedy search for pattern looking for zero or one repetitions of matchin pattern
re.search("a?", a)
a = "aaaa" results in a
re.search("a?", a)
a = "cccc" results in empty string but not None.

This is different than 8 or 9 in that it is non-greedy. So searching for a? from aaaa will only match first a. Now searching for a? from cccc resulted in empty string but not None. This means search is still match as a zero repetitions vs. None is being nothing has been matched. 

{M,N} / {M,N}? - (CURLY BRACKET) - FIND DEFINED CHARACTERS THAT IS REPEATED ANY RANGE OF TIMES SET IN THE UPPER AND LOWER LIMIT INSIDE CURLY BRACKET

Best way to illustrate it through examples. 6 example below will search any number of repetitions between 2 and 5 and three of them finds as much as five as allowed and other three of them finds as less as two as allowed. To give an understanding of how it works, each 3 sets are chosen with input string length that are more than 5 chars, second one being between 2 and 5 (3) and last one being less than 2 (1).

re.search("a{2,5}, a)
a = "aaaaaaaaaa" results in aaaaa
re.search("a{2,5}, a)? 
a = "aaaaaaaaaa" results in aa

In the example above, first one resulted in aaaaa because we are looking for up to 5 repetitions and found one and search stopped. In the second example, we are looking at anywhere between 2 and 5 occurrences but search will stop when finding two a-s.

re.search("a{2,5}, a) 
a = "aaa" results in aaa
re.search("a{2,5}, a)? 
a = "aaa" results in aa

In the example above, first one resulted in aaa because we are looking for up to 5 repetitions but only found 3 repetitions of a which is still a match and search stopped. In the second example, we are looking at anywhere between 2 and 5 occurrences but search will stop when finding two a-s.

re.search("a{2,5}, a) 
a = "a" results in None
re.search("a{2,5}, a)? 
a = "a" results in None

In the last example above, we are looking for minimum of 2 repetitions for both cases, therefore none of the search found match so result is None.

[] - (BRACKET)  - MATCH SET OF CHARACTER(S)
[^] - (AMPERSAND INSIDE A BRACKET - MATCH INVERSE OF SET OF CHAR(S)



Any character set inside the angle brackets can be a potential match therefore it can be used to selectively match certain set of chars. For any chars we know . (DOT) can be used. 

[] - angle bracket, search for any digits
Finding any continuous digits as much as possible:
re.search("[0-9]*", a) 
a = "12345" results in 12345

[.*] - special characters inside a bracket loses their meaning
Special characters such as . (DOT), * (STAR) loses the meaning inside the bracket and is interpreted literally
re.search("[.*]", a)
a = "12345" results in None 
This is because .* inside bracket is not greedy search for any char, instead it is literally searching for . and * characters.

[^] - reverse of set of characters
^ sign is a beginning of a line characters as described above in 4. and 5. but inside angle bracket, it is interpreted as reverse sign. That is find any char that is NOT in the set inside the angle bracket.

re.search("[^ ]*")
a = "can I go" resulted in can.
In the example above, it performs greedy search for any character that is NOT space (due to ^) and as soon as it encounters space, search stops. It is greedy therefore it produces "can" but stops at first space because space is not a match within the bracket


More to come...
















Source: python regular expression