Difference between revisions of "Advanced Regular Expressions"
(Created page with "== More useful patterns == {| class="wikitable" style="text-align: left"| |- ! Pattern !! Description !! REGEX !! Sample match !! Sample not match |- | \d || '''Di...") |
|||
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== | == Useful patterns == | ||
{| class="wikitable" style="text-align: left"| | {| class="wikitable" style="text-align: left"| | ||
|- | |- | ||
Line 113: | Line 113: | ||
= Sample Problems = | = Sample Problems = | ||
== Problem 1 == | |||
Which of the following strings match the regular expression pattern "^w{3}\.([a-z0-9]([-a-z0-9]{0,61}[a-z0-9])+\.)+[a-z0-9][-a-z0-9]{0,61}[a-z0-9]" ? | |||
::1. www.google.com | |||
::2. www.-petsmart.com | |||
::3. www.edu-.ro | |||
::4. www.google.co.in | |||
::5. www.examples.c.net | |||
::6. www.edu.training.computer-science.org | |||
::7. www.everglades_holidaypark.com | |||
'''Solution:''' | |||
This Regular Expression matches a domain name used to access web sites. | |||
RE starts with the subdomain www, continues with a number of names of domains, separated by a dot (Top-level domain (TLD), Second-level domain (SLD), Third-level domain, and so on). | |||
The name of a domain contains only small letters, digits and hyphen. The name can’t begin and can’t finish with a hyphen character. The length of the domain’s name is minimum 2 and maximum 63 characters. | |||
'''^''': the string starts with '''www''', followed by a dot character; | |||
'''[a-z0-9]''' : the first and the last character of the domain's name can be only a small letter or a digit; | |||
'''[-a-z0-9]{0,61}''': the next characters can be small letters, digits or a hyphen character. Maximum 61 characters; | |||
The last sequence '''[a-z0-9][-a-z0-9]{0,61}[a-z0-9]''' is for the Top-Level domain, which is not followed by a dot. | |||
The strings that are represented by this pattern are 1, 4 and 6. | |||
==Problem 2== | |||
Write a regular expression describing a set of strings formed to the following rules: | |||
1. Contain only lowercase letters of the English alphabet and the character '.'; | |||
2. Start and end with the same letter; | |||
3. Contain a sequence of at least one and at most 3 vowels, separated by zero or more characters '.' of a sequence consisting of at least one consonant. | |||
'''Solution:''' | |||
The Regular Expression is: | |||
'''([a-z])[a,e,i,o,u]{1,3}\.*[b-df-hj-np-tv-z]+(\1)''' | |||
'''([a-z])''' represents the group number 1 that captures the firs letter; | |||
'''\1''' is the number of the group that appears at the end of the string; | |||
'''[a,e,i,o,u]{1,3}''' describes sequence of one to three vowels; | |||
'''\.*''' the character ‘.’ appears zero to more times; | |||
'''[b-df-hj-np-tv-z]+''' a sequence of consonants, at least one consonant. |
Latest revision as of 09:22, 1 September 2020
Useful patterns
Pattern | Description | REGEX | Sample match | Sample
not match |
---|---|---|---|---|
\d | Digit.
Matches any digit. Equivalent with [0-9]. |
\d\d\d | 123 | 1-3 |
\D | Non digit.
Matches any character that is not a digit. |
\d\D\d | 1-3 | 123 |
\w | Word.
Matches any alphanumeric character and underscore. Equivalent with [a-zA-Z0-9_]. |
\w\w\w | a_A | a-A |
\W | Not Word.
Matches any character that is not word character (alphanumeric character and underscore). |
\W\W\W | +-$ | +_@ |
\s | Whitespace.
Matches any whitespace character (space, tab, line breaks). |
\d\s\w | 1 a | 1ab |
\S | Not Whitespace.
Matches any character that is not a whitespace character (space, tab, line breaks). |
\w\w\w\w\S\d | Test#1 | test 1 |
\b | Word boundaries.
Can be used to match a complete word. Word boundaries are the boundaries between a word and a non-word character. |
\bis\b | is; | This
island: |
{} | The curly braces {…}.
It tells the computer to repeat the preceding character (or set of characters) for as many times as the value inside this bracket. {min,} means the preceding character is matches min times or more. {min,max} means that the preceding character is repeated at least min and at most max times. |
abc{2} |
abcc |
abc |
.* | Matches any character (except for line terminators), matches between zero and unlimited times. | .* |
abbb Empty string |
|
.+ | Matches any character (except for line terminators), matches between one and unlimited times. | .+ | a
abbcc |
Empty string |
^ | Anchor ^.The start of the line.
Matches position just before the first character of the string. |
^The\s\w+ | The contest | One contest |
$ | Anchor $. The end of the line.
Matches position just after the last character of the string. |
\d{4}\sACSL$ | 2020 ACSL | 2020 STAR |
\ | Escape a special character.
If you want to use any of the metacharacters as a literal in a regex, you need to escape them with a backslash, like: \. \* \+ \[ etc. |
\w\w\w\. | cat. | lion |
() | Groups.
Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characthers and capturing them using the parentheses (). |
^(file.+)\.docx$ | file_graphs.docx
file_lisp.docx |
data.docx |
\number | Backreference.
A set of different symbols of a regular expression can be grouped together to act as a single unit and behave as a block. \n means that the group enclosed within the n-th bracket will be repeated at current position. |
|||
\1 | Contents of Group 1. | r(\w)g\1x | regex
Group \1 is e |
regxx |
\2 | Contents of Group 2. | (\d\d)\+(\d\d)=\2\+\1 | 20+21=21+20
Group \1 is 20 Group \2 is 21 |
20+21=20+21 |
Sample Problems
Problem 1
Which of the following strings match the regular expression pattern "^w{3}\.([a-z0-9]([-a-z0-9]{0,61}[a-z0-9])+\.)+[a-z0-9][-a-z0-9]{0,61}[a-z0-9]" ?
- 1. www.google.com
- 2. www.-petsmart.com
- 3. www.edu-.ro
- 4. www.google.co.in
- 5. www.examples.c.net
- 6. www.edu.training.computer-science.org
- 7. www.everglades_holidaypark.com
Solution: This Regular Expression matches a domain name used to access web sites.
RE starts with the subdomain www, continues with a number of names of domains, separated by a dot (Top-level domain (TLD), Second-level domain (SLD), Third-level domain, and so on).
The name of a domain contains only small letters, digits and hyphen. The name can’t begin and can’t finish with a hyphen character. The length of the domain’s name is minimum 2 and maximum 63 characters.
^: the string starts with www, followed by a dot character;
[a-z0-9] : the first and the last character of the domain's name can be only a small letter or a digit;
[-a-z0-9]{0,61}: the next characters can be small letters, digits or a hyphen character. Maximum 61 characters;
The last sequence [a-z0-9][-a-z0-9]{0,61}[a-z0-9] is for the Top-Level domain, which is not followed by a dot.
The strings that are represented by this pattern are 1, 4 and 6.
Problem 2
Write a regular expression describing a set of strings formed to the following rules:
1. Contain only lowercase letters of the English alphabet and the character '.';
2. Start and end with the same letter;
3. Contain a sequence of at least one and at most 3 vowels, separated by zero or more characters '.' of a sequence consisting of at least one consonant.
Solution:
The Regular Expression is:
([a-z])[a,e,i,o,u]{1,3}\.*[b-df-hj-np-tv-z]+(\1)
([a-z]) represents the group number 1 that captures the firs letter;
\1 is the number of the group that appears at the end of the string;
[a,e,i,o,u]{1,3} describes sequence of one to three vowels;
\.* the character ‘.’ appears zero to more times;
[b-df-hj-np-tv-z]+ a sequence of consonants, at least one consonant.