Introduction To Regular Expressions [ Regex ]


What Is A Regular Expression

Regular expression or regex for short is a sequence of characters that defines a search pattern.
Let me simplify this for you.
Imagine you're writing a huge assignment or a report then you realize you misspelled a word and apparently the word was used a couple of hundred times throughout your document, now any sane person living in the 21st century would :
find and replace->”misspelled word”->”correct word”.

Have you ever wondered how the computer checks for the word?
How the words are magically found and replaced?
It just searches the entire document for “word-to-be-searched” and replace the string with the new one.
Imagine this scenario, you are asked to redact phone numbers from a letter, now you don’t know any of the numbers personally, and since the letter contains plenty of numbers that are not just phone numbers and those should not be redacted what would you do then?
This is where Regular Expressions come in to play, a regular expression is basically a phase or a sequence of characters that represent a series of characters in any given order in a given grammar. To simplify it further:
Let G be the language we use, such that G = {a, b, c}
This basically means that we are just going to use a, b, and c as the total alphabets in the language instead of the whole 26 alphabets.

Now we can have the alphabets coming in any order, say abc, bac, aaa, bbb, acc, bbb….. and any number of times. Suppose that we have to find all the words that start with a, can have any of the 3 letters in the middle and ends with c. So our possible options are abc, aac, and acc.
Regex state machine
State Transition Diagram
Now looking at the state transition table, A B and C denote the states and a b and c denote the alphabets passed. We always start from the start state, here we start the automata if and only if the first character is ‘a’ else we just skip the word altogether.

From state A, the next state is reached when any of the three letters are incurred. Moving from B we go to state C, which is the final state (denoted by double circle) if and only if ‘c’ is incurred, else we move to dead state which means that our automata gets halted and returns that the word did not satisfy the grammar set (aac, abc & acc).
You don’t always need to draw out the transition table to use a regular expression, this whole table can be represented in text as :
Regular Expression: a(a+b+c)c
Where the symbols represent the following:           
+ :- OR operator
* :- 1 or more occurrences 
So if we need to find a sequence where the above word repeats itself over and over, we just use the regular expression, (a (a + b + c) c)*
Now that you have somewhat of an understanding of what and how regular expressions work, let’s move onto the actual coding side and implementation of regex.
Implementation in C#.

To access regular expression functionality you have to include System.Text.RegularExpressions.
Some of the symbol interpretations change when you use them in programming languages like C#.

List Of Symbols And Their Use In C# Regex

  • ‘\b’: This is a special symbol that indicates the compiler to match the position at the beginning or end of the word to be searched for.
  • ‘\d’: Indicates that the character is a digit ranging from 0-9.
  • {n}: This is usually used after a symbol or word to be searched, n is the number of time the symbol occurs.
  • ‘+’: At least one or more occurrences.
  • ‘\w’: Alphanumeric characters.
  •  ‘.’: Any character except a new line.
  •  ‘\s’: Whitespace.
  • ‘^’: Beginning of a string.
  • ‘$’: End of a string.
  • ‘*’: Any number of repetitions.
  • {n,m}: Repeat the symbol at least n times but not more than m times.
  • {n,}: Repeat the symbol at least n times with no upper limit.
  • ‘\W’: Not alphanumeric
  • ‘\S’: Not white space
  •  ‘\D’: Not digit
  • ‘\B’: Not beginning or end of a word
  • [^x]: Any character that is not x
  • [^aeiou]: Any consonant
  •  ‘*?’: Any number of times but as few as possible
  •  ‘+?’: One or more occurrences but as few as possible
  •  ‘??’: 0 or 1 occurrences but as few as possible
The main function that we will be using in the Regex class is :
Regex.Matches( string textInput, string regexStatement) :- returns MatchCollection object.
We will go over MatchCollection class in the future.

Example Program:

using System;
using System.Text.RegularExpressions;
     
public class BitshiftProgrammer
{
 private static void CheckForCaptialsAtSentenceStart(string text)
 {
       MatchCollection mc = Regex.Matches(text,@"\. [A-Z]\w*");
       /*Start checking with presence of '.' then look for a space then any captial letter between A & Z 
       then following it can be any alpha-numeric value until we see a non-alpha numeric value*/
       foreach (Match m in mc)
       {
            Console.WriteLine(m);
       }
 }
 public static void Main(string[] args)
 {
       Console.WriteLine("Checking for capital rule non-violating words");
       CheckForCaptialsAtSentenceStart("This is first sentence. Second sentence is better. third sentence needs some work. Fourth has become better");
 }
}
Output :
Checking for captical rule non-violting words
. Second
. Fourth
We will go over many such C# examples in the future where we look into much more complicated ones.
Well I hope you learnt something of value.
Please do support Bitshift Programmer by sharing this with your friends and colleagues.
For More C# Tutorials, go HERE.
For Unity Tutorials, go HERE.