Skip to content

Instantly share code, notes, and snippets.

@Ronin1702
Last active September 11, 2023 22:30
Show Gist options
  • Select an option

  • Save Ronin1702/0bedbe4ac3d42034717ebf9d4cdaf358 to your computer and use it in GitHub Desktop.

Select an option

Save Ronin1702/0bedbe4ac3d42034717ebf9d4cdaf358 to your computer and use it in GitHub Desktop.
URL Matching RegEx Tutorial by Kai Chen

RegEx Tutorial: Understanding URL Matching

Tweet about this

What is regex?

A regex, which is short for regular expression, is a sequence of characters that defines a specific search pattern. When included in code or search algorithms, regular expressions can be used to find certain patterns of characters within a string, or to find and replace a character or sequence of characters within a string. They are also frequently used to validate input. Such as email addresses, URLs, usernames, etc.

  • Purpose of the tutorial: This tutorial aims to dissect and explain the components of a specific regular expression (regex) used for URL matching.

  • Goal of the tutorial: By the end of this tutorial, you should have a better understanding of how this regex works and how to use it in your own projects.

Summary

The regular expression featured in this tutorial is designed for validating URLs. This regex ensures that the URL meets certain criteria for being a well-formatted URL. Below is the regex code snippet we will be focusing on:

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

Table of Contents

Regex Components

Anchors

The anchors ^ and $ are pivotal in defining the boundaries of the URL string. The ^ asserts the start of the string, and $ asserts the end. ^ and/or $ matches a position, not a character. This means that the regex will only match if the entire string conforms to the given pattern.

What is the purpose of the forward slashes (/) before and after the regex string anchors?

The forward slashes (/) at the beginning and end of the regular expression string serve as delimiters. They indicate the start and end of the regular expression pattern. This is common syntax in many programming languages like JavaScript, Perl, and others that use regular expressions.

For example, in JavaScript, you might define a regular expression for matching URLs like this:

const urlRegex = /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/;

In this example, the forward slashes / at the beginning and the end are NOT part of the pattern to be matched. Instead, they denote where the regular expression starts and ends. Anything between these slashes is considered part of the actual regular expression pattern.

back to top

Quantifiers

Quantifiers are elements that specify the number of occurrences for a particular element in the regex. In other words, each quantifier in the regex operates on its immediately preceding element, specifying how many times that element is allowed to occur in a matching string. In our URL matching regex, we have several quantifiers below:

  • ?

    • The question mark (?) makes the preceding character or group optional. It indicates that the preceding pattern element may occur zero or one time. In this regular expression, (https?:\/\/) is made optional by the ? right after the group. This means URLs with or without "http://" or "https://" will be matched. The ? right after the "s" makes the "s" optional, allowing for both "http" and "https".
  • +

    • The plus sign (+) quantifier indicates that the preceding character or group must occur one or more times. In [\da-z\.-]+, the + means that one or more of the characters in this character set (digits, lowercase alphabets, period, or hyphen) must appear for a match to occur. This part of the regex is generally for matching the domain part of the URL, like "www" in "www.google.com".
  • {2,6}

    • The {n,m} quantifier specifies that the preceding character or group must occur at least n times but not more than m times. In [a-z\.]{2,6}, it specifies that the characters (which are lowercase alphabets or periods) must appear between 2 and 6 times. This is generally for matching top-level domains like ".com", ".org", ".co.uk", etc.
  • *

    • The asterisk (*) quantifier indicates that the preceding character or group can appear zero or more times.
    • In [\/\w \.-]*)*, in summary, the first asterisk is working on individual characters within its defined set, while the second asterisk is working on sequences of characters that match the pattern defined within the parentheses. Below is a detailed explanation:
      • The first * inside the square brackets [\/\w \.-]*: This asterisk is a quantifier that applies to the character set [\/\w \.-]. It means that any character in this set (which includes a forward slash, word characters, space, period, or hyphen) can appear zero or more times. It's a quantifier for the character set and applies to each individual URL that this part of the regex pattern tests against. For Example:
        • Matches "/"
        • Matches "abc-./"
        • Matches "" (empty string, because zero occurrences are allowed)
      • The second * after the closing parenthesis )*: This asterisk is also a quantifier but applies to the entire group ([\/\w \.-]*). This means that the entire group can appear zero or more times. It allows for repeated sequences that match the group's pattern. For example:
        • Matches "/abc//def//"
        • Matches "abc-def.abc-def"
        • Matches "" (empty string, because zero occurrences are allowed)

back to top

Grouping Constructs

Grouping constructs are used to define a sub-expression for which quantifiers or other operators can apply. The parentheses () serve this purpose. In our URL regex above, we have 4 groups constructs. Each of these groups captures a specific part of a URL and allows us to separate a URL into its constituent parts like the scheme, domain, top-level domain, and path.

  • Group 1: (https?:\/\/)?

    • This is the first group in the expression. It captures the URL scheme. It matches either "http://" or "https://".
      • The https? part will match either "http" or "https" because the ? quantifier makes the "s" optional.
      • The :\/\/ part ensures that "://" follows "http" or "https".
    • The ? after the closing parenthesis indicates that this whole group is optional, meaning it will match URLs that either do or don't specify a scheme.
  • Group 2: ([\da-z\.-]+)

    • This is the second group in the expression. It captures the domain name, excluding the top-level domain (like .com).
      • The character set [\da-z\.-] matches digits (\d), lowercase alphabets (a-z), a period (\.), or a hyphen (-).
      • The + quantifier means that one or more of these characters must be present.
    • This group is not optional, meaning at least one character from the defined set must be present.
  • Group 3: ([a-z\.]{2,6})

    • This is the third group in the expression. It captures the top-level domain (like .com, .org, .co.uk).
      • The character set [a-z\.] matches any lowercase alphabet or a period.
      • The {2,6} quantifier specifies that this group must have between 2 to 6 of these characters.
  • Group 4: ([\/\w \.-]*)

    • This is the fourth group in the expression. It captures the path, query parameters, or any other part of the URL after the domain.
      • The character set [\/\w \.-] matches a forward slash (\/), any word character (\w), a space (``), a period (\.), or a hyphen (`-`).
      • The * quantifier means that zero or more of these characters can be present.
    • The asterisk (*) after this group means that this pattern can repeat zero or more times.

back to top

Bracket Expressions

Bracket expressions are a set of characters enclosed in square brackets []. They define a set of characters to match at a particular point in the expression. Bracket expressions allow you to create flexible patterns that can match a wide range of individual characters, making them a powerful tool in regex.

General Points About Bracket Expressions:

  • Order Doesn't Matter: [az] is equivalent to [za].
  • No Duplication: [aaaa] is equivalent to [a].
  • Literal Characters: Special characters lose their special meaning inside brackets. For instance, . and - are literal inside a bracket expression.
  • Negation: If the first character inside the brackets is a caret (^), then the bracket expression is negated. For example, [^a] will match any character that is not a.

The URL regex in this tutorial has 3 bracket expressions.

  • Bracket Expression 1: [\da-z\.-]

    • This is the first bracket expression in the regex, and it appears within Group 2. Here's the breakdown:
      • \d: Matches any digit, equivalent to [0-9].
      • a-z: Matches any lowercase alphabetical character from 'a' to 'z'.
      • .: Matches the period character literally.
      • -: Matches the hyphen character literally.
    • So, this bracket expression will match any single character that is a digit, a lowercase letter, a period, or a hyphen.
  • Bracket Expression 2: [a-z\.]

    • This is the second bracket expression in the regex, and it appears within Group 3. It includes:
      • a-z: Matches any lowercase alphabetical character from 'a' to 'z'.
      • .: Matches the period character literally.
    • This bracket expression will match any single character that is either a lowercase letter or a period.
  • Bracket Expression 3: [\/\w \.-]

    • This is the third bracket expression in the regex, and it appears within Group 4. It includes:
      • \/: Matches the forward slash character literally.
      • \w: Matches any word character, equivalent to [a-zA-Z0-9_].
      • (space): Matches the space character literally.
      • \.: Matches the period . character literally.
      • -: Matches the hyphen character literally.
    • This bracket expression will match any single character that is either a forward slash, a word character, a space, a period, or a hyphen.

back to top

Character Classes

In regular expressions, character classes are used to specify a set of characters that you wish to match. Character classes can appear within square brackets [ ] or as standalone elements, usually prefixed with a backslash \. Character classes provide a way to simplify your regular expressions and make them easier to read and maintain. They are especially useful when you want to match a range of characters that fall into a specific category, like digits or word characters.

General Points About Character Classes:

  • Shorthand: Character classes provide a shorthand for certain commonly used sets of characters. For example, \d is easier to read and write than [0-9].
  • Case Sensitivity: Character classes are case-sensitive by default. For example, \d will not match 'D', and \w will not match 'W'.
  • Negation: Some character classes can be negated or inverted by capitalizing them. For example, \D will match any character that is NOT a digit.
  • Combination with Custom Sets: You can combine character classes with custom sets within bracket expressions. For example, [a-zA-Z\d] will match all alphabetic characters and digits.
  • Not Special Inside Brackets: While character classes are special when used standalone, they lose their special meaning when they appear inside a square bracket expression. For example, [\d] is the same as \d.
  • Locale-Dependent: Some character classes can be locale-dependent, meaning their behavior could change based on the system's locale settings. However, this is usually not a concern for the most commonly used character classes like \d and \w.

In this tutorial's URL matching regex, there are two character classes.

  • Character Class 1: \d

    • This is a pre-defined character class that matches any digit. It is equivalent to [0-9]. In the regex, it appears in the second group within the bracket expression [\da-z\.-]. This means that this part of the URL can contain any digit from 0 to 9.
  • Character Class 2: \w

    • This is another pre-defined character class that matches any word character, which includes alphabetic characters (both upper and lower case), digits, and the underscore. It is equivalent to [a-zA-Z0-9_]. In the regex, it appears in the fourth group within the bracket expression [\/\w \.-]. This part of the URL can contain any word character.

back to top

The OR Operator

The OR operator adds a layer of flexibility to regular expressions, allowing for multiple possible matches. In this specific URL-matching regex, it's used to allow the URL to start with either "http" or "https".

General Notes on the OR Operator:

  • Grouping: It's often used in conjunction with parentheses ( ) to specify the scope of the options. For example, (abc|def) will match either "abc" or "def".
  • Multiple Choices: You can have more than two choices. For example, (a|b|c) will match any single "a", "b", or "c".
  • Whole Strings: The OR operator can also work on whole strings, not just characters. For example, (http|ftp|https) would match any of the strings "http", "ftp", or "https".
  • The OR operator is specifically used in the first grouping construct: (https?:\/\/).
    • This part of the regular expression is enclosed in parentheses, making it a grouping construct. Inside the group, we have http followed by an optional s (indicated by the ? quantifier), followed by ://.
  • The s? essentially means "either s or nothing," thanks to the ? quantifier making the s optional.
  • By using the OR operator implicitly through the ? quantifier, the regular expression allows URLs that start with either "http://" or "https://". This flexibility is often needed since web URLs can use either of these two schemes.

Importance of the OR Operator

  • The OR operator is very useful when you have multiple patterns that can be correct. In this case, whether the URL starts with "http" or "https," both are valid scenarios, and the OR operator allows us to capture both.

back to top

Flags

As indicated by the absence of any characters after the final slash, while our regex does NOT utilize flags, it's worth knowing that flags are a powerful feature in regular expressions, allowing you to fine-tune how the matching is performed. Even though they are not used in this specific URL-matching regex, they could be added depending on the requirements of the matching operation you're performing.

Commonly Used Flags:

  • Global (g): This flag is used to perform a global match, meaning the pattern will be searched for in all of the string instead of stopping after the first match.

  • Case-Insensitive (i): By using this flag, you can make the entire regular expression case-insensitive. For example, /abc/i would match "abc", "Abc", "aBc", etc.

  • Multiline (m): This flag is used when you want to perform multiline matching. When enabled, the ^ and $ anchors will match the start and end of lines, not just the start and end of the entire string.

  • DotAll (s): This makes the . special character match any character at all, including a newline, which it normally would not match.

  • Unicode (u): This makes the expression Unicode-aware, enabling improved handling of surrogate pairs.

  • Sticky (y): This flag is used to "stick" the match at the exact position in a string where the regex was last matched.

How to Use Multiple Flags:

You can use multiple flags by just putting them together. For example, /abc/gi would make the search both global and case-insensitive.

back to top

Character Escapes

Character escaping is fundamental to regex as it allows you to find exact matches for characters that would otherwise be treated as special symbols. It's what makes the pattern versatile and precise, ensuring that you find exactly what you're looking for in a text. Certain characters in regex have special meanings, like ., /, and ?. To treat them as literal characters, we precede them with backslashes (\).

General Notes on Character Escapes:

  • Special Characters: In regex, special characters like ., ?, *, +, ^, $, (, ), [, ], {, }, |, and / must be escaped if you want to search for them literally.
  • Alphanumeric Characters: When a backslash precedes a regular alphanumeric character, it usually turns that character into a special symbol (like turning d into \d, which matches any digit).
  • Double Escaping: In some programming languages, you might need to double escape certain characters in your regular expression string. For example, in JavaScript, you might write \. to represent a literal dot.
  • Unicode and Hexadecimal Codes: You can also use escaped sequences to represent special characters based on their Unicode or ASCII values, though this feature is not used in the given URL-matching regex.
  • Common Escapes: Some common escapes include \n for a new line, \t for a tab, and \r for a carriage return, although these are not used in the URL regex.
  • Context Matters: The need to escape a character can depend on its position in the regex. For example, a square bracket ] needs to be escaped only if you want to find it before the end of a character set.

In our URL matching regex, we escape the forward slashes and the dot to treat them as literals.

  • Escaping the Forward Slash in https?:\/\/: The forward slash / is a special character in regular expressions that typically indicates the start or end of the pattern. However, to match a literal forward slash in a URL, it's preceded by a backslash, making it \/.
  • In regular expressions, the dot . is a special character that matches any single character (except for a newline). To match a literal dot, such as the one commonly found in URLs, you need to use an escape sequence by placing a backslash before the dot: \.. However, it's important to note that the hyphen - within character classes does not require escaping and can be used to specify character ranges directly. For instance, to include digits, lowercase letters, and a literal dot in a character class, you can use the expression [\da-z.-].

back to top

Author

For any questions and inquiries, please Visit Copyright on Github

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment