Javascript regular expressions. JavaScript for Beginners: Learning Regular Expressions

Last update: 1.11.2015

Regular Expressions represent a pattern that is used to search or modify a string. To work with regular expressions in JavaScript, an object is defined RegExp.

There are two ways to define a regular expression:

Var myExp = /hello/; var myExp = new RegExp("hello");

The regular expression used here is quite simple: it consists of the single word "hello". In the first case, the expression is placed between two slashes, and in the second case, the RegExp constructor is used, in which the expression is passed as a string.

RegExp Methods

To determine if a regular expression matches a string, the RegExp object defines the test() method. This method returns true if the string matches the regular expression and false otherwise.

var initialText = "hello world!"; varexp = /hello/; var result = exp.test(initialText); document.write(result + "
"); // true initialText = "beautifull wheather"; result = exp.test(initialText); document.write(result); // false - there is no "hello" in the initialText string

The exec method works similarly - it also checks if the string matches the regular expression, only now this method returns the part of the string that matches the expression. If there are no matches, then null is returned.

var initialText = "hello world!"; varexp = /hello/; varresult = exp.exec(initialText); document.write(result + "
"); // hello initialText = "beautifull wheather"; result = exp.exec(initialText); document.write(result); // null

Character groups

A regular expression does not necessarily consist of regular strings, but can also include special elements of regular expression syntax. One such element is a group of characters enclosed in square brackets. For example:

Var initialText = "defensiveness"; var exp = /[abc]/; var result = exp.test(initialText); document.write(result + "
"); // true initialText = "city"; result = exp.test(initialText); document.write(result); // false

The expression [abc] indicates that the string must have one of three letters.

If we need to determine the presence of literal characters from a certain range in a string, then we can set this range once:

Var initialText = "defensiveness"; var exp = /[a-z]/; var result = exp.test(initialText); document.write(result + "
"); // true initialText = "3di0789"; result = exp.test(initialText); document.write(result); // false

In this case, the string must contain at least one character from the range a-z.

If, on the contrary, it is not necessary for the string to have only certain characters, then it is necessary to put the ^ sign in square brackets before listing the characters:

Var initialText = "defensiveness"; var exp = /[^a-z]/; var result = exp.test(initialText); document.write(result + "
"); // false initialText = "3di0789"; exp = /[^0-9]/; result = exp.test(initialText); document.write(result); // true

In the first case, the string should not have only characters from the range a-z, but since the string "defense" consists only of characters from this range, the test () method returns false, that is, the regular expression does not match the string.

In the second case ("3di0789") the string must not consist only of numeric characters. But since the string also contains letters, the string matches the regular expression, so the test method returns true.

If necessary, we can collect combinations of expressions:

Var initialText = "at home"; var exp = /[dt]o[nm]/; var result = exp.test(initialText); document.write(result); // true

The expression [dt]o[nm] indicates those strings that may contain the substrings "house", "volume", "don", "tone".

Expression Properties

    The global property allows you to find all substrings that match the regular expression. By default, when searching for substrings, a regular expression selects the first substring found in a string that matches the expression. Although there can be many substrings in a string that also match the expression. For this, this property is used in the form of the symbol g in expressions

    The ignoreCase property allows you to find substrings that match a regular expression, regardless of the case of the characters in the string. To do this, the character i is used in regular expressions.

    The multiline property allows you to find substrings that match a regular expression in multiline text. To do this, the symbol m is used in regular expressions.

For example:

Var initialText = "hello world"; var exp = /world/; var result = exp.test(initialText); // false

There is no match between the string and the expression here, since "World" differs from "world" in case. In this case, you need to change the regular expression by adding the ignoreCase property to it:

Varexp = /world/i;

Well, we can also use several properties at once.

JavaScript regexp is an object type that is used to match a sequence of characters in strings.

Creating the first regular expression

There are two ways to create a regular expression: using a regular expression literal or using the regular expression constructor. Each one represents the same pattern: the character " c' followed by ' a' followed by the symbol ' t».

// regular expression literal enclosed in slashes (/) var option1 = /cat/; // Regular expression constructor var option2 = new RegExp("cat");

As a general rule, if a regular expression remains constant, meaning it won't change, it's better to use a regular expression literal. If it will change or depend on other variables, it is better to use a method with a constructor.

RegExp.prototype.test() method

Remember when I said that regular expressions are objects? This means that they have a range of methods. The easiest method is JavaScript regexp test, which returns a boolean:

True (true): The string contains the regular expression pattern.

False (false): No match found.

console.log(/cat/.test(“the cat says meow”)); // true console.log(/cat/.test("the dog says bark")); // wrong

Regular Expression Basics Reminder

The secret of regular expressions is to remember common characters and groups. I highly recommend spending a few hours on the table below and then coming back and continuing your study.

Symbols

  • . – (dot ) matches any single character except a line break;
  • *  –  matches the previous expression that is repeated 0 or more times;
  • +  –  matches the previous expression that is repeated 1 or more times;
  • ? –  the previous expression is optional ( matches 0 or 1 times);
  • ^ - matches the beginning of the line;
  • $ - Matches the end of the string.

Character groups

  • d–  matches any single numeric character.
  • w–  matches any character (number, letter, or underscore).
  • [XYZ]–  character set. Matches any single character from the set given in parentheses. You can also specify ranges of characters, for example, .
  • [XYZ ]+–  matches a character from a set repeated one or more times.
  • [^A-Z]–  within the character set "^" is used as a negation sign. In this example, the pattern matches anything that is not uppercase letters.

Flags :

There are five optional flags in JavaScript regexp. They can be used alone or together, and are placed after the trailing slash. For example: /[ A-Z]/g. Here I will give only two flags.

g–  global search.

i–  case insensitive search.

Additional designs

(x )–   capturing brackets. This expression matches x and remembers the match so you can use it later.

(?:x )–  non-capturing brackets. The expression matches x but does not remember that match.

Matches x only if it is followed by y .

Testing the studied material

Let's test all of the above first. Let's say we want to check a string for any digits. You can use the "d" construct for this.

console.log(/d/.test("12-34")); // right

The above code returns true if there is at least one digit in the string. What if you need to check if a string matches the format? You can use multiple "d" characters to define the format:

console.log(/dd-dd/.test("12-34")); //true console.log(/dd-dd/.test("1234")); //wrong

If it doesn't matter how JavaScript regexp online comes with numbers before and after the "-" sign, you can use the "+" character to indicate that the "d" pattern occurs one or more times:

console.log(/d+-d+/.test("12-34")); // true console.log(/d+-d+/.test("1-234")); // true console.log(/d+-d+/.test("-34")); // wrong

For simplicity, parentheses can be used to group expressions. Let's say we have a cat meowing and we want to match the pattern " meow"(meow):

console.log(/me+(ow)+w/.test("meeeeowowowww")); // right

Now let's figure it out.

m => match one letter 'm';

e + => match the letter "e" one or more times;

(ow) + => match the letters "ow" one or more times;

w => match the letter ‘w ’;

'm' + 'eeee' + 'owowow' + 'w'.

When operators like "+" are used immediately after parentheses, they affect the entire contents of the parentheses.

operator "? ". It indicates that the previous character is optional. As you will see below, both test cases return true because the "s" characters are marked as optional.

console.log(/cats? says?/i.test("the Cat says meow")); //correct console.log(/cats? says?/i.test("the Cats say meow")); //right

If you want to find a slash character, you need to escape it with a backslash. The same is true for other characters that have a special meaning, such as the question mark. Here is a JavaScript regexp example of how to search for them:

var slashSearch = ///; varquestionSearch = /?/;

  • d is the same as : each construct corresponds to a numeric character.
  • w is the same as [ A—Za—z 0-9_]: Both expressions match any single alphanumeric character or underscore.

Example: adding spaces to strings written in camel case

In this example, we are very tired of the camel-style writing and we need a way to add spaces between words. Here is an example:

removeCc("camelCase") // => should return "camel Case"

There is a simple solution using a regular expression. First, we need to find all the capital letters. This can be done with a character set lookup and a global modifier.

This matches the "C" character in "camelCase"

Now, how to add a space before "C"?

We need to use capturing parentheses! They allow you to find a match and remember it to use later! Use capturing brackets to remember the found capital letter:

You can access the captured value later like this:

Above, we use $1 to access the captured value. By the way, if we had two sets of capturing parentheses, we would use $1 and $2 to refer to the captured values, and similarly for more capturing parentheses.

If you need to use parentheses but don't need to capture that value, you can use non-capturing parentheses: (?: x ). In this case, a match x is found, but it is not remembered.

Let's return to the current task. How do we implement capturing parentheses? With the JavaScript regexp replace method! We pass "$1" as the second argument. It is important to use quotation marks here.

function removeCc(str)( return str.replace(/()/g, "$1"); )

Let's look at the code again. We grab the uppercase letter and then replace it with the same letter. Inside the quotes, insert a space followed by the variable $1 . As a result, we get a space after each capital letter.

function removeCc(str)( return str.replace(/()/g, " $1"); ) removeCc("camelCase") // "camel Case" removeCc("helloWorldItIsMe") // "hello World It Is Me"

Example: remove capital letters

Now we have a string with a bunch of unnecessary capital letters. Have you guessed how to remove them? First, we need to select all the capital letters. Then we use a character set search using the global modifier:

We'll use the replace method again, but how do we make the character lowercase this time?

function lowerCase(str)( return str.replace(//g, ???); )

Hint: In the replace() method, you can specify a function as the second parameter.

We will use an arrow function to not capture the value of the found match. When using a function in the JavaScript regexp replace method, the function will be called after a match is found, and the result of the function is used as the replacement string. Even better, if the match is global and multiple matches are found, the function will be called for each match found.

function lowerCase(str)( return str.replace(//g, (u) => u.toLowerCase()); ) lowerCase("camel Case") // "camel case" lowerCase("hello World It Is Me" ) // "hello world it is me"

Regular expression is a sequence of characters that form search pattern.

When you need to find something in some text, then a search pattern is used to describe what you are looking for.

A single character or a more complex pattern can be used as a regular expression.

Regular expressions can be used to perform all kinds of text search and replace operations.

Syntax:

/pattern/modifiers;

Varpatt = /msiter/i;

Example explanation:

  • /msiter/i - regular expression.
  • msiter – the pattern used in the search operation.
  • i - modifier (specifies that the search should be case-insensitive).

Using String Methods

In JavaScript, regular expressions are often used in two string methods: search() and replace().

Method search() uses an expression to find a match and returns the position of the match found.

Method replace() returns the modified string where the template has been replaced.

search() method with regular expression

The following example uses a regular expression for a case-insensitive search:

Var str = "Visit the MSiter site"; var n = str.search(/msiter/i);

As a result, into a variable n 14 will be returned.

search() method with string

Method search() can also take a string as a parameter. In this case, the string parameter is converted into a regular expression:

The following example uses the string "MSiter" to search:

Var str = "Visit the MSiter site!"; var n = str.search("MSiter");

replace() method with regular expression

The following example uses a regular expression to replace the substring "Microsoft" with "Msiter" in a case-insensitive manner:

Var str = "Visit the Microsoft site!"; var res = str.replace(/microsoft/i, "MSiter");

As a result, the res variable will contain the string "Visit the MSiter site!".

replace() method with string

Method replace() can also take a string as a parameter:

Var str = "Visit the Microsoft site!"; var res = str.replace("Microsoft", "MSiter");

Did you notice that

  • In the described methods, regular expressions (instead of a string) can be used as parameters.
  • Regular expressions provide much more control over the search process (for example, you can search in a case-insensitive manner).

Regular expression modifiers

Modifiers allow you to expand the search area:

Regular expression patterns

Square brackets are used to search within a range of characters:

Metacharacters are characters with a special meaning:

Quantifiers determine the number of repetitions:

Regexp object

In JavaScript object RegExp is a regular expression object with predefined properties and methods.

test() method

Method test() The RegExp object is used to search for a pattern in the given string. It returns true or false depending on the result.

Varpatt = /e/; patt.test("The best things in life are free!");

Since the string contains the character "e" in this example, the result will be true.

To work with a RegExp object, it is not necessary to first place the regular expression in a variable. The two lines of code from the previous example can be reduced to one:

/e/.test("The best things in life are free!");

exec() method

Method exec() The RegExp object is used to search for a pattern in the given string. It returns the found text. If nothing was found, then null is returned.

The following example looks for the character "e" in a string:

/e/.exec("The best things in life are free!");

Since the string contains the character "e" in this example, the result will be e.

Some people, when faced with a problem, think, "Oh, I'm using regular expressions." Now they have two problems.
Jamie Zawinski

Yuan-Ma said, “It takes great strength to cut wood across the wood structure. It takes a lot of code to program across the problem structure.
Master Yuan-Ma, "Book of Programming"

Programming tools and techniques survive and proliferate in a chaotic evolutionary fashion. Sometimes it's not the beautiful and the brilliant that survive, but the ones that do reasonably well in their field—for example, if they're integrated into another successful technology.

In this chapter, we will discuss such a tool, regular expressions. This is a way to describe patterns in string data. They create a small separate language that is included in JavaScript and many other languages ​​and tools.

Regulars are both very strange and extremely useful. Their syntax is cryptic and the JavaScript API is clumsy for them. But it is a powerful tool for string exploration and manipulation. By understanding them, you will become a more effective programmer.

Create a regular expression

Regular - type of object. It can be created by calling the RegExp constructor, or by writing the desired pattern, surrounded by slashes.

Var re1 = new RegExp("abc"); var re2 = /abc/;

Both of these regular expressions represent the same pattern: the character "a" followed by the character "b" followed by the character "c".

If you use the RegExp constructor, then the pattern is written as a regular string, so all the rules about backslashes apply.

The second entry, where the pattern is between slashes, handles backslashes differently. First, since the pattern ends with a forward slash, we need to put a backslash before the forward slash we want to include in our pattern. Also, backslashes that are not part of special characters like \n will be preserved (rather than ignored as in strings) and change the meaning of the pattern. Some characters, such as the question mark or the plus sign, have a special meaning in regular expressions, and if you need to find such a character, it must also be preceded by a backslash.

Var8eenPlus = /eighteen\+/;

To know which characters to precede with a slash, you need to learn the list of all special characters in regular expressions. This isn't realistic yet, so when in doubt, just put a backslash before any character that isn't a letter, number, or space.

Checking for matches

Regular expressions have several methods. The simplest is test. If you pass it a string, it will return a boolean indicating whether the string contains an occurrence of the given pattern.

Console.log(/abc/.test("abcde")); // → true console.log(/abc/.test("abxde")); // → false

A regular consisting only of non-special characters is simply a sequence of those characters. If abc is anywhere in the string we are testing (not just at the beginning), test will return true.

Looking for a character set

Finding out if a string contains abc could also be done using indexOf. Regulars allow you to go further and compose more complex patterns.

Let's say we need to find any number. When we put a set of characters in square brackets in a regular expression, it means that this part of the expression matches any of the characters in the brackets.

Both expressions are on lines containing a digit.

Console.log(//.test("in 1992")); // → true console.log(//.test("in 1992")); // → true

In square brackets, a dash between two characters is used to specify a range of characters, where the sequence is specified by the Unicode encoding. The characters from 0 to 9 are there just in a row (codes from 48 to 57), so it captures them all and matches any digit.

Several character groups have their own built-in abbreviations.

\d Any digit
\w Alphanumeric character
\s Whitespace character (space, tab, newline, etc.)
\D is not a digit
\W is not an alphanumeric character
\S non-whitespace character
. any character except newline

Thus, you can set the date and time format like 30-01-2003 15:20 with the following expression:

Var dateTime = /\d\d-\d\d-\d\d\d\d \d\d:\d\d/; console.log(dateTime.test("30-01-2003 15:20")); // → true console.log(dateTime.test("30-jan-2003 15:20")); // → false

Looks terrible, doesn't it? There are too many backslashes that make it difficult to understand the pattern. Later we will improve it slightly.

Backslashes can also be used in square brackets. For example, [\d.] means any number or dot. Note that the dot inside the square brackets loses its special meaning and becomes just a dot. The same goes for other special characters like +.

You can invert a set of characters - that is, say that you need to find any character other than those in the set - by putting a ^ sign immediately after the opening square bracket.

Var notBinary = /[^01]/; console.log(notBinary.test("1100100010100110")); // → false console.log(notBinary.test("1100100010200110")); // → true

Repeating Pattern Parts

We know how to find one digit. But what if we need to find the whole number - a sequence of one or more digits?

If you put a + sign after something in the regular expression, it will mean that this element can be repeated more than once. /\d+/ means one or more digits.

Console.log(/"\d+"/.test(""123"")); // → true console.log(/"\d+"/.test("""")); // → false console.log(/"\d*"/.test(""123"")); // → true console.log(/"\d*"/.test("""")); // → true

The asterisk * has almost the same meaning, but it allows the pattern to appear zero times. If there is an asterisk after something, then it never prevents the pattern from being in the string - it just appears there zero times.

The question mark makes the pattern part optional, meaning it can occur zero or one time. In the following example, the character u may occur, but the pattern matches when it does not.

Var neighbor = /neighbou?r/; console.log(neighbor.test("neighbor")); // → true console.log(neighbor.test("neighbor")); // → true

Curly braces are used to specify the exact number of times a pattern must occur. (4) after the element means that it must occur 4 times in the line. You can also specify a gap: (2,4) means that the element must occur at least 2 and at most 4 times.

Another version of the date and time format, where days, months, and hours of one or two digits are allowed. And it's also a little more readable.

Var dateTime = /\d(1,2)-\d(1,2)-\d(4) \d(1,2):\d(2)/; console.log(dateTime.test("30-1-2003 8:45")); // → true

You can use open-ended spaces by omitting one of the numbers. (,5) means that the pattern can occur from zero to five times, and (5,) - from five or more.

Grouping subexpressions

Parentheses can be used to use the * or + operators on multiple elements at once. The part of the regular expression enclosed in brackets is considered one element from the point of view of operators.

Var cartoonCrying = /boo+(hoo+)+/i; console.log(cartoonCrying.test("Boohoooohoohooo")); // → true

The first and second pluses only apply to the second o in boo and hoo. The third + refers to the whole group (hoo+), finding one or more such sequences.

The letter i at the end of the expression makes the regular expression case-insensitive, so that B is the same as b.

Matches and groups

The test method is the simplest method for checking regular expressions. It only reports if a match was found or not. Regular expressions also have an exec method that will return null if nothing was found, and otherwise return an object with information about the match.

Varmatch = /\d+/.exec("one two 100"); console log(match); // → ["100"] console.log(match.index); // → 8

The object returned by exec has an index property that contains the number of the character that matched. In general, the object looks like an array of strings, where the first element is the string that was checked for a match. In our example, this will be the sequence of numbers we were looking for.

Strings have a match method that works in much the same way.

Console.log("one two 100".match(/\d+/)); // → ["100"]

When a regular expression contains subexpressions grouped in parentheses, the text that matches those groups will also appear in the array. The first element is always the whole match. The second is the part that matched the first group (the one with parentheses encountered first), then the second group, and so on.

Var quotedText = /"([^"]*)"/; console.log(quotedText.exec("she said "hello"")); // → [""hello"", "hello"]

When a group is not found at all (for example, if it is followed by a question mark), its position in the array contains undefined. If the group has matched multiple times, then only the last match will be in the array.

Console.log(/bad(ly)?/.exec("bad")); // → ["bad", undefined] console.log(/(\d)+/.exec("123")); // → ["123", "3"]

Groups are useful for extracting parts of strings. If we don't just want to check if a string contains a date, but extract it and create an object representing the date, we can enclose sequences of numbers in parentheses and select the date from the result of exec.

But first, a little digression in which we learn the preferred way to store dates and times in JavaScript.

date type

JavaScript has a standard object type for dates - more specifically, moments in time. It's called Date. If you simply create a date object with new, you will get the current date and time.

Console.log(new Date()); // → Sun Nov 09 2014 00:07:57 GMT+0300 (CET)

You can also create an object containing the given time

Console.log(new Date(2015, 9, 21)); // → Wed Oct 21 2015 00:00:00 GMT+0300 (CET) console.log(new Date(2009, 11, 9, 12, 59, 59, 999)); // → Wed Dec 09 2009 12:59:59 GMT+0300 (CET)

JavaScript uses a convention where month numbers start at zero and day numbers start at one. This is stupid and ridiculous. Look out.

The last four arguments (hours, minutes, seconds, and milliseconds) are optional and set to zero if not present.

Timestamps are stored as the number of milliseconds since the beginning of 1970. For times before 1970, negative numbers are used (this is due to the Unix time convention that was created around that time). The getTime method of the date object returns this number. It is big, of course.
console.log(new Date(2013, 11, 19).getTime()); // → 1387407600000 console.log(new Date(1387407600000)); // → Thu Dec 19 2013 00:00:00 GMT+0100 (CET)

If you give the Date constructor a single argument, it is treated as this number of milliseconds. You can get the current millisecond value by creating a Date object and calling the getTime method, or by calling the Date.now function.

The Date object has getFullYear, getMonth, getDate, getHours, getMinutes, and getSeconds methods for retrieving its components. There is also a getYear method that returns a rather useless two-digit code like 93 or 14.

By enclosing the desired parts of the template in parentheses, we can create a date object directly from the string.

Function findDate(string) ( var dateTime = /(\d(1,2))-(\d(1,2))-(\d(4))/; var match = dateTime.exec(string); return new Date(Number(match), Number(match) - 1, Number(match)); ) console.log(findDate("30-1-2003")); // → Thu Jan 30 2003 00:00:00 GMT+0100 (CET)

Word and line boundaries

Unfortunately, findDate will just as happily extract the meaningless date 00-1-3000 from the string "100-1-30000". The match can happen anywhere in the string, so in this case it will just start at the second character and end at the penultimate one.

If we need to force the match to take the entire string, we use the ^ and $ marks. ^ matches the beginning of the string, and $ matches the end. Therefore, /^\d+$/ matches a string of only one or more digits, /^!/ matches a string that starts with an exclamation point, and /x^/ does not match any string (the beginning of a string cannot be x).

If, on the other hand, we just want to make sure the date starts and ends on a word boundary, we use the \b mark. A word boundary can be the beginning or end of a line, or any place in a line where the character \w is alphanumeric on one side and non-alphanumeric on the other.

Console.log(/cat/.test("concatenate")); // → true console.log(/\bcat\b/.test("concatenate")); // → false

Note that the boundary label is not a character. It's just a constraint, meaning that a match only happens if a certain condition is met.

Choice templates

Let's say we need to find out if the text contains not just a number, but a number followed by pig, cow, or chicken in the singular or plural.

It would be possible to write three regular expressions and check them one by one, but there is a better way. Symbol | denotes a choice between patterns to the left and to the right of it. And you can say the following:

Var animalCount = /\b\d+ (pig|cow|chicken)s?\b/; console.log(animalCount.test("15 pigs")); // → true console.log(animalCount.test("15 pigchickens")); // → false

The parentheses delimit the part of the pattern to which | is applied, and you can put many such operators one after the other to indicate a choice of more than two options.

Search Engine

Regular expressions can be thought of as flowcharts. The following diagram describes the latest livestock example.

An expression matches a string if a path can be found from the left side of the diagram to the right. We remember the current position in the string, and each time we pass the rectangle, we check that the part of the string just past our position in it matches the contents of the rectangle.

So, checking the match of our regular expression in the string "the 3 pigs" when passing through the flowchart looks like this:

There is a word boundary at position 4, and we go through the first rectangle
- starting from the 4th position, we find the number, and go through the second rectangle
- at position 5, one path closes back before the second rectangle, and the second goes further to the rectangle with a space. We have a space, not a number, and we choose the second path.
- now we are at position 6, the beginning of "pigs", and at the triple fork of the paths. There is no "cow" or "chicken" in the string, but there is "pig", so we choose this path.
- at position 9 after the triple fork, one path bypasses the "s" and goes to the last rectangle with a word boundary, and the second goes through the "s". We have "s", so we go there.
- at position 10 we are at the end of the line, and only a word boundary can match. The end of the line is considered a boundary, and we go through the last rectangle. And so we successfully found our template.

Basically, regular expressions work like this: the algorithm starts at the beginning of the string and tries to find a match there. In our case, there's a word boundary there, so it goes past the first rectangle - but there's no number, so it stumbles on the second rectangle. Then it moves to the second character in the string, and tries to find a match there... And so on, until it finds a match or reaches the end of the string, in which case no match is found.

Kickbacks

The regular /\b(+b|\d+|[\da-f]h)\b/ matches either a binary number followed by a b, or a decimal number without a suffix, or a hexadecimal number (digits from 0 to 9 or characters from a to h), followed by h. Relevant chart:

When searching for a match, it may happen that the algorithm takes the upper path (binary number), even if there is no such number in the string. If there is a string “103”, for example, it is clear that only after reaching the number 3 the algorithm will understand that it is on the wrong path. In general, the string matches the regular expression, just not in this thread.

Then the algorithm rolls back. At the fork, it remembers the current position (in our case, this is the beginning of the line, just after the word boundary) so that you can go back and try another path if the chosen one does not work. For the string “103”, after encountering a triple, it will come back and try to walk the path for decimal numbers. This will work, so a match will be found.

The algorithm stops as soon as it finds a perfect match. This means that even if several options may be suitable, only one of them is used (in the order in which they appear in the regular season).

Backtracking occurs when using repetition operators such as + and *. If you search for /^.*x/ in the string "abcxe", the.* part of the regex will try to consume the entire string. The algorithm will then realize that it needs "x" as well. Since there is no "x" after the end of the string, the algorithm will try to find a match by backtracking one character. There is also no x after abcx, then it rolls back again, already to the substring abc. And after the line, it finds x and reports a successful match, at positions 0 to 4.

You can write a regular expression that will lead to multiple rollbacks. This problem occurs when the pattern can match the input in many different ways. For example, if we make a mistake when writing a regular expression for binary numbers, we might accidentally write something like /(+)+b/.

If the algorithm looks for such a pattern in a long string of 0s and 1s that doesn't end with a "b", it will first go through the inner loop until it runs out of digits. Then he will notice that there is no “b” at the end, roll back one position, go through the outer loop, give up again, try to roll back one more position along the inner loop ... And he will continue to search in this way, using both loops. That is, the amount of work with each character of the string will double. Even for a few dozen characters, the search for a match will take a very long time.

replace method

Strings have a replace method that can replace part of a string with another string.

Console.log("dad".replace("n", "m")); // → map

The first argument can also be a regular one, in which case the first occurrence of the regular expression in the string is replaced. When the “g” (global) option is added to the regular expression, all occurrences are replaced, not just the first one.

Console.log("Borobudur".replace(//, "a")); // → Barobudur console.log("Borobudur".replace(//g, "a")); // → Barabadar

It would make sense to pass the "replace all" option through a separate argument, or through a separate method like replaceAll. But unfortunately, the option is passed through the regular expression itself.

The full power of regular expressions is revealed when we use references to the groups found in the string specified in the regular expression. For example, we have a string containing people's names, one name per line, in the format LastName, FirstName. If we need to swap them and remove the comma to get "First Name Last Name", we write the following:

Console.log("Hopper, Grace\nMcCarthy, John\nRitchie, Dennis" .replace(/([\w ]+), ([\w ]+)/g, "$2 $1")); // → Grace Hopper // John McCarthy // Dennis Ritchie

$1 and $2 in the replacement string refer to groups of characters enclosed in parentheses. $1 is replaced with the text that matched the first group, $2 with the second group, and so on, up to $9. The entire match is contained in the $& variable.

You can also pass a function as the second argument. For each replacement, a function will be called, the arguments of which will be the found groups (and the entire matching part of the string as a whole), and its result will be inserted into a new string.

Simple example:

Vars = "the cia and fbi"; console.log(s.replace(/\b(fbi|cia)\b/g, function(str) ( return str.toUpperCase(); ))); // → the CIA and FBI

And here's a more interesting one:

Var stock = "1 lemon, 2 cabbages, and 101 eggs"; function minusOne(match, amount, unit) ( amount = Number(amount) - 1; if (amount == 1) // only one left, remove "s" at the end unit = unit.slice(0, unit.length - 1); else if (amount == 0) amount = "no"; return amount + " " + unit; ) console.log(stock.replace(/(\d+) (\w+)/g, minusOne)); // → no lemon, 1 cabbage, and 100 eggs

The code takes a string, finds all occurrences of numbers followed by a word, and returns a string where each number is reduced by one.

The group (\d+) goes into the amount argument, and the group (\w+) goes into unit. The function converts amount to a number - and this always works, because our pattern is just \d+. And then he makes changes to the word, in case there is only 1 item left.

Greed

It's easy to write a function using replace to remove all comments from JavaScript code. Here is the first try:

Function stripComments(code) ( return code.replace(/\/\/.*|\/\*[^]*\*\//g, ""); ) console.log(stripComments("1 + /* 2*/3"); // → 1 + 3 console.log(stripComments("x = 10;// ten!")); // → x = 10; console.log(stripComments("1 /* a */+/* b */ 1"); // → 1 1

The part before the "or" operator matches two slashes followed by any number of characters except newline characters. The part that removes multi-line comments is more complicated. We use [^], i.e. any non-blank character as a way to find any character. We can't use a dot because block comments continue on a new line, and the newline character doesn't match the dot.

But the output of the previous example is wrong. Why?

The [^]* part will first try to grab as many characters as it can. If, because of this, the next part of the regular expression does not find a match for itself, it will roll back one character and try again. In the example, the algorithm tries to capture the entire string, and then rolls back. Rolling back 4 characters, he will find */ in the line - and this is not what we wanted. We wanted to grab only one comment, not go to the end of the line and find the last comment.

Because of this, we say that the repetition operators (+, *, ?, and ()) are greedy, that is, they first grab as much as they can, and then go back. If you put a question after such an operator (+?, *?, ??, ()?), they become non-greedy and start finding the smallest possible occurrences.

And that's what we need. By making the asterisk match the minimum possible number of characters on the line, we consume only one block of comments, and no more.

Function stripComments(code) ( return code.replace(/\/\/.*|\/\*[^]*?\*\//g, ""); ) console.log(stripComments("1 /* a */+/* b */ 1")); // → 1 + 1

Many errors occur when using greedy operators instead of non-greedy ones. When using the repeat operator, always consider the non-greedy operator first.

Dynamic creation of RegExp objects

In some cases, the exact pattern is not known at the time the code is written. For example, you will need to look for a username in text, and enclose it in underscores. Since you will only recognize the name after the program has run, you cannot use slash notation.

But you can build a string and use the RegExp constructor. Here is an example:

var name = "harry"; var text = "Harry has a scar on his forehead."; var regexp = new RegExp("\\b(" + name + ")\\b", "gi"); console.log(text.replace(regexp, "_$1_")); // → And _Harry_ has a scar on his forehead.

When creating word boundaries, we have to use double slashes, because we write them in a normal line, and not in a regular expression with forward slashes. The second argument to RegExp contains options for regular expressions - in our case, "gi", i.e. global and case-insensitive.

But what if the name is "dea+hlrd" (if our user is a culhacker)? As a result, we get a meaningless regular expression that will not find matches in the string.

We can add backslashes before any character we don't like. We can't add backslashes before letters because \b or \n are special characters. But adding slashes before any non-alphanumeric characters is fine.

var name = "dea+hlrd"; var text = "This dea+hlrd got everyone."; var escaped = name.replace(/[^\w\s]/g, "\\$&"); var regexp = new RegExp("\\b(" + escaped + ")\\b", "gi"); console.log(text.replace(regexp, "_$1_")); // → This _dea+hlrd_ got everyone.

search method

The indexOf method cannot be used with regular expressions. But there is a search method, which just expects a regular expression. Like indexOf, it returns the index of the first occurrence, or -1 if it didn't.

Console.log("word".search(/\S/)); // → 2 console.log(" ".search(/\S/)); // → -1

Unfortunately, there is no way to tell a method to look for a match starting at a specific offset (as you can with indexOf). It would be helpful.

lastIndex property

The exec method also does not provide a convenient way to start searching from a given position in a string. But it gives an inconvenient way.

The regex object has properties. One of them is source, which contains a string. Another one is lastIndex , which controls, in some conditions, where the next occurrence search will start.

These conditions include that the global option g must be present, and that the search must be done using the exec method. A smarter solution would be to simply allow an extra argument to be passed to exec, but sanity is not a fundamental feature in the JavaScript regex interface.

Varpattern = /y/g; pattern.lastIndex = 3; varmatch = pattern.exec("xyzzy"); console.log(match.index); // → 4 console.log(pattern.lastIndex); // → 5

If the search was successful, the exec call updates the lastIndex property to point to the position after the found occurrence. If there was no success, lastIndex is set to zero - just like the lastIndex of the newly created object.

When using a global regex and multiple exec calls, these automatic lastIndex updates can lead to problems. Your regular expression can start searching from the position left from the previous call.

vardigit = /\d/g; console.log(digit.exec("here it is: 1"); // → ["1"] console.log(digit.exec("and now: 1")); // → null

Another interesting effect of the g option is that it changes how the match method works. When called with this option, instead of returning an array similar to the result of an exec, it finds all occurrences of the pattern in a string and returns an array of the substrings it finds.

Console.log("Banana".match(/en/g)); // → ["en", "en"]

So be careful with global regular variables. The cases where they are needed - calls to replace or places where you specifically use lastIndex - are probably all the cases in which they should be used.

Loops over occurrences

A typical task is to iterate over all occurrences of a pattern in a string so that you can access the match object in the loop body using lastIndex and exec.

Var input = "A string with 3 numbers in it... 42 and 88."; varnumber = /\b(\d+)\b/g; varmatch; while (match = number.exec(input)) console.log("Found ", match, " on ", match.index); // → Found 3 by 14 // Found 42 by 33 // Found 88 by 40

The fact that the value of the assignment is the value being assigned is used. Using the match = re.exec(input) construct as a condition in the while loop, we search at the beginning of each iteration, store the result in a variable, and end the loop when all matches are found.

Parsing INI files

At the end of the chapter, we will consider a problem using regular expressions. Imagine that we are writing a program that automatically collects information about our enemies via the Internet. (We will not write the whole program, only the part that reads the settings file. Sorry.) The file looks like this:

Searchengine=http://www.google.com/search?q=$1 spitefulness=9.7 ; comments are preceded by a semicolon; each section refers to a different enemy fullname=Larry Doe type=kindergarten oxen website=http://www.geocities.com/CapeCanaveral/11451 fullname=Gargamel type=evil wizard outputdir=/home/marijn/enemies/gargamel

The exact format of the file (which is quite widely used, and commonly referred to as INI) is as follows:

Blank lines and lines starting with a semicolon are ignored
- lines enclosed in square brackets start a new section
- lines containing an alphanumeric identifier followed by = add a setting in this section

Everything else is incorrect information.

Our task is to convert such a string into an array of objects, each with a name property and an array of settings. Each section needs one object, and another one for global settings at the top of the file.

Since the file needs to be parsed line by line, it's a good idea to start by splitting the file into lines. To do this, we used string.split("\n") in Chapter 6. Some operating systems use not one character \n, but two \r\n for line feed. Since the split method takes regular expressions as an argument, we can split lines using the /\r?\n/ expression, which allows both single \n and \r\n between lines.

Function parseINI(string) ( // Let's start with an object containing top-level settings var currentSection = (name: null, fields: ); var categories = ; string.split(/\r?\n/).forEach(function(line ) ( var match; if (/^\s*(;.*)?$/.test(line)) ( return; ) else if (match = line.match(/^\[(.*)\]$ /)) ( currentSection = (name: match, fields: ); categories.push(currentSection); ) else if (match = line.match(/^(\w+)=(.*)$/)) ( currentSection. fields.push((name: match, value: match)); ) else ( throw new Error("The line "" + line + "" contains invalid data."); ) )); return categories; )

The code goes through all the lines, updating the current section object "current section". First, it checks if the line can be ignored, using the regular expression /^\s*(;.*)?$/. Can you imagine how it works? The part between the brackets is the same as the comments, huh? makes it so that the regular expression will also match lines consisting of only spaces.

If the line is not a comment, the code checks to see if it starts a new section. If so, it creates a new object for the current section, to which subsequent settings are added.

The last possibility that makes sense is that the string is a normal setting, in which case it is added to the current object.

If none of the options worked, the function throws an error.

Notice how the frequent use of ^ and $ ensures that the expression matches the entire string, not part of it. If they are not used, the code will generally work, but sometimes it will give strange results, and such an error will be difficult to track down.

The if (match = string.match(...)) construct is like a trick using an assignment as a condition in a while loop. Often you don't know that the match call will succeed, so you can only access the resulting object inside the if block that tests for it. In order not to break the beautiful chain of if checks, we assign the result of the search to a variable, and immediately use this assignment as a check.

International symbols

Due to the initially simple implementation of the language, and the subsequent fixation of such an implementation “in granite”, JavaScript regular expressions are dumb with characters that are not found in English. For example, the character "letter" from the point of view of JavaScript regular expressions can be one of the 26 letters of the English alphabet, and for some reason also an underscore. Letters like é or β that are uniquely letters do not match \w (and will match \W, which is not a letter).

By a strange coincidence, historically \s (space) matches all characters that are considered whitespace in Unicode, including things like the non-breaking space or the Mongolian vowel separator.

Some implementations of regular expressions in other languages ​​have special syntax for searching for special categories of Unicode characters, such as "all caps", "all punctuation", or "control characters". There are plans to add such categories to JavaScript, but they will probably not be implemented soon.

Outcome

Regulars are objects representing search patterns in strings. They use their own syntax to express these patterns.

/abc/ Sequence of characters
// Any character from the list
/[^abc]/ Any character, except characters from the list
// Any character in between
/x+/ One or more occurrences of pattern x
/x+?/ One or more occurrences, non-greedy
/x*/ Zero or more occurrences
/x?/ Zero or one occurrence
/x(2,4)/ Two to four occurrences
/(abc)/ Group
/a|b|c/ Any of several patterns
/\d/ Any digit
/\w/ Any alphanumeric character ("letter")
/\s/ Any whitespace character
/./ Any character except newlines
/\b/ Word boundary
/^/ Beginning of line
/$/ End of line

The regex has a test method to check if a pattern exists in a string. There is an exec method that returns an array containing all found groups. The array has an index property, which contains the number of the character from which the match happened.

Strings have a match method to search for patterns, and a search method that returns only the starting position of the occurrence. The replace method can replace occurrences of a pattern with another string. In addition, you can pass a function to replace that will build a replacement string based on the pattern and the found groups.

Regular expressions have settings that are written after the closing slash. The i option makes the regular expression case-insensitive, and the g option makes it global, which, among other things, causes the replace method to replace all occurrences it finds, not just the first one.

The RegExp constructor can be used to create regular expressions from strings.

Regulators are a sharp tool with an uncomfortable handle. They greatly simplify some tasks, and can become unmanageable when solving other, complex tasks. Part of being able to use regexes is to be able to resist the temptation to stuff them into a task they weren't designed for.

Exercises

Inevitably, when solving problems, you will encounter incomprehensible cases, and you can sometimes despair, seeing the unpredictable behavior of some regular expressions. Sometimes it helps to study the behavior of a regular expression through an online service like debuggex.com, where you can see its visualization and compare it with the desired effect.
Regular golf
"Golf" in the code is called a game where you need to express a given program with a minimum number of characters. Regular golf is a practical exercise in writing the smallest possible regular expressions to find a given pattern, and only that.

For each of the substrings, write a regular expression to check whether they are in the string. The regular should find only these specified substrings. Don't worry about word boundaries unless specifically mentioned. When you get a working regex, try reducing it.

car and cat
- pop and prop
- ferret, ferry, and ferrari
- Any word ending in ious
- A space followed by a period, comma, colon, or semicolon.
- A word longer than six letters
- Word without letters e

// Enter your regular expressions verify(/.../, ["my car", "bad cats"], ["camper", "high art"]); verify(/.../, ["pop culture", "mad props"], ["plop"]); verify(/.../, ["ferret", "ferry", "ferrari"], ["ferrum", "transfer A"]); verify(/.../, ["how delicious", "spacious room"], ["ruinous", "consciousness"]); verify(/.../, ["bad punctuation ."], ["escape the dot"]); verify(/.../, ["hottentottententen"], ["no", "hotten totten tenten"]); verify(/.../, ["red platypus", "wobbling nest"], ["earth bed", "learning ape"]); function verify(regexp, yes, no) ( // Ignore unfinished exercises if (regexp.source == "...") return; yes.forEach(function(s) ( if (!regexp.test(s)) console .log("Found "" + s + """); )); no.forEach(function(s) ( if (regexp.test(s)) console.log("Unexpected occurrence of "" + s + " ""); )); )

Quotation marks in text
Let's say you've written a story, and you've used single quotes throughout the dialogue. Now you want to replace the dialogue quotes with double quotes, and leave single quotes in word abbreviations like aren't.

Come up with a pattern that differentiates between these two uses of quotes, and write a call to the replace method that does the replacement.

Numbers again
Sequences of digits can be found with a simple regular expression /\d+/.

Write an expression that finds only numbers written in JavaScript style. It must support a possible minus or plus in front of the number, a decimal point, and exponential notation 5e-3 or 1E10 - again with a plus or minus possible. Also note that the dot does not have to be preceded or followed by digits, but the number cannot consist of a single dot. That is, .5 or 5. are valid numbers, but one dot by itself is not.

// Enter a regular expression here. var number = /^...$/; // Tests: ["1", "-1", "+15", "1.55", ".5", "5.", "1.3e2", "1E-4", "1e+12"] .forEach(function(s) ( if (!number.test(s)) console.log("Didn't find "" + s + """); )); ["1a", "+-1", "1.2.3", "1+1", "1e4.5", ".5.", "1f5", "."].forEach(function(s) ( if (number.test(s)) console.log("Invalid "" + s + """); ));

Regular Expressions allow flexible search for words and expressions in texts in order to delete, extract or replace them.

Syntax:

//First way to create a regular expression var regexp=new RegExp( sample,modifiers); //Second option for creating a regular expression var regexp=/ sample/modifiers;

sample allows you to specify a character pattern for the search.

modifiers allow you to customize the behavior of the search:

  • i- search without regard to the case of letters;
  • g- global search (all matches in the document will be found, not just the first one);
  • m- multiline search.

Search for words and expressions

The simplest use of regular expressions is to search for words and expressions in various texts.

Here is an example of using search with modifiers:

//Set the regular expression rv1 rv1=/Russia/; //Set the regular expression rv2 rv2=/Russia/g; //Set the regular expression rv3 rv3=/Russia/ig; //It is highlighted in bold where matches will be found in the text when using the //expression rv1: Russia is the largest country in the world. Russia borders on 18 countries. RUSSIA is the successor state of the USSR. //It is highlighted in bold where matches will be found in the text when using the //expression rv2: Russia is the largest country in the world. Russia borders on 18 countries. RUSSIA is the successor state of the USSR."; //It is highlighted in bold where matches will be found in the text when using the //expression rv3: Russia is the largest state in the world. Russia borders on 18 countries. RUSSIA is the successor state of the USSR.";

Special symbols

In addition to regular characters, regular expression patterns can use Special symbols(metacharacters). Special characters with descriptions are shown in the table below:

Special character Description
. Matches any character except the end-of-line character.
\w Matches any alphabetic character.
\W Matches any non-alphabetic character.
\d Matches characters that are numbers.
\D Matches characters that are not numbers.
\s Matches whitespace characters.
\S Matches non-whitespace characters.
\b Matches will only be found at word boundaries (beginning or end).
\B Matches will be searched only not at the boundaries of words.
\n Matches a newline character.

/* The reg1 expression will find all words that start with two arbitrary letters and end with "wet". Since the words in the sentence are separated by a space, then at the beginning and at the end we add the special character \s) */ reg1=/\s..vet\s/g; txt="hello covenant velvet closet"; document.write(txt.match(reg1) + "
"); /* The reg2 expression will find all words starting with three arbitrary letters and ending with "wt" */ reg2=/\s...wt\s/g; document.write(txt.match(reg2) + "
"); txt1=" hello hello hello "; /* The reg3 expression will find all words that start with "with" followed by 1 digit and ending with "wt" */ var reg3=/when\dwt/g; document .write(txt1.match(reg3) + "
"); // The reg4 expression will find all the numbers in the text var reg4=/\d/g; txt2="5 years of study, 3 years of swimming, 9 years of shooting." document.write(txt2.match(reg4) + "
");

Quick View

Characters in square brackets

Using square brackets [keyu] You can specify a group of characters to search for.

The ^ character before a group of characters in square brackets [^kvg] says that you need to search for all characters of the alphabet except for the given ones.

Using a dash (-) between characters in square brackets [a-h] You can specify a range of characters to search for.

You can also search for numbers using square brackets.

//Set the regular expression reg1 reg1=/\sko[tdm]\s/g; //Set the text string txt1 txt1=" cat braid code chest of drawers com carpet "; //Let's use the regular expression reg1 to search for the string txt1 document.write(txt1.match(reg1) + "
"); reg2=/\sslo[^tg]/g; txt2="slot elephant syllable"; document.write(txt2.match(reg2) + "
"); reg3=//g; txt3="5 years of study, 3 years of swimming, 9 years of shooting"; document.write(txt3.match(reg3));

Quick View

Quantifiers

Quantifier- this is a construction that allows you to specify how many times the character or group of characters preceding it should occur in a match.

Syntax:

//The preceding character must occur x - times (x)//The preceding character must occur from x to y times, inclusive (x,y)//The preceding character must occur at least x times (x,)//Specifies that the preceding character must occur 0 or more times * //Specifies that the preceding character must occur 1 or more times + //Specifies that the preceding character must occur 0 or 1 times ?


//Set the regular expression rv1 rv1=/ko(5)shka/g //Set the regular expression rv2 rv2=/ko(3,)shka/g //Set the regular expression rv3 rv3=/ko+shka/g //Set regular expression rv4 rv4=/cat?cat/g //Set the regular expression rv5 rv5=/cat*cat/g //Bold indicates where in the text matches will be found when using //expression rv1: cat cat cat cat cooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooofor? coooooooooooooooooooooooooooooooooooooooooooooooooooooooow//Bold indicates where in the text matches will be found when using //expression rv3: kshka cat cat//Bold indicates where in the text matches will be found when //using the rv4 expression: kshka cat catooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooto do in the text when using //the rv5 expression: kshka cat

Note: if you want to use any special character (such as .* + ? or ()) as a normal character you must precede it with a \.

Using parentheses

By enclosing part of a regular expression pattern in parentheses, you tell the expression to remember the match found by that part of the pattern. The saved match can be used later in your code.

For example, the regular expression /(Dmitry)\sVasilyev/ will find the string "Dmitry Vasilyev" and remember the substring "Dmitry".

In the example below, we use the replace() method to change the order of words in text. We use $1 and $2 to access stored matches.

Var regexp = /(Dmitry)\s(Vasiliev)/; var text = "Dmitry Vasiliev"; var newtext = text.replace(regexp, "$2 $1"); document.write(newtext);

Quick View

Parentheses can be used to group characters before quantifiers.

A computer