Using regular expressions (regex) in Linux. Using regular expressions (regex) in Linux Linux grep regular expressions

One of the most useful and versatile commands in Linux terminal- "grep" command. Grep is an acronym that stands for "global regular expression print" (i.e., "search everywhere for matching regular expression lines and output them"). This means that grep can be used to see if input matches given patterns.

This seemingly trivial program is very powerful when used correctly. Its ability to sort input based on complex rules makes it a popular binder in many command chains.

This tutorial looks at some of the features of the grep command and then moves on to using regular expressions. All the techniques described in this guide can be applied to managing a virtual server.

Usage Basics

In its simplest form, grep is used to find matches for letter patterns in a text file. This means that if the grep command receives a search word, it will print every line of the file that contains that word.

As an example, you can use grep to search for lines containing the word "GNU" in version 3 of the GNU General Public License on an Ubuntu system.

cd /usr/share/common-licenses
grep "GNU" GPL-3
GNU GENERAL PUBLIC LICENSE





13. Use with the GNU Affero General Public License.
under version 3 of the GNU Affero General Public License into a single
...
...

The first argument, "GNU", is the template to look for, and the second argument, "GPL-3", is the input file to look for.

As a result, all lines containing the text pattern will be displayed. In some Linux distributions the searched pattern will be highlighted in the displayed lines.

General options

By default, grep simply looks for strongly specified patterns in the input file and prints the lines it finds. However, grep's behavior can be changed by adding some additional flags.

If you want to ignore the case of the search parameter and look for both uppercase and lowercase variations of the pattern, you can use the "-i" or "--ignore-case" utilities.

For example, you can use grep to search the same file for the word "license" in upper, lower, or mixed case.

grep -i "license" GPL-3
GNU GENERAL PUBLIC LICENSE
of this license document, but changing it is not allowed.
The GNU General Public License is a free, copyleft license for
The licenses for most software and other practical works are designed
the GNU General Public License is intended to guarantee your freedom to
GNU General Public License for most of our software; it also applies to


"This License" refers to version 3 of the GNU General Public License.
"The Program" refers to any copyrightable work licensed under this
...
...

As you can see, the output contains "LICENSE", "license", and "License". If there was an instance of "LiCeNsE" in the file, it would also be output.
If you want to find all lines that do not contain the specified pattern, you can use the "-v" or "--invert-match" flags.

As an example, you can use the following command to search the BSD license for all lines that do not contain the word "the":

grep -v "the"BSD
All rights reserved.
Redistribution and use in source and binary forms, with or without
are met:
may be used to endorse or promote products derived from this software
without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS"" ​​AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
...
...

As you can see, the last two lines were output as not containing the word "the" because the "ignore case" command was not used.

It is always useful to know the line numbers where matches were found. They can be found using the "-n" or "--line-number" flags.

If you apply this flag in the previous example, the following output will be displayed:

grep -vn "the" BSD
2:All rights reserved.
3:
4:Redistribution and use in source and binary forms, with or without
6:are met:
13: may be used to endorse or promote products derived from this software
14: without specific prior written permission.
15:
16:THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS"" ​​AND
17:ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
...
...

You can now refer to the line number as needed to make changes on each line that does not contain "the".

Regular Expressions

As mentioned in the introduction, grep stands for "global regular expression print". A regular expression is a text string that describes a specific search pattern.

Different applications and programming languages ​​use regular expressions in slightly different ways. This guide covers only a small subset of how Grep patterns are described.

Letter matches

The above examples of searching for the words "GNU" and "the" looked for very simple regular expressions that exactly matched the string of characters "GNU" and "the".

It is more correct to represent them as matches of strings of characters than as matches of words. As you become familiar with more complex patterns, this distinction will become more significant.

Patterns that exactly match the given characters are called "alphabetic" because they match the pattern letter by letter, character for character.

All alphabetic and numeric characters (as well as some other characters) match literally unless they have been modified by other expression engines.

Anchor matches

Anchors are special characters that indicate the location in a string of a desired match.

For example, you can specify that the search only looks for strings containing the word "GNU" at the very beginning. To do this, you need to use the anchor "^" before the literal string.

In this example, only the lines containing the word "GNU" at the very beginning are output.

grep "^GNU" GPL-3
GNU General Public License for most of our software; it also applies to
GNU General Public License, you may choose any version ever published

Similarly, the "$" anchor can be used after a literal string to indicate that the match is valid only if the character string being searched is at the end of the text string.

The following regular expression outputs only those lines that contain "and" at the end:

grep "and$" GPL-3
that there is no warranty for this free software. For both users" and
The precise terms and conditions for copying, distribution and


alternative is allowed only occasionally and noncommercially, and
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
provisionally, unless and until the copyright holder explicitly and
receives a license from the original licensors, to run, modify and
make, use, sell, offer for sale, import and otherwise run, modify and

Match any character

The dot (.) is used in regular expressions to indicate that any character can appear at the specified location.

For example, if you want to find matches containing two characters and then the sequence "cept", you would use the following pattern:

grep "..cept" GPL-3
use, which is precisely where it is most unacceptable. Therefore, we
infringement under applicable copyright law, except executing it on a
tells the user that there is no warranty for the work (except to the

form of a separately written license, or stated as exceptions;
You may not propagate or modify a covered work except as expressly
9. Acceptance Not Required for Having Copies.
...
...

As you can see, the words "accept" and "except" are displayed in the results, as well as variations of these words. The pattern would also match the sequence "z2cept" if there was one in the text.

Expressions in brackets

By placing a group of characters in square brackets (""), you can indicate that any of the characters in the brackets can be in this position.

This means that if you need to find strings containing "too" or "two", you can briefly specify these variations using the following pattern:

grep "to" GPL-3
your programs, too.

Developers that use the GNU GPL protect your rights with two steps:
a computer network, with no transfer of a copy, is not conveying.

Corresponding Source from a network server at no charge.
...
...

As you can see, both variations were found in the file.

Bracketing characters also provides several useful features. You can specify that the pattern matches everything except the characters in brackets by starting the list of characters in brackets with the "^" character.

AT this example the template ".ode" is used, which must not be matched by the sequence "code".

grep "[^c]ode" GPL-3
1. Source code.
model, to give anyone who possesses the object code either (1) a
the only significant mode of use of the product.
notice like this when it starts in an interactive mode:

It is worth noting that the second output line contains the word "code". This is not a regex or grep error.

Rather, this line was inferred because it also contains the pattern-matching "mode" sequence found in the word "model". That is, the string was output because it matched the pattern.

Another useful feature of brackets is the ability to specify a range of characters instead of typing each character separately.

This means that if you want to find every line that starts with a capital letter, you can use the following pattern:

grep "^" GPL-3
GNU General Public License for most of our software; it also applies to

license. Each licensee is addressed as "you". "Licenses" and


System Libraries, or general-purpose tools or generally available free
source.

...
...

Due to some inherent sorting issues, it's better to use character classes for a more accurate result. POSIX standard instead of the character range used in the example above.
There are many character classes not covered in this guide; for example, to perform the same procedure as in the example above, you can use the character class "[:upper:]" in parentheses.

grep "^[[:upper:]]" GPL-3
GNU General Public License for most of our software; it also applies to
States should not allow patents to restrict development and use of
license. Each licensee is addressed as "you". "Licenses" and
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
System Libraries, or general-purpose tools or generally available free
source.
User Product is transferred to the recipient in perpetuity or for a
...
...

Repeat pattern (0 or more times)

One of the most commonly used metacharacters is the character "*", which means "repeat the previous character or expression 0 or more times".

For example, if you want to find every line with opening or closing parentheses that contain only letters and single spaces between them, you can use the following expression:

grep "(*)" GPL-3

distribution (with or without modification), making available to the
than the work as a whole, that (a) is included in the normal form of
Component, and (b) serves only to enable use of the work with that
(if any) on which the executable work runs, or a compiler used to
(including a physical distribution medium), accompanied by the
(including a physical distribution medium), accompanied by a
place (gratis or for a charge), and offer equivalent access to the
...
...

How to avoid metacharacters

Sometimes you may want to look for a literal dot or a literal open parenthesis. Because these characters are certain value in regular expressions, you need to "escape" them by telling grep not to use their special meaning in this case.

These characters can be escaped by using a backslash (\) before a character that usually has a special meaning.

For example, if you want to find a string that starts with a capital and ends with a dot, you can use the following expression. The backslash before the last dot tells the command to "avoid" it, so that the last dot represents a literal dot and does not have the meaning "any character":

grep "^.*\.$" GPL-3
source.
License by making exceptions from one or more of its conditions.
License would be to refrain entirely from conveying the Program.
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
SUCH DAMAGES.
Also add information on how to contact you by electronic and paper mail.

Extended regular expressions

The grep command can also be used with the extended regular expression language by using the "-E" flag, or by calling the "egrep" command instead of "grep".

These commands open up the possibilities of "extended regular expressions". Extended regular expressions include all the basic metacharacters, as well as additional metacharacters to express more complex matches.

Grouping

One of the simplest and most useful features of extended regular expressions is the ability to group expressions and use them as a whole.

Parentheses are used to group expressions. If you need to use parentheses outside of extended regular expressions, they can be "escaped" with a backslash

grep "\(grouping\)" file.txt
grep -E "(grouping)" file.txt
egrep "(grouping)" file.txt

The above expressions are equivalent.

alternation

Just as square brackets specify different possible matches for a single character, alternation allows you to specify alternative matches for strings of characters or sets of expressions.

The vertical bar character "|" is used to denote alternation. Alternation is often used in grouping to indicate that one of two or more possible choices should be considered a match.

In this example, you need to find "GPL" or "General Public License":

grep -E "(GPL|General Public License)" GPL-3
The GNU General Public License is a free, copyleft license for
the GNU General Public License is intended to guarantee your freedom to
GNU General Public License for most of our software; it also applies to
price. Our General Public Licenses are designed to make sure that you
Developers that use the GNU GPL protect your rights with two steps:
For the developers" and authors" protection, the GPL clearly explains
authors" sake, the GPL requires that modified versions be marked as
have designed this version of the GPL to prohibit the practice for those
...
...

Alternation can be used to choose between two or more options; to do this, you need to enter the remaining options in the selection group, separating each with the pipe character "|".

quantifiers

In extended regular expressions, there are metacharacters that indicate how often a character repeats, much like the "*" metacharacter indicates matches of the previous character or string of characters 0 or more times.

To indicate a character match 0 or more times, you can use the character "?". It will make the previous character or set of characters essentially optional.

In this example, by adding the sequence "copy" to the optional group, the matches "copyright" and "right" are displayed:

grep -E "(copy)?right" GPL-3
Copyright (C) 2007 Free Software Foundation, Inc.
To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights. Therefore, you have
know their rights.
Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
"Copyright" also means copyright-like laws that apply to other kinds of
...
...

The "+" symbol matches expressions 1 or more times. It works almost like the "*" character, but when using "+", the expression must match at least 1 time.

The following expression matches the string "free" plus 1 or more non-whitespace characters:

grep -E "free[^[:space:]]+" GPL-3
The GNU General Public License is a free, copyleft license for
to take away your freedom to share and change the works. By contrast,
the GNU General Public License is intended to guarantee your freedom to
When we speak of free software, we are referring to freedom, not
have the freedom to distribute copies of free software (and charge for

freedoms that you received. You must make sure that they, too, receive
protecting users" freedom to change the software. The systematic
of the GPL, as needed to protect the freedom of users.
patents cannot be used to render the program non-free.

Number of match repetitions

Curly braces ("( )") can be used to specify the number of repetitions of matches. These characters are used to indicate the exact number, range, and upper and lower limits on the number of times an expression can match.

If you want to find all strings that contain a combination of three vowels, you can use the following expression:

grep -E "(3)" GPL-3
changed, so that their problems will not be attributed erroneously to
authors of previous versions.
receive it, in any medium, provided that you conspicuously and
give under the previous paragraph, plus a right to possession of the
covered work so as to satisfy simultaneously your obligations under this
If you need to find all words that are 16-20 characters long, use the following expression:
grep -E "[[:alpha:]](16,20)" GPL-3
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.
c) Prohibiting misrepresentation of the origin of that material, or

conclusions

In many cases, the grep command is useful for finding patterns within files or within a hierarchy. file system. It saves a lot of time, so you should familiarize yourself with its parameters and syntax.

Regular expressions are even more versatile and can be used in many ways. popular programs. For example, many text editors use regular expressions to find and replace text.

Moreover, advanced programming languages ​​use regular expressions to execute procedures on specific pieces of data. The ability to work with regular expressions will be useful in solving common problems related to the computer.

Tags: ,

grep stands for 'global regular expression printer'. grep strips the lines you want from text files that contain user-specified text.

grep can be used in two ways - by itself or in combination with streams.

grep is very rich in functionality, due to the large number of options it supports, such as: search using a string pattern or RegExp regex pattern or perl based regex, etc.

Because of its various functionalities, the grep tool has many options, including egrep (Extended GREP), fgrep (Fixed GREP), pgrep (ProcessGREP), rgrep (recursive GREP) etc. But these variants have minor differences from the original grep.

grep options

$ grep -V grep (GNU grep) 2.10 Copyright (C) 2011 Free Software Foundation, Inc. GPLv3+ License

There are modifications of the grep utility: egrep (with processing extended regular expressions), fgrep (treating $*^|()\ characters as literals, i.e. literally), rgrep (with recursive search enabled).

    egrep is the same as grep -E

    fgrep is the same as grep -F

    rgrep is the same as grep -r

    grep [-b] [-c] [-i] [-l] [-n] [-s] [-v] restricted_regex_BRE [file ...]

grep command matches strings source files with the pattern specified by restricted_regular_expression. If no files are specified, standard input is used. Typically, each successfully matched line is copied to standard output; if there are several source files, the found string is preceded by the file name. grep uses a compact, non-deterministic algorithm. Limited regular expressions are accepted as patterns (expressions that have strings of characters as their values ​​and use a limited set of alphanumeric and special characters). They have the same meaning as regular expressions in ed.

The easiest way to escape the characters $, *, , ^, |, (), and \ from the shell's interpretation is to enclose the restricted_regular_expression in single quotes.

Options:

B Prefixes each line with the block number in which it was found. This can be useful when searching for blocks by context (blocks are numbered from 0). -c Return only the number of lines containing the pattern. -h Prevents the filename containing the matched string from being printed before the actual string. Used when searching across multiple files. -i Ignores case in comparisons. -l Print only the names of files containing matched lines, one per line. If the pattern is found on multiple lines of a file, the file name is not repeated. -n Prefix each line with its number in the file (lines are numbered from 1). -s Suppresses reporting of non-existent or unreadable files. -v Print all lines except those containing a pattern. -w Looks for the expression as a word, as if it were surrounded by the metacharacters \< и \>.

grep --help

Usage: grep [OPTION]… PATTERN [FILE]… Search for PATTERN in each FILE or standard input. By default, PATTERN is a simple regular expression (BRE). Example: grep -i "hello world" menu.h main.c Selecting a regular expression type and interpreting it: -E, --extended-regexp PATTERN - extended regular expression (ERE) -F, --fixed-regexp PATTERN - strings fixed length, newline separated -G, --basic-regexp PATTERN - simple regular expression (BRE) -P, --perl-regexp PATTERN - Perl regular expression -e, --regexp=PATTERN use PATTERN for search - f, --file=FILE take PATTERN from FILE -i, --ignore-case ignore case difference -w, --word-regexp PATTERN must match all words -x, --line-regexp PATTERN must match entire line -z, --null-data byte-separated strings with zero value , not a line terminator Miscellaneous: -s, --no-messages suppress error messages -v, --revert-match select unsuitable lines -V, --version print version information and exit --help show this help and terminate --mmap for backwards compatibility, ignored Output Control: -m, --max-count=NUM stop after specified NUM matches -b, --byte-offset print byte offset with output lines -n, -- line-number print line number along with output lines --line-buffered flush buffer after each line -H, --with-filename print filename for each match -h, --no-filename do not start output with filename -- label=LABEL use LABEL as filename for standard input -o, --only-matching show only part of string that matches PATTERN -q, --quiet, --silent suppress all normal output --binary-files=TYPE read that the binary is of TYPE: binary, text, or without-match. -a, --text same as --binary-files=text -I same as --binary-files=without-match -d, --directories=ACTION how to handle directories ACTION can be read ), recurse (recursively) or skip (skip). -D, --devices=ACTION how to handle devices, FIFOs, and sockets ACTION can be read or skip -R, -r, --recursive same as --directories=recurse --include=F_PATTERN process only files matching under PATTERN --exclude=PATTERN skip files and directories matching PATTERN --exclude-from=FILE skip files matching pattern files from FILE --exclude-dir=PATTERN directories matching PATTERN will be omitted -L, - -files-without-match print only FILE names without matches -l, --files-with-matches print only FILE names with matches -c, --count print only number of matching lines per FILE -T, --initial-tab align tab (if needed) -Z, --null print byte 0 after FILE name Context management: -B, --before-context=NUM print NUM of lines of previous context -A, --after-context=NUM print NUM of lines of subsequent context -C, --context[=NUM] print NUM context lines -NUM same as - -context=NUMBER --color[=WHEN], --colour[=WHEN] use markers to distinguish matching strings; WHEN can be always (always), never (never) or auto (automatically) -U, --binary do not remove CR characters at the end of the line (MSDOS) -u, --unix-byte-offsets give offset as if not CRs (MSDOS) Instead of "egrep", it is supposed to run "grep -E". "fgrep" is supposed to be "grep -F". Running as "egrep" or "fgrep" is best avoided. When FILE is not given, or when FILE is -, then standard input is read. If less than two files are specified, -h is assumed. If a match is found, the program exit code will be 0, and 1 if not. If errors occur, or if the -q option is not specified, the exit code will be 2. Report bugs to Report errors in translation to GNU Grep homepage: Help on working with GNU programs:

In order to fully process texts in bash scripts with sed and awk, you just need to understand regular expressions. Implementations of this most useful tool can be found literally everywhere, and although all regular expressions are arranged in a similar way, based on the same ideas, working with them has certain features in different environments. Here we will talk about regular expressions that are suitable for use in scripts. command line linux.

This material is intended as an introduction to regular expressions for those who may not know what regular expressions are. Therefore, let's start from the very beginning.

What are regular expressions

For many, when they first see regular expressions, the thought immediately arises that they have a meaningless jumble of characters in front of them. But this, of course, is far from the case. Take a look at this regex for example


In our opinion, even an absolute beginner will immediately understand how it works and why you need it :) If you don’t quite understand, just read on and everything will fall into place.
A regular expression is a pattern that programs like sed or awk use to filter text. Templates use regular ASCII characters that represent themselves, and so-called metacharacters that play a special role, for example, allowing you to refer to certain groups of characters.

Regular expression types

Implementations of regular expressions in various environments, for example, in programming languages ​​like Java, Perl and Python, in Linux tools like sed, awk and grep, have certain peculiarities. These features depend on the so-called regular expression processing engines, which deal with the interpretation of patterns.
Linux has two regular expression engines:
  • An engine that supports the POSIX Basic Regular Expression (BRE) standard.
  • An engine that supports the POSIX Extended Regular Expression (ERE) standard.
Most Linux utilities conform to at least the POSIX BRE standard, but some utilities (including sed) only understand a subset of the BRE standard. One of the reasons for this limitation is the desire to make such utilities as fast as possible in word processing.

The POSIX ERE standard is often implemented in programming languages. It allows you to use a lot of tools when developing regular expressions. For example, these can be special character sequences for frequently used patterns, such as searching for individual words or sets of numbers in the text. Awk supports the ERE standard.

There are many ways to develop regular expressions, depending on the opinion of the programmer, and on the features of the engine under which they are created. It's not easy to write generic regular expressions that any engine can understand. Therefore, we will focus on the most commonly used regular expressions and look at the specifics of their implementation for sed and awk.

POSIX BRE regular expressions

Perhaps the simplest BRE pattern is a regular expression for finding an exact match of a sequence of characters in text. This is how searching for a string in sed and awk looks like:

$ echo "This is a test" | sed -n "/test/p" $ echo "This is a test" | awk "/test/(print $0)"

Finding text by pattern in sed


Finding text by pattern in awk

You may notice that the search for a given pattern is performed without taking into account the exact location of the text in the string. In addition, the number of occurrences does not matter. After the regular expression finds the given text anywhere in the string, the string is considered suitable and is passed for further processing.

When working with regular expressions, keep in mind that they are case sensitive:

$ echo "This is a test" | awk "/Test/(print $0)" $ echo "This is a test" | awk "/test/(print $0)"

Regular expressions are case sensitive

The first regular expression did not find any matches, since the word "test", which begins with a capital letter, does not occur in the text. The second, configured to search for a word written in capital letters, found a suitable string in the stream.

In regular expressions, you can use not only letters, but also spaces and numbers:

$ echo "This is a test 2 again" | awk "/test 2/(print $0)"

Finding a piece of text containing spaces and numbers

Spaces are treated by the regular expression engine as regular characters.

Special symbols

When using different characters in regular expressions, there are a few things to keep in mind. For example, there are some special characters, or metacharacters, that require a special approach when used in a template. Here they are:

.*^${}\+?|()
If one of these is needed in the pattern, it will need to be escaped with a backslash (backslash) - \ .

For example, if you need to find a dollar sign in the text, it must be included in the template, preceded by an escape character. Let's say there is a file myfile with the following text:

There is 10$ on my pocket
The dollar sign can be detected with a pattern like this:

$ awk "/\$/(print $0)" myfile

Using a special character in a template

In addition, the backslash is also a special character, so if you want to use it in a template, you will need to escape it too. It looks like two slashes following each other:

$ echo "\ is a special character" | awk "/\\/(print $0)"

Backslash escaping

Although the forward slash is not in the above list of special characters, attempting to use it in a regular expression written for sed or awk will result in an error:

$ echo "3 / 2" | awk "///(print $0)"

Incorrect use of a forward slash in a template

If it is needed, it must also be escaped:

$ echo "3 / 2" | awk "/\//(print $0)"

Escaping a forward slash

Anchor symbols

There are two special characters for anchoring a pattern to the beginning or end of a text string. The cap symbol - ^ allows you to describe sequences of characters that are at the beginning of text lines. If the pattern you are looking for appears elsewhere in the string, the regular expression will not respond to it. The use of this symbol looks like this:

$ echo "welcome to likegeeks website" | awk "/^likegeeks/(print $0)" $ echo "likegeeks website" | awk "/^likegeeks/(print $0)"

Search for a pattern at the beginning of a string

The ^ symbol is designed to search for a pattern at the beginning of a line, while the case of characters is also taken into account. Let's see how this will affect the processing of a text file:

$ awk "/^this/(print $0)" myfile


Search for a pattern at the beginning of a line in text from a file

When using sed, if you put an end anywhere inside a pattern, it will be treated like any other normal character:

$ echo "This ^ is a test" | sed -n "/s ^/p"

Cap not at start of pattern in sed

In awk, when using the same pattern, the given character must be escaped:

$ echo "This ^ is a test" | awk "/s \^/(print $0)"

A lid not at the beginning of a pattern in awk

With the search for text fragments at the beginning of the line, we figured it out. What if you need to find something at the end of a line?

The dollar sign - $ , which is the anchor character for the end of the line, will help us with this:

$ echo "This is a test" | awk "/test$/(print $0)"

Finding text at the end of a line

Both anchor characters can be used in the same pattern. Let's process the file myfile , the contents of which are shown in the figure below, using the following regular expression:

$ awk "/^this is a test$/(print $0)" myfile


A pattern that uses special characters for the beginning and end of a string

As you can see, the template reacted only to a string that fully corresponded to the given sequence of characters and their location.

Here's how to filter out empty lines using anchor characters:

$ awk "!/^$/(print $0)" myfile
In this template, I used the negation symbol, the exclamation mark - ! . Using this pattern searches for lines that contain nothing between the beginning and end of the line, and thanks to the exclamation mark, only lines that do not match this pattern are printed.

Dot symbol

The dot is used to search for any single character, except for the newline character. Let's pass the file myfile to such a regular expression, the contents of which are given below:

$ awk "/.st/(print $0)" myfile


Using dot in regular expressions

As can be seen from the output, only the first two lines from the file match the pattern, since they contain the sequence of characters "st" preceded by another character, while the third line does not contain a suitable sequence, and the fourth line does, but it is in at the very beginning of the line.

Character classes

A dot matches any single character, but what if you want to limit the set of characters you're looking for more flexibly? In such a situation, you can use character classes.

Thanks to this approach, you can organize a search for any character from a given set. To describe a character class, square brackets - are used:

$ awk "/th/(print $0)" myfile


Description of a character class in a regular expression

Here we are looking for a sequence of characters "th" preceded by the character "o" or the character "i".

Classes come in handy when looking for words that can start with either an uppercase or lowercase letter:

$ echo "this is a test" | awk "/his is a test/(print $0)" $ echo "This is a test" | awk "/his is a test/(print $0)"

Search for words that may start with a lowercase or uppercase letter

Character classes are not limited to letters. Other characters can be used here as well. It is impossible to say in advance in what situation the classes will be needed - it all depends on the problem being solved.

Negating character classes

Symbol classes can also be used to solve the reverse problem described above. Namely, instead of searching for symbols included in the class, you can organize a search for everything that is not included in the class. In order to achieve this behavior of a regular expression, you need to put a ^ sign in front of the list of class characters. It looks like this:

$ awk "/[^oi]th/(print $0)" myfile


Search for characters not in a class

In this case, sequences of characters "th" will be found, before which there is neither "o" nor "i".

Character ranges

In character classes, you can describe ranges of characters using dashes:

$ awk "/st/(print $0)" myfile


Describing a range of characters in a character class

In this example, the regular expression matches the character sequence "st" preceded by any character located, in alphabetical order, between the characters "e" and "p".

Ranges can also be created from numbers:

$ echo "123" | awk "//" $ echo "12a" | awk "//"

Regular expression for finding any three numbers

A character class can contain multiple ranges:

$ awk "/st/(print $0)" myfile


Character class consisting of multiple ranges

This regular expression will match all sequences of "st" preceded by characters from the ranges a-f and m-z .

Special character classes

BRE has special character classes that can be used when writing regular expressions:
  • [[:alpha:]] - matches any alphabetic character written in upper or lower case.
  • [[:alnum:]] - matches any alphanumeric character, namely characters in the ranges 0-9 , A-Z , a-z .
  • [[:blank:]] - Matches a space and a tab.
  • [[:digit:]] - any numeric character from 0 to 9 .
  • [[:upper:]] - alphabetic characters in upper case- A-Z .
  • [[:lower:]] - lower case alphabetic characters - a-z .
  • [[:print:]] - matches any printable character.
  • [[:punct:]] - matches punctuation marks.
  • [[:space:]] - whitespace characters, in particular - space, tab, characters NL , FF , VT , CR .
You can use special classes in templates like this:

$ echo "abc" | awk "/[[:alpha:]]/(print $0)" $ echo "abc" | awk "/[[:digit:]]/(print $0)" $ echo "abc123" | awk "/[[:digit:]]/(print $0)"


Special character classes in regular expressions

Asterisk symbol

If you place an asterisk after a character in a pattern, this will mean that the regular expression will work if the character appears in the string any number of times - including the situation when the character is absent in the string.

$ echo "test" | awk "/tes*t/(print $0)" $ echo "tessst" | awk "/tes*t/(print $0)"


Using the * character in regular expressions

This wildcard character is usually used to work with words that are misspelled all the time, or for words that can be spelled differently:

$ echo "I like green color" | awk "/colou*r/(print $0)" $ echo "I like green color " | awk "/colou*r/(print $0)"

Finding a word that has different spellings

In this example, the same regular expression matches both the word "color" and the word "colour". This is due to the fact that the character "u", followed by an asterisk, can either be absent or occur several times in a row.

Another useful feature stemming from the asterisk character is to combine it with a dot. This combination allows the regular expression to respond to any number of any characters:

$ awk "/this.*test/(print $0)" myfile


Template that responds to any number of any characters

In this case, it does not matter how many and what characters are between the words "this" and "test".

The asterisk can also be used with character classes:

$ echo "st" | awk "/s*t/(print $0)" $ echo "sat" | awk "/s*t/(print $0)" $ echo "set" | awk "/s*t/(print $0)"


Using the asterisk with character classes

In all three examples, the regular expression works because the asterisk after the character class means that if any number of "a" or "e" characters are found, or if they are not found, the string will match the given pattern.

POSIX ERE regular expressions

POSIX ERE templates that support some Linux utilities, may contain additional characters. As already mentioned, awk supports this standard, but sed does not.

Here we will look at the most commonly used characters in ERE patterns, which will be useful for you when creating your own regular expressions.

▍Question mark

The question mark indicates that the preceding character may occur once or not at all in the text. This character is one of the repetition metacharacters. Here are some examples:

$ echo "tet" | awk "/tes?t/(print $0)" $ echo "test" | awk "/tes?t/(print $0)" $ echo "tesst" | awk "/tes?t/(print $0)"


Question mark in regular expressions

As you can see, in the third case, the letter “s” occurs twice, so the regular expression does not respond to the word “tesst”.

The question mark can also be used with character classes:

$ echo "tst" | awk "/t?st/(print $0)" $ echo "test" | awk "/t?st/(print $0)" $ echo "tast" | awk "/t?st/(print $0)" $ echo "taest" | awk "/t?st/(print $0)" $ echo "teest" | awk "/t?st/(print $0)"


Question mark and character classes

If there are no characters from the class in the string, or one of them occurs once, the regular expression works, but as soon as two characters appear in the word, the system no longer finds a match for the pattern in the text.

▍Plus symbol

The plus sign in the pattern indicates that the regular expression will match the match if the preceding character occurs one or more times in the text. At the same time, such a construction will not react to the absence of a symbol:

$ echo "test" | awk "/te+st/(print $0)" $ echo "teest" | awk "/te+st/(print $0)" $ echo "tst" | awk "/te+st/(print $0)"


Plus sign in regular expressions

In this example, if there is no “e” character in the word, the regular expression engine will not find matches in the text. The plus symbol also works with character classes - in this way it is similar to the asterisk and the question mark:

$ echo "tst" | awk "/t+st/(print $0)" $ echo "test" | awk "/t+st/(print $0)" $ echo "teast" | awk "/t+st/(print $0)" $ echo "teeast" | awk "/t+st/(print $0)"


Plus sign and character classes

In this case, if the string contains any character from the class, the text will be considered to match the pattern.

▍ Curly braces

Curly brackets that can be used in ERE patterns are similar to the characters discussed above, but they allow you to more precisely specify the required number of occurrences of the character that precedes them. You can specify a limit in two formats:
  • n - a number specifying the exact number of searched occurrences
  • n, m - two numbers that are interpreted as follows: "at least n times, but not more than m".
Here are examples of the first option:

$ echo "tst" | awk "/te(1)st/(print $0)" $ echo "test" | awk "/te(1)st/(print $0)"

Curly braces in patterns, finding the exact number of occurrences

In older versions of awk, you had to use the --re-interval command-line option in order for the program to recognize intervals in regular expressions, but in newer versions this is not necessary.

$ echo "tst" | awk "/te(1,2)st/(print $0)" $ echo "test" | awk "/te(1,2)st/(print $0)" $ echo "teest" | awk "/te(1,2)st/(print $0)" $ echo "teeest" | awk "/te(1,2)st/(print $0)"


Spacing given in curly braces

In this example, the character "e" must occur 1 or 2 times in the string, then the regular expression will respond to the text.

Curly braces can also be used with character classes. The principles already familiar to you apply here:

$ echo "tst" | awk "/t(1,2)st/(print $0)" $ echo "test" | awk "/t(1,2)st/(print $0)" $ echo "teest" | awk "/t(1,2)st/(print $0)" $ echo "teeast" | awk "/t(1,2)st/(print $0)"


Curly braces and character classes

The template will react to the text if the character "a" or the character "e" occurs once or twice in it.

▍Logical “or” symbol

Symbol | - vertical bar, means logical "or" in regular expressions. When processing a regular expression containing several fragments separated by such a character, the engine will consider the parsed text as a match if it matches any of the fragments. Here is an example:

$ echo "This is a test" | awk "/test|exam/(print $0)" $ echo "This is an exam" | awk "/test|exam/(print $0)" $ echo "This is something else" | awk "/test|exam/(print $0)"


Boolean "or" in regular expressions

In this example, the regular expression is configured to search for the words "test" or "exam" in the text. Pay attention to the fact that between the template fragments and the | symbol separating them. there should be no spaces.

Regular expression fragments can be grouped using parentheses. If you group a certain sequence of characters, it will be perceived by the system as a regular character. That is, for example, repetition metacharacters can be applied to it. Here's what it looks like:

$ echo "Like" | awk "/Like(Geeks)?/(print $0)" $ echo "LikeGeeks" | awk "/Like(Geeks)?/(print $0)"


Grouping Regular Expression Fragments

In these examples, the word "Geeks" is enclosed in parentheses, followed by a question mark. Recall that the question mark means "0 or 1 repetition", as a result, the regular expression will match both the string "Like" and the string "LikeGeeks".

Practical examples

Now that we've covered the basics of regular expressions, it's time to do something useful with them.

▍Counting the number of files

Let's write a bash script that counts the files located in the directories that are written to the variable PATH environments. In order to do this, you will first need to form a list of paths to directories. Let's do this with sed, replacing colons with spaces:

$ echo $PATH | sed "s/:/ /g"
The replace command supports regular expressions as patterns for searching text. In this case, everything is extremely simple, we are looking for a colon symbol, but no one bothers to use something else here - it all depends on the specific task.
Now we need to go through the resulting list in a loop and perform the necessary actions to count the number of files there. The general scheme of the script will be as follows:

Mypath=$(echo $PATH | sed "s/:/ /g") for directory in $mypath do done
Now let's write the full text of the script, using the ls command to get information about the number of files in each of the directories:

#!/bin/bash mypath=$(echo $PATH | sed "s/:/ /g") count=0 for directory in $mypath do check=$(ls $directory) for item in $check do count=$ [ $count + 1 ] done echo "$directory - $count" count=0 done
When running the script, it may turn out that some directories from PATH do not exist, however, this will not prevent it from counting files in existing directories.


File count

The main value of this example is that using the same approach, you can solve much more complex problems. Which one depends on your needs.

▍Verifying email addresses

There are websites with huge collections of regular expressions that allow you to check addresses Email, phone numbers, and so on. However, it is one thing to take ready-made, and quite another to create something yourself. So let's write a regular expression to validate email addresses. Let's start with the analysis of the initial data. For example, here is an address:

[email protected]
The username, username , can consist of alphanumeric characters and some other characters. Namely, this is a dot, dash, underscore, plus sign. The username is followed by the @ sign.

Armed with this knowledge, let's start building the regular expression from its left side, which serves to check the username. Here's what we got:

^(+)@
This regular expression can be read as follows: "At the beginning of the line must be at least one character from those in the group given in square brackets, and after that there must be an @ sign."

Now it's the hostname queue - hostname . The same rules apply here as for the username, so the template for it would look like this:

(+)
The top-level domain name is subject to special rules. There can only be alphabetic characters, which must be at least two (for example, such domains usually contain a country code), and no more than five. All this means that the template for checking the last part of the address will be like this:

\.({2,5})$
You can read it like this: "First there must be a period, then - from 2 to 5 alphabetic characters, and after that the line ends."

Having prepared the patterns for the individual parts of the regular expression, let's put them together:

^(+)@(+)\.({2,5})$
Now it remains only to test what happened:

$echo" [email protected]" | awk "/^(+)@(+)\.((2,5))$/(print $0)" $ echo " [email protected]" | awk "/^(+)@(+)\.((2,5))$/(print $0)"


Validating an email address with regular expressions

The fact that the text passed to awk is displayed on the screen means that the system recognized it as an email address.

Results

If the regular expression for checking email addresses that you encountered at the very beginning of the article seemed completely incomprehensible then, we hope that now it no longer looks like a meaningless set of characters. If this is true, then this material has served its purpose. In fact, regular expressions are a topic that you can deal with all your life, but even the little that we have analyzed can already help you write scripts that process texts quite advanced.

In this series of materials, we usually showed very simple examples bash scripts, which literally consisted of several lines. Let's look at something bigger next time.

Dear readers! Do you use regular expressions when processing text in command line scripts?

In order to fully process texts in bash scripts with sed and awk, you just need to understand regular expressions. Implementations of this most useful tool can be found literally everywhere, and although all regular expressions are arranged in a similar way, based on the same ideas, working with them has certain features in different environments. Here we will talk about regular expressions that are suitable for use in Linux command line scripts.

This material is intended as an introduction to regular expressions for those who may not know what regular expressions are. Therefore, let's start from the very beginning.

What are regular expressions

For many, when they first see regular expressions, the thought immediately arises that they have a meaningless jumble of characters in front of them. But this, of course, is far from the case. Take a look at this regex for example


In our opinion, even an absolute beginner will immediately understand how it works and why you need it :) If you don’t quite understand, just read on and everything will fall into place.
A regular expression is a pattern that programs like sed or awk use to filter text. Templates use regular ASCII characters that represent themselves, and so-called metacharacters that play a special role, for example, allowing you to refer to certain groups of characters.

Regular expression types

Implementations of regular expressions in various environments, for example, in programming languages ​​like Java, Perl and Python, in Linux tools like sed, awk and grep, have certain peculiarities. These features depend on the so-called regular expression processing engines, which deal with the interpretation of patterns.
Linux has two regular expression engines:
  • An engine that supports the POSIX Basic Regular Expression (BRE) standard.
  • An engine that supports the POSIX Extended Regular Expression (ERE) standard.
Most Linux utilities conform to at least the POSIX BRE standard, but some utilities (including sed) only understand a subset of the BRE standard. One of the reasons for this limitation is the desire to make such utilities as fast as possible in word processing.

The POSIX ERE standard is often implemented in programming languages. It allows you to use a lot of tools when developing regular expressions. For example, these can be special character sequences for frequently used patterns, such as searching for individual words or sets of numbers in the text. Awk supports the ERE standard.

There are many ways to develop regular expressions, depending on the opinion of the programmer, and on the features of the engine under which they are created. It's not easy to write generic regular expressions that any engine can understand. Therefore, we will focus on the most commonly used regular expressions and look at the specifics of their implementation for sed and awk.

POSIX BRE regular expressions

Perhaps the simplest BRE pattern is a regular expression for finding an exact match of a sequence of characters in text. This is how searching for a string in sed and awk looks like:

$ echo "This is a test" | sed -n "/test/p" $ echo "This is a test" | awk "/test/(print $0)"

Finding text by pattern in sed


Finding text by pattern in awk

You may notice that the search for a given pattern is performed without taking into account the exact location of the text in the string. In addition, the number of occurrences does not matter. After the regular expression finds the given text anywhere in the string, the string is considered suitable and is passed for further processing.

When working with regular expressions, keep in mind that they are case sensitive:

$ echo "This is a test" | awk "/Test/(print $0)" $ echo "This is a test" | awk "/test/(print $0)"

Regular expressions are case sensitive

The first regular expression did not find any matches, since the word "test", which begins with a capital letter, does not occur in the text. The second, configured to search for a word written in capital letters, found a suitable string in the stream.

In regular expressions, you can use not only letters, but also spaces and numbers:

$ echo "This is a test 2 again" | awk "/test 2/(print $0)"

Finding a piece of text containing spaces and numbers

Spaces are treated by the regular expression engine as regular characters.

Special symbols

When using different characters in regular expressions, there are a few things to keep in mind. For example, there are some special characters, or metacharacters, that require a special approach when used in a template. Here they are:

.*^${}\+?|()
If one of these is needed in the pattern, it will need to be escaped with a backslash (backslash) - \ .

For example, if you need to find a dollar sign in the text, it must be included in the template, preceded by an escape character. Let's say there is a file myfile with the following text:

There is 10$ on my pocket
The dollar sign can be detected with a pattern like this:

$ awk "/\$/(print $0)" myfile

Using a special character in a template

In addition, the backslash is also a special character, so if you want to use it in a template, you will need to escape it too. It looks like two slashes following each other:

$ echo "\ is a special character" | awk "/\\/(print $0)"

Backslash escaping

Although the forward slash is not in the above list of special characters, attempting to use it in a regular expression written for sed or awk will result in an error:

$ echo "3 / 2" | awk "///(print $0)"

Incorrect use of a forward slash in a template

If it is needed, it must also be escaped:

$ echo "3 / 2" | awk "/\//(print $0)"

Escaping a forward slash

Anchor symbols

There are two special characters for anchoring a pattern to the beginning or end of a text string. The cap symbol - ^ allows you to describe sequences of characters that are at the beginning of text lines. If the pattern you are looking for appears elsewhere in the string, the regular expression will not respond to it. The use of this symbol looks like this:

$ echo "welcome to likegeeks website" | awk "/^likegeeks/(print $0)" $ echo "likegeeks website" | awk "/^likegeeks/(print $0)"

Search for a pattern at the beginning of a string

The ^ symbol is designed to search for a pattern at the beginning of a line, while the case of characters is also taken into account. Let's see how this will affect the processing of a text file:

$ awk "/^this/(print $0)" myfile


Search for a pattern at the beginning of a line in text from a file

When using sed, if you put an end anywhere inside a pattern, it will be treated like any other normal character:

$ echo "This ^ is a test" | sed -n "/s ^/p"

Cap not at start of pattern in sed

In awk, when using the same pattern, the given character must be escaped:

$ echo "This ^ is a test" | awk "/s \^/(print $0)"

A lid not at the beginning of a pattern in awk

With the search for text fragments at the beginning of the line, we figured it out. What if you need to find something at the end of a line?

The dollar sign - $ , which is the anchor character for the end of the line, will help us with this:

$ echo "This is a test" | awk "/test$/(print $0)"

Finding text at the end of a line

Both anchor characters can be used in the same pattern. Let's process the file myfile , the contents of which are shown in the figure below, using the following regular expression:

$ awk "/^this is a test$/(print $0)" myfile


A pattern that uses special characters for the beginning and end of a string

As you can see, the template reacted only to a string that fully corresponded to the given sequence of characters and their location.

Here's how to filter out empty lines using anchor characters:

$ awk "!/^$/(print $0)" myfile
In this template, I used the negation symbol, the exclamation mark - ! . Using this pattern searches for lines that contain nothing between the beginning and end of the line, and thanks to the exclamation mark, only lines that do not match this pattern are printed.

Dot symbol

The dot is used to search for any single character, except for the newline character. Let's pass the file myfile to such a regular expression, the contents of which are given below:

$ awk "/.st/(print $0)" myfile


Using dot in regular expressions

As can be seen from the output, only the first two lines from the file match the pattern, since they contain the sequence of characters "st" preceded by another character, while the third line does not contain a suitable sequence, and the fourth line does, but it is in at the very beginning of the line.

Character classes

A dot matches any single character, but what if you want to limit the set of characters you're looking for more flexibly? In such a situation, you can use character classes.

Thanks to this approach, you can organize a search for any character from a given set. To describe a character class, square brackets - are used:

$ awk "/th/(print $0)" myfile


Description of a character class in a regular expression

Here we are looking for a sequence of characters "th" preceded by the character "o" or the character "i".

Classes come in handy when looking for words that can start with either an uppercase or lowercase letter:

$ echo "this is a test" | awk "/his is a test/(print $0)" $ echo "This is a test" | awk "/his is a test/(print $0)"

Search for words that may start with a lowercase or uppercase letter

Character classes are not limited to letters. Other characters can be used here as well. It is impossible to say in advance in what situation the classes will be needed - it all depends on the problem being solved.

Negating character classes

Symbol classes can also be used to solve the reverse problem described above. Namely, instead of searching for symbols included in the class, you can organize a search for everything that is not included in the class. In order to achieve this behavior of a regular expression, you need to put a ^ sign in front of the list of class characters. It looks like this:

$ awk "/[^oi]th/(print $0)" myfile


Search for characters not in a class

In this case, sequences of characters "th" will be found, before which there is neither "o" nor "i".

Character ranges

In character classes, you can describe ranges of characters using dashes:

$ awk "/st/(print $0)" myfile


Describing a range of characters in a character class

In this example, the regular expression matches the character sequence "st" preceded by any character located, in alphabetical order, between the characters "e" and "p".

Ranges can also be created from numbers:

$ echo "123" | awk "//" $ echo "12a" | awk "//"

Regular expression for finding any three numbers

A character class can contain multiple ranges:

$ awk "/st/(print $0)" myfile


Character class consisting of multiple ranges

This regular expression will match all sequences of "st" preceded by characters from the ranges a-f and m-z .

Special character classes

BRE has special character classes that can be used when writing regular expressions:
  • [[:alpha:]] - matches any alphabetic character written in upper or lower case.
  • [[:alnum:]] - matches any alphanumeric character, namely characters in the ranges 0-9 , A-Z , a-z .
  • [[:blank:]] - Matches a space and a tab.
  • [[:digit:]] - any numeric character from 0 to 9 .
  • [[:upper:]] - upper case alphabetic characters - A-Z .
  • [[:lower:]] - lower case alphabetic characters - a-z .
  • [[:print:]] - matches any printable character.
  • [[:punct:]] - matches punctuation marks.
  • [[:space:]] - whitespace characters, in particular - space, tab, characters NL , FF , VT , CR .
You can use special classes in templates like this:

$ echo "abc" | awk "/[[:alpha:]]/(print $0)" $ echo "abc" | awk "/[[:digit:]]/(print $0)" $ echo "abc123" | awk "/[[:digit:]]/(print $0)"


Special character classes in regular expressions

Asterisk symbol

If you place an asterisk after a character in a pattern, this will mean that the regular expression will work if the character appears in the string any number of times - including the situation when the character is absent in the string.

$ echo "test" | awk "/tes*t/(print $0)" $ echo "tessst" | awk "/tes*t/(print $0)"


Using the * character in regular expressions

This wildcard character is usually used to work with words that are misspelled all the time, or for words that can be spelled differently:

$ echo "I like green color" | awk "/colou*r/(print $0)" $ echo "I like green color " | awk "/colou*r/(print $0)"

Finding a word that has different spellings

In this example, the same regular expression matches both the word "color" and the word "colour". This is due to the fact that the character "u", followed by an asterisk, can either be absent or occur several times in a row.

Another useful feature stemming from the asterisk character is to combine it with a dot. This combination allows the regular expression to respond to any number of any characters:

$ awk "/this.*test/(print $0)" myfile


Template that responds to any number of any characters

In this case, it does not matter how many and what characters are between the words "this" and "test".

The asterisk can also be used with character classes:

$ echo "st" | awk "/s*t/(print $0)" $ echo "sat" | awk "/s*t/(print $0)" $ echo "set" | awk "/s*t/(print $0)"


Using the asterisk with character classes

In all three examples, the regular expression works because the asterisk after the character class means that if any number of "a" or "e" characters are found, or if they are not found, the string will match the given pattern.

POSIX ERE regular expressions

The POSIX ERE templates that some Linux utilities support may contain additional characters. As already mentioned, awk supports this standard, but sed does not.

Here we will look at the most commonly used characters in ERE patterns, which will be useful for you when creating your own regular expressions.

▍Question mark

The question mark indicates that the preceding character may occur once or not at all in the text. This character is one of the repetition metacharacters. Here are some examples:

$ echo "tet" | awk "/tes?t/(print $0)" $ echo "test" | awk "/tes?t/(print $0)" $ echo "tesst" | awk "/tes?t/(print $0)"


Question mark in regular expressions

As you can see, in the third case, the letter “s” occurs twice, so the regular expression does not respond to the word “tesst”.

The question mark can also be used with character classes:

$ echo "tst" | awk "/t?st/(print $0)" $ echo "test" | awk "/t?st/(print $0)" $ echo "tast" | awk "/t?st/(print $0)" $ echo "taest" | awk "/t?st/(print $0)" $ echo "teest" | awk "/t?st/(print $0)"


Question mark and character classes

If there are no characters from the class in the string, or one of them occurs once, the regular expression works, but as soon as two characters appear in the word, the system no longer finds a match for the pattern in the text.

▍Plus symbol

The plus sign in the pattern indicates that the regular expression will match the match if the preceding character occurs one or more times in the text. At the same time, such a construction will not react to the absence of a symbol:

$ echo "test" | awk "/te+st/(print $0)" $ echo "teest" | awk "/te+st/(print $0)" $ echo "tst" | awk "/te+st/(print $0)"


Plus sign in regular expressions

In this example, if there is no “e” character in the word, the regular expression engine will not find matches in the text. The plus symbol also works with character classes - in this way it is similar to the asterisk and the question mark:

$ echo "tst" | awk "/t+st/(print $0)" $ echo "test" | awk "/t+st/(print $0)" $ echo "teast" | awk "/t+st/(print $0)" $ echo "teeast" | awk "/t+st/(print $0)"


Plus sign and character classes

In this case, if the string contains any character from the class, the text will be considered to match the pattern.

▍ Curly braces

Curly brackets that can be used in ERE patterns are similar to the characters discussed above, but they allow you to more precisely specify the required number of occurrences of the character that precedes them. You can specify a limit in two formats:
  • n - a number specifying the exact number of searched occurrences
  • n, m - two numbers that are interpreted as follows: "at least n times, but not more than m".
Here are examples of the first option:

$ echo "tst" | awk "/te(1)st/(print $0)" $ echo "test" | awk "/te(1)st/(print $0)"

Curly braces in patterns, finding the exact number of occurrences

In older versions of awk, you had to use the --re-interval command-line option in order for the program to recognize intervals in regular expressions, but in newer versions this is not necessary.

$ echo "tst" | awk "/te(1,2)st/(print $0)" $ echo "test" | awk "/te(1,2)st/(print $0)" $ echo "teest" | awk "/te(1,2)st/(print $0)" $ echo "teeest" | awk "/te(1,2)st/(print $0)"


Spacing given in curly braces

In this example, the character "e" must occur 1 or 2 times in the string, then the regular expression will respond to the text.

Curly braces can also be used with character classes. The principles already familiar to you apply here:

$ echo "tst" | awk "/t(1,2)st/(print $0)" $ echo "test" | awk "/t(1,2)st/(print $0)" $ echo "teest" | awk "/t(1,2)st/(print $0)" $ echo "teeast" | awk "/t(1,2)st/(print $0)"


Curly braces and character classes

The template will react to the text if the character "a" or the character "e" occurs once or twice in it.

▍Logical “or” symbol

Symbol | - a vertical bar, means a logical "or" in regular expressions. When processing a regular expression containing several fragments separated by such a character, the engine will consider the parsed text as a match if it matches any of the fragments. Here is an example:

$ echo "This is a test" | awk "/test|exam/(print $0)" $ echo "This is an exam" | awk "/test|exam/(print $0)" $ echo "This is something else" | awk "/test|exam/(print $0)"


Boolean "or" in regular expressions

In this example, the regular expression is configured to search for the words "test" or "exam" in the text. Pay attention to the fact that between the template fragments and the | symbol separating them. there should be no spaces.

Regular expression fragments can be grouped using parentheses. If you group a certain sequence of characters, it will be perceived by the system as a regular character. That is, for example, repetition metacharacters can be applied to it. Here's what it looks like:

$ echo "Like" | awk "/Like(Geeks)?/(print $0)" $ echo "LikeGeeks" | awk "/Like(Geeks)?/(print $0)"


Grouping Regular Expression Fragments

In these examples, the word "Geeks" is enclosed in parentheses, followed by a question mark. Recall that the question mark means "0 or 1 repetition", as a result, the regular expression will match both the string "Like" and the string "LikeGeeks".

Practical examples

Now that we've covered the basics of regular expressions, it's time to do something useful with them.

▍Counting the number of files

Let's write a bash script that counts files located in directories that are written to the PATH environment variable. In order to do this, you will first need to form a list of paths to directories. Let's do this with sed, replacing colons with spaces:

$ echo $PATH | sed "s/:/ /g"
The replace command supports regular expressions as patterns for searching text. In this case, everything is extremely simple, we are looking for a colon symbol, but no one bothers to use something else here - it all depends on the specific task.
Now we need to go through the resulting list in a loop and perform the necessary actions to count the number of files there. The general scheme of the script will be as follows:

Mypath=$(echo $PATH | sed "s/:/ /g") for directory in $mypath do done
Now let's write the full text of the script, using the ls command to get information about the number of files in each of the directories:

#!/bin/bash mypath=$(echo $PATH | sed "s/:/ /g") count=0 for directory in $mypath do check=$(ls $directory) for item in $check do count=$ [ $count + 1 ] done echo "$directory - $count" count=0 done
When running the script, it may turn out that some directories from PATH do not exist, however, this will not prevent it from counting files in existing directories.


File count

The main value of this example is that using the same approach, you can solve much more complex problems. Which one depends on your needs.

▍Verifying email addresses

There are websites with huge collections of regular expressions that allow you to check email addresses, phone numbers, and so on. However, it is one thing to take ready-made, and quite another to create something yourself. So let's write a regular expression to validate email addresses. Let's start with the analysis of the initial data. For example, here is an address:

[email protected]
The username, username , can consist of alphanumeric characters and some other characters. Namely, this is a dot, dash, underscore, plus sign. The username is followed by the @ sign.

Armed with this knowledge, let's start building the regular expression from its left side, which serves to check the username. Here's what we got:

^(+)@
This regular expression can be read as follows: "At the beginning of the line must be at least one character from those in the group given in square brackets, and after that there must be an @ sign."

Now it's the hostname queue - hostname . The same rules apply here as for the username, so the template for it would look like this:

(+)
The top-level domain name is subject to special rules. There can only be alphabetic characters, which must be at least two (for example, such domains usually contain a country code), and no more than five. All this means that the template for checking the last part of the address will be like this:

\.({2,5})$
You can read it like this: "First there must be a period, then - from 2 to 5 alphabetic characters, and after that the line ends."

Having prepared the patterns for the individual parts of the regular expression, let's put them together:

^(+)@(+)\.({2,5})$
Now it remains only to test what happened:

$echo" [email protected]" | awk "/^(+)@(+)\.((2,5))$/(print $0)" $ echo " [email protected]" | awk "/^(+)@(+)\.((2,5))$/(print $0)"


Validating an email address with regular expressions

The fact that the text passed to awk is displayed on the screen means that the system recognized it as an email address.

Results

If the regular expression for checking email addresses that you encountered at the very beginning of the article seemed completely incomprehensible then, we hope that now it no longer looks like a meaningless set of characters. If this is true, then this material has served its purpose. In fact, regular expressions are a topic that you can deal with all your life, but even the little that we have analyzed can already help you write scripts that process texts quite advanced.

In this series of materials, we usually showed very simple examples of bash scripts that literally consisted of a few lines. Let's look at something bigger next time.

Dear readers! Do you use regular expressions when processing text in command line scripts?

Internet