grep is by far the most often used command in Unix. Grep is an acronym that stands for “global regular expression printer.” In layman’s terms, grep retrieves the lines from a supplied text that fit the user’s criteria.
In UNIX, the grep command is used to output lines that match a pattern. It may be used to recursively search a directory structure of files for text and to find text in a file. It also provides support for regular expressions in pattern matching and can highlight the context of a match by displaying lines before and after the result. grep can be used on its own or in conjunction with pipes.
Example when pipe is used: $ ls -l | grep rwxrwxrwx
Example when pipe isn’t used: $ grep -r "192.168.1.5" /etc/
The following is a list of some of the scenarios in which you might want to use grep:
- The number of sequences in a multi-fasta sequence file may be counted.
- Get the fasta sequence file’s header lines
- In a sequence file, look for a matching theme.
- Look for restriction spots in the sequence (s)
- Get all of the Gene IDs from a multi-fasta sequence file, as well as many other things.
It also provides support for regular expressions in pattern matching and can highlight the context of a match by displaying lines before and after the result.
Let’s now use a DNA program to practical illustrate how the grep command is used to do some basic tasks on several sequences:
1. Sequence counting:
The number of sequences in a file should be equal to the number of description lines, according to the FASTA format standard. You may count the number of sequences by counting > in the file. This may be done using the grep count option -c and the counting option of the grep.
However, if the deflines include > more than once, the count will be messed up! You can utilize the following to be safe:
To count the number of sequences in AT cDNA.fa and RefSeq.faa, we apply the following strategy:
2. Taking one list and subtracting it from another:
You may use the grep function with the following arguments to eliminate a small number of genes from a bigger list:
The options -F and -w ensure that the entire word is utilized as a literal string, -v prevents the matching patterns from being printed, and -f filename.txt indicates that the input patterns are in the file.
3. Make a word count:
In contrast to the preceding example, if a word appears more than once in a line, it will only be tallied once. You must utilize a specific option to avoid this.
Now, instead of printing the complete line, you may get a lot of relevant information by only printing the pattern. For instance, how many times do you observe the following term in each line:
This will output the line number and the number of times the PATTERN appears in that line.
Let’s use grep to have some fun! Check out the AT cDNA.fa file to see what type of sequences it contains. Do they all appear to be a part of the same organism? What kind of creature is it?
You may also use grep to find all the lines that contain the phrase you’re looking for. This is especially handy when searching through a large number of annotated sequences for a single gene.
You may also use this function to determine if a certain characteristic (restriction site, motif, etc.) exists in your sequence of interest. The —color option of the grep command can help with this.
4. Look for a motif:
Use the color option to look for EcoR1 (GAATTC) site in the NT21.fa file in the sequence’s directory. Also, check for a C2H2 zinc finger motif in the RefSeq.faa file (let’s say the zinc finger motif is CXXXCXXXXXXXXXXHXXXH To create a more realistic pattern, you may either use dots to represent any amino acids or utilize complicated regular expressions. Consider the following:
5. Identifying patterns that do not match:
You may also use the grep command to filter out results that contain your search word. If you wish to look at genes that aren’t on chromosome 1, for example, you may use the -v option to omit chromosome 1 from your search.
Take note of the results from the two instructions above.
6. Trying to find many patterns:
In the same command, you may use grep to locate a set of patterns. The line containing any of the patterns you provide will be printed by grep. To do so, take these steps:
Any one of the three patterns (OR)
All three patterns are present (AND)
In the OR example, the | stands for or, but in the AND example, the output is piped from one command to another.
Try to decipher the following command lines (and keep track of your findings if necessary):
To show how it works, try some regular expressions based on the nucleotide/protein sequences supplied before.
7. Identifying empty lines
Blank lines can also be found with grep. As you can see from the regex above, ^ denotes the start of a line, whereas $ denotes the end. So, when you search for ^$, you’re looking for lines with no content (blank).
Similarly, if you wish to get rid of blank lines, follow these steps:
8. locating all files containing a term
When working with a large number of files, you may find yourself in a scenario where you only want to handle a subset of them. You may use grep to rapidly locate such files if you know a specific phrase that appears in them.
-r searches all files in subfolders recursively, and -l prints the filename after the first occurrence, rather than the matching line. The “.” at the end instructs grep to look through all of the files in the directory. As a consequence, you’ll have a subset of files that are relevant to you.
Replace -l with -L if you want files that don’t use the word (like the option -v for a negative match). This will only show files that haven’t been matched.
9. Print lines before and after the phrase that matches.
The line containing the matched phrase is returned by a standard grep search. It is sometimes necessary to publish lines before or after the word occurrence in order to understand the context of the phrase. You can define the number of lines you wish to display using the -B (for before) and -A (for after) options.
Before the match, this will print 10 lines (including the PATTERN line).
To print lines after the match, do the following:
You may also combine before and after lines to create lines that are both before and after.
In a nutshell, the grep command in the Linux operating system allows you to search through files. It may be used to search through a single file or a collection of files.
Now you’re ready to use the grep command like a pro on Linux!