All about Unix Shell

Thursday, August 21, 2008

Sed rewamped

Sed reads its input from stdin (i.e., the console) or from files (or both), and sends the results to stdout (console or screen).

Sed's normal behavior is to print the entire file, including the parts that haven't been altered, unless you use the -n switch.

The "-n" stands for "no output".
This switch is almost always used in conjunction with a 'p' command somewhere, which says to print only the sections of the file that have been specified.

- When you want ti review only few lines of file. use this, This will display first 3 lines.
sed -n 1,3p file - # the 'p' stands for print

- Likewise, sed could show me everything else BUT those particular lines, without physically changing the file on the disk,
sed 1,3d file # the 'd' stands for delete

- An exclamation mark '!' after a regex ('/RE/!') or line number will select all lines that do NOT match that address.
. ^ $ [ ] { } ( ) ? + *

Three things are interpolated: ampersand (&), backreferences, and options for special seds.

An ampersand on the RHS is replaced by the entire expression matched on the LHS. There is never any reason to use grouping like this:
s/$some-complex-regex$/one two \1 three/

since you can do this instead:
s/some-complex-regex/one two & three/

- Substitution switches

Standard versions of sed support 4 main flags or switches which may be added to the end of an "s///" command. They are:
N - Replace the Nth match of the pattern on the LHS, where N is an integer between 1 and 512. If N is omitted, the default is to replace the first match only.
g - Global replace of all matches to the pattern.
p - Print the results to stdout, even if -n switch is used.
w file - Write the pattern space to 'file' if a replacement was done. If the file already exists when the script is executed, it is overwritten. During script execution, w appends to the file for each match.

SED ONE LINERS

- sed G file - Double space a file
- sed 'G;G' file - # Triple space a file
- sed 's/^[ ^t]*//' file # Delete leading whitespace (spaces/tabs) from end of each line.
- sed 's/[ ^t]*$//' file # Delete trailing whitespace (spaces/tabs) from end of each line.
- sed 's/^[ ^t]*//;s/[ ^]*$//' file # Delete BOTH leading and trailing whitespace from each line
- sed 's/foo/bar/' file # replaces only 1st instance in a line
- sed 's/foo/bar/4' file # replaces only 4th instance in a line
- sed 's/foo/bar/g' file # replaces ALL instances within a line
- sed -e :a -e '/\\$/N; s/\\\n//; ta' file # If a line ends with a backslash, join the next line to it.

Addressing and address ranges
Sed commands may have an optional "address" or "address range" prefix. If there is no address or address range given, then the command is applied to all the lines of the input file or text stream.

5d # delete line 5 only
5!d # delete every line except line 5
/RE/s/LHS/RHS/g # substitute only if RE occurs on the line
/^$/b label # if the line is blank, branch to ':label'
/./!b label # ... another way to write the same command
\%.%!b label # ... yet another way to write this command
$!N # on all lines but the last, get the Next line
/tape$/ # matches the word 'tape' at the end of a line
/tape$deck/ # matches the word 'tape$deck' with a literal '$'
/tape\ndeck/ # matches 'tape' and 'deck' with a newline between

Monday, August 11, 2008

Regular Expression

Structure of RE :
Anchors are used to specify the position of the pattern in relation to a line of text.
Character Sets match one or more characters in a single position.
Modifiers specify how many times the previous character set is repeated.

There are also two types of regular expressions:
- "Basic" regular expression,
- "extended" regular expression.

Utility	Regular Expression Type
vi	Basic
sed	Basic
grep	Basic
csplit	Basic
dbx	Basic
dbxtool	Basic
more	Basic
ed	Basic
expr	Basic
lex	Basic
pg	Basic
nl	Basic
rdist	Basic
awk	Extended
nawk	Extended
egrep	Extended
EMACS	EMACS Regular Expressions
PERL	PERL Regular Expressions

Anchor Characters: ^ and $

Pattern	Matches
^A	"A" at the beginning of a line
A$	"A" at the end of a line
A^	"A^" anywhere on a line
$A	"$A" anywhere on a line
^^	"^" at the beginning of a line
$$	"$" at the end of a line

Match any character with .
The character "." is one of those special meta-characters. By itself it will match any character, except the end-of-line character. The pattern that will match a line with a single characters is
^.$

Regular
Expression Meaning
-----------------------------------------------------
. - A single character (except newline)
^ - Beginning of line
$ - End of line
[...] - Range of characters
* - zero or more duplicates
\< - Beginning of word
\> - End of word
$..$ - Remembers pattern
\1..\9 - Recalls pattern
_+ - One or more duplicates
? - Zero or one duplicate
\{M,N\} - M to N Duplicates
(...|...) Shows alteration
$...\|...$ Shows alteration
\w - Matches a letter in a word
\W - Opposite of \w

---------------------------------------------------

Sed (Stream Line Editor)

The slash as a delimiter
Say /usr/local/bin to /common/bin - you could use the backslash to quote the slash:
sed 's/\/usr\/local\/bin/\/common\/bin/' new
Gulp. Some call this a 'Picket Fence' and it's ugly. It is easier to read if you use an underline instead of a slash as a delimiter:
sed 's_/usr/local/bin_/common/bin_' new
Some people use colons:
sed 's:/usr/local/bin:/common/bin:' new
Others use the "|" character.
sed 's|/usr/local/bin|/common/bin|' new

Using \1 to keep part of the pattern
To review, the escaped parentheses (that is, parentheses with backslashes before them) remember portions of the regular expression. You can use this to exclude part of the regular expression. The "\1" is the first remembered pattern, and the "\2" is the second remembered pattern. Sed has up to nine remembered patterns.

If you wanted to keep the first word of a line, and delete the rest of the line, mark the important part with the parenthesis:
sed 's/$[a-z]*$.*/\1/'

The "\1" doesn't have to be in the replacement string (in the right hand side). It can be in the pattern you are searching for (in the left hand side). If you want to eliminate duplicated words, you can try:
sed 's/$[a-z]*$ \1/\1/'
You can have up to nine values: "\1" thru "\9."

"[^ ]*," - matches everything except a space.

This next example keeps the first word on the line but deletes the second:
sed 's/$[a-zA-Z]*$ $[a-zA-Z]*$ /\1 /' new

Tuesday, August 5, 2008

Shell Optimizations

- Check the loops in the script. Time consumed by repetitive operations adds up quickly. If at all possible, remove time-consuming operations from within loops.

- Use builtin commands in preference to system commands. Builtins execute faster and usually do not launch a subshell when invoked.

- Avoid unnecessary commands, particularly in a pipe.

- Use the time and times tools to profile computation-intensive commands.

- Try to minimize file I/O. Bash is not particularly efficient at handling files, so consider using more appropriate tools for this within the script, such as awk or Perl.

- Write your scripts in a modular and coherent form, so they can be reorganized and tightened up as necessary.

Gotchas

- Assigning reserved words or characters to variable names.

- Using a hyphen or other reserved characters in a variable name (or function name).

- Using the same name for a variable and a function. This can make a script difficult to understand.

- Using whitespace inappropriately. In contrast to other programming languages, Bash can be quite finicky about whitespace.

- Not terminating with a semicolon the final command in a code block within curly brackets.

- Assuming uninitialized variables (variables before a value is assigned to them) are "zeroed out". An uninitialized variable has a value of null, not zero.

- Mixing up = and -eq in a test. Remember, = is for comparing literal variables and -eq for integers.

- Misusing string comparison operators.

- Sometimes variables within "test" brackets ([ ]) need to be quoted (double quotes). Failure to do so may cause unexpected behavior.

- Attempting to use - as a redirection operator (which it is not) will usually result in an unpleasant surprise.

- Using Bash-specific functionality in a Bourne shell script (#!/bin/sh) on a non-Linux machine may cause unexpected behavior. A Linux system usually aliases sh to bash, but this does not necessarily hold true for a generic UNIX machine.

- Putting whitespace in front of the terminating limit string of a here document will cause unexpected behavior in a script.

- Putting more than one echo statement in a function whose output is captured.

- A script may not export variables back to its parent process, the shell, or to the environment.

- A related problem occurs when trying to write the stdout of a tail -f piped to grep.
tail -f /var/log/messages | grep "$ERROR_MSG" >> error.log
# The "error.log" file will not have anything written to it.

- Using "suid" commands within scripts is risky, as it may compromise system security.

- Bash does not handle the double slash (//) string correctly.

Debug Shell Script

Tools for debugging non-working scripts include

1. Inserting echo statements at critical points in the script to trace the variables, and otherwise give a snapshot of what is going on.

2. Using the tee filter to check processes or data flows at critical points.

3. Setting option flags -n -v -x
sh -n scriptname checks for syntax errors without actually running the script.
sh -v scriptname echoes each command before executing it. sh -x scriptname echoes the result each command, but in an abbreviated manner.

4. Using an "assert" function to test a variable or condition at critical points in a script.

5. Using the $LINENO variable and the caller builtin.

6. Trapping at exit.
The exit command in a script triggers a signal 0, terminating the process, that is, the script itself.

Monday, August 4, 2008

/dev/null

Uses of /dev/null
Think of /dev/null as a black hole. It is essentially the equivalent of a write-only file. Everything written to it disappears. Attempts to read or output from it result in nothing. All the same, /dev/null can be quite useful from both the command line and in scripts.