Thursday, August 21, 2008

Sed rewamped

Sed reads its input from stdin (i.e., the console) or from files (or both), and sends the results to stdout (console or screen).

Sed's normal behavior is to print the entire file, including the parts that haven't been altered, unless you use the -n switch.

The "-n" stands for "no output".
This switch is almost always used in conjunction with a 'p' command somewhere, which says to print only the sections of the file that have been specified.

- When you want ti review only few lines of file. use this, This will display first 3 lines.
sed -n 1,3p file - # the 'p' stands for print

- Likewise, sed could show me everything else BUT those particular lines, without physically changing the file on the disk,
sed 1,3d file # the 'd' stands for delete


- An exclamation mark '!' after a regex ('/RE/!') or line number will select all lines that do NOT match that address.
. ^ $ [ ] { } ( ) ? + *

Three things are interpolated: ampersand (&), backreferences, and options for special seds.

An ampersand on the RHS is replaced by the entire expression matched on the LHS. There is never any reason to use grouping like this:
s/\(some-complex-regex\)/one two \1 three/

since you can do this instead:
s/some-complex-regex/one two & three/

- Substitution switches

Standard versions of sed support 4 main flags or switches which may be added to the end of an "s///" command. They are:
N - Replace the Nth match of the pattern on the LHS, where N is an integer between 1 and 512. If N is omitted, the default is to replace the first match only.
g - Global replace of all matches to the pattern.
p - Print the results to stdout, even if -n switch is used.
w file - Write the pattern space to 'file' if a replacement was done. If the file already exists when the script is executed, it is overwritten. During script execution, w appends to the file for each match.


SED ONE LINERS

- sed G file - Double space a file
- sed 'G;G' file - # Triple space a file
- sed 's/^[ ^t]*//' file # Delete leading whitespace (spaces/tabs) from end of each line.
- sed 's/[ ^t]*$//' file # Delete trailing whitespace (spaces/tabs) from end of each line.
- sed 's/^[ ^t]*//;s/[ ^]*$//' file # Delete BOTH leading and trailing whitespace from each line
- sed 's/foo/bar/' file # replaces only 1st instance in a line
- sed 's/foo/bar/4' file # replaces only 4th instance in a line
- sed 's/foo/bar/g' file # replaces ALL instances within a line
- sed -e :a -e '/\\$/N; s/\\\n//; ta' file # If a line ends with a backslash, join the next line to it.


Addressing and address ranges
Sed commands may have an optional "address" or "address range" prefix. If there is no address or address range given, then the command is applied to all the lines of the input file or text stream.

5d # delete line 5 only
5!d # delete every line except line 5
/RE/s/LHS/RHS/g # substitute only if RE occurs on the line
/^$/b label # if the line is blank, branch to ':label'
/./!b label # ... another way to write the same command
\%.%!b label # ... yet another way to write this command
$!N # on all lines but the last, get the Next line
/tape$/ # matches the word 'tape' at the end of a line
/tape$deck/ # matches the word 'tape$deck' with a literal '$'
/tape\ndeck/ # matches 'tape' and 'deck' with a newline between

Monday, August 11, 2008

Regular Expression

Structure of RE :
Anchors are used to specify the position of the pattern in relation to a line of text.
Character Sets match one or more characters in a single position.
Modifiers specify how many times the previous character set is repeated.

There are also two types of regular expressions:
- "Basic" regular expression,
- "extended" regular expression.













































































Utility Regular Expression Type
vi Basic
sed Basic
grep Basic
csplit Basic
dbx Basic
dbxtool Basic
more Basic
ed Basic
expr Basic
lex Basic
pg Basic
nl Basic
rdist Basic
awk Extended
nawk Extended
egrep Extended
EMACS EMACS Regular Expressions
PERL PERL Regular Expressions

Anchor Characters: ^ and $





























Pattern Matches
^A "A" at the beginning of a line
A$ "A" at the end of a line
A^ "A^" anywhere on a line
$A "$A" anywhere on a line
^^ "^" at the beginning of a line
$$ "$" at the end of a line

Match any character with .
The character "." is one of those special meta-characters. By itself it will match any character, except the end-of-line character. The pattern that will match a line with a single characters is
^.$

Regular
Expression Meaning
-----------------------------------------------------
. - A single character (except newline)
^ - Beginning of line
$ - End of line
[...] - Range of characters
* - zero or more duplicates
\< - Beginning of word
\> - End of word
\(..\) - Remembers pattern
\1..\9 - Recalls pattern
_+ - One or more duplicates
? - Zero or one duplicate
\{M,N\} - M to N Duplicates
(...|...) Shows alteration
\(...\|...\) Shows alteration
\w - Matches a letter in a word
\W - Opposite of \w

---------------------------------------------------

Sed (Stream Line Editor)

The slash as a delimiter
Say /usr/local/bin to /common/bin - you could use the backslash to quote the slash:
sed 's/\/usr\/local\/bin/\/common\/bin/' new
Gulp. Some call this a 'Picket Fence' and it's ugly. It is easier to read if you use an underline instead of a slash as a delimiter:
sed 's_/usr/local/bin_/common/bin_' new
Some people use colons:
sed 's:/usr/local/bin:/common/bin:' new
Others use the "|" character.
sed 's|/usr/local/bin|/common/bin|' new


Using \1 to keep part of the pattern
To review, the escaped parentheses (that is, parentheses with backslashes before them) remember portions of the regular expression. You can use this to exclude part of the regular expression. The "\1" is the first remembered pattern, and the "\2" is the second remembered pattern. Sed has up to nine remembered patterns.

If you wanted to keep the first word of a line, and delete the rest of the line, mark the important part with the parenthesis:
sed 's/\([a-z]*\).*/\1/'

The "\1" doesn't have to be in the replacement string (in the right hand side). It can be in the pattern you are searching for (in the left hand side). If you want to eliminate duplicated words, you can try:
sed 's/\([a-z]*\) \1/\1/'
You can have up to nine values: "\1" thru "\9."

"[^ ]*," - matches everything except a space.

This next example keeps the first word on the line but deletes the second:
sed 's/\([a-zA-Z]*\) \([a-zA-Z]*\) /\1 /' new

Tuesday, August 5, 2008

Shell Optimizations

- Check the loops in the script. Time consumed by repetitive operations adds up quickly. If at all possible, remove time-consuming operations from within loops.

- Use builtin commands in preference to system commands. Builtins execute faster and usually do not launch a subshell when invoked.

- Avoid unnecessary commands, particularly in a pipe.

- Use the time and times tools to profile computation-intensive commands.

- Try to minimize file I/O. Bash is not particularly efficient at handling files, so consider using more appropriate tools for this within the script, such as awk or Perl.

- Write your scripts in a modular and coherent form, so they can be reorganized and tightened up as necessary.

Gotchas

- Assigning reserved words or characters to variable names.

- Using a hyphen or other reserved characters in a variable name (or function name).

- Using the same name for a variable and a function. This can make a script difficult to understand.

- Using whitespace inappropriately. In contrast to other programming languages, Bash can be quite finicky about whitespace.

- Not terminating with a semicolon the final command in a code block within curly brackets.

- Assuming uninitialized variables (variables before a value is assigned to them) are "zeroed out". An uninitialized variable has a value of null, not zero.

- Mixing up = and -eq in a test. Remember, = is for comparing literal variables and -eq for integers.

- Misusing string comparison operators.

- Sometimes variables within "test" brackets ([ ]) need to be quoted (double quotes). Failure to do so may cause unexpected behavior. 

- Attempting to use - as a redirection operator (which it is not) will usually result in an unpleasant surprise.

-  Using Bash-specific functionality in a Bourne shell script (#!/bin/sh) on a non-Linux machine may cause unexpected behavior. A Linux system usually aliases sh to bash, but this does not necessarily hold true for a generic UNIX machine.

- Putting whitespace in front of the terminating limit string of a here document will cause unexpected behavior in a script.

- Putting more than one echo statement in a function whose output is captured.

- A script may not export variables back to its parent process, the shell, or to the environment.

- A related problem occurs when trying to write the stdout of a tail -f piped to grep.
   tail -f /var/log/messages | grep "$ERROR_MSG" >> error.log
# The "error.log" file will not have anything written to it.

- Using "suid" commands within scripts is risky, as it may compromise system security.

- Bash does not handle the double slash (//) string correctly.


Debug Shell Script

Tools for debugging non-working scripts include

1. Inserting echo statements at critical points in the script to trace the variables, and otherwise give a snapshot of what is going on.

2. Using the tee filter to check processes or data flows at critical points.

3. Setting option flags -n -v -x
    sh -n scriptname checks for syntax errors without actually running the script. 
    sh -v scriptname echoes each command before executing it.                                         sh -x scriptname echoes the result each command, but in an abbreviated manner. 

4.  Using an "assert" function to test a variable or condition at critical points in a script. 

5. Using the $LINENO variable and the caller builtin.

6. Trapping at exit.
    The exit command in a script triggers a signal 0, terminating the process, that is, the script itself.



 

Monday, August 4, 2008

/dev/null

Uses of /dev/null
  Think of /dev/null as a black hole. It is essentially the equivalent of a write-only file. Everything written to it disappears. Attempts to read or output from it result in nothing. All the same, /dev/null can be quite useful from both the command line and in scripts.


/dev and /proc

/dev

- The /dev directory contains entries for the physical devices that may or may not be present in the hardware.

- Among other things, the /dev directory contains loopback devices, such as /dev/loop0. A loopback device is a gimmick that allows an ordinary file to be accessed as if it were a block device. This enables mounting an entire filesystem within a single large file.

- A few of the pseudo-devices in /dev have other specialized uses, such as /dev/null, /dev/zero, /dev/urandom, /dev/sda1, /dev/udp, and /dev/tcp.

- To mount a USB flash drive, append the following line to /etc/fstab. [95]
 /dev/sda1     /mnt/flashdrive    auto   noauto,user,noatime   0 0

- When executing a command on a /dev/tcp/$host/$port pseudo-device file, Bash opens a TCP connection to the associated socket.

- A socket is a communications node associated with a specific I/O port. (This is analogous to a hardware socket, or receptacle, for a connecting cable.) It permits data transfer between hardware devices on the same machine, between machines on the same network, between machines across different networks, and, of
course, between machines at different locations on the Internet.

/proc
The /proc directory is actually a pseudo-filesystem. The files in /proc mirror currently running system and kernel processes and contain information and statistics about them.

- It is even possible to control certain peripherals with commands sent to the /proc directory.


- The /proc directory contains subdirectories with unusual numerical names. Every one of these names maps to the process ID of a currently running process.

-   Within each of these subdirectories, there are a number of files that hold useful information about the corresponding process. 

- The stat and status files keep runningstatistics on the process,
- cmdline file holds the command-line arguments the process was invoked with
- exe file is a symbolic link to the complete path name of the invoking process. 



Shell - function

- Like "real" programming languages, Bash has functions, though in a somewhat limited implementation. A function is a subroutine, a code block that implements a set of operations, a "black box" that performs a specified task.

- Wherever there is repetitive code, when a task repeats with only slight variations in procedure, then consider using a function.

function function_name {
              command...
              }
or
function_name () {
               command...
                }

- A function may be "compacted" into a single line.
fun () { echo "This is a function"; echo; }

- Functions are called, triggered, simply by invoking their names.

Exit and Return
exit status
  Functions return a value, called an exit status. The exit status may be explicitly specified by a return statement, otherwise it is the exit status of the last command in the function (0 if successful, and a non-zero error code if not). This exit status may be used in the script by referencing it as $?. This mechanism effectively permits script functions to have a "return value" similar to C functions.

return
  Terminates a function. A return command [86] optionally takes an integer  argument, which is returned to the calling script as the "exit status" of the function, and this exit status is assigned to the variable $?.
   

The largest positive integer a function can return is 255.

Ex. 

#!/bin/bash
# realname.sh
#
# From username, gets "real name" from /etc/passwd.

ARGCOUNT=1                 # Expect one arg.
E_WRONGARGS=65
file=/etc/passwd
pattern=$1
if [ $# -ne "$ARGCOUNT" ]
then
  echo "Usage: `basename $0` USERNAME"
  exit $E_WRONGARGS
fi
file_excerpt ()                   # Scan file for pattern, then print relevant portion of line.
{
    while read line # "while" does not necessarily need "[ condition ]"
    do
           echo "$line" | grep $1 | awk -F":" '{ print $5 }' # Have awk use ":" delimiter.
    done
} <$file                            # Redirect into function's stdin.

file_excerpt $pattern

What makes a variable local?
        A variable declared as local is one that is visible only within the block of code in which it appears. It has local "scope." In a function, a local variable has meaning only within that function block.

#!/bin/bash
# Global and local variables inside a function.
func ()
{
  local loc_var=23 # Declared as local variable.
  echo # Uses the 'local' builtin.
  echo "\"loc_var\" in function = $loc_var"
  global_var=999 # Not declared as local.
  # Defaults to global.
  echo "\"global_var\" in function = $global_var"
}

func

# Now, to see if local variable "loc_var" exists outside function.
   echo "\"loc_var\" outside function = $loc_var"
   echo "\"global_var\" outside function = $global_var"

- A function calling himself known as Function Recursion.

Subshells

- Running a shell script launches a new process, a subshell. "A subshell is a process launched by a shell (or shell script)."

- Each shell script running is, in effect, a subprocess (child process) of the parent shell.

- A shell script can itself launch subprocesses. These subshells let the script do parallel processing, in effect executing multiple subtasks simultaneously.

NOTE : In general, an external command in a script forks off a subprocess, whereas a Bash builtin does not. For this reason, builtins execute more quickly than their external command equivalents.

- Variables in a subshell are not visible outside the block of code in the subshell. They are not accessible to the parent process, to the shell that launched the subshell. 

#!/bin/bash
# subshell-test.sh
(
# Inside parentheses, and therefore a subshell . . .
while [ 1 ] # Endless loop.
do
  echo "Subshell running . . ."
done
)

Now, run the script:
sh subshell-test.sh
And, while the script is running, from a different xterm:
ps -ef | grep subshell-test.sh
UID PID PPID C STIME TTY TIME CMD
500 2698 2502 0   14:26 pts/4 00:00:00    sh subshell-test.sh
500 2699 2698 21 14:26 pts/4 00:00:24    sh subshell-test.sh

- A subshell may be used to set up a "dedicated environment" for a command group.

COMMAND1
COMMAND2
COMMAND3
(
  IFS=:
  PATH=/bin
  unset TERMINFO
  set -C
  shift 5
  COMMAND4
  COMMAND5
  exit 3 # Only exits the subshell!
)
# The parent shell has not been affected, and the environment is preserved.
COMMAND6
COMMAND7
As seen here, the exit command only terminates the subshell in which it is running, not the parent shell or
script.

- Processes may execute in parallel within different subshells. This permits breaking a complex task into subcomponents processed concurrently.

- A command block between curly braces does not launch a subshell.
{ command1; command2; command3; . . . commandN; }






Sunday, August 3, 2008

I/O Redirection.

There are always three default "files" open, stdin (the keyboard), stdout (the screen), and stderr (error messages output to the screen).

The file descriptors for stdin, stdout, and stderr are 0, 1, and 2, respectively. For opening additional files, there remain descriptors 3 to 9. It is sometimes useful to assign one of these additional file descriptors to stdin, stdout, or stderr as a temporary duplicate link.

# Single-line redirection commands (affect only the line they are on):
# --------------------------------------------------------------------
1>filename
# Redirect stdout to file "filename."
1>>filename
# Redirect and append stdout to file "filename."
2>filename
# Redirect stderr to file "filename."
2>>filename
# Redirect and append stderr to file "filename."
&>filename
# Redirect both stdout and stderr to file "filename."
2>&1
# Redirects stderr to stdout.
# Error messages get sent to same place as standard output.

Friday, August 1, 2008

Extended Regular Expression

The question mark -- ? -- matches zero or one of the previous RE. It is generally used for matching single characters.

. The plus -- + -- matches one or more of the previous RE. It serves a role similar to the *, but does not match zero occurrences.
echo a111b | sed -ne '/a1\+b/p'
echo a111b | grep 'a1\+b'
echo a111b | gawk '/a1+b/'

• Escaped "curly brackets" -- \{ \} -- indicate the number of occurrences of a preceding RE to match.
"[0-9]\{5\}" matches exactly five digits (characters in the range of 0 to 9).
bash$ echo 2222 | gawk --re-interval '/2{3}/'

2222

Regular Expressions.

The asterisk -- * -- matches any number of repeats of the character string or RE preceding it, including zero instances.

"1133*" matches 11 + one or more 3's: 113, 1133, 1133333, and so forth.

The dot -- . -- matches any one character, except a newline."13." matches 13 + at least one of any character (including a space):

1133, 11333, but not 13 (additional character missing).

The caret -- ^ -- matches the beginning of a line, but sometimes, depending on context, negates the meaning of a set of characters in an RE.

The dollar sign -- $ -- at the end of an RE matches the end of a line.

"XXX$" matches XXX at the end of a line.

Brackets -- [...] -- enclose a set of characters to match in a single RE.
"[xyz]" matches the characters x, y, or z.
"[c-n]" matches any of the characters in the range c to n.
"[B-Pk-y]" matches any of the characters in the ranges B to P and k to y.
"[a-z0-9]" matches any lowercase letter or any digit.

Escaped "angle brackets" -- \<...\> -- mark word boundaries.

"\" matches the word "the," but not the words "them," "there," "other," etc.