Monday, January 21, 2008

Unix cut

The external cut command displays selected columns of fields from each line of a file. It is a UNIX equivalent to the relational algebra selection operation. If capabilities are not enouh, then the alternatives are AWK and Perl.

The most typical usage is cutting one of several columns from a file to create a new file. For example

cut -d ' ' -f 2-7

retrieves the second to 7th column which should be separated by blank. Cut can work with column delimited positional (each starts with certain offset) or separator-delimited column (with column separator being blank, comma, colon, etc). By default, cut use a delimiter (defined in -d option in example above) stored in a shell variable called IFS (Input Field Separators).

cut is essentially a text parsing tool and you can also use other text parsing tools like awk or perl for the same task especially if the separator varies.


Sort and uniq
The other two UNIX commands I've found useful when parsing log files are sort and uniq. Say you want to look at all the pages requested from your site, in alphabetical order. The command would look something like this:

cat myapp_log.20031016 cut -d' ' -f4 sort

But that gives you all the pages requested. If you're not interested in all requests, but only the unique pages, whether they were requested once or a million times, then you would just filter through the uniq command:
cat myapp_log.20031016 cut -d' ' -f4 sort uniq


"cut" lets you select just part of the information from each line of a file. If, for instance, you have a file called "file1" with data in this format:

0001 This is the first line
0002 This is the second

and so on, you can look at just the numbers by typing

cut -c1-4 file1

The "-c" flag means "columns"; it will display the first four columns of each line of the file. You can also look at everything but the line numbers:

cut -c6-100 file1

will display the sixth through one hundredth column (if the line is less than a hundred characters -- and most will be -- you'll see up to the end of the line).
You can also use cut to look at fields instead of columns: for instance, if a file looks like this:

curran:Stuart Curran
jlynch:Jack Lynch
afilreis:Al Filreis
loh:Lucy Oh

you can use cut to find the full name of each person, even though it's not always in the same place on each line. Type

cut -f2 -d: file1

"-f2" means "the second field"; "-d:" means the delimiter (the character that separates the fields) is a colon. To use a space as a delimiter, put it in quotations:

cut -f2 -d" " file1

There are two modes for classic Unix cut command:

Column mode -- can select columns of a file (A column is one character position). This variant can act as a generalized substr function. Classic Unix cat cannot count them from the back of the line like Perl substr function, but rcut can ). This type of selection is specified with -c option. List can be opened (from the beginning like in -5 or to the end like in 6-, or closed (like 6-9).

cut -c 4,5,20 foo # cuts foo at columns 4, 5, and 20.
cut -c 1-5 a.dat more # print the first 5 characters of every line in the file a.dat

Field mode -- can select fields of a file (By default a field is defined to be a delimiter (tab) separated group of characters; that can be changed using option -d, see below). This type of selection is specified with -f option ( -f [list] )

cut -d ":" -f1,7 /etc/passwd # cuts fields 1 and 7 from /etc/passwd
cut -d ":" -f 1,6- /etc/passwd # cuts fields 1, 6 to the end from /etc/passwd

In field mode the delimiter can be specified for fields with option -d [character] The default is TAB. If SPACE is the delimiter, be sure to put it in quotes (-d " ").
Note: Another way to specify blank (or other shell-sensitive character) is to use \ -- the following example prints the second field of every line in the file /etc/passwd

% cut -f2 -d\ /etc/passwd more

cut can suppress lines that contain no defined delimiters (-s option). Unless specified, lines with no delimiters will be included in the output untouched

By using pipes and output shell redirection operators you can create new files with a subset of columns or fields contained in the first file.

Sometimes cut is used in shell programming to select certain substrings from a variable, for example:

echo Argument 1 = [$1]
c=`echo $1 cut -c6-8`
echo Characters 6 to 8 = [$c]

Output:

Argument 1 = [1234567890]
Characters 6 to 8 = [678]

This is one of many ways to perform such a selection and in many cases AWK is a better tool for the job. If you are selecting fields of a shell variable, you should probably use the set command and echo the desired positional parameter into pipe.
For complex cases Perl is the way to go. Moreover several Perl re-implementations of cut exists: see for example cut. Perl implementations are more flexible and less capricious that the C-written original Unix cut command.


Syntax
As I mentioned before there are two variants of cut: the first in character column cut and the second is delimiter based (parsing) cut. In both cases option can be separated from the value by a space, for example

-d ' '

In other words POSIX and GNU implementations of cut uses "almost" standard logical lexical parsing of argument although most examples in the books use "old style" with arguments "glued" to options. "Glued" style of specifying arguments is generally an anachronism. Still quoting of delimiter might not always be possible even in modern versions for example most implementations of cut requires that delimiter \t (tab) be specified without quotes. You generally need to experiment with your particular implementation.

1. Character column cut

cut -c list [ file_list ]

Option:
-c list Display (cut) columns, specified in list, from the input data. Columns are counted from one, not from zero, so the first column is column 1. List can be separated from the option by space(s) but no spaces are allowed within the list. Multiple values must be comma (,) separated. The list defines the exact columns to display. For example, the -c 1,4,7 notation cuts columns 1, 4, and 7 of the input. The -c -10,50 would select columns 1 through 10 and 50 through end-of-line (please remember that columns are conted from one)

2. Delimiter-based (parsing) cut

cut -f list [ -d char ] [ -s ] [ file_list ]

Options:
d char The character char is used as the field delimiter. It is usually quoted but can be escaped. The default delimiter is a tab character. To use a character that has special meaning to the shell, you must quote the character so the shell does not interpret it. For example, to use a single space as a delimiter, type -d' '.

-f list Selects (cuts) fields, specified in list, from the input data. Fields are counted from one, not from zero. No spaces are allowed within the list. Multiple values must be comma (,) separated. The list defines the exact field to display. The most practically important ranges are "open" ranges, were either starting field or the last field are not specified explicitly (omitted). For example:

Selection from the beginning of the line to a certain field is specified as -N, were N is the number of the filed. For example -f -5

Selection from the certain filed to the end of the line (all fileds starting from N) is specified as N-. For example -f 5-

Specification can be complex and include both selected fields and ranges. For example, -f 1,4,7 would select fields 1, 4, and 7. The -f2,4-6,8 would select fields 2 to 6 (range) and field 8.

Limitations

Cut is good only for simple cases. In complex cases AWK and Perl actually save your time.

Limitations are many. Among them:

Delimiter are single characters; they are not regular expressions. This leads to disappointment when you try to parse blank-delimited file with cut: multiple blanks are counted as multiple filed separators.

Syntax is irregular and sometimes tricky. For example one character delimiters can be quoted by escaped delimiters cannot be quoted.

Semantic is the most basic. Cut is essentially a text parser and as such is suitable mainly for parsing colon delimited and similar files. functionality does even match the level of Fortran IV format statement.

1 comment:

Cory said...

The "cut" command is great... It allows me to do so many things.. What I am trying to figure out is how to to all the same tricks with perl that I can do with csh. Time to crack the books!