Monday, January 21, 2008

Unix sort

The Unix sort command sorts ASCII files. The input can be from files or the standard input. The output can be a file or the standard output.
Data can be sorted in serveral ways:

interpreting sorting keys alphabetically or numerically (-n option)
in ascending or descending order (-r --sort in reverse option)

By default fields in each record(line of input) are delimited by blanks, but you can specify a different delimiter using option -t. You can also character columns.

Selection of sorting keys is similar to selection of fields in cut and is extremely obscure.

There are two types of sorting keys definition in Unix sort:

new (with -k option)
old (+n -n).

New-style sort keys definitions

New style definition use -k option:
-k field_start [type] [,field_end [type] ]

where:
field_start and field_end
define a key field restricted to a portion of the line.

type
is a modifier from the list of characters bdfiMnr. The b modifier behaves like the -b option, but applies only to the field_start or field_end to which it is attached and characters within a field are counted from the first non-blank character in the field. (This applies separately to first_character and last_character.) The other modifiers behave like the corresponding options, but apply only to the key field to which they are attached. They have this effect if specified with field_start, field_end or both. If any modifier is attached to a field_start or to a field_end, no option applies to either.

When there are multiple key fields, later keys are compared only after all earlier keys compare equal. Except when the -u option is specified, lines that otherwise compare equal are ordered as if none of the options -d, -f, -i, -n or -k were present (but with -r still in effect, if it was specified) and with all bytes in the lines significant to the comparison.

The notation:
-k field_start[type][,field_end[type]]

defines a key field that begins at field_start and ends at field_end inclusive, unless field_start falls beyond the end of the line or after field_end, in which case the key field is empty. A missing field_end means the last character of the line.
A field comprises a maximal sequence of non-separating characters and, in the absence of option -t, any preceding field separator.

The field_start portion of the keydef option-argument has the form:

field_number[.first_character]

Fields and characters within fields are numbered starting with 1. field_number and first_character, interpreted as positive decimal integers, specify the first character to be used as part of a sort key. If .first_character is omitted, it refers to the first character of the field.

The field_end portion of the keydef option-argument has the form:

field_number[.last_character]

The field_number is as described above for field_start. last_character, interpreted as a non-negative decimal integer, specifies the last character to be used as part of the sort key. If last_character evaluates to zero or .last_character is omitted, it refers to the last character of the field specified by field_number.

If the -b option or b type modifier is in effect, characters within a field are counted from the first non-blank character in the field. (This applies separately to first_character and last_character.)

Old-style keys definition
There is also "old-style" key definition[+pos1 [-pos2]] that are now formally obsolite, but still is extremly widely used. Provide functionality equivalent to the -kkeydef option.

pos1 and pos2 each have the form m.n optionally followed by one or more of the flags bdfiMnr. A starting position specified by +m.n is interpreted to mean the n+1st character in the m+1st field. A missing .n means .0, indicating the first character of the m+1st field. If the b flag is in effect n is counted from the first non-blank in the m+1st field; +m.0b refers to the first non-blank character in the m+1st field.

A last position specified by -m.n is interpreted to mean the nth character (including separators) after the last character of the mth field. A missing .n means .0, indicating the last character of the mth field. If the b flag is in effect n is counted from the last leading blank in the m+1st field; -m.1b refers to the first non-blank in the m+1st field.

The fully specified +pos1 -pos2 form with type modifiers T and U: +w.xT -y.zU

is equivalent to:

undefined (z==0 & U contains b & -t is present)
-k w+1.x+1T,y.0U (z==0 otherwise)
-k w+1.x+1T,y+1.zU (z > 0)

Implementations support at least nine occurrences of the sort keys (the -k option and obsolescent +pos1 and -pos2) which are significant in command line order. If no sort key is specified, a default sort key of the entire line is used.

If the ordering rule options precede the sort key options, they are globally applied to all sort keys. For example, sort -r +2 -3 infile
reverses the order based on field 3. If the ordering rule options are attached to a sort key option, they override the global ordering options for only that sort key. For example,

sort -r +2d -3d infile

sorts field 3 by dictionary comparison but sorts the rest of the line using reverse comparison.

The most common mistake is to forget to use -n option for sorting numeric fields. Also specifying delimiter (option -t) with an unquoted character after it can be a source of problems; it's better to use single quotes around the character that you plan to use as a delimiter. for example -t ':'

The most common mistake is to forget to use -n option for sorting numeric fields

Here is a standard example of usage of the sort utility, sorting /etc/passwd file (user database) by UID (the third colon-separated field in the passwd file structure):

sort -t ':' +2 /etc/passwd # incorrect result, the field is numeric
sort -n -t ':' +2 /etc/passwd # order of the numbers is now correct

sort -t ':' -k 3,3n /etc/passwd
sort -t ':' +2 -3n /etc/passwd
sort -n -t ':' -k 3,3 /etc/passwd

See Sorting key definitions and Examples for more details. Generally you will be surprised how often the result is not what you want due to the obscurity of the definitions

Be careful and always test your sorting keys on a small sample before sorting the whole file. You will be surprised how often the result is not what you want.

By default sort sorts the file in ascending order using the entire line as a sorting key. Please note that a lot of WEB resources interpret this sort utility behavior incorrectly (most often they state that by default sorting is performed on the first key).

The three most important options of Unix soft are -n (numeric keys sorting), +n (sorting using n-th field, counting from zero) and -r (sort in reverse). For example:
Sort the entire lines as a key: sort
Sort in numeric order: sort -n
Sort on n+1 st field (old style key definition): sort +n

Comparisons are based on one or more sort keys extracted from each line of input. Again, please remember that by default, there is one sort key, the entire input line.

Lines are ordered according to the collating sequence of the current locale. By changing locale you can change the behavior of the sort.

In Solaris there are two variants of sort: System V version and BSD version. Both have identical options:

/usr/bin/sort [ -cmu ] [ -o output ] [ -T directory ] [ -y [ kmem ]] [ -z recsz ] [ -dfiMnr ] [ - b ] [ t char ] [ -k keydef ] [ +pos1 [ -pos2 ]] [ file...]

/usr/xpg4/bin/sort [ -cmu ] [ -o output ] [ -T directory ] [ -y [ kmem ]] [ -z recsz ] [ - dfiMnr ] [ -b ] [ -t char ] [ -k keydef ] [ +pos1 [ -pos2 ]] [ file...]

The sort command can (and should) be used in pipes or have its output redirected as desired. Here are some practically important examples that illustrates using of this utility (for more examples please look into our sort examples collection page):

1. Sort the file and display the sorted file, one page at a time. (If you prefer the standard command "more", you can use this instead of "less". "less" is an enhanced version of "more" - for example, it allows you to move backwards and forwards in the file; to exit "less" , use "q".)
sort file less

2. Read and sorts infile and display the result on standard output, which has been redirected to outfile:

sort infile > outfile # sorts infile and writes the results to outfile

3. Write sorted output directly to outfile.

sort -o outfile infile # same as in 1, but using an option -o

4. Read and sort infile "in place" writing to the same file

sort -o infile infile # sort file "in place"

5. Sort the file and pipe it to uniq command with the number of identical keys counter printed (-c option in uniq)

sort infile uniq -c

6. Pipe the output of the who command to the input of the sort command:

who sort

7. Classic "number of instances" analysis of log files:

cat messages awk '{"select the keywords"}' sort uniq -c sort -nr

In simple cases cut can be used instead of AWK. For example the following example couts distinc visitors from HTTP logs (assuming this is the first field in the logs):

cat http.log cut -d " " -f 1 sort uniq -c sort -nr

8. Sort the file, then prepend line numbers to each line (not many Unix adminitrators know that cat can be used to number lines):

sort -n file cat -n

This can be useful if you want to count the number of lines in which the first entry is in a given range: simply subtract the line numbers corresponding to the beginning and end of the range.

As I mentioned about by default the sort command uses entire lines as a key. It compares the characters starting with the first, until non-matching character or the end of the shortest line. Leading blanks (spaces and tabs) are considered valid characters to compare. Thus a line beginning with a space precedes a line beginning with the letter A. If you do not want this effect you need to delete leading spaces beforehand.

Multiple sort keys may be used on the same command line. If two lines have the same value in the same field, sort uses the next set of sort keys to resolve the equal comparison. For example,

sort +4 -5 +1 -2 infile

means to sort based on field 5 (+4 -5). If two lines have the same value in field 5, sort those two lines based on field 2 (+1 -2).

Beside sorting Unix sort is useful for merging files (option -m). It can also checked whether the file is sorted or not (option -c). It can also suppress duplicates (option -u):

-c Check whether the given files are already sorted: if they are not all sorted, print an error message and exit with a status of 1.

-m Merge the given files by sorting them as a group. Each input file should already be individually sorted. It always works to sort instead of merge; merging is provided because it is faster, in the case where it works.

-u Suppress all but one occurrence of matching keys.

In case Unix sort does not produce the required results you might want to look into Perl built-in function. If it is too slow more memory can be specified on invocation.


The most important options

The following list describes the options and their arguments that may be used to control how sort functions.

- Forces sort to read from the standard input. Useful for reading from pipes and files simultaneously.

-c Verifies that the input is sorted according to the other options specified on the command line. If the input is sorted correctly then no output is provided. If the input is not sorted then sort informs you of the situation. The message resembles this.

sort: disorder: This line not in sorted order.

-m Merges the sorted input. sort assumes the input is already sorted. sort normally merges input as it sorts. This option informs sort that the input is already sorted, thus sort runs much faster.

-o output Sends the output to file output instead of the standard output. The output file may be the same name as one of the input files.

-u Suppress all but one occurrence of matching keys. Normally, the entire line is the key. If field or character keys are specified, then the suppressing is done based on the keys.

-y kmem Use kmem kilobytes of main memory to initially start the sorting. If more memory is needed, sort automatically requests it from the operating system. The amount of memory allocated for the sort impacts the speed of the sort significantly. If no kmem is specified, sort starts with the default amount of memory (usually 32K). The maximum (usually 1 Megabyte) amount of memory may be allocated if needed. If 0 is specified for kmem, the minimum (usually 16K) amount of memory is allocated.

-z recsz Specifies the record size used to store each line. Normally the recsz is set to the longest line read during the sort phase. If the -c or -m options are specified, the sort phase is not performed and thus the record size defaults to a system size. If this default size is not large enough, sort may abort during the merge phase. To alleviate this problem you can specify a recsz that will allow the merge phase to run without aborting.


Ordering Options

-d Specifies dictionary sort. Only blanks, digits, and alphabetic characters are significant in the comparison.

-f Fold lowercase letters to uppercase. Ignores the difference between upper and lowercase ASCII characters.

-i Ignore characters outside the ASCII range of 040 (octal) and 0176 (octal). Only alphabetic characters, blanks, digits, and punctuation are used for comparison (printable characters). Control characters are ignored. This is only valid for nonnumeric sorts.

-M Compare fields as months. The first three nonblank characters are folded (see -i option) to uppercase and compared. Thus January is compared as JAN. JAN precedes FEB, and fields not containing months precede JAN. The -b option is placed in effect automatically.

-n Sorts the input numerically. The comparison is based on numerical value instead of alphabetic value. The number field used for comparison can contain optional blanks, optional minus signs, optional decimal point, and zero or more digits. The -b option is placed in effect automatically. Exponential numbers are not sorted correctly.

-r Reverse the order of the output.


Old-style Sort Key Options

+pos1
Specifies the beginning position of the input line used for field comparison. If pos1 is not specified then comparison begins at the beginning of the line. The pos1 position has the notation of f.c. The f specifies the number of fields to skip. The c specifies the number of characters to skip. For example, 3.2 is interpreted as skip three fields and two characters before performing comparisons. Omitting the .c portion is equivalent to specifying .0. Field one is referred to as position 0. If f is set to 0 then character positions are used for comparison.

-pos2
Specifies the ending position of the input line used for field comparison. If pos2 is not specified then comparison is done through the end of the line. The pos2 position has the notation of f.c. The f specifies to compare through field f. The c specifies the number of characters to compare through after field f. For example, -4.3 is interpreted as compare through three characters after the end of field four. Omitting the .c portion is equivalent to specifying .0.

-b
Ignores leading blanks when using the restricted sort key positions (+pos1 and -pos2). If the -b option is placed before the first +pos1 argument, then it applies to all +pos1 arguments. The -b option can be attached to each pos string to affect only that field.

-t c
Use character c as the field separator. Multiple adjacent c's in the records are interpreted empty fields surrounded by separators. If you need to use multiple characters as a separator you need to convert record so that they are represented by a single character using sed or AWK. You can use nonprintable characters as a separator to ensure their uniqueness.


Old News ;-)
[Jul 14, 2007] My SysAd Blog -- UNIX Sort Files by Their Filesizes
Here's a convenient way of finding those space hogs in your home directory (can be any directory). For me, those large files are usually a result of mkfile event (testing purposes) and can be promptly deleted. Here's an example of its use.

#cd /export/home/esofthub
#ls -l sort +4n awk '{print $5 "\t" $9}'

Find recursively (a little awkward)
#ls -lR sort +4n awk '{print $5 "\t" $9}' more



ps -ef sort

This command pipeline sorts the output of the "ps -ef" command. Because no arguments are supplied to the sort command, the output is sorted in alphabetic order by the first column of the ps -ef output (i.e., the output is sorted alphabetically by username).

ls -al sort +4n

This command performs a numeric sort on the fifth column of the "ls -al" output. This results in a file listing where the files are listed in ascending order, from smallest in size to largest in size.

ls -al sort +4n more

The same command as the previous, except the output is piped into the more command. This is useful when the output will not all fit on one screen.

ls -al sort +4nr

This command reverses the order of the numeric sort, so files are listed in descending order of size, with the largest file listed first, and the smallest file listed last.

[Feb 15, 2007] UNIX Disk Usage Simplifying Analysis with sort
The output of du has been very informative, but it's difficult to scan a listing to ascertain the four or five largest directories, particularly as more and more directories and files are included in the output. The good news is that the Unix sort utility is just the tool we need to sidestep this problem.
# du -s * sort -nr
13984 Lynx
10464 IBM
3092 Gator
412 bin
196 DEMO
84 etcpasswd
76 CBO_MAIL
48 elance
36 CraigsList
16 Exchange
4 gettermsheet.sh
4 getstocks.sh
4 getmodemdriver.sh
4 buckaroo
4 browse.sh
4 badjoke.rot13
4 badjoke
0 gif.gif

One final concept and we're ready to move along. If you want to only see the five largest files or directories in a specific directory, all that you'd need to do is pipe the command sequence to head:

# du -s * sort -nr head -5
13984 Lynx
10464 IBM
3092 Gator
412 bin
196 DEMO

The ! command (pronounced "bang") creates a temporary file to be used with a program that requires a filename in its command line. This is useful with shells that don't support process substitution. For example, to diff two files after sorting them, you might do:

diff `! sort file1` `! sort file2`

This command pipeline sorts the output of the "ps -ef" command. Because no arguments are supplied to the sort command, the output is sorted in alphabetic order by the first column of the ps -ef output (i.e., the output is sorted alphabetically by username).

ls -al sort +4n

This command performs a numeric sort on the fifth column of the "ls -al" output. This results in a file listing where the files are listed in ascending order, from smallest in size to largest in size.

ls -al sort +4n more

The same command as the previous, except the output is piped into the more command. This is useful when the output will not all fit on one screen.

ls -al sort +4nr

This command reverses the order of the numeric sort, so files are listed in descending order of size, with the largest file listed first, and the smallest file listed last.

Unix cut

The external cut command displays selected columns of fields from each line of a file. It is a UNIX equivalent to the relational algebra selection operation. If capabilities are not enouh, then the alternatives are AWK and Perl.

The most typical usage is cutting one of several columns from a file to create a new file. For example

cut -d ' ' -f 2-7

retrieves the second to 7th column which should be separated by blank. Cut can work with column delimited positional (each starts with certain offset) or separator-delimited column (with column separator being blank, comma, colon, etc). By default, cut use a delimiter (defined in -d option in example above) stored in a shell variable called IFS (Input Field Separators).

cut is essentially a text parsing tool and you can also use other text parsing tools like awk or perl for the same task especially if the separator varies.


Sort and uniq
The other two UNIX commands I've found useful when parsing log files are sort and uniq. Say you want to look at all the pages requested from your site, in alphabetical order. The command would look something like this:

cat myapp_log.20031016 cut -d' ' -f4 sort

But that gives you all the pages requested. If you're not interested in all requests, but only the unique pages, whether they were requested once or a million times, then you would just filter through the uniq command:
cat myapp_log.20031016 cut -d' ' -f4 sort uniq


"cut" lets you select just part of the information from each line of a file. If, for instance, you have a file called "file1" with data in this format:

0001 This is the first line
0002 This is the second

and so on, you can look at just the numbers by typing

cut -c1-4 file1

The "-c" flag means "columns"; it will display the first four columns of each line of the file. You can also look at everything but the line numbers:

cut -c6-100 file1

will display the sixth through one hundredth column (if the line is less than a hundred characters -- and most will be -- you'll see up to the end of the line).
You can also use cut to look at fields instead of columns: for instance, if a file looks like this:

curran:Stuart Curran
jlynch:Jack Lynch
afilreis:Al Filreis
loh:Lucy Oh

you can use cut to find the full name of each person, even though it's not always in the same place on each line. Type

cut -f2 -d: file1

"-f2" means "the second field"; "-d:" means the delimiter (the character that separates the fields) is a colon. To use a space as a delimiter, put it in quotations:

cut -f2 -d" " file1

There are two modes for classic Unix cut command:

Column mode -- can select columns of a file (A column is one character position). This variant can act as a generalized substr function. Classic Unix cat cannot count them from the back of the line like Perl substr function, but rcut can ). This type of selection is specified with -c option. List can be opened (from the beginning like in -5 or to the end like in 6-, or closed (like 6-9).

cut -c 4,5,20 foo # cuts foo at columns 4, 5, and 20.
cut -c 1-5 a.dat more # print the first 5 characters of every line in the file a.dat

Field mode -- can select fields of a file (By default a field is defined to be a delimiter (tab) separated group of characters; that can be changed using option -d, see below). This type of selection is specified with -f option ( -f [list] )

cut -d ":" -f1,7 /etc/passwd # cuts fields 1 and 7 from /etc/passwd
cut -d ":" -f 1,6- /etc/passwd # cuts fields 1, 6 to the end from /etc/passwd

In field mode the delimiter can be specified for fields with option -d [character] The default is TAB. If SPACE is the delimiter, be sure to put it in quotes (-d " ").
Note: Another way to specify blank (or other shell-sensitive character) is to use \ -- the following example prints the second field of every line in the file /etc/passwd

% cut -f2 -d\ /etc/passwd more

cut can suppress lines that contain no defined delimiters (-s option). Unless specified, lines with no delimiters will be included in the output untouched

By using pipes and output shell redirection operators you can create new files with a subset of columns or fields contained in the first file.

Sometimes cut is used in shell programming to select certain substrings from a variable, for example:

echo Argument 1 = [$1]
c=`echo $1 cut -c6-8`
echo Characters 6 to 8 = [$c]

Output:

Argument 1 = [1234567890]
Characters 6 to 8 = [678]

This is one of many ways to perform such a selection and in many cases AWK is a better tool for the job. If you are selecting fields of a shell variable, you should probably use the set command and echo the desired positional parameter into pipe.
For complex cases Perl is the way to go. Moreover several Perl re-implementations of cut exists: see for example cut. Perl implementations are more flexible and less capricious that the C-written original Unix cut command.


Syntax
As I mentioned before there are two variants of cut: the first in character column cut and the second is delimiter based (parsing) cut. In both cases option can be separated from the value by a space, for example

-d ' '

In other words POSIX and GNU implementations of cut uses "almost" standard logical lexical parsing of argument although most examples in the books use "old style" with arguments "glued" to options. "Glued" style of specifying arguments is generally an anachronism. Still quoting of delimiter might not always be possible even in modern versions for example most implementations of cut requires that delimiter \t (tab) be specified without quotes. You generally need to experiment with your particular implementation.

1. Character column cut

cut -c list [ file_list ]

Option:
-c list Display (cut) columns, specified in list, from the input data. Columns are counted from one, not from zero, so the first column is column 1. List can be separated from the option by space(s) but no spaces are allowed within the list. Multiple values must be comma (,) separated. The list defines the exact columns to display. For example, the -c 1,4,7 notation cuts columns 1, 4, and 7 of the input. The -c -10,50 would select columns 1 through 10 and 50 through end-of-line (please remember that columns are conted from one)

2. Delimiter-based (parsing) cut

cut -f list [ -d char ] [ -s ] [ file_list ]

Options:
d char The character char is used as the field delimiter. It is usually quoted but can be escaped. The default delimiter is a tab character. To use a character that has special meaning to the shell, you must quote the character so the shell does not interpret it. For example, to use a single space as a delimiter, type -d' '.

-f list Selects (cuts) fields, specified in list, from the input data. Fields are counted from one, not from zero. No spaces are allowed within the list. Multiple values must be comma (,) separated. The list defines the exact field to display. The most practically important ranges are "open" ranges, were either starting field or the last field are not specified explicitly (omitted). For example:

Selection from the beginning of the line to a certain field is specified as -N, were N is the number of the filed. For example -f -5

Selection from the certain filed to the end of the line (all fileds starting from N) is specified as N-. For example -f 5-

Specification can be complex and include both selected fields and ranges. For example, -f 1,4,7 would select fields 1, 4, and 7. The -f2,4-6,8 would select fields 2 to 6 (range) and field 8.

Limitations

Cut is good only for simple cases. In complex cases AWK and Perl actually save your time.

Limitations are many. Among them:

Delimiter are single characters; they are not regular expressions. This leads to disappointment when you try to parse blank-delimited file with cut: multiple blanks are counted as multiple filed separators.

Syntax is irregular and sometimes tricky. For example one character delimiters can be quoted by escaped delimiters cannot be quoted.

Semantic is the most basic. Cut is essentially a text parser and as such is suitable mainly for parsing colon delimited and similar files. functionality does even match the level of Fortran IV format statement.