Monday, January 21, 2008

Unix sort

The Unix sort command sorts ASCII files. The input can be from files or the standard input. The output can be a file or the standard output.
Data can be sorted in serveral ways:

interpreting sorting keys alphabetically or numerically (-n option)
in ascending or descending order (-r --sort in reverse option)

By default fields in each record(line of input) are delimited by blanks, but you can specify a different delimiter using option -t. You can also character columns.

Selection of sorting keys is similar to selection of fields in cut and is extremely obscure.

There are two types of sorting keys definition in Unix sort:

new (with -k option)
old (+n -n).

New-style sort keys definitions

New style definition use -k option:
-k field_start [type] [,field_end [type] ]

where:
field_start and field_end
define a key field restricted to a portion of the line.

type
is a modifier from the list of characters bdfiMnr. The b modifier behaves like the -b option, but applies only to the field_start or field_end to which it is attached and characters within a field are counted from the first non-blank character in the field. (This applies separately to first_character and last_character.) The other modifiers behave like the corresponding options, but apply only to the key field to which they are attached. They have this effect if specified with field_start, field_end or both. If any modifier is attached to a field_start or to a field_end, no option applies to either.

When there are multiple key fields, later keys are compared only after all earlier keys compare equal. Except when the -u option is specified, lines that otherwise compare equal are ordered as if none of the options -d, -f, -i, -n or -k were present (but with -r still in effect, if it was specified) and with all bytes in the lines significant to the comparison.

The notation:
-k field_start[type][,field_end[type]]

defines a key field that begins at field_start and ends at field_end inclusive, unless field_start falls beyond the end of the line or after field_end, in which case the key field is empty. A missing field_end means the last character of the line.
A field comprises a maximal sequence of non-separating characters and, in the absence of option -t, any preceding field separator.

The field_start portion of the keydef option-argument has the form:

field_number[.first_character]

Fields and characters within fields are numbered starting with 1. field_number and first_character, interpreted as positive decimal integers, specify the first character to be used as part of a sort key. If .first_character is omitted, it refers to the first character of the field.

The field_end portion of the keydef option-argument has the form:

field_number[.last_character]

The field_number is as described above for field_start. last_character, interpreted as a non-negative decimal integer, specifies the last character to be used as part of the sort key. If last_character evaluates to zero or .last_character is omitted, it refers to the last character of the field specified by field_number.

If the -b option or b type modifier is in effect, characters within a field are counted from the first non-blank character in the field. (This applies separately to first_character and last_character.)

Old-style keys definition
There is also "old-style" key definition[+pos1 [-pos2]] that are now formally obsolite, but still is extremly widely used. Provide functionality equivalent to the -kkeydef option.

pos1 and pos2 each have the form m.n optionally followed by one or more of the flags bdfiMnr. A starting position specified by +m.n is interpreted to mean the n+1st character in the m+1st field. A missing .n means .0, indicating the first character of the m+1st field. If the b flag is in effect n is counted from the first non-blank in the m+1st field; +m.0b refers to the first non-blank character in the m+1st field.

A last position specified by -m.n is interpreted to mean the nth character (including separators) after the last character of the mth field. A missing .n means .0, indicating the last character of the mth field. If the b flag is in effect n is counted from the last leading blank in the m+1st field; -m.1b refers to the first non-blank in the m+1st field.

The fully specified +pos1 -pos2 form with type modifiers T and U: +w.xT -y.zU

is equivalent to:

undefined (z==0 & U contains b & -t is present)
-k w+1.x+1T,y.0U (z==0 otherwise)
-k w+1.x+1T,y+1.zU (z > 0)

Implementations support at least nine occurrences of the sort keys (the -k option and obsolescent +pos1 and -pos2) which are significant in command line order. If no sort key is specified, a default sort key of the entire line is used.

If the ordering rule options precede the sort key options, they are globally applied to all sort keys. For example, sort -r +2 -3 infile
reverses the order based on field 3. If the ordering rule options are attached to a sort key option, they override the global ordering options for only that sort key. For example,

sort -r +2d -3d infile

sorts field 3 by dictionary comparison but sorts the rest of the line using reverse comparison.

The most common mistake is to forget to use -n option for sorting numeric fields. Also specifying delimiter (option -t) with an unquoted character after it can be a source of problems; it's better to use single quotes around the character that you plan to use as a delimiter. for example -t ':'

The most common mistake is to forget to use -n option for sorting numeric fields

Here is a standard example of usage of the sort utility, sorting /etc/passwd file (user database) by UID (the third colon-separated field in the passwd file structure):

sort -t ':' +2 /etc/passwd # incorrect result, the field is numeric
sort -n -t ':' +2 /etc/passwd # order of the numbers is now correct

sort -t ':' -k 3,3n /etc/passwd
sort -t ':' +2 -3n /etc/passwd
sort -n -t ':' -k 3,3 /etc/passwd

See Sorting key definitions and Examples for more details. Generally you will be surprised how often the result is not what you want due to the obscurity of the definitions

Be careful and always test your sorting keys on a small sample before sorting the whole file. You will be surprised how often the result is not what you want.

By default sort sorts the file in ascending order using the entire line as a sorting key. Please note that a lot of WEB resources interpret this sort utility behavior incorrectly (most often they state that by default sorting is performed on the first key).

The three most important options of Unix soft are -n (numeric keys sorting), +n (sorting using n-th field, counting from zero) and -r (sort in reverse). For example:
Sort the entire lines as a key: sort
Sort in numeric order: sort -n
Sort on n+1 st field (old style key definition): sort +n

Comparisons are based on one or more sort keys extracted from each line of input. Again, please remember that by default, there is one sort key, the entire input line.

Lines are ordered according to the collating sequence of the current locale. By changing locale you can change the behavior of the sort.

In Solaris there are two variants of sort: System V version and BSD version. Both have identical options:

/usr/bin/sort [ -cmu ] [ -o output ] [ -T directory ] [ -y [ kmem ]] [ -z recsz ] [ -dfiMnr ] [ - b ] [ t char ] [ -k keydef ] [ +pos1 [ -pos2 ]] [ file...]

/usr/xpg4/bin/sort [ -cmu ] [ -o output ] [ -T directory ] [ -y [ kmem ]] [ -z recsz ] [ - dfiMnr ] [ -b ] [ -t char ] [ -k keydef ] [ +pos1 [ -pos2 ]] [ file...]

The sort command can (and should) be used in pipes or have its output redirected as desired. Here are some practically important examples that illustrates using of this utility (for more examples please look into our sort examples collection page):

1. Sort the file and display the sorted file, one page at a time. (If you prefer the standard command "more", you can use this instead of "less". "less" is an enhanced version of "more" - for example, it allows you to move backwards and forwards in the file; to exit "less" , use "q".)
sort file less

2. Read and sorts infile and display the result on standard output, which has been redirected to outfile:

sort infile > outfile # sorts infile and writes the results to outfile

3. Write sorted output directly to outfile.

sort -o outfile infile # same as in 1, but using an option -o

4. Read and sort infile "in place" writing to the same file

sort -o infile infile # sort file "in place"

5. Sort the file and pipe it to uniq command with the number of identical keys counter printed (-c option in uniq)

sort infile uniq -c

6. Pipe the output of the who command to the input of the sort command:

who sort

7. Classic "number of instances" analysis of log files:

cat messages awk '{"select the keywords"}' sort uniq -c sort -nr

In simple cases cut can be used instead of AWK. For example the following example couts distinc visitors from HTTP logs (assuming this is the first field in the logs):

cat http.log cut -d " " -f 1 sort uniq -c sort -nr

8. Sort the file, then prepend line numbers to each line (not many Unix adminitrators know that cat can be used to number lines):

sort -n file cat -n

This can be useful if you want to count the number of lines in which the first entry is in a given range: simply subtract the line numbers corresponding to the beginning and end of the range.

As I mentioned about by default the sort command uses entire lines as a key. It compares the characters starting with the first, until non-matching character or the end of the shortest line. Leading blanks (spaces and tabs) are considered valid characters to compare. Thus a line beginning with a space precedes a line beginning with the letter A. If you do not want this effect you need to delete leading spaces beforehand.

Multiple sort keys may be used on the same command line. If two lines have the same value in the same field, sort uses the next set of sort keys to resolve the equal comparison. For example,

sort +4 -5 +1 -2 infile

means to sort based on field 5 (+4 -5). If two lines have the same value in field 5, sort those two lines based on field 2 (+1 -2).

Beside sorting Unix sort is useful for merging files (option -m). It can also checked whether the file is sorted or not (option -c). It can also suppress duplicates (option -u):

-c Check whether the given files are already sorted: if they are not all sorted, print an error message and exit with a status of 1.

-m Merge the given files by sorting them as a group. Each input file should already be individually sorted. It always works to sort instead of merge; merging is provided because it is faster, in the case where it works.

-u Suppress all but one occurrence of matching keys.

In case Unix sort does not produce the required results you might want to look into Perl built-in function. If it is too slow more memory can be specified on invocation.


The most important options

The following list describes the options and their arguments that may be used to control how sort functions.

- Forces sort to read from the standard input. Useful for reading from pipes and files simultaneously.

-c Verifies that the input is sorted according to the other options specified on the command line. If the input is sorted correctly then no output is provided. If the input is not sorted then sort informs you of the situation. The message resembles this.

sort: disorder: This line not in sorted order.

-m Merges the sorted input. sort assumes the input is already sorted. sort normally merges input as it sorts. This option informs sort that the input is already sorted, thus sort runs much faster.

-o output Sends the output to file output instead of the standard output. The output file may be the same name as one of the input files.

-u Suppress all but one occurrence of matching keys. Normally, the entire line is the key. If field or character keys are specified, then the suppressing is done based on the keys.

-y kmem Use kmem kilobytes of main memory to initially start the sorting. If more memory is needed, sort automatically requests it from the operating system. The amount of memory allocated for the sort impacts the speed of the sort significantly. If no kmem is specified, sort starts with the default amount of memory (usually 32K). The maximum (usually 1 Megabyte) amount of memory may be allocated if needed. If 0 is specified for kmem, the minimum (usually 16K) amount of memory is allocated.

-z recsz Specifies the record size used to store each line. Normally the recsz is set to the longest line read during the sort phase. If the -c or -m options are specified, the sort phase is not performed and thus the record size defaults to a system size. If this default size is not large enough, sort may abort during the merge phase. To alleviate this problem you can specify a recsz that will allow the merge phase to run without aborting.


Ordering Options

-d Specifies dictionary sort. Only blanks, digits, and alphabetic characters are significant in the comparison.

-f Fold lowercase letters to uppercase. Ignores the difference between upper and lowercase ASCII characters.

-i Ignore characters outside the ASCII range of 040 (octal) and 0176 (octal). Only alphabetic characters, blanks, digits, and punctuation are used for comparison (printable characters). Control characters are ignored. This is only valid for nonnumeric sorts.

-M Compare fields as months. The first three nonblank characters are folded (see -i option) to uppercase and compared. Thus January is compared as JAN. JAN precedes FEB, and fields not containing months precede JAN. The -b option is placed in effect automatically.

-n Sorts the input numerically. The comparison is based on numerical value instead of alphabetic value. The number field used for comparison can contain optional blanks, optional minus signs, optional decimal point, and zero or more digits. The -b option is placed in effect automatically. Exponential numbers are not sorted correctly.

-r Reverse the order of the output.


Old-style Sort Key Options

+pos1
Specifies the beginning position of the input line used for field comparison. If pos1 is not specified then comparison begins at the beginning of the line. The pos1 position has the notation of f.c. The f specifies the number of fields to skip. The c specifies the number of characters to skip. For example, 3.2 is interpreted as skip three fields and two characters before performing comparisons. Omitting the .c portion is equivalent to specifying .0. Field one is referred to as position 0. If f is set to 0 then character positions are used for comparison.

-pos2
Specifies the ending position of the input line used for field comparison. If pos2 is not specified then comparison is done through the end of the line. The pos2 position has the notation of f.c. The f specifies to compare through field f. The c specifies the number of characters to compare through after field f. For example, -4.3 is interpreted as compare through three characters after the end of field four. Omitting the .c portion is equivalent to specifying .0.

-b
Ignores leading blanks when using the restricted sort key positions (+pos1 and -pos2). If the -b option is placed before the first +pos1 argument, then it applies to all +pos1 arguments. The -b option can be attached to each pos string to affect only that field.

-t c
Use character c as the field separator. Multiple adjacent c's in the records are interpreted empty fields surrounded by separators. If you need to use multiple characters as a separator you need to convert record so that they are represented by a single character using sed or AWK. You can use nonprintable characters as a separator to ensure their uniqueness.


Old News ;-)
[Jul 14, 2007] My SysAd Blog -- UNIX Sort Files by Their Filesizes
Here's a convenient way of finding those space hogs in your home directory (can be any directory). For me, those large files are usually a result of mkfile event (testing purposes) and can be promptly deleted. Here's an example of its use.

#cd /export/home/esofthub
#ls -l sort +4n awk '{print $5 "\t" $9}'

Find recursively (a little awkward)
#ls -lR sort +4n awk '{print $5 "\t" $9}' more



ps -ef sort

This command pipeline sorts the output of the "ps -ef" command. Because no arguments are supplied to the sort command, the output is sorted in alphabetic order by the first column of the ps -ef output (i.e., the output is sorted alphabetically by username).

ls -al sort +4n

This command performs a numeric sort on the fifth column of the "ls -al" output. This results in a file listing where the files are listed in ascending order, from smallest in size to largest in size.

ls -al sort +4n more

The same command as the previous, except the output is piped into the more command. This is useful when the output will not all fit on one screen.

ls -al sort +4nr

This command reverses the order of the numeric sort, so files are listed in descending order of size, with the largest file listed first, and the smallest file listed last.

[Feb 15, 2007] UNIX Disk Usage Simplifying Analysis with sort
The output of du has been very informative, but it's difficult to scan a listing to ascertain the four or five largest directories, particularly as more and more directories and files are included in the output. The good news is that the Unix sort utility is just the tool we need to sidestep this problem.
# du -s * sort -nr
13984 Lynx
10464 IBM
3092 Gator
412 bin
196 DEMO
84 etcpasswd
76 CBO_MAIL
48 elance
36 CraigsList
16 Exchange
4 gettermsheet.sh
4 getstocks.sh
4 getmodemdriver.sh
4 buckaroo
4 browse.sh
4 badjoke.rot13
4 badjoke
0 gif.gif

One final concept and we're ready to move along. If you want to only see the five largest files or directories in a specific directory, all that you'd need to do is pipe the command sequence to head:

# du -s * sort -nr head -5
13984 Lynx
10464 IBM
3092 Gator
412 bin
196 DEMO

The ! command (pronounced "bang") creates a temporary file to be used with a program that requires a filename in its command line. This is useful with shells that don't support process substitution. For example, to diff two files after sorting them, you might do:

diff `! sort file1` `! sort file2`

This command pipeline sorts the output of the "ps -ef" command. Because no arguments are supplied to the sort command, the output is sorted in alphabetic order by the first column of the ps -ef output (i.e., the output is sorted alphabetically by username).

ls -al sort +4n

This command performs a numeric sort on the fifth column of the "ls -al" output. This results in a file listing where the files are listed in ascending order, from smallest in size to largest in size.

ls -al sort +4n more

The same command as the previous, except the output is piped into the more command. This is useful when the output will not all fit on one screen.

ls -al sort +4nr

This command reverses the order of the numeric sort, so files are listed in descending order of size, with the largest file listed first, and the smallest file listed last.

No comments: