This show has been flagged as Clean by the host.
This series is dedicated to exploring little-known—and occasionally useful—trinkets lurking in the dusty corners of UNIX-like operating systems.
Most users of UNIX-like systems are probably familiar with the
diff
utility. It is widely used with source code to compare two files and see what the differences are between them. Non-programmers, like me, also use it to examine what has changed in different versions of scripts or configuration files. Quite a few pieces of newer software can compare different versions of data and express changes in a format either identical to or similar to
diff
output.
However, there are two other long-standing tools for this purpose that are far less known and deserve in my view to be termed UNIX Curios. The first of these is
cmp
1
. While
diff
is primarily intended to be used on text files and compares them line by line,
cmp
compares files byte by byte. In my experience, its main use is to see whether two binary files are in fact identical—if they are,
cmp
outputs nothing and returns an exit status of 0. Back when methods of transferring files were not as reliable as they are today, this was a tool I would reach for sometimes. For example, you could use it to confirm that the data on a CD-ROM you burned was the same as the original.
If there is a difference between the files,
cmp
will return an exit status of 1. By default, it will also print the location (byte and line number) of the first differing byte. When used with the
-l
option, it will print the location and value of
every
byte that differs. There is one exception to these: if the files are the same except that one is shorter than the other, it will print a message to that effect. The exit status will still be 1 in that case.
Using the
-s
option with
cmp
will cause it to be totally silent and output nothing. Only the exit status will indicate whether the files are the same, different, or if the exit status is greater than 1, that an error occurred. This makes it useful for scripting, for example in case you wanted to confirm that a file copied to another location arrived fully intact.
It is worth noting that
diff
is also capable of comparing binary files—however, it is not required by POSIX to report what is actually different or where differences occur. The same exit status as in
cmp
is returned: 0 if the files are the same, 1 if they are different, or greater than 1 if an error occurred. While many implementations offer an option to suppress the output,
this is not in the standard
2
so the most portable method would be to instead redirect output to
/dev/null
. On my system the
diff
utility is three times the size of
cmp
, so if you don't need its extra capabilities, it is a less efficient way of doing the job.
The other UNIX Curio for today is
comm
, and this utility
3
is also intended to compare two files to see what is common between them. Ken Fallon briefly talked about it a few years ago in
HPR episode 3889
. Compared to the others, it has a much more specific use case. The two files are expected to be text files that are already sorted. What
comm
will do is print a tab-separated list of all the lines appearing in either or both files. Lines only in the first file will appear in the first column, lines only in the second file will be in the second column, and lines in both files will be in the third column.
Any combination of the options
-1
,
-2
, and
-3
can be used with
comm
to suppress printing of the first, second, or third column respectively. Using all three options at the same time
is
supported but it results in no output, so that isn't very useful. Unlike the other utilities, the exit status of
comm
doesn't tell you anything about the two files. It will be 0 if the program ran successfully, and greater than 0 if it didn't.
I'm not sure if I have ever actually used
comm
for anything practical. I find its default output a bit difficult to meaningfully interpret, plus you need to ensure the two files are already sorted. It seems to be best suited to comparing lists, and one use case that Ken Fallon mentioned would be comparing two lists of files to see if any are missing. The command
comm -3 listA listB
would print files that only appear in
listA
in the first column and those only in
listB
in the second column. This would let you ignore all the filenames that appear in both and focus on those that were absent from one or the other. If on the other hand you only wanted to see the filenames that
are
on both lists,
comm -12 listA listB
would give you that.
Some more frivolous potential uses also come to mind. If for some reason the
cat
utility is broken on your system, you could use
comm listA /dev/null
to print the file
listA
instead. If you want to insert tab characters before every line of a file but have an aversion to using
sed
or
awk
, then
comm /dev/null listA
would output
listA
with one tab before each line, and
comm listA listA
would insert two tabs. A bit silly, but it would work. The GNU implementation of
comm
even lets you
choose something other than a tab to separate the columns
4
, so you could go wild with that.
According to the POSIX specifications for
cmp
and
comm
, one of the two filenames given as arguments, but not both, can be a "
-
", in which case standard input will be used for that "file" in the comparison. Also, the results are undefined if both arguments are the same FIFO special, character special, or block special file. Some implementations might not have these limitations, but you shouldn't rely on that everywhere.
All three of these were developed quite early. The
cmp
utility
appeared in 1971's First Edition UNIX
5
, while
comm
and
diff
seem to have made their debut in Fourth Edition UNIX
6,7
from 1973. The original versions might not have behaved exactly like their modern counterparts, and newer implementations (especially of the
diff
utility) have acquired additional options and capabilities, but the basic operation of each has stayed the same.
The next time you need to compare files against each other, consider whether
cmp
or
comm
might be appropriate before automatically reaching for
diff
. They all have their uses in different situations.
References:
Cmp specification
https://pubs.opengroup.org/onlinepubs/009695399/utilities/cmp.html
Diff specification
https://pubs.opengroup.org/onlinepubs/009695399/utilities/diff.html
Comm specification
https://pubs.opengroup.org/onlinepubs/009695399/utilities/comm.html
GNU coreutils manual: comm
https://www.gnu.org/software/coreutils/manual/html_node/comm-invocation.html
First Edition UNIX cmp manual page
http://man.cat-v.org/unix-1st/1/cmp
Fourth Edition UNIX comm manual page
https://www.tuhs.org/cgi-bin/utree.pl?file=V4/usr/man/man1/comm.1
Fourth Edition UNIX diff source
https://www.tuhs.org/cgi-bin/utree.pl?file=V4/usr/source/s1/diff1.c
Provide feedback on this episode.