Motivation

Found myself comparing the differences between files a few times within a month, especially .txt and .csv files with many rows. I would usually need to find out what rows are missing from one file for reasons such as expecting them to be the same.

How to do it?

  • Use grep! Yes, I thought grep is only useful to grab certain lines from a file or to find missing lines in one file.
1
2
# to find what is in file2 but missing in file1
grep -Fxvf file1 file2

Note that the order of the files matters. If positions of file2 and file1 are swapped, what it will do is find what is in file1 but missing in file2.

What do the flags mean? According to the official documentation for GNU grep:

  • -F: --fixed-strings, interpret patterns as fixed strings, not regular expressions (Apparently specifying this saves a lot of time when working with very large files. Experienced this when working with BLAST results.)
  • -x: --line-regexp, select only those matches that exactly match the whole line. For regular expression patterns, this is like parenthesizing each pattern and then surrounding it with ‘v’ and ‘$’
  • -v: --invert-match, invert the sense of matching, to select non-matching lines
  • -f: --file=file, obtain patterns from file, one per line. If this option is used multiple times or is combined with the -e option, search for all patterns, given. The empty file contains zero pattern, and therefore matches nothing

Possible improvement

So I was trying to directly obtain the difference between 2 SQLite databases. I wanted to see which rows are missing in one of them since I expected them to be the same. I considered making use of the command line program for SQLite, specifically the command sqldiff but since it would give me differences in the schema instead, that would transform one database to another. While this may give me a hint of the differences, it would be great to directly obtain the missing rows in a SQLite database (a one-liner solution or something).