Text Processing in Linux

Text Processing in Linux

I originally created and posted this November 22, 2004.

Here are some examples of using the utilities found on Unix (available on some other platforms also) for manipulating the text in files. awk and perl both allow writing full programs, but I primarily use both as short one-liner programs which allows them to be piped to/from other Unix programs. Each of these programs has capabilities that make it better than the others in some situations which I have attempted to demonstrate below. I don’t claim any of these to be original to me; references are at the bottom of the page.

I have collected this information over the course of several years, during which time I have used Sun Solaris and various flavors of Linux. Note that the versions of these tools included with Solaris don’t entirely match the GNU versions, so some of what you see below may need to be tinkered with to make work.

The philosophy of Unix utilities is to develop a tool that is very good at doing a specific thing. The output of a tool can be sent to another tool via the pipe (i.e., the | character) as shown in several examples below. So, one program’s output becomes the next program’s input.

awk  cat  csplit  cut  find  fmt  fold  grep  head  join  nl  paste  perl  sdiff  sed  sort  split  tail  uniq  wc

Examples References

sed, awk, and perl

awk — good for working with files that contain information in columns.

    1. Display only the first three columns of the file SOMEFILE, using tabs to separate the results:
awk ‘{print $1 “\t\t” $2 “\t” $3}’ SOMEFILE
    1. Display the first and fifth columns of the password file with a tab between them
awk -F: ‘{print $1 “\t” $5}’ /etc/passwd

-F: changes the column delimiter from spaces (the default) to a colon (:)

    1. Display the second column of the file using double colons as the field separator
awk -v ‘FS=::’ ‘{print $2}’ ratings.dat
    1. replace first column as “ORACLE” in SOMEFILE
awk ‘{$1 = “ORACLE”; print }’ SOMEFILE
    1. print the last field of every input line:
awk ‘{ print $NF }’ SOMEFILE
    1. print the first 50 characters of each line. if a line has fewer than 50 characters, then the line is padded with spaces.
awk ‘{ printf(“%-50.50s\n”, $0) }’ SOMEFILE
    1. sum the values in column 1
awk ‘BEGIN{total=0;} {total += $1;} END{print “total is “, total}’ SOMEFILE
    1. sum the values in columns 1, 2 and 4 in order to calculate precision and recall
awk -F ‘,’ ‘BEGIN{TP=0; FP=0; FN=0} {TP += $1; FP += $2; FN += $4} END{print “precision is “, TP/(FP+TP); print “recall is “, TP/(FN+TP)}’ prec-recall-2states.txt
    1. sum each row
awk ‘{sum=0; for(i=1; i<=NF; i++){sum+=$i}; print sum}’ SOMEFILE

 

sed — from the man page:

Sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline). While in some ways similar to an editor which permits scripted edits (such as ed), sed works by making only one pass over the input(s), and is consequently more efficient. But it is sed’s ability to filter text in a pipeline which particularly distinguishes it from other types of editors.

    1. Double space infile and send the output to outfile
sed G < infile > outfile

I use the input/output notation shown above. It is appropriate in many, if not all, cases to leave out the less than sign, e.g., sed G infile > outfile

    1. Double space a file which already has blank lines in it. Output file should contain no more than one blank line between lines of text.
sed ‘/^$/d;G’ < infile > outfile
    1. Triple space a file
sed ‘G;G’ < infile > outfile
    1. Undo double-spacing (assumes even-numbered lines are always blank)
sed ‘n;d’ < infile > outfile
    1. Insert a blank line above every line which matches regex (“regex” represents a regular expression)
sed ‘/regex/{x;p;x;}’ < infile > outfile
    1. Print the line immediately before regex, but not the line containing regex
sed -n ‘/regexp/{g;1!p;};h’ < infile > outfile
    1. Print the line immediately after regex, but not the line containing regex
sed -n ‘/regexp/{n;p;}’ < infile > outfile
    1. Insert a blank line below every line which matches regex
sed ‘/regex/G’ < infile > outfile
    1. Insert a blank line above and below every line which matches regex
sed ‘/regex/{x;p;x;G;}’ < infile > outfile
    1. Convert DOS newlines (CR/LF) to Unix format
sed ‘s/^M$//’ < infile > outfile # in bash/tcsh, to get ^M press Ctrl-V then Ctrl-M
    1. Print only those lines matching the regular expression—similar to grep

sed -n ‘/some_word/p’ infile
sed ‘/some_word/!d’

    1. Print those lines that do not match the regular expression—similar to grep -v

sed -n ‘/regexp/!p’
sed ‘/regexp/d’

    1. Skip the first two lines (start at line 3) and then alternate between printing 5 lines and skipping 3 for the entire file
sed -n ‘3,${p;n;p;n;p;n;p;n;p;n;n;n;}’ < infile > outfile

Notice that there are five p’s in the sequence, representing the five lines to print. The three lines to skip between each set of lines to print are represented by the n;n;n; at the end of the sequence.
    1. Delete trailing whitespace (spaces, tabs) from end of each line
sed ‘s/[ \t]*$//’ < infile > outfile
    1. Substitute (find and replace) foo with bar on each line
sed ‘s/foo/bar/’ < infile > outfile # replaces only 1st instance in a line
sed ‘s/foo/bar/4’ < infile > outfile # replaces only 4th instance in a line
sed ‘s/foo/bar/g’ < infile > outfile # replaces ALL instances in a line
    1. Replace each occurrence of the hexadecimal character 92 with an apostrophe:
sed s/\x92/’/g” < old_file.txt > new_file.txt
    1. Print section of file between two regular expressions (inclusive)
sed -n ‘/regex1/,/regex1/p’ < old_file.txt > new_file.txt
    1. Combine the line containing REGEX with the line that follows it
sed -e ‘N’ -e ‘s/REGEX\n/REGEX/’ < old_file.txt > new_file.txt

 

perl — can do anything sed and awk can do, but not always as easily as shown in the examples above.

    1. replace OLDSTRING with NEWSTRING in the file(s) in FILELIST [e.g., file1 file2 or *.txt]
perl -pi.bak -e ‘s/OLDSTRING/NEWSTRING/g’ FILELIST

The options used are:

      • -e — allows a one-line script to be ran from the command line
      • -i — files are edited in place. In the example above, the .bak extension will be placed on original files
      • -p — causes the script to be placed in a while loop that iterates over the filename arguments

 

    1. the full perl program to do the same as the one-liner (without creating backup copies) is
#!/usr/bin/perl
# perl-example.pl
while (<>)
{
	s/OLDSTRING/NEWSTRING/g;
	print;
}
			

run using ./perl-example.pl FILELIST

    1. remove the carriage returns necessary for DOS text files from files on the Unix system
perl -pi.bak -e ‘s/\r$//g’ FILELIST

 

Assorted Utilities

Some of the examples below use the following files:

file1 file2
Tom 123 Main 
Dick 4787 West
Harry 98 North
Sue 1035 Cooper
Tom programmer
Dick lawyer
Harry artist

 

ga.txt
The Gettysburg Address
Gettysburg, Pennsylvania
November 19, 1863


Four score and seven years ago our fathers brought forth on this continent,
a new nation, conceived in Liberty, and dedicated to the proposition that
all men are created equal.
 
Now we are engaged in a great civil war, testing whether that nation, or any
nation so conceived and so dedicated, can long endure. We are met on a great
battle-field of that war. We have come to dedicate a portion of that field,
as a final resting place for those who here gave their lives that that nation
might live. It is altogether fitting and proper that we should do this.
 
But, in a larger sense, we can not dedicate -- we can not consecrate -- we
can not hallow -- this ground. The brave men, living and dead, who struggled
here, have consecrated it, far above our poor power to add or detract. The
world will little note, nor long remember what we say here, but it can never
forget what they did here. It is for us the living, rather, to be dedicated
here to the unfinished work which they who fought here have thus far so
nobly advanced. It is rather for us to be here dedicated to the great task
remaining before us -- that from these honored dead we take increased devotion
to that cause for which they gave the last full measure of devotion -- that we
here highly resolve that these dead shall not have died in vain -- that this
nation, under God, shall have a new birth of freedom -- and that government
of the people, by the people, for the people, shall not perish from the earth.
 
Source: The Collected Works of Abraham Lincoln, Vol. VII, edited by Roy
P. Basler.

 

In the examples using these files, the percent sign (%) at the beginning of the line represents the command prompt. Comments of what is happening follow the pound sign (#).

 

grep — prints the lines of a file that match a search string (string can be a regular expression)

grep -i string some_file # print the lines containing string regardless of case
grep -v string some_file # print the lines that don’t contain string
grep -E “string1|string2” some_file # print the lines that contain string1 or string2

find — find has many parameters for restricting what it finds, but I only demonstrate here how to use it to recursively search from the current location for files containing the_word. More examples of using find.

find . -type f -print | xargs grep the_word 2>/dev/null
find . -type f -exec grep ‘the_word’ {} \; -print

In the first example, results of the find command are piped to grep; xargs is used to pass the filenames one at a time to grep. The value of STDERR (the errors) is eliminated by using 2>/dev/null. The second example shows how to grep each filename by using a command-line option of find.

 

Operations on entire files

cat — concatenate files and print on the standard output

% cat -E file2  # display file2, showing $ at end of each line
Tom programmer$
Dick lawyer$
Harry artist$



cat -v somefile  # display somefile, showing nonprinting characters using ^ and M- notation, except for LFD and TAB
cat -e somefile  # display somefile, combining the effects of -v and -E

nl — Number lines of files

% nl file1
     1	Tom 123 Main 
     2	Dick 4787 West
     3	Harry 98 North
     4	Sue 1035 Cooper

wc — print the number of bytes, words, and lines in files

% wc -l file1  # print number of lines
      4 file1
% wc -w file1  # print number of words
     12 file1
% wc -m file1  # print number of characters
     60 file1
% wc file1     # print number of lines, characters, and words
      4      12      60 file1

 

Alter the format of a file

fmt — Reformat each paragraph of a file

% fmt -w 50 ga.txt # reformat to 50 characters per line
The Gettysburg Address Gettysburg, Pennsylvania
November 19, 1863

Four score and seven years ago our fathers
brought forth on this continent, a new nation,
conceived in Liberty, and dedicated to the
proposition that all men are created equal.

Now we are engaged in a great civil war, testing
whether that nation, or any nation so conceived
and so dedicated, can long endure. We are met on
a great battle-field of that war. We have come
to dedicate a portion of that field, as a final
resting place for those who here gave their lives
that that nation might live. It is altogether
fitting and proper that we should do this.

But, in a larger sense, we can not dedicate --
we can not consecrate -- we can not hallow --
this ground. The brave men, living and dead, who
struggled here, have consecrated it, far above
our poor power to add or detract. The world will
little note, nor long remember what we say here,
but it can never forget what they did here. It is
for us the living, rather, to be dedicated here
to the unfinished work which they who fought here
have thus far so nobly advanced. It is rather
for us to be here dedicated to the great task
remaining before us -- that from these honored
dead we take increased devotion to that cause for
which they gave the last full measure of devotion
-- that we here highly resolve that these dead
shall not have died in vain -- that this nation,
under God, shall have a new birth of freedom --
and that government of the people, by the people,
for the people, shall not perish from the earth.

Source: The Collected Works of Abraham Lincoln,
Vol. VII, edited by Roy P. Basler.

fold — wrap each input line to fit in specified width

% fold -w 50 ga.txt
The Gettysburg Address 
Gettysburg, Pennsylvania 
November 19, 1863

Four score and seven years ago our fathers brought
 forth on this continent,
a new nation, conceived in Liberty, and dedicated 
to the proposition that
all men are created equal.

Now we are engaged in a great civil war, testing w
hether that nation, or any
nation so conceived and so dedicated, can long end
ure. We are met on a great
battle-field of that war. We have come to dedicate
 a portion of that field,
as a final resting place for those who here gave t
heir lives that that nation
might live. It is altogether fitting and proper th
at we should do this.

But, in a larger sense, we can not dedicate -- we 
can not consecrate -- we
can not hallow -- this ground. The brave men, livi
ng and dead, who struggled
here, have consecrated it, far above our poor powe
r to add or detract. The
world will little note, nor long remember what we 
say here, but it can never
forget what they did here. It is for us the living
, rather, to be dedicated
here to the unfinished work which they who fought 
here have thus far so
nobly advanced. It is rather for us to be here ded
icated to the great task
remaining before us -- that from these honored dea
d we take increased devotion
to that cause for which they gave the last full me
asure of devotion -- that we
here highly resolve that these dead shall not have
 died in vain -- that this
nation, under God, shall have a new birth of freed
om -- and that government
of the people, by the people, for the people, shal
l not perish from the earth.

Source: The Collected Works of Abraham Lincoln, Vo
l. VII, edited by Roy
P. Basler.

 

Output parts of files

head — Output the first part of files

% head -2 file1  # print the first two lines
Tom 123 Main
Dick 4787 West

tail — Output the last part of files

% tail -2 file1  # display the last 2 lines
Harry 98 North
Sue 1035 Cooper

split — Split a file into pieces (default is 1000 lines each)

split somefile         # create files of the form xaa, xab, and so on
split -l 500 somefile  # each new file will be at most 500 lines long

csplit — split a file into sections determined by context lines

csplit bigfile /The End/+4            # break at the line that is 4 lines below The End
cpslit -k bigfile /The End/+1 "{99}"  # break at the line below each occurrence of The End up to 99 times

 

Operate on fields within a line

cut — print selected parts of lines from

% cut -c1-10 file2                  # cut characters 1 through 10 from file2
Tom progra
Dick lawye
Harry arti
 
% cut -d " " -f2 file1              # cut the second column (-f2); use a space as the delimiter (-d " ")
123
4787
98
1035

ls *.txt | cut -c1-3 | xargs mkdir  # create directories with the names of the first three letters of each .txt file

paste — merge lines of files, separated by tabs. The columns of the input files are placed side-by-side with each other.

% paste file1 file2 
Tom 123 Main 	Tom programmer
Dick 4787 West	Dick lawyer
Harry 98 North	Harry artist
Sue 1035 Cooper 

join — join lines of two files on a common field (files should be sorted by common field)

% join -a 2 -a 1 -o 1.1,1.2,2.2 -e " " file1 file2
Tom 123 programmer
Dick 4787 lawyer
Harry 98 artist
Sue 1035

join -a 2 -a 1 -o 1.1,1.2,2.2 -e " " -1 1 -2 3 file1 file2

-a list unpairable lines in file1 and file2
-o display fields 1 and 2 of file1 field 2 of file2
-e replace any empty output fields with blanks
-1 join on field 1 of file1
-2 join on field 3 of file2

sdiff — print differences between files

sdiff -s file1 file2

-s supress identical lines

 

Operate on sorted files

sort — sort lines of text files

% sort +1 file1     # sort on the second column (the count starts at zero)
Sue 1035 Cooper
Tom 123 Main 
Dick 4787 West
Harry 98 North


% sort -n +1 file1  # perform a numeric sort (-n) by the second column
Harry 98 North
Tom 123 Main 
Sue 1035 Cooper
Dick 4787 West

use lensort to sort by line length
use chunksort to sort paragraphs separated by a blank line


uniq
— displays unique lines from a sorted file

cat SOMEFILE | sort | uniq   # this could have been done easier with  sort SOMEFILE | uniq
uniq -c filename             # prefix lines by the number of occurrences
uniq -d filename             # display the lines that are not unique
uniq -D filename             # print all duplicate lines
uniq -i filename             # ignore differences in case when comparing
uniq -s filename             # avoid comparing the first N characters
uniq -u filename             # only print unique lines

 

 

To perform these operations on multiple files, it is often helpful to create a simple shell script to operate on the appropriate files.

 

Assorted Examples that Combine Tools

These examples don’t necessarily rely on the sample files given above.

    1. find all files beginning in the current directory and sum the number of lines in them
find . -exec wc -l {} \; | awk ‘{total = total+$1;print total ” ” $1 ” ” $2}’
    1. print the 4th, 3rd, and 2nd columns of SOMEFILE (in that order), and sort on the last column (the 2nd column of the original file)
cat SOMEFILE | awk ‘{ print $4 ” ” $3 ” ” $2 }’ | sort +2
    1. print total size of all files
find . -type f -name “*.*” -ls | awk ‘BEGIN{ FILECNT = 0; T_SIZE = 0;} { T_SIZE += $7; FILECNT++} END{print “Total Files:”, FILECNT, “Total Size:”, T_SIZE,”Average Size:”, T_SIZE / FILECNT;}’
    1. list all files with a size less than 100 bytes
ls -l | awk ‘{if ($5 < 100) {print $5 ” ” $8}}’

here $5 represents the column of file sizes produced by ls -l

    1. delete all files with a size less than 100 bytes
ls -l | awk ‘{if ($5 < 100) {print $8}}’ | xargs -i -t rm \{}
    1. if the number in the second column is less than 1000, prefix it with a zero
awk ‘{if ($2 < 1000) {print $1 ” 0″ $2 ” ” $3} else {print $1 ” ” $2 ” ” $3}}’ < dvd-titles2.sh > dvd-titles3.sh
    1. combine file1 and file2 and show TAB characters as ^I
% paste file1 file2 | cat -T
Tom 123 Main ^ITom programmer
Dick 4787 West^IDick lawyer
Harry 98 North^IHarry artist
Sue 1035 Cooper^I
    1. sort ratings.dat on column 2 and subsort on column 0 using : as the delimiter, redirecting the output to ratings-sorted.dat
sort -t : -n +2 +0 ratings.dat > ratings-sorted.dat
    1. cut the first and third columns of movies-ratings.dat, using the : as the delimiter, and count the unique lines
cut -d : -f 1,3 movies-ratings.dat | uniq -c
    1. In a file where each line begins with ‘File’ followed by one or more digits followed by ‘=’, e.g., ‘File23=’, find the duplicates
awk -F = ‘{print $2}’ untitled.pls |sort|uniq -c |sort
    1. Find all files from the current location with filenames of at least 50 characters
find . -exec basename {} \; | sed -n ‘/^.\{50\}/p’
    1. A file of closed captions needs to be cleaned up. Search for the blank lines and remove them as well as the two lines that follow the blank lines. This works by not printing everything from the blank line (/^$/) to the line with the colons (/:/). Since the first section to clean up doesn’t have a blank line to look for, begin on the 3rd line of the file.
% head -7 0273-mary_shelleys_frankenstein.cc
1
00:00:30,063 –> 00:00:33,066
[ Woman ]
“I BUSIED MYSELF
TO THINK OF A STORY…

2
00:00:33,066 –> 00:00:37,570
“WHICH WOULD SPEAK
TO THE MYSTERIOUS FEARS
OF OUR NATURE…

3
00:00:37,570 –> 00:00:39,572
“AND AWAKEN…
%
% sed -n ‘3,${/^$/,/:/!p}’ < 3370-betrayed.cc > 3370-betrayed.cc.clean
%
% head -7 0273-mary_shelleys_frankenstein.cc.clean
[ Woman ]
“I BUSIED MYSELF
TO THINK OF A STORY…
“WHICH WOULD SPEAK
TO THE MYSTERIOUS FEARS
OF OUR NATURE…
“AND AWAKEN…
    1. Search for lines containing ::0038:: or ::0148:: or ::0187::, use sed to replace the :: field delimiters with a %, and then perform a numerical sort on the second column. Note that egrep is equivalent to grep -E
$ egrep “::0038::|::0148::|::0187::” ratings.dat | sed ‘s/::/%/g’ | sort -t % +1 -n > match-ratings.txt
    1. determine the disk usage of each subdirectory of the current directory, sort in descending order, and format for readability
$ du -s *|sort -n -r|awk ‘{printf(“%8.0fKB %s\n”, $1, $2)}’
29223820KB bob
23038660KB tom
19999376KB sue
11010288KB andy
    1. for columns 3-6125, find those columns that have some value other than ‘0,’ and count the number of occurrences
#!/bin/sh

for col in $(seq 3 6125); do 
	echo "column $col"
	awk '{print $'$col'}' allshots2nd10minutes.shots | grep -vc "0,"
done
    1. print column 51 followed by the line number for this value, sorted by the values from column 51
$ awk ‘{print $51 “\t” FNR}’ allshots2nd5-10thIframes-sparse.shots |sort
    1. extract the 6th column from all but the last line of somefile
$ head -n -1 somefile | awk ‘{print $6}’
    1. print all but the first column of somefile
$ awk -f remove_first_column.awk somefile
    1. where the file

remove_first_column.awk

    1. consists of the following:
# remove_first_column.awk
BEGIN {
	ORS=""
}
{
	for (i = 2; i <= NF; i++)
		if (i == NF)
			print $i "\n"
		else
			print $i " "
}
    1. The first line of file1 contains header information, which we don’t want. file2 lacks the column headers and therefore contains one less line than file1. Extract all but the first line of file1 and combine with the columns of file2 to create file3 with the vertical bar (|) as the delimiter between the columns of each.
$ tail -n+2 file1 | paste -d ‘|’ – file2 > file3
    1. delete the lines up to and including the regular expression (REGEX)
$ sed ‘1,/REGEX/d;’ somefile.txt
    1. delete the lines up to the regular expression (REGEX)
$ sed -e ‘/REGEX/p’ -e ‘1,/REGEX/d;’ somefile.txt
    1. delete all newlines (this turns the entire document into a single line
$ tr -d ‘\n’ < somefile.txt
    1. combine groups of nonblank lines into a single line, where each group is separated by a single blank line. This works by first changing each blank line to XXXXX; second, each newline is replaced by a space; third, each XXXXX is now replaced with a newline in order to separate the original groups into lines.
$ cat somefile.txt
this is the
first section of
the file

this is the
second section of
the file

this is the
third section of
the file
$ sed ‘s/^$/XXXXX/’ somefile.txt | tr ‘\n’ ‘ ‘ | sed ‘s/XXXXX/\n/g’| sed ‘s/^ //’
this is the first section of the file
this is the second section of the file
this is the third section of the file
    1. remove non-alphabetic characters and convert uppercase to lowercase
$ tr -cs “[:alpha:]” ” ” < somefile.txt | tr “[:upper:]” “[:lower:]”

 

References

  1. GNU core utilities
  2. Using the GNU text utilities
  3. awk one-liners
  4. The GNU Awk User’s Guide
  5. Awk: Dynamic Variables
  6. How to Use Awk (Hartigan)
  7. sed one-liners
  8. sed scripts
  9. Sed – An Introduction
  10. Perl one-liners
  11. Perl one-liners
  12. Perl regular expressions
  13. Unix Power Tools, 2nd Ed., O’Reilly
  14. Linux Cookbook, 2nd Ed., No Starch Press
  15. Unix in a Nutshell, 3rd Ed., O’Reilly
  16. John & Ed’s Miscellaneous Unix Tips
  17. Classic Shell Scripting, O’Reilly — great overview of the Unix philosophy of combining small tools that are each very good at a specific thing
Comments are closed.