sed & gawk Survival Guide: Practical Unix Stream Processing
Unix systems treat text as the universal interface.
Logs, configuration files, command outputs, and pipelines are all text streams.
Two classic tools dominate command-line text processing:
- sed — stream editor for modifying text
- awk / gawk — pattern processing language for structured text
Understanding how these tools think about data makes everyday debugging much easier.
Stream Processing Philosophy
Unix tools operate on streams rather than entire files.
Instead of loading everything into memory, each tool performs a small transformation and passes the result downstream.
Typical pipeline:
cat file | sed ... | awk ... | sort | uniq
A useful mental shortcut:
sed → modify text
awk → analyze structured text
Examples:
sed 's/foo/bar/'
awk '{print $1}'
sed Execution Model
sed processes input one line at a time.
Conceptually, each line passes through the same loop:
read line → pattern space
apply commands
print result
repeat
Two internal buffers control sed behaviour.
Pattern Space
The pattern space contains the current line being processed.
Example:
sed 's/foo/bar/'
Processing:
pattern space = "foo hello"
apply substitution
pattern space = "bar hello"
output
Hold Space
The hold space is persistent storage between cycles.
Important commands:
h— copy pattern space to hold spaceH— append pattern space to hold spaceg— copy hold space to pattern spaceG— append hold space to pattern spacex— swap pattern and hold spaces
Mental model:
pattern space = working memory
hold space = persistent memory
sed Primitives
Most sed scripts rely on a small set of primitive operations.
s substitution
d delete pattern space
p print pattern space
N append next line
D delete first line of pattern space
P print first line of pattern space
: define label
b unconditional branch
t branch if substitution succeeded
q quit early
Complex one-liners are usually combinations of these primitives.
sed Patterns
Certain patterns appear repeatedly in real sed scripts.
Loop Pattern
sed supports simple loops using labels and conditional branching.
sed ':a; s/foo/bar/; ta'
Breakdown:
:a → define label
s/foo/bar/ → replace first occurrence
ta → jump back if substitution succeeded
Conceptually:
repeat substitution
until no more matches exist
Input:
foo foo foo
Output:
bar bar bar
Reverse Stream Pattern
Classic example:
sed '1!G; h; $!d'
Input:
A
B
C
Output:
C
B
A
Conceptually:
1!G → append previous lines
h → store current state
$!d → skip printing until last line
The hold space accumulates lines in reverse order.
Sliding Window Pattern
sed can maintain a rolling buffer.
Example: print everything except the last five lines.
sed -n -e ':a; 1,5!{P; N; D}; N; ba' text.txt
Once the pattern space contains five lines:
- print the oldest line
- append the next input line
- remove the oldest line
The final five lines remain buffered and are never printed.
Multi-Line Merge Pattern
Example: merge lines ending with a continuation character.
sed ':a; /\\$/N; s/\\\n//; ta'
Input:
hello world \
continued line
Output:
hello world continued line
awk Execution Model
awk treats input as structured records.
Each line becomes a record split into fields.
Processing loop:
read record
split into fields
evaluate pattern
execute action
Important variables:
$0— full record$1— first field$2— second fieldNF— number of fieldsNR— record numberFS— field separator
awk Primitives
Core awk operations include:
- field extraction
- conditional filtering
- substitution
- associative arrays
- record control (
getline)
Example:
awk '{print $1}'
Real Example: FASTQ to FASTA
A FASTQ record contains four lines:
@read_id
SEQUENCE
+
QUALITY
FASTA requires:
>read_id
SEQUENCE
sed (step addressing)
sed -n '1~4s/^@/>/p; 2~4p' input.fastq > output.fasta
Explanation:
1~4 → header lines
2~4 → sequence lines
sed (pattern matching)
sed -e '/^@/!d; s//>/; N' input.fastq > output.fasta
Logic:
keep header lines
replace @ with >
append next line (sequence)
awk implementation
awk 'NR%4==1 {sub(/^@/, ">", $0); print; getline; print}' input.fastq
Explanation:
NR%4==1 → header
getline → read sequence
awk expresses record structure more directly.
awk Aggregation
awk associative arrays allow quick aggregation.
Example input:
user1 200
user2 150
user1 300
Command:
awk '{sum[$1] += $2} END {for (u in sum) print u, sum[u]}' file.txt
Output:
user1 500
user2 150
Deduplicating While Preserving Order
awk can remove duplicates without sorting.
awk '!seen[$0]++' file.txt
This prints only the first occurrence of each line.
Log Analysis Pipelines
These pipelines appear constantly during debugging.
Top IP addresses:
awk '{print $1}' access.log | sort | uniq -c | sort -nr | head
Count errors per service:
grep ERROR log | awk '{print $3}' | sort | uniq -c
Slow requests:
awk '$NF > 1000' access.log
sed and awk Together
In practice these tools are often chained so that each stage performs a small transformation.
Example: count log entries per minute.
Input log lines may look like:
[2026-03-06 14:12:33] INFO request completed
[2026-03-06 14:12:40] ERROR timeout
Pipeline:
sed 's/^\[\(....-..-.. ..:..\):..]/\1/' application.log \
| awk '{count[$1" "$2]++} END {for (t in count) print t, count[t]}' \
| sort
Pipeline logic:
sed → remove seconds from timestamp
awk → count entries per minute
sort → order results chronologically
Example output:
2026-03-06 14:12 34
2026-03-06 14:13 27
2026-03-06 14:14 31
Here the roles are clear:
sed → normalize text structure
awk → perform aggregation
Each stage performs a single transformation, making the pipeline easy to reason about.
Common Pitfalls
Shell quoting
Prefer single quotes.
sed 's/foo/bar/'
Double quotes allow shell expansion.
Field separators
CSV parsing requires specifying delimiters.
awk -F, '{print $2}'
Regex expectations
sed uses basic regular expressions by default.
.*
matches the longest possible sequence.
Overly complex sed
sed excels at simple stream transformations.
However, once scripts rely heavily on multi-line buffers and control flow, readability drops quickly.
For example, this sed command prints everything except the last five lines of a file:
sed -n -e ':a; 1,5!{P; N; D}; N; ba' text.txt
The logic is correct, but difficult to understand at first glance.
The same task expressed in awk is often clearer:
awk '
{
buffer[NR % 6] = $0
if (NR > 5)
print buffer[(NR - 5) % 6]
}
' text.txt
Explanation:
NR % 6 → store lines in a circular buffer
NR > 5 → wait until at least five lines have been read
(NR - 5)%6 → print the line that is five lines behind
Why % 6?
To print everything except the last five lines, the program must keep five lines of look-ahead.
The buffer therefore needs N + 1 slots.
skip last N lines → buffer size = N + 1
So for five lines:
buffer size = 6
This creates a circular buffer that always holds the most recent lines while printing older ones safely outside the tail.
Both commands solve the same problem.
The sed version relies on pattern space manipulation and cycle control.
The awk version expresses the logic directly using state and indexing.
When transformations become stateful or algorithmic, awk is usually easier to read and maintain.
Closing Thoughts
The real power of sed and awk comes from thinking in streams.
Instead of writing large programs, engineers compose pipelines that progressively transform data.
This approach makes it possible to:
- inspect large log files quickly
- reshape command output
- prototype data processing workflows
- debug production systems directly from the shell
These tools remain fundamental to Unix systems because they solve real problems with minimal overhead.