Most people won't think twice about writing thousands of lines of code to accomplish what a line or two of bash script will handle.
Some anti-pattterns to avoid come to mind:
NIH (Not Invented Here)
Golden Hammer (Treat every new problem like it's a nail.)
Re-inventing the wheel
Text processing is composed of four primary activities. Filtering, Transforming, Sorting and Joining.
To achieve the fastest processing speed, you should try grouping all of your filtering, transforming and joining tasks together in one pipeline.
Stream processing tasks (filtering, transforming, joining) are limited by disk io only so take advantage of your disk scan and apply all operations as co-routines at the time you read the file.
Lets say I need to apply 5 regular expressions to a file:
Example (as co-routines- equally fast ways):
time cat bigfile \
|grep -vE "[^a-z0-9 ][^a-z0-9 ]|[^a-z0-9] [^a-z0-9]|
OR
time cat bigfile \
|grep -v '[^a-z0-9 ][^a-z0-9 ]' \
|grep -v '[^a-z0-9] [^a-z0-9]' \
|grep -v '
time cat bigfile|grep -v '[^a-z0-9 ][^a-z0-9 ]'>tmpfile1
time cat tmpfile1|grep -v '[^a-z0-9] [^a-z0-9]'>tmpfile2
time cat tmpfile2|grep -v '
time cat tmpfile3|grep -v '\. '>tmpfile4
time cat tmpfile4|grep -v "[a-z]' " >bigfile.clean
Using temp files here causes the equivalent of 5 full text scans on the data when you should really only be reading the data once.
No comments:
Post a Comment