Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

it doesnt matter for a file of size 1kb. For a file of size 10Gb, every process matters.

For the downvoters: please time how long it takes to do something like `cat $file | awk '{print $1}' ` and `awk <$file '{print $1}'`



Not exactly convincing:

    ~/desktop$ du -h c.dat
    11G     c.dat
    ~/desktop$ time cat c.dat | awk '{ print $1 }' > /dev/null
    
    real    0m53.997s
    user    0m52.930s
    sys     0m7.986s
    ~/desktop$ time < c.dat awk '{ print $1 }' > /dev/null
    
    real    0m53.898s
    user    0m51.074s
    sys     0m2.807s
cat CPU usage didn't exceed 1.6% at any time. The biggest cost is in redundant copying, so the more actual work you're doing on the data, the less and less it matters.


I was curious, so, here goes; 'foo' was a file of ~1G containing lines made up of of 999 'x's and one '\n'.

    $ ls -lh foo
    -rw-r--r-- 1 ori ori 954M Sep  5 22:57 foo

    $ time cat foo | awk '{print $1}' > /dev/null

    real	0m1.631s
    user	0m1.452s
    sys 	0m0.540s

    $ time awk <foo '{print $1}' > /dev/null 

    real	0m1.541s
    user	0m1.376s
    sys 	0m0.160s
This was run from a warm cache, so that the overhead of the extra IO from a pipe would dominate.


Both invocations take similiar amounts of "real" time because the task is IO-bound and it takes roughly 1.5s on your machine to read the file.

But if you add up the "user" and "sys" time in the cat example, you see that it took 1.992s of actual cpu-time... Which is actually about a 30% increase in cpu-time spent.

The perf decrease wasn't visible because you have multiple cores parallelizing the extra cpu-time, but it was there.


So the two are different because awk's call to read() is effectively the same as a read directly from a file, whereas copying is taking place through the pipe with the pipeline approach?


Basically you see a linear increase in time. If it was going to take a coffee break's worth of time one way, it will take a slightly longer coffee break worth of time the other. It is fairly rare that the additional time involved matters and there isn't something else that you should be doing anyway.


The difference between

    cat file | foo
    foo <file
assuming foo only reads stdin so `foo file' isn't possible, is that with the latter the shell will open file for reading on file descriptor 0 (stdin) before execing foo and the only cost is the read(2)s that foo does directly from file.

With the needless cat we have cat having to read the bytes and then write(2) them whereupon foo reads them as before. So the number of system calls goes from R to R+W+R assuming all reads and writes use the same block size and more byte copying may be required.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: