Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Here is how it should be done with PADDB: http://pastebin.com/MY9tENpW This is 20% faster than the author's version on my computer: 0.210 sec vs. 0.260 sec to process 1GiB. The tight loop is simple:

  400710:	66 0f fc 04 07       	paddb  (%rdi,%rax,1),%xmm0
  400715:	48 83 c0 10          	add    $0x10,%rax
  400719:	48 39 c6             	cmp    %rax,%rsi
  40071c:	77 f2                	ja     400710 <sum_array+0x10>
Compare this to the author's complex version:

  400720:	66 0f 6f 14 07       	movdqa (%rdi,%rax,1),%xmm2
  400725:	48 83 c0 10          	add    $0x10,%rax
  400729:	48 39 c6             	cmp    %rax,%rsi
  40072c:	66 0f 6f c2          	movdqa %xmm2,%xmm0
  400730:	66 0f 68 d4          	punpckhbw %xmm4,%xmm2
  400734:	66 0f 60 c4          	punpcklbw %xmm4,%xmm0
  400738:	66 0f f5 d1          	pmaddwd %xmm1,%xmm2
  40073c:	66 0f f5 c1          	pmaddwd %xmm1,%xmm0
  400740:	66 0f fe c3          	paddd  %xmm3,%xmm0
  400744:	66 0f fe c2          	paddd  %xmm2,%xmm0
  400748:	66 0f 6f d8          	movdqa %xmm0,%xmm3
  40074c:	77 d2                	ja     400720 <sum_array+0x20>


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: