Performance testing suggests that:
+ dense text this is 2x faster.
+ 'hello world' text this is 1.7x faster.
Change-Id: I4ff940663c44d0b22c9187deb4ee397a9d9953b0
Signed-off-by: Michael Meeks <michael.meeks@collabora.com>
Simply calculate our loop variables from the iteration we're on.
Change-Id: I0bb73302fb09963b2a1f5b3d93ef302316ef1d4f
Signed-off-by: Michael Meeks <michael.meeks@collabora.com>
Remove the special case for the first pixel, and instead have a
previous pixel run initialized to zero.
AVX2 has no effective shift for the while si256 so use permutation
to shift the last pixel of the previous run into the right place,
mask it and combine.
Saves a second un-aligned load of the same data, and branch.
Change-Id: I77c9cdead13d37aaf4d9f31d98cbd5c4a9c5ce24
Signed-off-by: Michael Meeks <michael.meeks@collabora.com>
just enough to get the same results as before
https://github.com/CollaboraOnline/online/issues/7165
Signed-off-by: Caolán McNamara <caolan.mcnamara@collabora.com>
Change-Id: I109c9b8f1e7935782c72e0179aa0ed48712eadb6
Split it out as a C file, to avoid accidental C++ header inclusion,
and C is a cross-platform assembler anyway so a good match.
Change-Id: I6c042781713aecaf143b9663af8377659a7deaf1
Signed-off-by: Michael Meeks <michael.meeks@collabora.com>