[RFC] Whether RGB <-> XYZ conversion should be more faster or be kept simple #351

kimikage · 2019-09-15T18:25:25Z

This is not an issue report. And, I have not decided a vision of what it should be.

I was interested in the following comment in another issue.

But here it doesn't matter because most of the time is taken up by the v^(1/2.4) operation in srgb_compand.

Originally posted by @timholy in #349 (comment)

And then, I gave it a try eliminating ^ with Float32 precision. (I bet there's a better way.)

function srgb_compand(v::Float32)
    v <= 0.0031308f0 && return 12.92v
    x = sqrt(Float64(v))
    if v >= 0.25f0
        return  -0.027068702898623787 + x*(1.3721685639381944 + x*(
                -0.8182691289384268   + x*(1.0130285191962123 + x*(
                -0.9480751093798352   + x*(0.5864252237648636 + x*(
                -0.2120914317979767   + x*(0.03388206630456692)))))))
    elseif v >= 0.015625f0
        return  -0.04681200154477308  + x*(1.7595676106892333 + x*(
                -4.800335523711974    + x*(28.963281530269672 + x*(
                -144.84684738192738   + x*(551.4601017130509  + x*(
                -1559.1686261466277   + x*(3207.764496252879  + x*(
                -4656.012229629203    + x*(4512.997609562684  + x*(
                -2619.2703123846018   + x*(687.9857377729561)))))))))))
    else
        return  -0.050769586670801364 + x*(2.004280478094793  + x*(
                -11.749002783289283   + x*(146.84612431888658 + x*(
                -1443.135076339118    + x*(9983.922783293772  + x*(
                -45370.79700636592    + x*(121264.00163436493 + x*(
                -144305.83989718332))))))))
    end
end

Since the sRGB matrix transformation is (substantially) always done in Float64, I changed the type of the matrix elements to Float32. (Formally, it may be better to keep them in Float64 for XYZ{Float64}.)

Then, it certainly worked as I expected.

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

julia> using BenchmarkTools

julia> @benchmark convert(RGB{Float64}, x) setup=(x=XYZ{Float64}(rand(),rand(),rand()))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     76.185 ns (0.00% GC)
  median time:      156.340 ns (0.00% GC)
  mean time:        168.913 ns (0.00% GC)
  maximum time:     1.117 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     970

julia> @benchmark convert(RGB{Float32}, x) setup=(x=XYZ{Float32}(rand(),rand(),rand()))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     13.427 ns (0.00% GC)
  median time:      19.438 ns (0.00% GC)
  mean time:        20.656 ns (0.00% GC)
  maximum time:     101.503 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark convert(RGB{Float64}, x) setup=(x=rand(Lab{Float64}))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     91.615 ns (0.00% GC)
  median time:      169.392 ns (0.00% GC)
  mean time:        177.020 ns (0.00% GC)
  maximum time:     446.016 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     954

julia> @benchmark convert(RGB{Float32}, x) setup=(x=rand(Lab{Float32}))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     36.995 ns (0.00% GC)
  median time:      46.673 ns (0.00% GC)
  mean time:        48.788 ns (0.00% GC)
  maximum time:     131.955 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992

Since the RGB <-> XYZ conversion is used frequently, I think it is worth specializing. However, I also think that such an ad hoc function is not elegant.

Do you have any thoughts?

The text was updated successfully, but these errors were encountered:

johnnychen94 · 2019-09-16T04:30:07Z

Is generated function designed for this situation?

kimikage · 2019-09-16T08:02:41Z

Am I correct in understanding that generating the approximate expressions (e.g. polynomials) can be done in compile-time?
From my understanding, the generated function may be helpful for supporting various gamma values, but in the case of the constant gamma 2.4(the median gamma is ≈2.2), the meta implementation may be as complicated as a concrete implementation as shown above.

Edit:
In general, I don't care that the first execution is slow. I'm disgusted by the dull test/conversion.jl, though.

timholy · 2019-09-16T15:07:50Z

generating the approximate expressions (e.g. polynomials) can be done in compile-time?

I agree, I wouldn't do this with a generated function.

I really like the speedup. I might consider using a continued fraction representation instead. Some such representations are here. Since 2.4 == 12//5 there may be even faster approaches.

I am not certain one has to get full mathematical precision, since our ability to discriminate colors has limits. But of course precision is convenient, especially for tests, so we shouldn't give it up too lightly.

I'm disgusted by the dull test/conversion.jl, though.

Yeah, tests have long been a sore point. The JLD-based tests were an attempt to improve the situation, but more could be done.

kimikage · 2019-09-17T06:04:05Z

Interestingly, exp(1/2.4 * log(x)) is faster than x^(1/2.4). And curiously, it is also faster than exp2(1/2.4 * log2(x)).
This may change in future releases of Julia (or LLVM), but currently it looks like a useful (and maybe-not-so-ugly) hack.

julia> @benchmark x^(1/2.4) setup=(x=max(rand(),0.0031308))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     57.274 ns (0.00% GC)
  median time:      57.884 ns (0.00% GC)
  mean time:        60.855 ns (0.00% GC)
  maximum time:     77.416 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     983

julia> @benchmark exp(1/2.4 * log(x)) setup=(x=max(rand(),0.0031308))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.910 ns (0.00% GC)
  median time:      15.917 ns (0.00% GC)
  mean time:        18.759 ns (0.00% GC)
  maximum time:     38.437 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark exp2(1/2.4 * log2(x)) setup=(x=max(rand(),0.0031308))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     24.874 ns (0.00% GC)
  median time:      25.476 ns (0.00% GC)
  mean time:        25.388 ns (0.00% GC)
  maximum time:     47.442 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

kimikage · 2019-09-21T04:49:10Z

I merged the PR #352 since there is almost no loss in accuracy.
However, if we allow the loss, there are various techniques for speeding up. The polynomial approximation as shown above is even faster than the PR #352.

So, I would like to continue this discussion for a while.

kimikage · 2019-10-07T12:58:38Z

I know that 5/12 ≈ 1/2, but I did not noticed at all that 5/12 == 1/2 + 1/4 - 1/3.
In short, we can use this:

pow3_4(x) = (y = sqrt(x); y*sqrt(y))
pow5_12(x) = pow3_4(x) / cbrt(x)

cbrt(x) is not so fast, but pow5_12(x) is faster than exp(1/2.4 * log(x)).

On the other hand, as 5 is a prime, I think this technique is not effective for invert_srgb_compand().
12/5 == 2 + 2/5 == 2 + 1/2 - 1/10 may help to accelerate the rate of convergence, but I don't know what algorithm is the best. The best way may possibly be using MKL, though.

Edit:
The convergence speed of a continued fraction representation of pow is not so fast. Especially, in the case of z ≈ -1 (i.e. RGB value x ≈ 0), it is not a good approximation. I think we have to use our ingenuity.

Edit2:
12/5 == 4/5 * 3 may be suitable for the Newton's method because 4/5 ≈ 1.

function pow4_5(x::Float32)
    t = 0.2f0x + 0.8f0x/sqrt(sqrt(x))
    abs(1-t) > 0.02f0 && (t = 0.2f0t + 0.8f0x/sqrt(sqrt(t)))
    abs(1-t) > 0.52f0 && (t = 0.2f0t + 0.8f0x/sqrt(sqrt(t)))
    t = Float64(t)
    0.2t + 0.8x/sqrt(sqrt(t))
end
pow12_5(x::Float32) = (y = pow4_5(x); y^3)

However, the speedup effect is not so great.

Edit3:
I added the lookup table for N0f8.

Colors.jl/src/conversions.jl

Line 281 in 43d93ff

    
           const invert_srgb_compand_n0f8 = [invert_srgb_compand(v/255) for v = 0:255] # LUT

So, we can use quadratic/cubic splines by means of the LUT.

const invert_srgb_compand_n0f8 = [invert_srgb_compand(v/255.0) for v = 0:257] # +2

function invert_srgb_compand(v::Float32)
    i = unsafe_trunc(Int, v * 255)
    (i < 12 || i > 255) && return invert_srgb_compand(Float64(v))
    y0,y1,y2,y3 = view(invert_srgb_compand_n0f8, i:i+3)
    dv = v * 255.0 - i
    dv == 0.0 && return y1
    if v < 0.38857287f0 # `≈srgb_compand(1/8)`
        return y1+0.5*dv*(-1/3*y3+2y2- y1-2/3*y0 + 
                      dv*(         y2-2y1+    y0 +
                      dv*(+1/3*y3- y2+ y1-1/3*y0)))
    else
        return y1+0.5*dv*(-y3+4y2-3y1+dv*(y3-2y2+y1))
    end
end

As far as I can think of, this spline technique with the LUT is almost fastest for Float32. My concern is that it is not much simpler than the high-order polynomial approximation without LUTs, although it has less magic numbers.

kimikage · 2019-10-10T23:03:52Z

Benchmark (snip.)

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

before (`43d93ff`)

julia> @benchmark convert(RGB{Float64}, x) setup=(x=XYZ{Float64}(rand(),rand(),rand()))
  minimum time:     19.658 ns (0.00% GC)
  median time:      45.737 ns (0.00% GC)
  mean time:        50.024 ns (0.00% GC)
  maximum time:     176.429 ns (0.00% GC)

julia> @benchmark convert(RGB{Float32}, x) setup=(x=XYZ{Float32}(rand(),rand(),rand()))
  minimum time:     21.465 ns (0.00% GC)
  median time:      45.135 ns (0.00% GC)
  mean time:        48.481 ns (0.00% GC)
  maximum time:     147.642 ns (0.00% GC)

julia> @benchmark convert(XYZ{Float64}, x) setup=(x=RGB{Float64}(rand(),rand(),rand()))
  minimum time:     20.583 ns (0.00% GC)
  median time:      54.618 ns (0.00% GC)
  mean time:        55.775 ns (0.00% GC)
  maximum time:     143.372 ns (0.00% GC)

julia> @benchmark convert(XYZ{Float32}, x) setup=(x=RGB{Float32}(rand(),rand(),rand()))
  minimum time:     23.193 ns (0.00% GC)
  median time:      59.839 ns (0.00% GC)
  mean time:        61.497 ns (0.00% GC)
  maximum time:     153.012 ns (0.00% GC)

after (kimikage@`6860191` see below)

julia> @benchmark convert(RGB{Float64}, x) setup=(x=XYZ{Float64}(rand(),rand(),rand()))
  minimum time:     19.357 ns (0.00% GC)
  median time:      30.792 ns (0.00% GC)
  mean time:        33.945 ns (0.00% GC)
  maximum time:     92.979 ns (0.00% GC)

julia> @benchmark convert(RGB{Float32}, x) setup=(x=XYZ{Float32}(rand(),rand(),rand()))
  minimum time:     20.762 ns (0.00% GC)
  median time:      31.293 ns (0.00% GC)
  mean time:        35.580 ns (0.00% GC)
  maximum time:     122.969 ns (0.00% GC)

julia> @benchmark convert(XYZ{Float64}, x) setup=(x=RGB{Float64}(rand(),rand(),rand()))
  minimum time:     19.559 ns (0.00% GC)
  median time:      54.162 ns (0.00% GC)
  mean time:        56.013 ns (0.00% GC)
  maximum time:     131.896 ns (0.00% GC)

julia> @benchmark convert(XYZ{Float32}, x) setup=(x=RGB{Float32}(rand(),rand(),rand()))
  minimum time:     13.213 ns (0.00% GC)
  median time:      20.521 ns (0.00% GC)
  mean time:        21.990 ns (0.00% GC)
  maximum time:     89.189 ns (0.00% GC)

kimikage · 2019-10-11T06:51:30Z

Umm...:confused:
I hope the (implicit) SIMD optimization will be a little smarter.

function invert_srgb_compand(v::Float32)
    i = unsafe_trunc(Int32, v * 255)
    (i < 13 || i > 255) && return invert_srgb_compand(Float64(v))
    @inbounds y = view(invert_srgb_compand_n0f8, i:i+3)
    dv = v * 255.0 - i
    dv == 0.0 && @inbounds return y[2]
    if v < 0.38857287f0
        return @fastmath(y[2]+0.5*dv*((-2/3*y[1]- y[2])+(2y[3]-1/3*y[4])+
                                  dv*((     y[1]-2y[2])+  y[3]-
                                  dv*(( 1/3*y[1]- y[2])+( y[3]-1/3*y[4]) ))))
    else
        return @fastmath(y[2]+0.5*dv*((4y[3]-3y[2])-y[4]+dv*((y[4]-y[3])+(y[2]-y[3]))))
    end
end

Edit:
kimikage@f789000

julia> @benchmark convert(XYZ{Float32}, x) setup=(x=RGB{Float32}(rand(),rand(),rand()))
  minimum time:     13.713 ns (0.00% GC)
  median time:      17.317 ns (0.00% GC)
  mean time:        18.173 ns (0.00% GC)
  maximum time:     59.859 ns (0.00% GC)

johnnychen94 · 2019-10-11T07:00:55Z

Will this PR be influenced by JuliaMath/FixedPointNumbers.jl#125 ?

kimikage · 2019-10-11T07:43:42Z

The answer is probably no, because the RGB <--> XYZ conversions mainly use Floats, not FixedPointNumbers. Of course, if FixedPointNumbers such as Q7f24 will have much greater performance than Float32, using them will be yet another strategy to improve the conversions.

kimikage · 2019-11-02T01:45:52Z

I learned a little about Julia's inlining. cf. JuliaMath/FixedPointNumbers.jl#129 (comment)
FYI, usually, inlining pow12_5 is effective for the caller (invert_srgb_compand(v::Fractional)), but it may have the opposite effect for the caller's caller (invert_srgb_compand(v::Float32)).
kimikage@f789000

# `pow12_5` is called from `invert_srgb_compand`.
# x^y ≈ exp(y*log(x)) ≈ exp2(y*log2(y)); the middle form is faster
@noinline pow12_5(x) = x^2 * exp(0.4 * log(x)) # 12/5 == 2.4 == 2 + 0.4

kimikage · 2019-11-24T11:42:19Z

I'm going to change the significant digits of numbers in the sRGB matrices from ∼7 to ∼16. (cf. #355 (comment)) Paradoxically, this provides a backing that the accuracy can be lowered with lower precision types than Float64.
Strictly speaking, this will be a breaking change (cf. #361), but I think it is negligible because the current numbers also do not follow the sRGB specification.

Is there a problem?

timholy · 2019-11-27T02:09:59Z

Sounds good to me. Once we've made one breaking change (and recent changes in FixedPointNumbers probably qualify, I have a PR I should submit for ColorVectorSpace that fixes one of the tests), there's little reason to hold back. Might as well get things "right."

kimikage · 2019-12-17T15:15:31Z

BTW, even if there is not enough color gamut support, the conversion matrices can be obtained dynamically (in the precompilation) when the color primaries are defined. (cf. #372 (comment))
Although some errors occur in the calculation of the inverse matrix, there is no problem perceptually.

kimikage changed the title ~~[RFC] RGB <-> XYZ conversion should be more faster or be kept simple?~~ [RFC] Whether RGB <-> XYZ conversion should be more faster or be kept simple Sep 15, 2019

kimikage mentioned this issue Sep 17, 2019

Speed up RGB <--> XYZ conversions #352

Merged

kimikage mentioned this issue Oct 8, 2019

Problems with conversion testing #355

Closed

kimikage mentioned this issue Jan 7, 2020

MSC() is strange #349

Open

This was referenced Feb 29, 2020

Inconsistency in handling hues #379

Closed

Speed up RGB <--> XYZ conversions again #410

Merged

kimikage closed this as completed in #410 Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Whether RGB <-> XYZ conversion should be more faster or be kept simple #351

[RFC] Whether RGB <-> XYZ conversion should be more faster or be kept simple #351

kimikage commented Sep 15, 2019 •

edited

Loading

johnnychen94 commented Sep 16, 2019

kimikage commented Sep 16, 2019 •

edited

Loading

timholy commented Sep 16, 2019

kimikage commented Sep 17, 2019

kimikage commented Sep 21, 2019

kimikage commented Oct 7, 2019 •

edited

Loading

kimikage commented Oct 10, 2019 •

edited

Loading

kimikage commented Oct 11, 2019 •

edited

Loading

johnnychen94 commented Oct 11, 2019

kimikage commented Oct 11, 2019

kimikage commented Nov 2, 2019

kimikage commented Nov 24, 2019

timholy commented Nov 27, 2019

kimikage commented Dec 17, 2019

[RFC] Whether RGB <-> XYZ conversion should be more faster or be kept simple #351

[RFC] Whether RGB <-> XYZ conversion should be more faster or be kept simple #351

Comments

kimikage commented Sep 15, 2019 • edited Loading

johnnychen94 commented Sep 16, 2019

kimikage commented Sep 16, 2019 • edited Loading

timholy commented Sep 16, 2019

kimikage commented Sep 17, 2019

kimikage commented Sep 21, 2019

kimikage commented Oct 7, 2019 • edited Loading

kimikage commented Oct 10, 2019 • edited Loading

Benchmark (snip.)

before (43d93ff)

after (kimikage@6860191 see below)

kimikage commented Oct 11, 2019 • edited Loading

johnnychen94 commented Oct 11, 2019

kimikage commented Oct 11, 2019

kimikage commented Nov 2, 2019

kimikage commented Nov 24, 2019

timholy commented Nov 27, 2019

kimikage commented Dec 17, 2019

kimikage commented Sep 15, 2019 •

edited

Loading

kimikage commented Sep 16, 2019 •

edited

Loading

kimikage commented Oct 7, 2019 •

edited

Loading

kimikage commented Oct 10, 2019 •

edited

Loading

before (`43d93ff`)

after (kimikage@`6860191` see below)

kimikage commented Oct 11, 2019 •

edited

Loading