Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Whether RGB <-> XYZ conversion should be more faster or be kept simple #351

Closed
kimikage opened this issue Sep 15, 2019 · 14 comments · Fixed by #410
Closed

[RFC] Whether RGB <-> XYZ conversion should be more faster or be kept simple #351

kimikage opened this issue Sep 15, 2019 · 14 comments · Fixed by #410

Comments

@kimikage
Copy link
Collaborator

kimikage commented Sep 15, 2019

This is not an issue report. And, I have not decided a vision of what it should be.

I was interested in the following comment in another issue.

But here it doesn't matter because most of the time is taken up by the v^(1/2.4) operation in srgb_compand.

Originally posted by @timholy in #349 (comment)

And then, I gave it a try eliminating ^ with Float32 precision. (I bet there's a better way.)

function srgb_compand(v::Float32)
    v <= 0.0031308f0 && return 12.92v
    x = sqrt(Float64(v))
    if v >= 0.25f0
        return  -0.027068702898623787 + x*(1.3721685639381944 + x*(
                -0.8182691289384268   + x*(1.0130285191962123 + x*(
                -0.9480751093798352   + x*(0.5864252237648636 + x*(
                -0.2120914317979767   + x*(0.03388206630456692)))))))
    elseif v >= 0.015625f0
        return  -0.04681200154477308  + x*(1.7595676106892333 + x*(
                -4.800335523711974    + x*(28.963281530269672 + x*(
                -144.84684738192738   + x*(551.4601017130509  + x*(
                -1559.1686261466277   + x*(3207.764496252879  + x*(
                -4656.012229629203    + x*(4512.997609562684  + x*(
                -2619.2703123846018   + x*(687.9857377729561)))))))))))
    else
        return  -0.050769586670801364 + x*(2.004280478094793  + x*(
                -11.749002783289283   + x*(146.84612431888658 + x*(
                -1443.135076339118    + x*(9983.922783293772  + x*(
                -45370.79700636592    + x*(121264.00163436493 + x*(
                -144305.83989718332))))))))
    end
end

Since the sRGB matrix transformation is (substantially) always done in Float64, I changed the type of the matrix elements to Float32. (Formally, it may be better to keep them in Float64 for XYZ{Float64}.)

Then, it certainly worked as I expected.

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

julia> using BenchmarkTools

julia> @benchmark convert(RGB{Float64}, x) setup=(x=XYZ{Float64}(rand(),rand(),rand()))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     76.185 ns (0.00% GC)
  median time:      156.340 ns (0.00% GC)
  mean time:        168.913 ns (0.00% GC)
  maximum time:     1.117 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     970

julia> @benchmark convert(RGB{Float32}, x) setup=(x=XYZ{Float32}(rand(),rand(),rand()))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     13.427 ns (0.00% GC)
  median time:      19.438 ns (0.00% GC)
  mean time:        20.656 ns (0.00% GC)
  maximum time:     101.503 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark convert(RGB{Float64}, x) setup=(x=rand(Lab{Float64}))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     91.615 ns (0.00% GC)
  median time:      169.392 ns (0.00% GC)
  mean time:        177.020 ns (0.00% GC)
  maximum time:     446.016 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     954

julia> @benchmark convert(RGB{Float32}, x) setup=(x=rand(Lab{Float32}))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     36.995 ns (0.00% GC)
  median time:      46.673 ns (0.00% GC)
  mean time:        48.788 ns (0.00% GC)
  maximum time:     131.955 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992

Since the RGB <-> XYZ conversion is used frequently, I think it is worth specializing. However, I also think that such an ad hoc function is not elegant.

Do you have any thoughts?

@kimikage kimikage changed the title [RFC] RGB <-> XYZ conversion should be more faster or be kept simple? [RFC] Whether RGB <-> XYZ conversion should be more faster or be kept simple Sep 15, 2019
@johnnychen94
Copy link
Member

Is generated function designed for this situation?

@kimikage
Copy link
Collaborator Author

kimikage commented Sep 16, 2019

Am I correct in understanding that generating the approximate expressions (e.g. polynomials) can be done in compile-time?
From my understanding, the generated function may be helpful for supporting various gamma values, but in the case of the constant gamma 2.4(the median gamma is ≈2.2), the meta implementation may be as complicated as a concrete implementation as shown above.

Edit:
In general, I don't care that the first execution is slow. I'm disgusted by the dull test/conversion.jl, though.

@timholy
Copy link
Member

timholy commented Sep 16, 2019

generating the approximate expressions (e.g. polynomials) can be done in compile-time?

I agree, I wouldn't do this with a generated function.

I really like the speedup. I might consider using a continued fraction representation instead. Some such representations are here. Since 2.4 == 12//5 there may be even faster approaches.

I am not certain one has to get full mathematical precision, since our ability to discriminate colors has limits. But of course precision is convenient, especially for tests, so we shouldn't give it up too lightly.

I'm disgusted by the dull test/conversion.jl, though.

Yeah, tests have long been a sore point. The JLD-based tests were an attempt to improve the situation, but more could be done.

@kimikage
Copy link
Collaborator Author

Interestingly, exp(1/2.4 * log(x)) is faster than x^(1/2.4). And curiously, it is also faster than exp2(1/2.4 * log2(x)).
This may change in future releases of Julia (or LLVM), but currently it looks like a useful (and maybe-not-so-ugly) hack.

julia> @benchmark x^(1/2.4) setup=(x=max(rand(),0.0031308))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     57.274 ns (0.00% GC)
  median time:      57.884 ns (0.00% GC)
  mean time:        60.855 ns (0.00% GC)
  maximum time:     77.416 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     983

julia> @benchmark exp(1/2.4 * log(x)) setup=(x=max(rand(),0.0031308))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.910 ns (0.00% GC)
  median time:      15.917 ns (0.00% GC)
  mean time:        18.759 ns (0.00% GC)
  maximum time:     38.437 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark exp2(1/2.4 * log2(x)) setup=(x=max(rand(),0.0031308))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     24.874 ns (0.00% GC)
  median time:      25.476 ns (0.00% GC)
  mean time:        25.388 ns (0.00% GC)
  maximum time:     47.442 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

@kimikage
Copy link
Collaborator Author

I merged the PR #352 since there is almost no loss in accuracy.
However, if we allow the loss, there are various techniques for speeding up. The polynomial approximation as shown above is even faster than the PR #352.

So, I would like to continue this discussion for a while.

@kimikage
Copy link
Collaborator Author

kimikage commented Oct 7, 2019

I know that 5/12 ≈ 1/2, but I did not noticed at all that 5/12 == 1/2 + 1/4 - 1/3.
In short, we can use this:

pow3_4(x) = (y = sqrt(x); y*sqrt(y))
pow5_12(x) = pow3_4(x) / cbrt(x)

cbrt(x) is not so fast, but pow5_12(x) is faster than exp(1/2.4 * log(x)).

On the other hand, as 5 is a prime, I think this technique is not effective for invert_srgb_compand().
12/5 == 2 + 2/5 == 2 + 1/2 - 1/10 may help to accelerate the rate of convergence, but I don't know what algorithm is the best. The best way may possibly be using MKL, though.

Edit:
The convergence speed of a continued fraction representation of pow is not so fast. Especially, in the case of z ≈ -1 (i.e. RGB value x ≈ 0), it is not a good approximation. I think we have to use our ingenuity.

Edit2:
12/5 == 4/5 * 3 may be suitable for the Newton's method because 4/5 ≈ 1.

function pow4_5(x::Float32)
    t = 0.2f0x + 0.8f0x/sqrt(sqrt(x))
    abs(1-t) > 0.02f0 && (t = 0.2f0t + 0.8f0x/sqrt(sqrt(t)))
    abs(1-t) > 0.52f0 && (t = 0.2f0t + 0.8f0x/sqrt(sqrt(t)))
    t = Float64(t)
    0.2t + 0.8x/sqrt(sqrt(t))
end
pow12_5(x::Float32) = (y = pow4_5(x); y^3)

However, the speedup effect is not so great.

Edit3:
I added the lookup table for N0f8.

const invert_srgb_compand_n0f8 = [invert_srgb_compand(v/255) for v = 0:255] # LUT

So, we can use quadratic/cubic splines by means of the LUT.

const invert_srgb_compand_n0f8 = [invert_srgb_compand(v/255.0) for v = 0:257] # +2

function invert_srgb_compand(v::Float32)
    i = unsafe_trunc(Int, v * 255)
    (i < 12 || i > 255) && return invert_srgb_compand(Float64(v))
    y0,y1,y2,y3 = view(invert_srgb_compand_n0f8, i:i+3)
    dv = v * 255.0 - i
    dv == 0.0 && return y1
    if v < 0.38857287f0 # `≈srgb_compand(1/8)`
        return y1+0.5*dv*(-1/3*y3+2y2- y1-2/3*y0 + 
                      dv*(         y2-2y1+    y0 +
                      dv*(+1/3*y3- y2+ y1-1/3*y0)))
    else
        return y1+0.5*dv*(-y3+4y2-3y1+dv*(y3-2y2+y1))
    end
end

As far as I can think of, this spline technique with the LUT is almost fastest for Float32. My concern is that it is not much simpler than the high-order polynomial approximation without LUTs, although it has less magic numbers.

@kimikage
Copy link
Collaborator Author

kimikage commented Oct 10, 2019

Benchmark (snip.)

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

before (43d93ff)

julia> @benchmark convert(RGB{Float64}, x) setup=(x=XYZ{Float64}(rand(),rand(),rand()))
  minimum time:     19.658 ns (0.00% GC)
  median time:      45.737 ns (0.00% GC)
  mean time:        50.024 ns (0.00% GC)
  maximum time:     176.429 ns (0.00% GC)

julia> @benchmark convert(RGB{Float32}, x) setup=(x=XYZ{Float32}(rand(),rand(),rand()))
  minimum time:     21.465 ns (0.00% GC)
  median time:      45.135 ns (0.00% GC)
  mean time:        48.481 ns (0.00% GC)
  maximum time:     147.642 ns (0.00% GC)

julia> @benchmark convert(XYZ{Float64}, x) setup=(x=RGB{Float64}(rand(),rand(),rand()))
  minimum time:     20.583 ns (0.00% GC)
  median time:      54.618 ns (0.00% GC)
  mean time:        55.775 ns (0.00% GC)
  maximum time:     143.372 ns (0.00% GC)

julia> @benchmark convert(XYZ{Float32}, x) setup=(x=RGB{Float32}(rand(),rand(),rand()))
  minimum time:     23.193 ns (0.00% GC)
  median time:      59.839 ns (0.00% GC)
  mean time:        61.497 ns (0.00% GC)
  maximum time:     153.012 ns (0.00% GC)

after (kimikage@6860191 see below)

julia> @benchmark convert(RGB{Float64}, x) setup=(x=XYZ{Float64}(rand(),rand(),rand()))
  minimum time:     19.357 ns (0.00% GC)
  median time:      30.792 ns (0.00% GC)
  mean time:        33.945 ns (0.00% GC)
  maximum time:     92.979 ns (0.00% GC)

julia> @benchmark convert(RGB{Float32}, x) setup=(x=XYZ{Float32}(rand(),rand(),rand()))
  minimum time:     20.762 ns (0.00% GC)
  median time:      31.293 ns (0.00% GC)
  mean time:        35.580 ns (0.00% GC)
  maximum time:     122.969 ns (0.00% GC)

julia> @benchmark convert(XYZ{Float64}, x) setup=(x=RGB{Float64}(rand(),rand(),rand()))
  minimum time:     19.559 ns (0.00% GC)
  median time:      54.162 ns (0.00% GC)
  mean time:        56.013 ns (0.00% GC)
  maximum time:     131.896 ns (0.00% GC)

julia> @benchmark convert(XYZ{Float32}, x) setup=(x=RGB{Float32}(rand(),rand(),rand()))
  minimum time:     13.213 ns (0.00% GC)
  median time:      20.521 ns (0.00% GC)
  mean time:        21.990 ns (0.00% GC)
  maximum time:     89.189 ns (0.00% GC)

@kimikage
Copy link
Collaborator Author

kimikage commented Oct 11, 2019

Umm...:confused:
I hope the (implicit) SIMD optimization will be a little smarter.

function invert_srgb_compand(v::Float32)
    i = unsafe_trunc(Int32, v * 255)
    (i < 13 || i > 255) && return invert_srgb_compand(Float64(v))
    @inbounds y = view(invert_srgb_compand_n0f8, i:i+3)
    dv = v * 255.0 - i
    dv == 0.0 && @inbounds return y[2]
    if v < 0.38857287f0
        return @fastmath(y[2]+0.5*dv*((-2/3*y[1]- y[2])+(2y[3]-1/3*y[4])+
                                  dv*((     y[1]-2y[2])+  y[3]-
                                  dv*(( 1/3*y[1]- y[2])+( y[3]-1/3*y[4]) ))))
    else
        return @fastmath(y[2]+0.5*dv*((4y[3]-3y[2])-y[4]+dv*((y[4]-y[3])+(y[2]-y[3]))))
    end
end

Edit:
kimikage@f789000

julia> @benchmark convert(XYZ{Float32}, x) setup=(x=RGB{Float32}(rand(),rand(),rand()))
  minimum time:     13.713 ns (0.00% GC)
  median time:      17.317 ns (0.00% GC)
  mean time:        18.173 ns (0.00% GC)
  maximum time:     59.859 ns (0.00% GC)

@johnnychen94
Copy link
Member

Will this PR be influenced by JuliaMath/FixedPointNumbers.jl#125 ?

@kimikage
Copy link
Collaborator Author

The answer is probably no, because the RGB <--> XYZ conversions mainly use Floats, not FixedPointNumbers. Of course, if FixedPointNumbers such as Q7f24 will have much greater performance than Float32, using them will be yet another strategy to improve the conversions.

@kimikage
Copy link
Collaborator Author

kimikage commented Nov 2, 2019

I learned a little about Julia's inlining. cf. JuliaMath/FixedPointNumbers.jl#129 (comment)
FYI, usually, inlining pow12_5 is effective for the caller (invert_srgb_compand(v::Fractional)), but it may have the opposite effect for the caller's caller (invert_srgb_compand(v::Float32)).
kimikage@f789000

# `pow12_5` is called from `invert_srgb_compand`.
# x^y ≈ exp(y*log(x)) ≈ exp2(y*log2(y)); the middle form is faster
@noinline pow12_5(x) = x^2 * exp(0.4 * log(x)) # 12/5 == 2.4 == 2 + 0.4

@kimikage
Copy link
Collaborator Author

I'm going to change the significant digits of numbers in the sRGB matrices from ∼7 to ∼16. (cf. #355 (comment)) Paradoxically, this provides a backing that the accuracy can be lowered with lower precision types than Float64.
Strictly speaking, this will be a breaking change (cf. #361), but I think it is negligible because the current numbers also do not follow the sRGB specification.

Is there a problem?

@timholy
Copy link
Member

timholy commented Nov 27, 2019

Sounds good to me. Once we've made one breaking change (and recent changes in FixedPointNumbers probably qualify, I have a PR I should submit for ColorVectorSpace that fixes one of the tests), there's little reason to hold back. Might as well get things "right."

@kimikage
Copy link
Collaborator Author

BTW, even if there is not enough color gamut support, the conversion matrices can be obtained dynamically (in the precompilation) when the color primaries are defined. (cf. #372 (comment))
Although some errors occur in the calculation of the inverse matrix, there is no problem perceptually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants