Skip to content
This repository has been archived by the owner on Jun 1, 2022. It is now read-only.

Extremely poor performance on /api/importSourceLocations #646

Open
simonw opened this issue Jun 8, 2021 · 17 comments
Open

Extremely poor performance on /api/importSourceLocations #646

simonw opened this issue Jun 8, 2021 · 17 comments
Labels
performance Site and API performance optimizations

Comments

@simonw
Copy link
Collaborator

simonw commented Jun 8, 2021

@bryanculbertson says:

I think there is a severe performance degradation with importSourceLocations. I was able to load source locations without match actions by reducing batch size from 500 -> 50, but I cannot load source locations when there are match actions.

https://discord.com/channels/799147121357881364/813861006718926848/851940797959110706

@simonw simonw added the performance Site and API performance optimizations label Jun 8, 2021
@simonw
Copy link
Collaborator Author

simonw commented Jun 8, 2021

vial/vaccinate/api/views.py

Lines 300 to 302 in 1a797e4

if safe_to_match and matched_location is not None:
source_location.matched_location = matched_location
source_location.save()

That can use update_fields: https://docs.djangoproject.com/en/3.2/ref/models/instances/#specifying-which-fields-to-save

simonw added a commit that referenced this issue Jun 8, 2021
@simonw
Copy link
Collaborator Author

simonw commented Jun 8, 2021

From Discord:

Here is an a trace of one upload where it takes 7.5s to import 50 source locations: https://ui.honeycomb.io/vaccinateca/datasets/vial-production/result/bLxZa41fDZ/trace/h2HXqorB5XX
Slowest query looks to be an existence check on concordances that is taking ~200ms!

Perhaps there are missing indices? Or this table needs some vacuuming?

SELECT *
FROM concordance_identifier
WHERE authority = %s AND identifier = %s
LIMIT 21

@simonw
Copy link
Collaborator Author

simonw commented Jun 8, 2021

That query running slow makes no sense, since it's against an index:

CREATE UNIQUE INDEX core_concordanceidentifier_source_identifier_0ccc1cfe_uniq
  ON public.concordance_identifier USING btree (authority, identifier)

@simonw
Copy link
Collaborator Author

simonw commented Jun 8, 2021

I built this dashboard to see if we need to vaccum - but it seems to indicate that the tables in question here have been auto-vacuumed recently: https://vial.calltheshots.us/dashboard/show-database-bloat/

@simonw
Copy link
Collaborator Author

simonw commented Jun 8, 2021

Maybe we need to run ANALYZE on a bunch of tables to update some statistics?

@simonw
Copy link
Collaborator Author

simonw commented Jun 8, 2021

@bryanculbertson
Copy link

bryanculbertson commented Jun 8, 2021

These 3 sections of code are affected when there is a match action:

  1. matched_location = Location.objects.get(public_id=record.match.id)
  2. vial/vaccinate/api/views.py

    Lines 301 to 302 in 1a797e4

    source_location.matched_location = matched_location
    source_location.save()
  3. vial/vaccinate/api/views.py

    Lines 324 to 327 in 1a797e4

    new_concordances = source_location.concordances.difference(
    matched_location.concordances.all()
    )
    matched_location.concordances.add(*new_concordances)

@simonw
Copy link
Collaborator Author

simonw commented Jun 8, 2021

I'm going to run these:

vacuum analyse public.api_log;
vacuum analyse public.source_location; -- took 15s
vacuum analyse public.source_location_match_history; -- took 154ms
vacuum analyse public.concordance_identifier; -- took 1.1s
vacuum analyse public.concordance_location; -- took 628ms
vacuum analyse public.concordance_source_location; -- took 526ms

@bryanculbertson
Copy link

bryanculbertson commented Jun 8, 2021

Are you running the UK version of Postgres? :)

VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] ANALYZE [ table_name [ (column_name [, ...] ) ] ]

https://www.postgresql.org/docs/12/sql-vacuum.html

@simonw
Copy link
Collaborator Author

simonw commented Jun 8, 2021

Hah, TIL PostgreSQL speaks US and UK English!

@bryanculbertson
Copy link

bryanculbertson commented Jun 8, 2021

Since Velma matching still works, I suspect the problem is concordance updating.

Velma does not update concordances (a separate problem), and so that is the biggest difference between import and update.

@simonw
Copy link
Collaborator Author

simonw commented Jun 8, 2021

This is the code most likely to be performing poorly:

vial/vaccinate/api/views.py

Lines 311 to 327 in d90d512

for link in links:
identifier, _ = ConcordanceIdentifier.objects.get_or_create(
authority=link.authority, identifier=link.id
)
identifier.source_locations.add(source_location)
if safe_to_match:
if record.match is not None and record.match.action == "new":
matched_location = build_location_from_source_location(
source_location, None
)
if matched_location is not None:
new_concordances = source_location.concordances.difference(
matched_location.concordances.all()
)
matched_location.concordances.add(*new_concordances)

@simonw
Copy link
Collaborator Author

simonw commented Jun 9, 2021

I'm suspicious of that source_location.concordances.difference(...) bit. https://docs.djangoproject.com/en/3.2/ref/models/querysets/#difference

@simonw
Copy link
Collaborator Author

simonw commented Jun 9, 2021

I tried this in the debugger:

print(source_location.concordances.difference(matched_location.concordances.all()).query)

And got back this SQL:

SELECT
  "concordance_identifier"."id",
  "concordance_identifier"."created_at",
  "concordance_identifier"."authority",
  "concordance_identifier"."identifier"
FROM
  "concordance_identifier"
  INNER JOIN "concordance_source_location" ON (
    "concordance_identifier"."id" = "concordance_source_location"."concordanceidentifier_id"
  )
WHERE
  "concordance_source_location"."sourcelocation_id" = 5

EXCEPT
  (
    SELECT
      "concordance_identifier"."id",
      "concordance_identifier"."created_at",
      "concordance_identifier"."authority",
      "concordance_identifier"."identifier"
    FROM
      "concordance_identifier"
      INNER JOIN "concordance_location" ON (
        "concordance_identifier"."id" = "concordance_location"."concordanceidentifier_id"
      )
    WHERE
      "concordance_location"."location_id" = 199
  )

@simonw
Copy link
Collaborator Author

simonw commented Jun 9, 2021

Explain: https://vial.calltheshots.us/dashboard/?sql=explain+SELECT%0D%0A++%22concordance_identifier%22.%22id%22%2C%0D%0A++%22concordance_identifier%22.%22created_at%22%2C%0D%0A++%22concordance_identifier%22.%22authority%22%2C%0D%0A++%22concordance_identifier%22.%22identifier%22%0D%0AFROM%0D%0A++%22concordance_identifier%22%0D%0A++INNER+JOIN+%22concordance_source_location%22+ON+(%0D%0A++++%22concordance_identifier%22.%22id%22+%3D+%22concordance_source_location%22.%22concordanceidentifier_id%22%0D%0A++)%0D%0AWHERE%0D%0A++%22concordance_source_location%22.%22sourcelocation_id%22+%3D+5%0D%0A%0D%0AEXCEPT%0D%0A++(%0D%0A++++SELECT%0D%0A++++++%22concordance_identifier%22.%22id%22%2C%0D%0A++++++%22concordance_identifier%22.%22created_at%22%2C%0D%0A++++++%22concordance_identifier%22.%22authority%22%2C%0D%0A++++++%22concordance_identifier%22.%22identifier%22%0D%0A++++FROM%0D%0A++++++%22concordance_identifier%22%0D%0A++++++INNER+JOIN+%22concordance_location%22+ON+(%0D%0A++++++++%22concordance_identifier%22.%22id%22+%3D+%22concordance_location%22.%22concordanceidentifier_id%22%0D%0A++++++)%0D%0A++++WHERE%0D%0A++++++%22concordance_location%22.%22location_id%22+%3D+199%0D%0A++)%3A6sjJVzKgmUCqhzyLFJUGTa5eo_RS7ac8lofQJDWd7cs

HashSetOp Except  (cost=0.85..90.17 rows=3 width=372)
  ->  Append  (cost=0.85..90.10 rows=7 width=372)
"        ->  Subquery Scan on ""*SELECT* 1""  (cost=0.85..36.79 rows=3 width=52)"
              ->  Nested Loop  (cost=0.85..36.76 rows=3 width=48)
                    ->  Index Scan using concordance_source_location_sourcelocation_id_7edfcb49 on concordance_source_location  (cost=0.42..11.44 rows=3 width=4)
                          Index Cond: (sourcelocation_id = 5)
                    ->  Index Scan using core_concordanceidentifier_pkey on concordance_identifier  (cost=0.42..8.44 rows=1 width=48)
                          Index Cond: (id = concordance_source_location.concordanceidentifier_id)
"        ->  Subquery Scan on ""*SELECT* 2""  (cost=0.84..53.28 rows=4 width=52)"
              ->  Nested Loop  (cost=0.84..53.24 rows=4 width=48)
                    ->  Index Scan using concordance_location_location_id_ec52acad on concordance_location  (cost=0.42..19.48 rows=4 width=4)
                          Index Cond: (location_id = 199)
                    ->  Index Scan using core_concordanceidentifier_pkey on concordance_identifier concordance_identifier_1  (cost=0.42..8.44 rows=1 width=48)
                          Index Cond: (id = concordance_location.concordanceidentifier_id)

@simonw
Copy link
Collaborator Author

simonw commented Jun 9, 2021

From https://docs.djangoproject.com/en/3.2/ref/models/relations/#django.db.models.fields.related.RelatedManager.add

For many-to-many relationships add() accepts either model instances or field values, normally primary keys, as the *objs argument.

Using primary keys could speed things up a lot. I need an efficient way to find the primary keys of the concordance identifiers which are on source_location but not on matched_location.

@simonw
Copy link
Collaborator Author

simonw commented Jun 9, 2021

This is slightly more efficient:

            if matched_location is not None:
                new_concordances = list(
                    source_location.concordances.values_list(
                        "pk", flat=True
                    ).difference(matched_location.concordances.all())
                )
                matched_location.concordances.add(*new_concordances)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
performance Site and API performance optimizations
Projects
None yet
Development

No branches or pull requests

2 participants