Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy to RKI data of yesterday while today is accurate. #491

Open
steg123 opened this issue Jan 27, 2021 · 7 comments
Open

Discrepancy to RKI data of yesterday while today is accurate. #491

steg123 opened this issue Jan 27, 2021 · 7 comments

Comments

@steg123
Copy link

steg123 commented Jan 27, 2021

I have been using your data and realized that the accumulated figure of your RKI data for cases of today is in line with the reports from the RKI.

However, the number from yesterday differs substantially. Hence, if I am looking at differences I get other results than what is officially reported.

Why?

Number of cases total from your csv:
https://raw.githubusercontent.com/jgehrcke/covid-19-germany-gae/master/cases-rki-by-state.csv
2151198 (25.1.)
2161279 (26.1.)
Difference 10081

Official numbers from RKI
2148077 (26.1. 0:00)
2161275 (27.1. 0:00)
Difference 13198

I don't care about the date attribution. However, t0 and t-1 should be equal in both datasets.

Thank you very much for your work! I really appreciate it and just want to find out why the numbers don't add up for me. Thanks!

@jgehrcke
Copy link
Owner

@steg123 thank you so much for this report! I want to and should have investigated immediately; but I have been moving to another city this week and am super occupied with that.

I'll try to get to this asap, but it might not be before next week. Until then: can you do more investigation? That would help! Is the same systematic problem happening every day now?

@steg123
Copy link
Author

steg123 commented Jan 29, 2021

Yes, I think this is an ongoing issue.

I checked the values for today and yesterday and can find a similar deviation.

Data from your csv https://raw.githubusercontent.com/jgehrcke/covid-19-germany-gae/master/cases-rki-by-state.csv
27.1. 2183451
28.1. 2192850
Difference 9399

Official RKI Data from webpage https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html
28.1. 2178828
29.1. 2192850
Difference 14022

Difference of difference is 4623. (It was 3117 for the other example.)

Is the grand total in your csv calculated from individual values? Maybe there is an error in this calculation?

I appreciate you looking into this matter. Thanks!

@steg123
Copy link
Author

steg123 commented Jan 30, 2021

RKI Website
29. 2192850
30. 2205171
Difference 12321

Your csv
29. 2197128
30. 2205171
Difference 8043

Diff of diff 4278
It's not increasing, which makes it unlikely to be a fixed one time error. My bet is on an aggregation calculation.

@jgehrcke
Copy link
Owner

I wonder about this Fallzahlen.html source and what to expect from it, compared to Risklayer and ArcGIS.

Official RKI Data from webpage https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html

This source (let's call it Fallzahlen.html) could be the "problem" here. I have never really paid much attention to this specific website so far which is why this is a little new to me.

I think a more interesting and worthwhile RKI data source than this Fallzahlen.html is their ArcGIS system (or: I wonder in which case one would look at this website instead of at their Corona dashboard which is driven by the ArcGIS system).

I also assume that the ArcGIS system is their primary data source, and everything else (including Fallzahlen.html) is derived from it.

The RKI data in this repository are based on said ArcGIS system. The ArcGIS database is continuously being updated with RKI data. Notably, cases and deaths are continuously being added for days that are long ago.

It's that ArcGIS database that drives the RKI Corona dashboard.

I think it's expected that when you compare these two time series that they can be pretty different:

  1. a time series built from looking at Fallzahlen.html every day (where the data for individual days in the past never changes)
  2. the time series read out of the RKI ArcGIS system (where the entire time series may change every day, as data gets better over time)

Instead of (1) I think you might actually want to look at (2) which is what this repo simplifies :-). I.e., this issue might be about the very reason for why this repository is valuable!

In other words:

  1. 2192850
  1. 2197128

This means the RKI added ~5000 cases for the 29th, but only on the 30th, a day after you took the value from the website. That update, however, is only reflected in the ArcGIS database.

Two more aspects:

Does this in any way help and make sense? Or is this confusing and does not make sense? The last thing I want to do is talk an actual issue down. ;-)

@steg123
Copy link
Author

steg123 commented Feb 1, 2021

"I also assume that the ArcGIS system is their primary data source, and everything else (including Fallzahlen.html) is derived from it."

If this were true, the numbers would add up.

In addition most (every?) current media article on new infections will refer to the difference reported explicitly in Fallzahlen.html .

I do understand numbers are updated by RKI in retrospect. However, it makes no sense to me to provide two figures which claim to be up to date "official" RKI data, but differ substantially. This does not necessarily mean there is any calculation error on your part.

The numbers commonly used in media are not the ones you provide here in the csv. This may be as you explained. However, I find the naming strangely misleading and using your csv will lead to counterintuitive differences with the most up-to-date and official announcement of the RKI.

"This means the RKI added ~5000 cases for the 29th, but only on the 30th, a day after you took the value from the website."

All values "I took" are taken at the exact same time.

Fallzahlen.html reports both, the new cumulative number of infections as well as the difference to yesterdays numbers. Hence both numbers are up to date and official (and as such being used in many media).

On your two more aspects.
If the data you provide is not in line with what it suggests it should not be published.
I will read up on #227. :-)

Thanks for replying. I really appreciate it.

I will stop using the RKI Data in your csv as either we are on to something big (RKI knowingly reporting artificially inflated new infection numbers) or I fail to understand what's going on.

@jgehrcke
Copy link
Owner

jgehrcke commented Feb 9, 2021

If the data you provide is not in line with what it suggests it should not be published.

Well. Strong words.

What we know: there are two RKI sources:

  • the RKI ArcGIS system (which I think I understand rather well)
  • The fallzahlen.html (which is a bit of a mystery to me)

I think the hypothesis is that there is a difference about the two of them, affecting the last 1..N days. Where does this difference come from -- what does it mean? Is it good or bad? Well, @steg123 what do you think about these questions? It looks like you're mainly interested in what the RKI publishes on Fallzahlen.html.

I could guess that the RL data set is very close to that, or even precisely reflects the RKI's Fallzahlen.html. That is, it could be that the RKI publishes the most recent unverified data on Fallzahlen.html, based on the RL crowdsourcing effort (or a comparable effort they implemented themselves), which is after all the fastest information flow path from the Gesundheitsaemter to a central entity. Can you maybe follow up on that? I would appreciate if you could bring a little more light into this. Thanks!

Can you maybe also have a look at #369?

And if @mathiasflick or @alexgit2k would like to chime in here: that would also be appreciated.

I can only repeat that I do not fully understand how https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html relates to the other data sources that I know much more about.

@mathiasflick
Copy link

When I began to analyse the COVID-19 situation almost a year ago, I started with the famous "Fallzahlen.hmtl" of RKI. And the problem described (but most likely not fully understood - or I am wrong ...) by @steg123 was exactly the reason that brought me to this github share (Thanks again to @jgehrcke !).
Fallzahlen.html just gives the count and net difference in "Meldungen" that reached RKI within the last 24 hours (for Germany and the "Bundesländer", regarding "cases" and deaths) regardless of the timestamps provided by the Gesundheitsämter" (i.e. "Meldedatum"). By the way: this information is also depicted in the dashboard.
But: Every single "Meldung" will be reflected in the (ArcGIS-)Database, thus be kind of registered with the correct timestamp and therefore building valid time series regarding the "Meldedatum". And there are latencies, corrections, etc.! Just to remind you: Individual Case -> ( "Gesundheitsamt"<->"Labor") -> "Gesundheitsministerium Bundesland" -> RKI. These "registered" entries in the database form the basis for all timelines depicted in the dashboard. And - of course - these timelines are giving more detailed information regarding the "real" course of the infection (including computation of dependent indicators, etc.).
Just to sum it up: The "Fallzahlen" values of a single report will actually disperse into several days (usually the last few ones, but in rare cases even weeks back for single cases ...).
Greetings from Cologne
Mathias

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants