-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy to RKI data of yesterday while today is accurate. #491
Comments
@steg123 thank you so much for this report! I want to and should have investigated immediately; but I have been moving to another city this week and am super occupied with that. I'll try to get to this asap, but it might not be before next week. Until then: can you do more investigation? That would help! Is the same systematic problem happening every day now? |
Yes, I think this is an ongoing issue. I checked the values for today and yesterday and can find a similar deviation. Data from your csv https://raw.githubusercontent.com/jgehrcke/covid-19-germany-gae/master/cases-rki-by-state.csv Official RKI Data from webpage https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html Difference of difference is 4623. (It was 3117 for the other example.) Is the grand total in your csv calculated from individual values? Maybe there is an error in this calculation? I appreciate you looking into this matter. Thanks! |
RKI Website Your csv Diff of diff 4278 |
I wonder about this Fallzahlen.html source and what to expect from it, compared to Risklayer and ArcGIS.
This source (let's call it Fallzahlen.html) could be the "problem" here. I have never really paid much attention to this specific website so far which is why this is a little new to me. I think a more interesting and worthwhile RKI data source than this Fallzahlen.html is their ArcGIS system (or: I wonder in which case one would look at this website instead of at their Corona dashboard which is driven by the ArcGIS system). I also assume that the ArcGIS system is their primary data source, and everything else (including Fallzahlen.html) is derived from it. The RKI data in this repository are based on said ArcGIS system. The ArcGIS database is continuously being updated with RKI data. Notably, cases and deaths are continuously being added for days that are long ago. It's that ArcGIS database that drives the RKI Corona dashboard. I think it's expected that when you compare these two time series that they can be pretty different:
Instead of (1) I think you might actually want to look at (2) which is what this repo simplifies :-). I.e., this issue might be about the very reason for why this repository is valuable! In other words:
This means the RKI added ~5000 cases for the 29th, but only on the 30th, a day after you took the value from the website. That update, however, is only reflected in the ArcGIS database. Two more aspects:
Does this in any way help and make sense? Or is this confusing and does not make sense? The last thing I want to do is talk an actual issue down. ;-) |
"I also assume that the ArcGIS system is their primary data source, and everything else (including Fallzahlen.html) is derived from it." If this were true, the numbers would add up. In addition most (every?) current media article on new infections will refer to the difference reported explicitly in Fallzahlen.html . I do understand numbers are updated by RKI in retrospect. However, it makes no sense to me to provide two figures which claim to be up to date "official" RKI data, but differ substantially. This does not necessarily mean there is any calculation error on your part. The numbers commonly used in media are not the ones you provide here in the csv. This may be as you explained. However, I find the naming strangely misleading and using your csv will lead to counterintuitive differences with the most up-to-date and official announcement of the RKI. "This means the RKI added ~5000 cases for the 29th, but only on the 30th, a day after you took the value from the website." All values "I took" are taken at the exact same time. Fallzahlen.html reports both, the new cumulative number of infections as well as the difference to yesterdays numbers. Hence both numbers are up to date and official (and as such being used in many media). On your two more aspects. Thanks for replying. I really appreciate it. I will stop using the RKI Data in your csv as either we are on to something big (RKI knowingly reporting artificially inflated new infection numbers) or I fail to understand what's going on. |
Well. Strong words. What we know: there are two RKI sources:
I think the hypothesis is that there is a difference about the two of them, affecting the last 1..N days. Where does this difference come from -- what does it mean? Is it good or bad? Well, @steg123 what do you think about these questions? It looks like you're mainly interested in what the RKI publishes on Fallzahlen.html. I could guess that the RL data set is very close to that, or even precisely reflects the RKI's Fallzahlen.html. That is, it could be that the RKI publishes the most recent unverified data on Fallzahlen.html, based on the RL crowdsourcing effort (or a comparable effort they implemented themselves), which is after all the fastest information flow path from the Gesundheitsaemter to a central entity. Can you maybe follow up on that? I would appreciate if you could bring a little more light into this. Thanks! Can you maybe also have a look at #369? And if @mathiasflick or @alexgit2k would like to chime in here: that would also be appreciated. I can only repeat that I do not fully understand how https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html relates to the other data sources that I know much more about. |
When I began to analyse the COVID-19 situation almost a year ago, I started with the famous "Fallzahlen.hmtl" of RKI. And the problem described (but most likely not fully understood - or I am wrong ...) by @steg123 was exactly the reason that brought me to this github share (Thanks again to @jgehrcke !). |
I have been using your data and realized that the accumulated figure of your RKI data for cases of today is in line with the reports from the RKI.
However, the number from yesterday differs substantially. Hence, if I am looking at differences I get other results than what is officially reported.
Why?
Number of cases total from your csv:
https://raw.githubusercontent.com/jgehrcke/covid-19-germany-gae/master/cases-rki-by-state.csv
2151198 (25.1.)
2161279 (26.1.)
Difference 10081
Official numbers from RKI
2148077 (26.1. 0:00)
2161275 (27.1. 0:00)
Difference 13198
I don't care about the date attribution. However, t0 and t-1 should be equal in both datasets.
Thank you very much for your work! I really appreciate it and just want to find out why the numbers don't add up for me. Thanks!
The text was updated successfully, but these errors were encountered: