-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge difference in Total deaths with other RKI sources. #227
Comments
Hey. Thanks for the inquiry. In short, I am pretty confident that hat Maybe you can try doing that yourself, by looking at other data sources providing 'time series'? Otherwise, I'll try to get back to you soon, looking again at other places (such as at https://github.com/CSSEGISandData/COVID-19 and also at the Risklayer-provided time series data about deaths). |
Thank you for the quick response. Could you maybe Link your Data on the reported dead cases as I can't find that in your documentation and they are not listed per date on RKIs Arcgis. Tanks so much in advance |
I just reviewed the Data presented by the JHU again, which is, by my knowledge, said to be the most accurate source in terms of taking report delays into account. |
Okay. Thanks for doing that. That of course motivated me to invest a little more time towards the re-re-validation I described above :-). Key difference between the RL and RKI data setsThe JHU data set, as far as I remember, is largely based on the Risklayer GmbH-initiated crowd-sourcing effort. So, let's ignore JHU for now and look at the RL (Risklayer) data. I tend to explain the advantage of the RL data set as "most credible for now". Because they have the fastest pipeline from the individual Gesundheitsamt to their aggregation spreadsheet. In the main README here in this project I therefore describe this (Risklayer) data set as
In other words, the Risklayer data set is a reasonably good source for media to state things about today/yesterday. Now, what's the decisive difference between the Risklayer data set and the RKI data set? Risklayer do not seem to post-process data from the past to the same extent the RKI does. My impression has always been that the RKI constantly updates history based on new insights. They apply corrections to historical data as they come up, and as they have time and resources. These amendments can reach back far into the past (weeks, months). That is, the individual data point (say, the total number of deaths for all Germany for the specific day 2020-03-30) evolves in the RKI data set; over time, as they implement more and more corrections to their time series data. For that reason, I describe the RKI data as
in the main README of this project here. About:
"Yes", in the sense that RL/JHU data is good "for today". However, this statement is not true for historical data. Total number of deaths for all Germany for 2020-03-30Now, let's look at the specific discrepancy you've pointed out. Let's call the metric of interest (the total number of deaths for all Germany for the specific day 2020-03-30)
|
for the record: added the tooling I used for the above's analysis/plot here: #234 |
Wow! Thank you so much for all the work and help! |
Thanks for the response, and the kind words!
Thanks for acking this explicitly -- popularity of data is rarely correlated with its quality :P especially in times of the "AI/ML/data science" hype.
Ha, I guess that's for you to find out then :) Good luck! Maybe -- for the purpose of your thesis -- why don't you reach out to the RKI, asking about the shift, maybe pointing them to this discussion thread here? :) If you do that: please report back -- super curious, and also like to keep the dots connected. About the "quite big impact" -- I understand that certain models for certain dynamics could be rather sensitive to this shift. But (being a physicist myself) I think that it's important to be skeptical here -- to consider that this then might be an over-sensitivity. But yeah, just a superficial intuition. Please keep coming back with good questions when you have them. Cheers! |
I'm working on my Bachelor Thesis and I am searching for the right parameters for a SIR-like Model my Professor created for CoV-19.
While your Data is really helpful for my researches I noticed huge differences in the Total deaths for Germany between this data and the data the RKI released in their Daily Situational Report. For Example on March 31st your Data states 2662 Deaths in Germany while the attached report mentions 583 deaths.
2020-03-31-de.pdf
I've noticed that there might be some delay in in reports as the number of total reported cases in these situational reports is like 2-3 days behind your data. But this gap is too huge to be delay-based. The total deaths also don't sum up to this much until 2 or 3 weeks later. Do you know or think of any reason why the difference is this huge?
If you want to look at all those daily reports (I guess you already know these, but here is the link anyways):
https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Situationsberichte/Gesamt.html
Again thank you very much for your work and also thanks in advance for any answer you might give.
Cheers Jonas
The text was updated successfully, but these errors were encountered: