Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with harvest_dois #8

Open
jameshowison opened this issue May 7, 2024 · 2 comments
Open

Issue with harvest_dois #8

jameshowison opened this issue May 7, 2024 · 2 comments

Comments

@jameshowison
Copy link

I'm running into an issue where harvest_pmcids works but harvest_dois does not. For pmcids the PDFs are gathered, but for harvest_dois they are not.

I have run into this with arxiv dois, but then I tried with the dois in the test folder in this project.

The symptom is that harvester.diagnostic(full=True) shows "total invalid PDF: 7" when I run with the test DOIs.

Any chance that something is broken in the doi list approach, but not in the pmcids approach?

@kermitt2
Copy link
Owner

kermitt2 commented May 9, 2024

Hi @jameshowison !

The reason is that arXiv DOI are not CrossRef DOI, but DataCite DOI. This module only resolves CrossRef ones... So it results in 0 PDF found. This is the problem of the multiple new DOI providers, and the fact that preprint services now use these free DOIs.

I made something specific for arXiv https://github.com/kermitt2/arxiv_harvester for creating a full arXiv mirror, but not just for a few arXiv PDF.

@jameshowison
Copy link
Author

Hmmm. Two things then,

  1. the DOI in https://github.com/kermitt2/article_dataset_builder/blob/master/test/dois.txt are also not working for me. Those aren't arxiv dois, are they?
  2. Where should the documentation show the issue with non-crossref dois? Maybe the method should be renamed harvest_crossref_dois? Is there some way to detect DOIs that the module can't obtain?

Looks like the arxiv DOIs work using arxiv_base from the config.harvester file if strip off arvix. from the front of the DOIs. Eg.

doi:10.48550/arxiv.1808.06161

works to get direct PDF via

https://arxiv.org/pdf/1808.06161

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants