mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2023-03-18
This commit is contained in:
@ -257,4 +257,146 @@ $ ./ilri/resolve_orcids.py -i /tmp/2023-03-14-orcids.txt -o /tmp/2023-03-14-orci
|
||||
$ ./ilri/update_orcids.py -i /tmp/2023-03-14-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247
|
||||
```
|
||||
|
||||
## 2023-03-15
|
||||
|
||||
- Jawoo was asking about possibilities to harvest PDFs from CGSpace for some kind of AI chatbot integration
|
||||
- I see we have 45,000 PDFs (format ID 2)
|
||||
|
||||
```console
|
||||
localhost/dspacetest= ☘ SELECT COUNT(*) FROM bitstream WHERE NOT deleted AND bitstream_format_id=2;
|
||||
count
|
||||
───────
|
||||
45281
|
||||
(1 row)
|
||||
```
|
||||
|
||||
- Rework some of my Python scripts to use a common `db_connect` function from util
|
||||
- I reworked my `post_bitstreams.py` script to be able to overwrite bitstreams if requested
|
||||
- The use case is to upload thumbnails for all the journal articles where we have these horrible pixelated journal covers
|
||||
- I replaced JPEG thumbnails for ~896 ILRI publications by exporting a list of DOIs from the 10568/3 collection that were CC-BY, getting their PDFs from Sci-Hub, and then posting them with my new script
|
||||
|
||||
## 2023-03-16
|
||||
|
||||
- Continue working on the ILRI publication thumbnails
|
||||
- There were about sixty-four that had existing PNG "journal cover" thumbnails that didn't get replaced because I only overwrote the JPEG ones yesterday
|
||||
- Now I generated a list of those bitstream UUIDs and deleted them with a shell script via the REST API
|
||||
- I made a [pull request on DSpace 7 to update the bitstream format registry for PNG, WebP, and AVIF](https://github.com/DSpace/DSpace/pull/8722)
|
||||
- Export CGSpace to perform mappings to Initiatives collections
|
||||
- I also used this export to find CC-BY items with DOIs that had JPEGs or PNGs in their provenance, meaning that the submitter likely submitted a low-quality "journal cover" for the item
|
||||
- I found about 330 of them and got most of their PDFs from Sci-Hub and replaced the crappy thumbnails with real ones where Sci-Hub had them (~245)
|
||||
- In related news, I realized you can get an [API key from Elsevier and download the PDFs from their API](https://stackoverflow.com/questions/59202176/python-download-papers-from-sciencedirect-by-doi-with-requests):
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
api_key = 'fuuuuuuuuu'
|
||||
doi = "10.1016/j.foodqual.2021.104362"
|
||||
request_url = f'https://api.elsevier.com/content/article/doi:{doi}'
|
||||
|
||||
headers = {
|
||||
'X-ELS-APIKEY': api_key,
|
||||
'Accept': 'application/pdf'
|
||||
}
|
||||
|
||||
with requests.get(request_url, stream=True, headers=headers) as r:
|
||||
if r.status_code == 200:
|
||||
with open("article.pdf", "wb") as f:
|
||||
for chunk in r.iter_content(chunk_size=1024*1024):
|
||||
f.write(chunk)
|
||||
```
|
||||
|
||||
- The question is, how do we know if a DOI is Elsevier or not...
|
||||
- CGIAR Repositories Working Group meeting
|
||||
- We discussed controlled vocabularies for funders
|
||||
- I suggested checking our combined lists against Crossref and ROR
|
||||
- Export a list of donors from `cg.contributor.donor` on CGSpace:
|
||||
|
||||
```console
|
||||
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=248) to /tmp/2023-03-16-donors.txt;
|
||||
COPY 1521
|
||||
```
|
||||
|
||||
- Then resolve them against Crossref's funders API:
|
||||
|
||||
```console
|
||||
$ ./ilri/crossref_funders_lookup.py -e fuuuu@cgiar.org -i /tmp/2023-03-16-donors.txt -o ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv -d
|
||||
$ csvgrep -c matched -m true ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv | wc -l
|
||||
472
|
||||
$ sed 1d ~/Downloads/2023-03-16-cgspace-crossref-funders-results.csv | wc -l
|
||||
1521
|
||||
```
|
||||
|
||||
- That's a 31% hit rate, but I see some simple things like "Bill and Melinda Gates Foundation" instead of "Bill & Melinda Gates Foundation"
|
||||
|
||||
## 2023-03-17
|
||||
|
||||
- I did the same lookup of CGSpace donors on ROR's 2022-12-01 data dump:
|
||||
|
||||
```console
|
||||
$ ./ilri/ror_lookup.py -i /tmp/2023-03-16-donors.txt -o ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv -r v1.15-2022-12-01-ror-data.json
|
||||
$ csvgrep -c matched -m true ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv | wc -l
|
||||
407
|
||||
$ sed 1d ~/Downloads/2023-03-16-cgspace-ror-funders-results.csv | wc -l
|
||||
1521
|
||||
```
|
||||
|
||||
- That's a 26.7% hit rate
|
||||
- As for the number of funders in each dataset
|
||||
- Crossref has about 34,000
|
||||
- ROR has 15,000 if "FundRef" data is a proxy for that:
|
||||
|
||||
```console
|
||||
$ grep -c -rsI FundRef v1.15-2022-12-01-ror-data.json
|
||||
15162
|
||||
```
|
||||
|
||||
- On a related note, I remembered that DOI.org has a list of DOI prefixes and publishers: https://doi.crossref.org/getPrefixPublisher
|
||||
- In Python I can look up publishers by prefix easily, here with a nested list comprehension:
|
||||
|
||||
```console
|
||||
In [10]: [publisher for publisher in publishers if '10.3390' in publisher['prefixes']]
|
||||
Out[10]:
|
||||
[{'prefixes': ['10.1989', '10.32545', '10.20944', '10.3390', '10.35995'],
|
||||
'name': 'MDPI AG',
|
||||
'memberId': 1968}]
|
||||
```
|
||||
|
||||
- And in OpenRefine, if I create a new column based on the DOI using Jython:
|
||||
|
||||
```python
|
||||
import json
|
||||
|
||||
with open("/home/aorth/src/git/DSpace/publisher-doi-prefixes.json", "rb") as f:
|
||||
publishers = json.load(f)
|
||||
|
||||
doi_prefix = value.split("/")[3]
|
||||
|
||||
publisher = [publisher for publisher in publishers if doi_prefix in publisher['prefixes']]
|
||||
|
||||
return publisher[0]['name']
|
||||
```
|
||||
|
||||
- ... though this is very slow and hung OpenRefine when I tried it
|
||||
- I added the ability to overwrite multiple bitstream formats at once in `post_bitstreams.py`
|
||||
|
||||
```console
|
||||
$ ./ilri/post_bitstreams.py -i test.csv -u https://dspacetest.cgiar.org/rest -e fuuu@example.com -p 'fffnjnjn' -d -s 2B40C7C4E34CEFCF5AFAE4B75A8C52E2 --overwrite JPEG --overwrite PNG -n
|
||||
Session valid: 2B40C7C4E34CEFCF5AFAE4B75A8C52E2
|
||||
Opened test.csv
|
||||
384142cb-58b9-4e64-bcdc-0a8cc34888b3: checking for existing bitstreams in THUMBNAIL bundle
|
||||
> (DRY RUN) Deleting bitstream: IFPRI Malawi_Maize Market Report_February_202_anonymous.pdf.jpg (16883cb0-1fc8-4786-a04f-32132e0617d4)
|
||||
> (DRY RUN) Deleting bitstream: AgroEcol_Newsletter_2.png (7e9cd434-45a6-4d55-8d56-4efa89d73813)
|
||||
> (DRY RUN) Uploading file: 10568-129666.pdf.jpg
|
||||
```
|
||||
|
||||
- I learned how to use Python's built-in `logging` module and it simplifies all my debug and info printing
|
||||
- I re-factored a few scripts to use the new logging
|
||||
|
||||
## 2023-03-18
|
||||
|
||||
- I applied changes for publishers on 16,000 items in batches of 5,000
|
||||
- While working on my `post_bitstreams.py` script I realized the Tomcat Crawler Session Manager valve that groups bot user agents into sessions is causing my login to fail the first time, every time
|
||||
- I've disabled it for now and will check the Munin session graphs after some time to see if it makes a difference
|
||||
- In any case I have much better spider user agent lists in DSpace now than I did years ago when I started using the Crawler Session Manager valve
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user