mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-24 23:50:17 +01:00
120 lines
6.1 KiB
Markdown
120 lines
6.1 KiB
Markdown
---
|
||
title: "June, 2024"
|
||
date: 2024-06-03T14:14:00+03:00
|
||
author: "Alan Orth"
|
||
categories: ["Notes"]
|
||
---
|
||
|
||
## 2024-06-03
|
||
|
||
- Working on IFPRI datasets
|
||
- I noticed the licenses were missing from Nilam's original file so I found a way to check [Dataverse's API for a persistent identifier](https://guides.dataverse.org/en/latest/api/native-api.html#export-metadata-of-a-dataset-in-various-formats)
|
||
- We have both Handles and DOIs for these datasets, both from Harvard's Dataverse
|
||
|
||
<!--more-->
|
||
|
||
- I used this GREL in OpenRefine to create a new column based on URLs using the DOI (uppercasing the DOI for Dataverse):
|
||
|
||
```
|
||
"https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:" + value.split('https://doi.org/')[-1].toUppercase()
|
||
```
|
||
|
||
- Then I was able to extract the license text from the JSON response using:
|
||
|
||
```
|
||
value.parseJson()['datasetVersion']['termsOfUse']
|
||
```
|
||
|
||
- Similar for the Handle...
|
||
|
||
## 2024-06-04
|
||
|
||
- Some Dataverse entries have the license in `['datasetVersion']['license']` instead...
|
||
- I finalized cleaning the 722 IFPRI datasets and uploaded them to CGSpace
|
||
|
||
## 2024-06-14
|
||
|
||
- Minor cleanups on IFPRI's 2016–2019 batch migration file
|
||
- I will start with duplicates on unique identifiers like DOIs
|
||
|
||
## 2026-06-18
|
||
|
||
- Merge and upload metadata for duplicates in IFPRI's 2016–2019 set:
|
||
- 144 exact match on CGSpace via DOI, type, and date
|
||
- 32 with CGSpace handles
|
||
- I also spent some time converting the `ilri/post_bitstreams.py` script to use the DSpace 7 REST API via dspace-rest-client
|
||
- There are 28 PDFs specified for these 176 duplicates, and a handful of them do not already exist on CGSpace so I will upload them
|
||
|
||
## 2024-06-19
|
||
|
||
- Spent some time checking the remaining 3312 IFPRI 2016–2019 migration set for duplicates on CGSpace
|
||
- There seem to be about 50 exact matches of title, type, and issue date
|
||
|
||
## 2024-06-20
|
||
|
||
- Finalize merging and uploading metadata for 48 duplicates from the IFPRI 2016–2019 migration set
|
||
- Heavy load on both CGSpace and DSpace 7 Test this afternoon
|
||
- Took me a while to figure out it was due to someone / something hammering `/search` for a bunch of facets
|
||
- The `pm2 logs` command was more useful than the nginx logs to see the requests at least, for example:
|
||
|
||
```
|
||
0|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&spc.page=1&f.accessRights=Open%20Access,equals&f.dateIssued.min=2023&f.dateIssued.max=2024&f.country=Colombia,equals&f.subject=climate%20change,equals&f.region=Latin%20America%20and%20the%20Caribbean,equals&f.publisher=CGIAR%20FOCUS%20Climate%20Security,equals - - ms - -
|
||
1|dspace-ui | GET /search?f.accessRights=Open%20Access,equals&spc.page=1&f.sponsorship=CGIAR%20Trust%20Fund,equals&f.impactArea=Climate%20adaptation%20and%20mitigation,equals&f.region=Eastern%20Africa,equals&f.publisher=International%20Institute%20of%20Tropical%20Agriculture,equals - - ms - -
|
||
3|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&f.sdg=SDG%2012%20-%20Responsible%20consumption%20and%20production,equals&spc.page=1&f.affiliation=CGIAR%20Research%20Program%20on%20Climate%20Change,%20Agriculture%20and%20Food%20Security,equals&f.affiliation=Alliance%20of%20Bioversity%20International%20and%20CIAT,equals&f.dateIssued.min=2020&f.dateIssued.max=2021&f.impactArea=Environmental%20health%20and%20biodiversity,equals - - ms - -
|
||
```
|
||
|
||
- Still difficult to find the client, because the logs are all [coming from Angular's user agent](https://github.com/DSpace/dspace-angular/issues/2902) and IP
|
||
- I changed the nginx logging to use the `X-Forwarded-For` header, as the default `combined` log format uses `$remote_addr` by default, which is only accurate if the request doesn't come from Angular (ie directly to the API)
|
||
- From what I can see now the IPs are all coming from Huawei Cloud and Tencent
|
||
- The ASNs are AS136907 (Huawei) and AS132203 (Tencent)
|
||
- For now I will just add those to the list of bot networks
|
||
|
||
## 2024-06-21
|
||
|
||
- Update the nginx logging to use [nginx's `real_ip` module](http://nginx.org/en/docs/http/ngx_http_realip_module.html) to log the correct client IP
|
||
- I think this means we will start sending 'bot' to the Angular / Express frontend because bot IPs will be properly classified now...
|
||
- I will have to re-work or at least re-think that nginx configuration for requests going to the frontend because the proposed fix in https://github.com/DSpace/dspace-angular/issues/2902 is to pass on the client's user-agent
|
||
- Then I updated the list of bot networks:
|
||
|
||
```console
|
||
$ wget https://asn.ipinfo.app/api/text/list/AS12876 \
|
||
https://asn.ipinfo.app/api/text/list/AS132203 \
|
||
https://asn.ipinfo.app/api/text/list/AS13238 \
|
||
https://asn.ipinfo.app/api/text/list/AS136907 \
|
||
https://asn.ipinfo.app/api/text/list/AS14061 \
|
||
https://asn.ipinfo.app/api/text/list/AS14618 \
|
||
https://asn.ipinfo.app/api/text/list/AS16276 \
|
||
https://asn.ipinfo.app/api/text/list/AS16509 \
|
||
https://asn.ipinfo.app/api/text/list/AS203020 \
|
||
https://asn.ipinfo.app/api/text/list/AS204287 \
|
||
https://asn.ipinfo.app/api/text/list/AS21859 \
|
||
https://asn.ipinfo.app/api/text/list/AS23576 \
|
||
https://asn.ipinfo.app/api/text/list/AS24940 \
|
||
https://asn.ipinfo.app/api/text/list/AS396982 \
|
||
https://asn.ipinfo.app/api/text/list/AS45102 \
|
||
https://asn.ipinfo.app/api/text/list/AS50245 \
|
||
https://asn.ipinfo.app/api/text/list/AS55286 \
|
||
https://asn.ipinfo.app/api/text/list/AS6939 \
|
||
https://asn.ipinfo.app/api/text/list/AS8075
|
||
$ cat AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
|
||
$ wc -l /tmp/networks.txt
|
||
8675 /tmp/networks.txt
|
||
```
|
||
|
||
- Update list of ORCID identifiers with new ones from Alliance and IFPRI
|
||
- Finalize uploading the remaining 3,264 items from IFPRI's 2016–2019 batch migration to CGSpace
|
||
|
||
## 2024-06-24
|
||
|
||
- Minor updates to [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) and [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) to normalize a few more invalid DOI formats
|
||
|
||
## 2024-06-25
|
||
|
||
- Work on uploading some missing PDFs from the IFPRI 2016–2019 batch migration
|
||
|
||
## 2024-06-26
|
||
|
||
- Did a big cleanup of several thousand journal articles based on metadata from Crossref
|
||
|
||
<!-- vim: set sw=2 ts=2: -->
|