cgspace-notes/content/posts/2024-06.md

108 lines
5.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "June, 2024"
date: 2024-06-03T14:14:00+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-06-03
- Working on IFPRI datasets
- I noticed the licenses were missing from Nilam's original file so I found a way to check [Dataverse's API for a persistent identifier](https://guides.dataverse.org/en/latest/api/native-api.html#export-metadata-of-a-dataset-in-various-formats)
- We have both Handles and DOIs for these datasets, both from Harvard's Dataverse
<!--more-->
- I used this GREL in OpenRefine to create a new column based on URLs using the DOI (uppercasing the DOI for Dataverse):
```
"https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:" + value.split('https://doi.org/')[-1].toUppercase()
```
- Then I was able to extract the license text from the JSON response using:
```
value.parseJson()['datasetVersion']['termsOfUse']
```
- Similar for the Handle...
## 2024-06-04
- Some Dataverse entries have the license in `['datasetVersion']['license']` instead...
- I finalized cleaning the 722 IFPRI datasets and uploaded them to CGSpace
## 2024-06-14
- Minor cleanups on IFPRI's 20162019 batch migration file
- I will start with duplicates on unique identifiers like DOIs
## 2026-06-18
- Merge and upload metadata for duplicates in IFPRI's 20162019 set:
- 144 exact match on CGSpace via DOI, type, and date
- 32 with CGSpace handles
- I also spent some time converting the `ilri/post_bitstreams.py` script to use the DSpace 7 REST API via dspace-rest-client
- There are 28 PDFs specified for these 176 duplicates, and a handful of them do not already exist on CGSpace so I will upload them
## 2024-06-19
- Spent some time checking the remaining 3312 IFPRI 20162019 migration set for duplicates on CGSpace
- There seem to be about 50 exact matches of title, type, and issue date
## 2024-06-20
- Finalize merging and uploading metadata for 48 duplicates from the IFPRI 20162019 migration set
- Heavy load on both CGSpace and DSpace 7 Test this afternoon
- Took me a while to figure out it was due to someone / something hammering `/search` for a bunch of facets
- The `pm2 logs` command was more useful than the nginx logs to see the requests at least, for example:
```
0|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&spc.page=1&f.accessRights=Open%20Access,equals&f.dateIssued.min=2023&f.dateIssued.max=2024&f.country=Colombia,equals&f.subject=climate%20change,equals&f.region=Latin%20America%20and%20the%20Caribbean,equals&f.publisher=CGIAR%20FOCUS%20Climate%20Security,equals - - ms - -
1|dspace-ui | GET /search?f.accessRights=Open%20Access,equals&spc.page=1&f.sponsorship=CGIAR%20Trust%20Fund,equals&f.impactArea=Climate%20adaptation%20and%20mitigation,equals&f.region=Eastern%20Africa,equals&f.publisher=International%20Institute%20of%20Tropical%20Agriculture,equals - - ms - -
3|dspace-ui | GET /search?f.sdg=SDG%2013%20-%20Climate%20action,equals&f.sdg=SDG%2012%20-%20Responsible%20consumption%20and%20production,equals&spc.page=1&f.affiliation=CGIAR%20Research%20Program%20on%20Climate%20Change,%20Agriculture%20and%20Food%20Security,equals&f.affiliation=Alliance%20of%20Bioversity%20International%20and%20CIAT,equals&f.dateIssued.min=2020&f.dateIssued.max=2021&f.impactArea=Environmental%20health%20and%20biodiversity,equals - - ms - -
```
- Still difficult to find the client, because the logs are all [coming from Angular's user agent](https://github.com/DSpace/dspace-angular/issues/2902) and IP
- I changed the nginx logging to use the `X-Forwarded-For` header, as the default `combined` log format uses `$remote_addr` by default, which is only accurate if the request doesn't come from Angular (ie directly to the API)
- From what I can see now the IPs are all coming from Huawei Cloud and Tencent
- The ASNs are AS136907 (Huawei) and AS132203 (Tencent)
- For now I will just add those to the list of bot networks
## 2024-06-21
- Update the nginx logging to use [nginx's `real_ip` module](http://nginx.org/en/docs/http/ngx_http_realip_module.html) to log the correct client IP
- I think this means we will start sending 'bot' to the Angular / Express frontend because bot IPs will be properly classified now...
- I will have to re-work or at least re-think that nginx configuration for requests going to the frontend because the proposed fix in https://github.com/DSpace/dspace-angular/issues/2902 is to pass on the client's user-agent
- Then I updated the list of bot networks:
```console
$ wget https://asn.ipinfo.app/api/text/list/AS12876 \
https://asn.ipinfo.app/api/text/list/AS132203 \
https://asn.ipinfo.app/api/text/list/AS13238 \
https://asn.ipinfo.app/api/text/list/AS136907 \
https://asn.ipinfo.app/api/text/list/AS14061 \
https://asn.ipinfo.app/api/text/list/AS14618 \
https://asn.ipinfo.app/api/text/list/AS16276 \
https://asn.ipinfo.app/api/text/list/AS16509 \
https://asn.ipinfo.app/api/text/list/AS203020 \
https://asn.ipinfo.app/api/text/list/AS204287 \
https://asn.ipinfo.app/api/text/list/AS21859 \
https://asn.ipinfo.app/api/text/list/AS23576 \
https://asn.ipinfo.app/api/text/list/AS24940 \
https://asn.ipinfo.app/api/text/list/AS396982 \
https://asn.ipinfo.app/api/text/list/AS45102 \
https://asn.ipinfo.app/api/text/list/AS50245 \
https://asn.ipinfo.app/api/text/list/AS55286 \
https://asn.ipinfo.app/api/text/list/AS6939 \
https://asn.ipinfo.app/api/text/list/AS8075
$ cat AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
$ wc -l /tmp/networks.txt
8675 /tmp/networks.txt
```
- Update list of ORCID identifiers with new ones from Alliance and IFPRI
- Finalize uploading the remaining 3,264 items from IFPRI's 20162019 batch migration to CGSpace
<!-- vim: set sw=2 ts=2: -->