cgspace-notes/content/posts/2022-08.md

126 lines
7.6 KiB
Markdown
Raw Normal View History

2022-08-01 15:36:13 +02:00
---
title: "August, 2022"
date: 2022-08-01T10:22:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-08-01
- Our request to add [CC-BY-3.0-IGO to SPDX](https://github.com/spdx/license-list-XML/issues/1525) was approved a few weeks ago
<!--more-->
2022-08-03 20:01:39 +02:00
## 2022-08-02
- Resume working on the MARLO Innovations
- Last week Jose had sent me an updated CSV with UTF-8 formatting, which was missing the filename column
- I joined it with the older file (stripped down to just the `cg.number` and `filename` columns and then did the same cleanups I had done last week
- I noticed there are six PDFs unused, so I asked Jose
- Spent some time trying to understand the REST API submission issues that Rafael from CIAT is having with tip-approve and tip-submit
- First, according to my notes in 2020-10, a user must be a *collection admin* in order to submit via the REST API
- Second, a collection must have a "Accept/Reject/Edit Metadata" step defined in the workflow
- Also, I referenced my notes from this gist I had made for exactly this purpose! https://gist.github.com/alanorth/40fc3092aefd78f978cca00e8abeeb7a
## 2022-08-03
- I came up with an interesting idea to add missing countries and AGROVOC terms to the MARLO Innovation metadata
- I copied the abstract column to two new fields: `countrytest` and `agrovoctest` and then used this Jython code as a transform to drop terms that don't match (using CGSpace's country list and list of 1,400 AGROVOC terms):
```python
with open(r"/tmp/cgspace-countries.txt",'r') as f :
countries = [name.rstrip().lower() for name in f]
return "||".join([x for x in value.split(' ') if x.lower() in countries])
```
- Then I joined them with the other country and AGROVOC columns
- I had originally tried to use csv-metadata-quality to look up and drop invalid AGROVOC terms but it was timing out ever dozen or so requests
- Then I briefly tried to use lightrdf to export a text file of labels from AGROVOC's RDF, but I couldn't figure it out
- I just realized this will not match countries with spaces in our cell value, ugh... and Jython has weird syntax and errors and I can't get normal Python code to work here, I'm missing something
- Then I extracted the titles, dates, and types and added IDs, then ran them through `check-duplicates.py` to find the existing items on CGSpace so I can add them as `dcterm.relation` links
```console
$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-08-03-Innovations-Cleaned.csv | sed '1s/line_number/id/' > /tmp/innovations-temp.csv
$ ./ilri/check-duplicates.py -i /tmp/innovations-temp.csv -u dspacetest -db dspacetest -p 'dom@in34sniper' -o /tmp/ccafs-duplicates.csv
```
- There were about 115 with existing items on CGSpace
- Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):
```console
$ csvjoin --left -c dc.title ~/Downloads/2022-08-03-Innovations-Cleaned.csv ~/Downloads/2022-08-03-Innovations-relations.csv > /tmp/innovations-with-relations.csv
```
- Then I used SAFBuilder to create a SimpleItemArchive and import to DSpace Test:
```console
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
$ dspace import --add --eperson=aorth@mjanja.ch --source /tmp/SimpleArchiveFormat --mapfile=./2022-08-03-innovations.map
```
- Meeting with Mohammed Salem about harmonizing MEL and CGSpace metadata fields
- I still need to share our results and recommendations with Peter, Enrico, Sara, Svetlana, et al
- I made some minor fixes to csv-metadata-quality while working on the MARLO CRP Innovations
2022-08-05 18:10:21 +02:00
## 2022-08-05
- I discussed issues with the DSpace 7 submission forms on Slack and Mark Wood found that the migration tool creates a non-working submission form
- After updating the class name of the collection step and removing the "complete" and "sample" steps the submission form was working
- Now the issue is that the controlled vocabularies show up like this:
![Controlled vocabulary bug in DSpace 7](/cgspace-notes/2022/08/dspace7-submission.png)
- I think we need to add IDs, I will have to check what the implications of that are
2022-08-05 20:05:13 +02:00
- Emilio contacted me last week to say they have re-worked their harvester on Hetzner to use a new user agent: `AICCRA website harvester`
- I verified that I see it in the REST API logs, but I don't see any new stats hits for it
- I do see 11,000 hits from that IP last month when I had the incorrect nginx configuration that was sending a literal `$http_user_agent` so I purged those
2022-08-05 20:09:24 +02:00
- It is lucky that we have `harvest` in the DSpace spider agent example file so Solr doesn't log these hits, nothing needed to be done in nginx
2022-08-05 18:10:21 +02:00
2022-08-14 06:37:48 +02:00
## 2022-08-13
- I noticed there was high load on CGSpace, around 9 or 10
- Looking at the Munin graphs it seems to just be the last two hours or so, with a slight increase in PostgreSQL connections, firewall traffic, and a more noticeable increase in CPU
- DSpace sessions are normal
- The number of unique hosts making requests to nginx is pretty low, though it's only 6AM in the server's time
- I see one IP in Sweden making a lot of requests with a normal user agent: 80.248.237.167
- This host is on Internet Vikings (INTERNETBOLAGET), and I see 140,000 requests from them in Solr
- I see reports of excessive scraping on AbuseIPDB.com
- I'm gonna add their 80.248.224.0/20 to the bot-networks.conf in nginx
- I will also purge all the hits from this IP in Solr statistics
2022-08-14 06:51:49 +02:00
- I also see the core.ac.uk bot making tens of thousands of requests today, but we are already tagging that as a bot in Tomcat's Crawler Session Manager valve, so they should be sharing a Tomcat session with other bots and not creating too many sessions
2022-08-14 06:37:48 +02:00
2022-08-16 03:46:57 +02:00
## 2022-08-15
- Start indexing on AReS
- Add CONSERVATION to ILRI subjects on CGSpace
- I see that AGROVOC has `conservation agriculture` and I suggested that we use that instead
2022-08-18 22:45:48 +02:00
## 2022-08-17
- Peter and Jose sent more feedback about the CRP Innovation records from MARLO
- We expanded the CRP names in the citation and removed the `cg.identifier.url` URLs because they are ugly and will stop working eventually
- The mappings of MARLO links will be done internally with the `cg.number` IDs like "IN-1119" and the Handle URIs
## 2022-08-18
- I talked to Jose about the CCAFS MARLO records
- He still hasn't finished re-processing the PDFs to update the internal MARLO links
- I started looking at the other records (MELIAs, OICRs, Policies) and found some minor issues in the MELIAs so I sent feedback to Jose
- On second thought, I opened the MELIAs file in OpenRefine and it looks OK, so this must have been a parsing issue in LibreOffice when I was checking the file (or perhaps I didn't use the correct quoting when importing)
- Import the original MELIA v2 CSV file into OpenRefine to fix encoding before processing with csvcut/csvjoin
- Then extract the IDs and filenames from the original V2 file and join with the UTF-8 file:
```console
$ csvcut -c 'cg.number (series/report No.)',File ~/Downloads/MELIA-Metadata-v2-csv.csv > MELIA-v2-IDs-Files.csv
$ csvjoin -c 'cg.number (series/report No.)' MELIAs\ metadata\ utf8\ 20220816_JM.csv MELIA-v2-IDs-Files.csv > MELIAs-UTF-8-with-files.csv
```
- Then I imported them into OpenRefine to start metadata cleaning and enrichment
2022-08-19 07:43:37 +02:00
- Make some minor changes to [cgspace-submission-guidelines](https://github.com/ilri/cgspace-submission-guidelines)
- Upgrade to Bootstrap v5.2.0
- Dedupe value pairs and controlled vocabularies before writing them
- Sort the controlled vocabularies before writing them (we don't do this for value pairs because some are added in specific order, like CRPs)
2022-08-18 22:45:48 +02:00
2022-08-01 15:36:13 +02:00
<!-- vim: set sw=2 ts=2: -->