mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-08-03
This commit is contained in:
@ -224,7 +224,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2022-06-28-cgspace-subjects.txt -o /tmp/2022-
|
||||
- I had to try in different Python versions because 3.10.x is apparently too new
|
||||
- For future reference I was able to search with lightrdf:
|
||||
|
||||
```console
|
||||
```python
|
||||
import lightrdf
|
||||
parser = lightrdf.Parser()
|
||||
# prints millions of lines
|
||||
|
@ -11,4 +11,56 @@ categories: ["Notes"]
|
||||
|
||||
<!--more-->
|
||||
|
||||
## 2022-08-02
|
||||
|
||||
- Resume working on the MARLO Innovations
|
||||
- Last week Jose had sent me an updated CSV with UTF-8 formatting, which was missing the filename column
|
||||
- I joined it with the older file (stripped down to just the `cg.number` and `filename` columns and then did the same cleanups I had done last week
|
||||
- I noticed there are six PDFs unused, so I asked Jose
|
||||
- Spent some time trying to understand the REST API submission issues that Rafael from CIAT is having with tip-approve and tip-submit
|
||||
- First, according to my notes in 2020-10, a user must be a *collection admin* in order to submit via the REST API
|
||||
- Second, a collection must have a "Accept/Reject/Edit Metadata" step defined in the workflow
|
||||
- Also, I referenced my notes from this gist I had made for exactly this purpose! https://gist.github.com/alanorth/40fc3092aefd78f978cca00e8abeeb7a
|
||||
|
||||
## 2022-08-03
|
||||
|
||||
- I came up with an interesting idea to add missing countries and AGROVOC terms to the MARLO Innovation metadata
|
||||
- I copied the abstract column to two new fields: `countrytest` and `agrovoctest` and then used this Jython code as a transform to drop terms that don't match (using CGSpace's country list and list of 1,400 AGROVOC terms):
|
||||
|
||||
```python
|
||||
with open(r"/tmp/cgspace-countries.txt",'r') as f :
|
||||
countries = [name.rstrip().lower() for name in f]
|
||||
|
||||
return "||".join([x for x in value.split(' ') if x.lower() in countries])
|
||||
```
|
||||
|
||||
- Then I joined them with the other country and AGROVOC columns
|
||||
- I had originally tried to use csv-metadata-quality to look up and drop invalid AGROVOC terms but it was timing out ever dozen or so requests
|
||||
- Then I briefly tried to use lightrdf to export a text file of labels from AGROVOC's RDF, but I couldn't figure it out
|
||||
- I just realized this will not match countries with spaces in our cell value, ugh... and Jython has weird syntax and errors and I can't get normal Python code to work here, I'm missing something
|
||||
- Then I extracted the titles, dates, and types and added IDs, then ran them through `check-duplicates.py` to find the existing items on CGSpace so I can add them as `dcterm.relation` links
|
||||
|
||||
```console
|
||||
$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-08-03-Innovations-Cleaned.csv | sed '1s/line_number/id/' > /tmp/innovations-temp.csv
|
||||
$ ./ilri/check-duplicates.py -i /tmp/innovations-temp.csv -u dspacetest -db dspacetest -p 'dom@in34sniper' -o /tmp/ccafs-duplicates.csv
|
||||
```
|
||||
|
||||
- There were about 115 with existing items on CGSpace
|
||||
- Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):
|
||||
|
||||
```console
|
||||
$ csvjoin --left -c dc.title ~/Downloads/2022-08-03-Innovations-Cleaned.csv ~/Downloads/2022-08-03-Innovations-relations.csv > /tmp/innovations-with-relations.csv
|
||||
```
|
||||
|
||||
- Then I used SAFBuilder to create a SimpleItemArchive and import to DSpace Test:
|
||||
|
||||
```console
|
||||
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
|
||||
$ dspace import --add --eperson=aorth@mjanja.ch --source /tmp/SimpleArchiveFormat --mapfile=./2022-08-03-innovations.map
|
||||
```
|
||||
|
||||
- Meeting with Mohammed Salem about harmonizing MEL and CGSpace metadata fields
|
||||
- I still need to share our results and recommendations with Peter, Enrico, Sara, Svetlana, et al
|
||||
- I made some minor fixes to csv-metadata-quality while working on the MARLO CRP Innovations
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user