cgspace-notes/2022-08.md at e7d7d4af89efaab985613337ecf933dfcbca0261

alanorth/cgspace-notes

Fork 0

mirror of https://github.com/alanorth/cgspace-notes.git synced 2024-12-18 11:12:18 +01:00

Alan Orth ba6f826201

content/posts/2022-08.md: syntax fix

2023-02-22 11:59:48 +03:00

19 KiB

Raw Blame History

title

date

author

2022-08-01

Our request to add CC-BY-3.0-IGO to SPDX was approved a few weeks ago

2022-08-02

Resume working on the MARLO Innovations
- Last week Jose had sent me an updated CSV with UTF-8 formatting, which was missing the filename column
- I joined it with the older file (stripped down to just the cg.number and filename columns and then did the same cleanups I had done last week
- I noticed there are six PDFs unused, so I asked Jose
Spent some time trying to understand the REST API submission issues that Rafael from CIAT is having with tip-approve and tip-submit
- First, according to my notes in 2020-10, a user must be a collection admin in order to submit via the REST API
- Second, a collection must have a "Accept/Reject/Edit Metadata" step defined in the workflow
- Also, I referenced my notes from this gist I had made for exactly this purpose! https://gist.github.com/alanorth/40fc3092aefd78f978cca00e8abeeb7a

2022-08-03

I came up with an interesting idea to add missing countries and AGROVOC terms to the MARLO Innovation metadata
- I copied the abstract column to two new fields: countrytest and agrovoctest and then used this Jython code as a transform to drop terms that don't match (using CGSpace's country list and list of 1,400 AGROVOC terms):

with open(r"/tmp/cgspace-countries.txt",'r') as f :
    countries = [name.rstrip().lower() for name in f]

return "||".join([x for x in value.split(' ') if x.lower() in countries])

Then I joined them with the other country and AGROVOC columns
- I had originally tried to use csv-metadata-quality to look up and drop invalid AGROVOC terms but it was timing out ever dozen or so requests
- Then I briefly tried to use lightrdf to export a text file of labels from AGROVOC's RDF, but I couldn't figure it out
- I just realized this will not match countries with spaces in our cell value, ugh... and Jython has weird syntax and errors and I can't get normal Python code to work here, I'm missing something
Then I extracted the titles, dates, and types and added IDs, then ran them through check-duplicates.py to find the existing items on CGSpace so I can add them as dcterm.relation links

$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-08-03-Innovations-Cleaned.csv | sed '1s/line_number/id/' > /tmp/innovations-temp.csv
$ ./ilri/check-duplicates.py -i /tmp/innovations-temp.csv -u dspacetest -db dspacetest -p 'dom@in34sniper' -o /tmp/ccafs-duplicates.csv

There were about 115 with existing items on CGSpace
Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):

$ csvjoin --left -c dc.title ~/Downloads/2022-08-03-Innovations-Cleaned.csv ~/Downloads/2022-08-03-Innovations-relations.csv > /tmp/innovations-with-relations.csv

Then I used SAFBuilder to create a SimpleItemArchive and import to DSpace Test:

$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
$ dspace import --add --eperson=fuuu@fuuu.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-08-03-innovations.map

Meeting with Mohammed Salem about harmonizing MEL and CGSpace metadata fields
- I still need to share our results and recommendations with Peter, Enrico, Sara, Svetlana, et al
I made some minor fixes to csv-metadata-quality while working on the MARLO CRP Innovations

2022-08-05

I discussed issues with the DSpace 7 submission forms on Slack and Mark Wood found that the migration tool creates a non-working submission form
- After updating the class name of the collection step and removing the "complete" and "sample" steps the submission form was working
- Now the issue is that the controlled vocabularies show up like this:

I think we need to add IDs, I will have to check what the implications of that are
- Note (2022-09-27): see this related change from DSpace 7.3: https://github.com/DSpace/DSpace/pull/8174
Emilio contacted me last week to say they have re-worked their harvester on Hetzner to use a new user agent: AICCRA website harvester
- I verified that I see it in the REST API logs, but I don't see any new stats hits for it
- I do see 11,000 hits from that IP last month when I had the incorrect nginx configuration that was sending a literal $http_user_agent so I purged those
- It is lucky that we have harvest in the DSpace spider agent example file so Solr doesn't log these hits, nothing needed to be done in nginx

2022-08-13

I noticed there was high load on CGSpace, around 9 or 10
- Looking at the Munin graphs it seems to just be the last two hours or so, with a slight increase in PostgreSQL connections, firewall traffic, and a more noticeable increase in CPU
- DSpace sessions are normal
- The number of unique hosts making requests to nginx is pretty low, though it's only 6AM in the server's time
I see one IP in Sweden making a lot of requests with a normal user agent: 80.248.237.167
- This host is on Internet Vikings (INTERNETBOLAGET), and I see 140,000 requests from them in Solr
- I see reports of excessive scraping on AbuseIPDB.com
- I'm gonna add their 80.248.224.0/20 to the bot-networks.conf in nginx
- I will also purge all the hits from this IP in Solr statistics
I also see the core.ac.uk bot making tens of thousands of requests today, but we are already tagging that as a bot in Tomcat's Crawler Session Manager valve, so they should be sharing a Tomcat session with other bots and not creating too many sessions

2022-08-15

Start indexing on AReS
Add CONSERVATION to ILRI subjects on CGSpace
- I see that AGROVOC has conservation agriculture and I suggested that we use that instead

2022-08-17

Peter and Jose sent more feedback about the CRP Innovation records from MARLO
- We expanded the CRP names in the citation and removed the cg.identifier.url URLs because they are ugly and will stop working eventually
- The mappings of MARLO links will be done internally with the cg.number IDs like "IN-1119" and the Handle URIs

2022-08-18

I talked to Jose about the CCAFS MARLO records
- He still hasn't finished re-processing the PDFs to update the internal MARLO links
- I started looking at the other records (MELIAs, OICRs, Policies) and found some minor issues in the MELIAs so I sent feedback to Jose
- On second thought, I opened the MELIAs file in OpenRefine and it looks OK, so this must have been a parsing issue in LibreOffice when I was checking the file (or perhaps I didn't use the correct quoting when importing)
Import the original MELIA v2 CSV file into OpenRefine to fix encoding before processing with csvcut/csvjoin
- Then extract the IDs and filenames from the original V2 file and join with the UTF-8 file:

$ csvcut -c 'cg.number (series/report No.)',File ~/Downloads/MELIA-Metadata-v2-csv.csv > MELIA-v2-IDs-Files.csv
$ csvjoin -c 'cg.number (series/report No.)' MELIAs\ metadata\ utf8\ 20220816_JM.csv MELIA-v2-IDs-Files.csv > MELIAs-UTF-8-with-files.csv

Then I imported them into OpenRefine to start metadata cleaning and enrichment
Make some minor changes to cgspace-submission-guidelines
- Upgrade to Bootstrap v5.2.0
- Dedupe value pairs and controlled vocabularies before writing them
- Sort the controlled vocabularies before writing them (we don't do this for value pairs because some are added in specific order, like CRPs)

2022-08-19

Peter Ballantyne sent me metadata for 311 Gender items that need to be duplicate checked on CGSpace before uploading
- I spent a half an hour in OpenRefine to fix the dates because they only had YYYY, but most abstracts and titles had more specific information about the date
- Then I checked for duplicates:

$ ./ilri/check-duplicates.py -i ~/Downloads/gender-ppts-xlsx.csv -u dspace -db dspace -p 'fuuu' -o /tmp/gender-duplicates.csv

I sent the list of ~130 possible duplicates to Peter to check
Jose sent new versions of the MARLO Innovation/MELIA/OICR/Policy PDFs
- The idea was to replace tinyurl links pointing to MARLO, but I still see many tinyurl links, some of which point to CGIAR Sharepoint and require a login
- I asked them why they don't just use the original links in the first place in case tinyurl.com disappears
I continued working on the MARLO MELIA v2 UTF-8 metadata
- I did the same metadata enrichment exercise to extract countries and AGROVOC subjects from the abstract field that I did earlier this month, using a Jython expression to match terms in copies of the abstract field
- It helps to replace some characters with spaces first with this GREL: value.replace(/[.\/;(),]/, " ")
- This caught some extra AGROVOC terms, but unfortunately we only check for single-word terms
- Then I checked for existing items on CGSpace matching these MELIA using my duplicate checker:

$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv -u dspace -db dspace -p 'fuuu' -o /tmp/melia-matches.csv

Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):

$ xsv join --left id ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv id ~/Downloads/melia-matches-csv.csv > /tmp/melias-with-relations.csv

I had to use xsv because csvcut was throwing an error detecting the dialect of the input CSVs (?)
I created a SAF bundle and imported the 749 MELIAs to DSpace Test
I found thirteen items on CGSpace with dates in format "DD/MM/YYYY" so I fixed those

2022-08-20

Peter sent me back the results of the duplicate checking on the Gender presentations
- There were only a handful of duplicates, so I used the IDs in the spreadsheet to flag and delete them in OpenRefine
I had a new idea about matching AGROVOC subjects and countries in OpenRefine
- I was previously splitting up the text value field (title/abstract/etc) by spaces and searching for each word in the list of terms/countries like this:

with open(r"/tmp/cgspace-countries.txt",'r') as f:
    countries = [name.rstrip().lower() for name in f]

return "||".join([x for x in value.split(' ') if x.lower() in countries])

But that misses multi-word terms/countries with spaces, so we can search the other way around by using a regex for each term/country and checking if it appears in the text value field:

import re

with open(r"/tmp/agrovoc-subjects.txt",'r') as f:
    terms = [name.rstrip().lower() for name in f]

return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])

Now we are only limited by our small (~1,400) list of AGROVOC subjects, so I did an export from PostgreSQL of all dcterms.subjects values and am looking them up against AGROVOC's API right now:

localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2022-08-20-agrovoc.csv WITH CSV HEADER;
COPY 21685
$ csvcut -c 1 /tmp/2022-08-20-agrovoc.csv | sed 1d > /tmp/all-subjects.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/all-subjects.txt -o 2022-08-20-all-subjects-results.csv
$ csvgrep -c 'number of matches' -m 0 -i /tmp/2022-08-20-all-subjects-results.csv.bak | csvcut -c 1 | sed 1d > /tmp/agrovoc-subjects.txt
$ wc -l /tmp/agrovoc-subjects.txt
11834 /tmp/agrovoc-subjects.txt

Then I created a new column joining the title and abstract, and ran the Jython expression above against this new file with 11,000 AGROVOC terms
- Then I joined that column with Peter's dcterms.subject column and then deduplicated it with this Jython:

res = []

[res.append(x) for x in value.split("||") if x not in res]

return "||".join(res)

This is way better, but you end up getting a bunch of countries, regions, and short words like "gates" matching in AGROVOC that are inappropriate (we typically don't tag these in AGROVOC) or incorrect (gates, as in windows or doors, not the funding agency)
- I did a text facet in OpenRefine and removed a bunch of these by eye
Then I finished adding the dcterms.relation and CRP metadata flagged by Peter on the Gender presentations
- I'm waiting for him to send me the PDFs and then I will upload them to DSpace Test

2022-08-21

Start indexing on AReS
The load on CGSpace was around 5.0 today, and now that I started the harvesting it's over 10 for an hour now, sigh...
- I'm going to try an experiment to block Googlebot, bingbot, and Yandex for a week to see if the load goes down

2022-08-22

I tried to re-generate the SAF bundle for the MARLO Innovations after improving the AGROVOC subjects and the v3 PDFs, but six are missing from the v3 zip that are present in the original zip:
- ProjectInnovationSummary-WLE-P500-I78.pdf
- ProjectInnovationSummary-WLE-P452-I699.pdf
- ProjectInnovationSummary-WLE-P518-I696.pdf
- ProjectInnovationSummary-WLE-P442-I740.pdf
- ProjectInnovationSummary-WLE-P516-I647.pdf
- ProjectInnovationSummary-WLE-P438-I585.pdf
I downloaded them manually using the URLs in the original CSV
I also uploaded a new version of the MELIAs to DSpace Test

2022-08-23

Checking the number of items on CGSpace so we can keep an eye on the 100,000 number:

dspace=# SELECT COUNT(uuid) FROM item WHERE in_archive='t';
 count 
-------
 95716
(1 row)

If I check OAI I see more, but perhaps that counts mapped items multiple times
Peter said the 303 Gender PPTs were good to go, so I updated the collection mappings and IDs in OpenRefine and then uploaded them to CGSpace:

$ dspace import --add --eperson=fuu@fuu.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-08-23-gender-ppts.map

I created a GitHub issue for OpenRXV compatibility issues with DSpace 7

2022-08-24

Start working on the MARLO OICRs
- First I extracted the filenames and IDs from the v2 metadata file, then joined it with the UTF-8 version:

$ xsv select 'cg.number (series/report No.),File' OICRS\ Metadata\ v2.csv > /tmp/OICR-files.csv
$ xsv join --left 'cg.number (series/report No.)' OICRS\ metadata\ utf8\ 20220816_JM.csv 'cg.number (series/report No.)' /tmp/OICR-files.csv > OICRs-UTF-8-with-files.csv

After that I imported it into OpenRefine for data cleaning
- To enrich the metadata I combined the title and abstract into a new field and then checked my list of 11,000 AGROVOC terms against it
- First, create a new column with this GREL:

cells["dc.title"].value + " " + cells["dcterms.abstract"].value

Then use this Jython:

import re

with open(r"/tmp/agrovoc-subjects.txt",'r') as f : 
    terms = [name.rstrip().lower() for name in f]

return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])

After that I de-duplicated the terms using this Jython:

res = []

[res.append(x) for x in value.split("||") if x not in res]

return "||".join(res)

Then I split the multi-values on "||" and used a text facet to remove some countries and other nonsense terms that matched, like "gates" and "al" and "s"
- Then I did the same for countries
Then I exported the CSV and started searching for duplicates so that I can add them as relations:

$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-24-OICRs.csv -u dspace -db dspace -p 'omg' -o /tmp/oicrs-matches.csv

Oh wow, I actually found one OICR already uploaded to CGSpace... I have to ask Jose about that

2022-08-25

I started processing the MARLO Policies in OpenRefine, similar to the Innovations, MELIAs, and OICRs above
- I also re-ran the AGROVOC matching on Innovations because my technique has improved since I ran it a few weeks ago

2022-08-29

Start a harvest on AReS
Meeting with Peter and Abenet about CGSpace issues
I mapped the one MARLO OICR duplicate from the CCAFS Reports collection and deleted it from the OICRs CSV

2022-08-30

Manuel from the "Alianza SIDALC" in South America contacted me asking for permission to harvest CGSpace and include our content in their system
- I responded that we would be glad if they harvested us, and that they should use a useful user agent so we can contact them incase of any issues or changes on the server
I emailed ILRI ICT to ask how Abenet and I can use the CGSpace Support email address in our email applications because we haven't checked that account in years
- I tried to log in on office365.com but it gave an error
- I got access to the account and cleaned up the inbox, unsubscribed from a bunch of Microsoft and Yammer feeds, etc
Remind Dani, Tariku, and Andrea about the legacy links that we want to update on ILRI's website:
Join the MARLO OICRs with their relations that I processed a few days ago (minus the second id column and some others):

$ xsv join --left id ~/Downloads/2022-08-24-OICRs.csv id ~/Downloads/oicrs-matches-csv.csv | xsv select '!id[1],Your Title,Their Title,Similarity,Your Date,Their Date,datediff' > /tmp/oicrs-with-relations.csv

Then I cleaned them with csv-metadata-quality to catch some duplicates, add regions, etc and re-imported to OpenRefine
- I flagged a few duplicates for Jose and he'll let me know what to do with them
I imported the OICRs to DSpace Test:

$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
$ dspace import --add --eperson=fuuuu@fuuu.com --source /tmp/SimpleArchiveFormat-oicrs --mapfile=./2022-08-30-OICRs.map

Meeting with Marie-Angelique, Abenet, Valentina, Sara, and Margarita about Types
I am testing the org.apache.cocoon.uploads.autosave=false setting for XMLUI so that files posted via multi-part forms get memory mapped instead of written to disk
Check the MARLO Policies for relations and join them with the main CSV file:

$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-25-Policies-UTF-8-With-Files.csv -u dspace -db dspace -p 'fuui' -o /tmp/policies-matches.csv
$ xsv join --left id ~/Downloads/2022-08-25-Policies-UTF-8-With-Files.csv id /tmp/policies-matches.csv | xsv select '!id[1],Your Title,Their Title,Similarity,Your Date,Their Date' > /tmp/policies-with-relations.csv

19 KiB Raw Blame History

2022-08-01

2022-08-02

2022-08-03

2022-08-05

2022-08-13

2022-08-15

2022-08-17

2022-08-18

2022-08-19

2022-08-20

2022-08-21

2022-08-22

2022-08-23

2022-08-24

2022-08-25

2022-08-29

2022-08-30

19 KiB

Raw Blame History