2022-08-01
2022-08-02
- Resume working on the MARLO Innovations
- Last week Jose had sent me an updated CSV with UTF-8 formatting, which was missing the filename column
- I joined it with the older file (stripped down to just the
cg.number
and filename
columns and then did the same cleanups I had done last week
- I noticed there are six PDFs unused, so I asked Jose
- Spent some time trying to understand the REST API submission issues that Rafael from CIAT is having with tip-approve and tip-submit
- First, according to my notes in 2020-10, a user must be a collection admin in order to submit via the REST API
- Second, a collection must have a “Accept/Reject/Edit Metadata” step defined in the workflow
- Also, I referenced my notes from this gist I had made for exactly this purpose! https://gist.github.com/alanorth/40fc3092aefd78f978cca00e8abeeb7a
2022-08-03
- I came up with an interesting idea to add missing countries and AGROVOC terms to the MARLO Innovation metadata
- I copied the abstract column to two new fields:
countrytest
and agrovoctest
and then used this Jython code as a transform to drop terms that don’t match (using CGSpace’s country list and list of 1,400 AGROVOC terms):
with open(r"/tmp/cgspace-countries.txt",'r') as f :
countries = [name.rstrip().lower() for name in f]
return "||".join([x for x in value.split(' ') if x.lower() in countries])
- Then I joined them with the other country and AGROVOC columns
- I had originally tried to use csv-metadata-quality to look up and drop invalid AGROVOC terms but it was timing out ever dozen or so requests
- Then I briefly tried to use lightrdf to export a text file of labels from AGROVOC’s RDF, but I couldn’t figure it out
- I just realized this will not match countries with spaces in our cell value, ugh… and Jython has weird syntax and errors and I can’t get normal Python code to work here, I’m missing something
- Then I extracted the titles, dates, and types and added IDs, then ran them through
check-duplicates.py
to find the existing items on CGSpace so I can add them as dcterm.relation
links
$ csvcut -l -c dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-08-03-Innovations-Cleaned.csv | sed '1s/line_number/id/' > /tmp/innovations-temp.csv
$ ./ilri/check-duplicates.py -i /tmp/innovations-temp.csv -u dspacetest -db dspacetest -p 'dom@in34sniper' -o /tmp/ccafs-duplicates.csv
- There were about 115 with existing items on CGSpace
- Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):
$ csvjoin --left -c dc.title ~/Downloads/2022-08-03-Innovations-Cleaned.csv ~/Downloads/2022-08-03-Innovations-relations.csv > /tmp/innovations-with-relations.csv
- Then I used SAFBuilder to create a SimpleItemArchive and import to DSpace Test:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
$ dspace import --add --eperson=aorth@mjanja.ch --source /tmp/SimpleArchiveFormat --mapfile=./2022-08-03-innovations.map
- Meeting with Mohammed Salem about harmonizing MEL and CGSpace metadata fields
- I still need to share our results and recommendations with Peter, Enrico, Sara, Svetlana, et al
- I made some minor fixes to csv-metadata-quality while working on the MARLO CRP Innovations
2022-08-05
- I discussed issues with the DSpace 7 submission forms on Slack and Mark Wood found that the migration tool creates a non-working submission form
- After updating the class name of the collection step and removing the “complete” and “sample” steps the submission form was working
- Now the issue is that the controlled vocabularies show up like this:
- I think we need to add IDs, I will have to check what the implications of that are
- Emilio contacted me last week to say they have re-worked their harvester on Hetzner to use a new user agent:
AICCRA website harvester
- I verified that I see it in the REST API logs, but I don’t see any new stats hits for it
- I do see 11,000 hits from that IP last month when I had the incorrect nginx configuration that was sending a literal
$http_user_agent
so I purged those
- It is lucky that we have
harvest
in the DSpace spider agent example file so Solr doesn’t log these hits, nothing needed to be done in nginx
2022-08-13
- I noticed there was high load on CGSpace, around 9 or 10
- Looking at the Munin graphs it seems to just be the last two hours or so, with a slight increase in PostgreSQL connections, firewall traffic, and a more noticeable increase in CPU
- DSpace sessions are normal
- The number of unique hosts making requests to nginx is pretty low, though it’s only 6AM in the server’s time
- I see one IP in Sweden making a lot of requests with a normal user agent: 80.248.237.167
- This host is on Internet Vikings (INTERNETBOLAGET), and I see 140,000 requests from them in Solr
- I see reports of excessive scraping on AbuseIPDB.com
- I’m gonna add their 80.248.224.0/20 to the bot-networks.conf in nginx
- I will also purge all the hits from this IP in Solr statistics
- I also see the core.ac.uk bot making tens of thousands of requests today, but we are already tagging that as a bot in Tomcat’s Crawler Session Manager valve, so they should be sharing a Tomcat session with other bots and not creating too many sessions
2022-08-15
- Start indexing on AReS
- Add CONSERVATION to ILRI subjects on CGSpace
- I see that AGROVOC has
conservation agriculture
and I suggested that we use that instead
2022-08-17
- Peter and Jose sent more feedback about the CRP Innovation records from MARLO
- We expanded the CRP names in the citation and removed the
cg.identifier.url
URLs because they are ugly and will stop working eventually
- The mappings of MARLO links will be done internally with the
cg.number
IDs like “IN-1119” and the Handle URIs
2022-08-18
- I talked to Jose about the CCAFS MARLO records
- He still hasn’t finished re-processing the PDFs to update the internal MARLO links
- I started looking at the other records (MELIAs, OICRs, Policies) and found some minor issues in the MELIAs so I sent feedback to Jose
- On second thought, I opened the MELIAs file in OpenRefine and it looks OK, so this must have been a parsing issue in LibreOffice when I was checking the file (or perhaps I didn’t use the correct quoting when importing)
- Import the original MELIA v2 CSV file into OpenRefine to fix encoding before processing with csvcut/csvjoin
- Then extract the IDs and filenames from the original V2 file and join with the UTF-8 file:
$ csvcut -c 'cg.number (series/report No.)',File ~/Downloads/MELIA-Metadata-v2-csv.csv > MELIA-v2-IDs-Files.csv
$ csvjoin -c 'cg.number (series/report No.)' MELIAs\ metadata\ utf8\ 20220816_JM.csv MELIA-v2-IDs-Files.csv > MELIAs-UTF-8-with-files.csv
- Then I imported them into OpenRefine to start metadata cleaning and enrichment
- Make some minor changes to cgspace-submission-guidelines
- Upgrade to Bootstrap v5.2.0
- Dedupe value pairs and controlled vocabularies before writing them
- Sort the controlled vocabularies before writing them (we don’t do this for value pairs because some are added in specific order, like CRPs)
2022-08-19
- Peter Ballantyne sent me metadata for 311 Gender items that need to be duplicate checked on CGSpace before uploading
- I spent a half an hour in OpenRefine to fix the dates because they only had YYYY, but most abstracts and titles had more specific information about the date
- Then I checked for duplicates:
$ ./ilri/check-duplicates.py -i ~/Downloads/gender-ppts-xlsx.csv -u dspace -db dspace -p 'fuuu' -o /tmp/gender-duplicates.csv
- I sent the list of ~130 possible duplicates to Peter to check
- Jose sent new versions of the MARLO Innovation/MELIA/OICR/Policy PDFs
- The idea was to replace tinyurl links pointing to MARLO, but I still see many tinyurl links, some of which point to CGIAR Sharepoint and require a login
- I asked them why they don’t just use the original links in the first place in case tinyurl.com disappears
- I continued working on the MARLO MELIA v2 UTF-8 metadata
- I did the same metadata enrichment exercise to extract countries and AGROVOC subjects from the abstract field that I did earlier this month, using a Jython expression to match terms in copies of the abstract field
- It helps to replace some characters with spaces first with this GREL:
value.replace(/[.\/;(),]/, " ")
- This caught some extra AGROVOC terms, but unfortunately we only check for single-word terms
- Then I checked for existing items on CGSpace matching these MELIA using my duplicate checker:
$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv -u dspace -db dspace -p 'fuuu' -o /tmp/melia-matches.csv
- Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):
$ xsv join --left id ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv id ~/Downloads/melia-matches-csv.csv > /tmp/melias-with-relations.csv
- I had to use
xsv
because csvcut
was throwing an error detecting the dialect of the input CSVs (?)
- I created a SAF bundle and imported the 749 MELIAs to DSpace Test
- I found thirteen items on CGSpace with dates in format “DD/MM/YYYY” so I fixed those
2022-08-20
- Peter sent me back the results of the duplicate checking on the Gender presentations
- There were only a handful of duplicates, so I used the IDs in the spreadsheet to flag and delete them in OpenRefine
- I had a new idea about matching AGROVOC subjects and countries in OpenRefine
- I was previously splitting up the text value field (title/abstract/etc) by spaces and searching for each word in the list of terms/countries like this:
with open(r"/tmp/cgspace-countries.txt",'r') as f :
countries = [name.rstrip().lower() for name in f]
return "||".join([x for x in value.split(' ') if x.lower() in countries])
- But that misses multi-word terms/countries with spaces, so we can search the other way around by using a regex for each term/country and checking if it appears in the text value field:
import re
with open(r"/tmp/agrovoc-subjects.txt",'r') as f :
terms = [name.rstrip().lower() for name in f]
return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
- Now we are only limited by our small (~1,400) list of AGROVOC subjects, so I did an export from PostgreSQL of all
dcterms.subjects
values and am looking them up against AGROVOC’s API right now:
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2022-08-20-agrovoc.csv WITH CSV HEADER;
COPY 21685
$ csvcut -c 1 /tmp/2022-08-20-agrovoc.csv | sed 1d > /tmp/all-subjects.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/all-subjects.txt -o 2022-08-20-all-subjects-results.csv
$ csvgrep -c 'number of matches' -m 0 -i /tmp/2022-08-20-all-subjects-results.csv.bak | csvcut -c 1 | sed 1d > /tmp/agrovoc-subjects.txt
$ wc -l /tmp/agrovoc-subjects.txt
11834 /tmp/agrovoc-subjects.txt
- Then I created a new column joining the title and abstract, and ran the Jython expression above against this new file with 11,000 AGROVOC terms
- Then I joined that column with Peter’s
dcterms.subject
column and then deduplicated it with this Jython:
res = []
[res.append(x) for x in value.split("||") if x not in res]
return "||".join(res)
- This is way better, but you end up getting a bunch of countries, regions, and short words like “gates” matching in AGROVOC that are inappropriate (we typically don’t tag these in AGROVOC) or incorrect (gates, as in windows or doors, not the funding agency)
- I did a text facet in OpenRefine and removed a bunch of these by eye
- Then I finished adding the
dcterms.relation
and CRP metadata flagged by Peter on the Gender presentations
- I’m waiting for him to send me the PDFs and then I will upload them to DSpace Test
2022-08-21
- Start indexing on AReS
- The load on CGSpace was around 5.0 today, and now that I started the harvesting it’s over 10 for an hour now, sigh…
- I’m going to try an experiment to block Googlebot, bingbot, and Yandex for a week to see if the load goes down
2022-08-22
- I tried to re-generate the SAF bundle for the MARLO Innovations after improving the AGROVOC subjects and the v3 PDFs, but six are missing from the v3 zip that are present in the original zip:
- ProjectInnovationSummary-WLE-P500-I78.pdf
- ProjectInnovationSummary-WLE-P452-I699.pdf
- ProjectInnovationSummary-WLE-P518-I696.pdf
- ProjectInnovationSummary-WLE-P442-I740.pdf
- ProjectInnovationSummary-WLE-P516-I647.pdf
- ProjectInnovationSummary-WLE-P438-I585.pdf
- I downloaded them manually using the URLs in the original CSV
- I also uploaded a new version of the MELIAs to DSpace Test
2022-08-23
- Checking the number of items on CGSpace so we can keep an eye on the 100,000 number:
dspace=# SELECT COUNT(uuid) FROM item WHERE in_archive='t';
count
-------
95716
(1 row)
- If I check OAI I see more, but perhaps that counts mapped items multiple times
- Peter said the 303 Gender PPTs were good to go, so I updated the collection mappings and IDs in OpenRefine and then uploaded them to CGSpace:
$ dspace import --add --eperson=fuu@fuu.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-08-23-gender-ppts.map
2022-08-24
- Start working on the MARLO OICRs
- First I extracted the filenames and IDs from the v2 metadata file, then joined it with the UTF-8 version:
$ xsv select 'cg.number (series/report No.),File' OICRS\ Metadata\ v2.csv > /tmp/OICR-files.csv
$ xsv join --left 'cg.number (series/report No.)' OICRS\ metadata\ utf8\ 20220816_JM.csv 'cg.number (series/report No.)' /tmp/OICR-files.csv > OICRs-UTF-8-with-files.csv
- After that I imported it into OpenRefine for data cleaning
- To enrich the metadata I combined the title and abstract into a new field and then checked my list of 11,000 AGROVOC terms against it
- First, create a new column with this GREL:
cells["dc.title"].value + " " + cells["dcterms.abstract"].value
import re
with open(r"/tmp/agrovoc-subjects.txt",'r') as f :
terms = [name.rstrip().lower() for name in f]
return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
- After that I de-duplicated the terms using this Jython:
res = []
[res.append(x) for x in value.split("||") if x not in res]
return "||".join(res)
- Then I split the multi-values on “||” and used a text facet to remove some countries and other nonsense terms that matched, like “gates” and “al” and “s”
- Then I did the same for countries
- Then I exported the CSV and started searching for duplicates so that I can add them as relations:
$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-24-OICRs.csv -u dspace -db dspace -p 'omg' -o /tmp/oicrs-matches.csv
- Oh wow, I actually found one OICR already uploaded to CGSpace… I have to ask Jose about that
2022-08-25
- I started processing the MARLO Policies in OpenRefine, similar to the Innovations, MELIAs, and OICRs above
- I also re-ran the AGROVOC matching on Innovations because my technique has improved since I ran it a few weeks ago
2022-08-29