mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-16 20:07:03 +01:00
531 lines
26 KiB
Markdown
531 lines
26 KiB
Markdown
---
|
||
title: "November, 2022"
|
||
date: 2022-11-01T09:11:36+03:00
|
||
author: "Alan Orth"
|
||
categories: ["Notes"]
|
||
---
|
||
|
||
## 2022-11-01
|
||
|
||
- Last night I re-synced DSpace 7 Test from CGSpace
|
||
- I also updated all my local `7_x-dev` branches on the latest upstreams
|
||
- I spent some time updating the authorizations in Alliance collections
|
||
- I want to make sure they use groups instead of individuals where possible!
|
||
- I reverted the Cocoon autosave change because it was more of a nuissance that Peter can't upload CSVs from the web interface and is a very low severity security issue
|
||
|
||
<!--more-->
|
||
|
||
- I ran FixLowQualityThumbnails from cgspace-java-helpers on some large collections on CGSpace and ended up fixing 194 items!
|
||
- I did some minor checking and uploaded twenty-four IFPRI outputs for the Initiatives to DSpace Test
|
||
- Tim merged my [pull request to override the ImageMagick PDF density in DSpace 7](https://github.com/DSpace/DSpace/pull/8553)
|
||
- I ported it to DSpace 6.x and submitted a pull request: https://github.com/DSpace/DSpace/pull/8560
|
||
|
||
## 2022-11-02
|
||
|
||
- I joined the FAO–CGIAR AGROVOC results sharing meeting
|
||
- From June to October, 2022 we suggested 39 new keywords, added 27 to AGROVOC, 4 rejected, and 9 still under discussion
|
||
- Doing duplicate check on IFPRI's batch upload and I found one duplicate uploaded by IWMI earlier this year
|
||
- I will update the metadata of that item and map it to the correct Initiative collection
|
||
|
||
## 2022-11-03
|
||
|
||
- I added countries to the twenty-three IFPRI items in OpenRefine based on their titles and abstracts (using the Jython trick I learned a few months ago), then added regions using csv-metadata-quality, and uploaded them to CGSpace
|
||
- I exported a list of collections from CGSpace so I can run the thumbnail fixes on each, as we seem to have issues when doing it on (some) large communities like the CRP community:
|
||
|
||
```console
|
||
localhost/dspace= ☘ \COPY (SELECT ds6_collection2collectionhandle(uuid) AS collection FROM collection) to /tmp/collections.txt
|
||
COPY 1268
|
||
```
|
||
|
||
- Then I started a test run on DSpace Test:
|
||
|
||
```console
|
||
$ while read -r collection; do chrt -b 0 dspace dsrun io.github.ilri.cgspace.scripts.FixLowQualityThumbnails $collection | tee -a /tmp/FixLowQualityThumbnails.log; done < /tmp/collections.txt
|
||
```
|
||
|
||
- I'll be curious to check the log after it's all done.
|
||
- After a few hours I see:
|
||
|
||
```console
|
||
$ grep -c 'Action: remove' /tmp/FixLowQualityThumbnails.log
|
||
626
|
||
```
|
||
|
||
- Not bad, because last week I did a more manual selection of collections and deleted ~200
|
||
- I will replicate this on CGSpace soon, and also try the FixJpgJpgThumbnails tool
|
||
- I see that the CIAT Library is still up, so I should really grab all the PDFs before they shut that old server down
|
||
- Export a list of items with PDFs linked there:
|
||
|
||
```console
|
||
localhost/dspacetest= ☘ \COPY (SELECT dspace_object_id,text_value FROM metadatavalue WHERE metadata_field_id=219 AND text_value LIKE '%ciat-library%') to /tmp/ciat-library-items.csv;
|
||
COPY 4621
|
||
```
|
||
|
||
- After stripping the page numbers off I see there are only about 2,700 unique files, and we have to filter the dead JSPUI ones...
|
||
|
||
```console
|
||
$ csvcut -c url 2022-11-03-CIAT-Library-items.csv | sed 1d | grep -v jspui | sort -u | wc -l
|
||
2752
|
||
```
|
||
|
||
- I'm not sure how we'll handle the duplicates because many items are book chapters or something where they share a PDF
|
||
|
||
## 2022-11-04
|
||
|
||
- I decided to check for old pre-ImageMagick thumbnails on CGSpace by finding any bitstreams with the description "Generated Thumbnail":
|
||
|
||
```console
|
||
localhost/dspacetest= ☘ \COPY (SELECT ds6_bitstream2itemhandle(dspace_object_id) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND text_value='Generated Thumbnail') to /tmp/old-thumbnails.txt;
|
||
COPY 1147
|
||
$ grep -v '\\N' /tmp/old-thumbnails.txt > /tmp/old-thumbnail-handles.txt
|
||
$ wc -l /tmp/old-thumbnail-handles.txt
|
||
987 /tmp/old-thumbnail-handles.txt
|
||
```
|
||
|
||
- A bunch of these have `\N` for some reason when I use the `ds6_bitstream2itemhandle` function to get their handles so I had to exclude those...
|
||
- I forced the media-filter for these items on CGSpace:
|
||
|
||
```console
|
||
$ while read -r handle; do JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -i $handle -f -v; done < /tmp/old-thumbnail-handles.txt
|
||
```
|
||
|
||
- Upload some batch records via CSV for Peter
|
||
- Update the about page on CGSpace with new text from Peter
|
||
- Add a few more ORCID identifiers and names to my growing file `2022-09-22-add-orcids.csv`
|
||
- I tagged fifty-four new authors using this list
|
||
- I deleted and mapped one duplicate item for Maria Garruccio
|
||
- I updated the CG Core website from Bootstrap v4.6 to v5.2
|
||
|
||
## 2022-11-07
|
||
|
||
- I did a harvest on AReS last night but it seems that MELSpace's sitemap is broken again because we have 10,000 fewer records
|
||
- I filed [an issue](https://github.com/ecrmnn/iso-3166-1/issues/10) on the iso-3166-1 npm package to update the name of Turkey to Türkiye
|
||
- I also filed [an issue](https://github.com/flyingcircusio/pycountry/issues/148) and [a pull request](https://github.com/flyingcircusio/pycountry/pull/149) on the pycountry package
|
||
- I also filed [an issue](https://github.com/konstantinstadler/country_converter/issues/121) and [a pull request](https://github.com/konstantinstadler/country_converter/pull/122) on the country-converter package
|
||
- I also changed one item on CGSpace that had been submitted since the name was changed
|
||
- I also imported the new iso-codes 4.12.0 into cgspace-java-helpers
|
||
- I also updated it in the DSpace `input-forms.xml`
|
||
- I also forked the iso-3166-1 package from npm and updated Swaziland, Macedonia, and Turkey in my fork
|
||
- I submitted a [pull request](https://github.com/ecrmnn/iso-3166-1/pull/11) to update this upstream
|
||
- Since I was making all these pull requests I also made [one on country-converter for the UN M.49 region "South-eastern Asia"](https://github.com/konstantinstadler/country_converter/pull/123)
|
||
- Port the [ImageMagick PDF cropbox fix](https://github.com/DSpace/DSpace/pull/8550) to DSpace 6.x
|
||
- I deployed it on CGSpace, ran all updates, and rebooted the host
|
||
- I ran the filter-media script on one large collection where many of these PDFs with cropbox issues exist:
|
||
|
||
```console
|
||
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/78 >& /tmp/filter-media-cropbox.log
|
||
```
|
||
|
||
- But looking at the items it processed, I'm not sure it's working as expected
|
||
- I looked at a few dozen
|
||
- I found some links to the Bioversity website on CGSpace that are not redirecting properly:
|
||
|
||
```console
|
||
$ http --print Hh http://www.bioversityinternational.org/nc/publications/publication/issue/geneflow_2004.html
|
||
GET /nc/publications/publication/issue/geneflow_2004.html HTTP/1.1
|
||
Accept: */*
|
||
Accept-Encoding: gzip, deflate
|
||
Connection: keep-alive
|
||
Host: www.bioversityinternational.org
|
||
User-Agent: HTTPie/3.2.1
|
||
|
||
HTTP/1.1 302 Found
|
||
Connection: Keep-Alive
|
||
Content-Length: 275
|
||
Content-Type: text/html; charset=iso-8859-1
|
||
Date: Mon, 07 Nov 2022 16:35:21 GMT
|
||
Keep-Alive: timeout=15, max=100
|
||
Location: https://www.bioversityinternational.orgnc/publications/publication/issue/geneflow_2004.html
|
||
Server: Apache
|
||
```
|
||
|
||
- The `Location` header is clearly wrong, and if I try https directly I get an HTTP 500
|
||
|
||
## 2022-11-08
|
||
|
||
- Looking at the Solr statistics hits on CGSpace for 2022-11
|
||
- I see 221.219.100.42 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
|
||
- I see 122.10.101.60 is in Hong Kong and making thousands of requests to XMLUI handles in a few hours, using a normal user agent
|
||
- I see 135.125.21.38 on OVH is making thousands of requests trying to do SQL injection
|
||
- I see 163.237.216.11 is somewhere in California making thousands of requests with a normal user agent
|
||
- I see 51.254.154.148 on OVH is making thousands of requests trying to do SQL injection
|
||
- I see 221.219.103.211 is on China Unicom and was making thousands of requests to XMLUI in a few hours, using a normal user agent
|
||
- I see 216.218.223.53 on Hurricane Electric making thousands of requests to XMLUI in a few minutes using a normal user agent
|
||
- I will purge all these hits and proably add China Unicom's subnet mask to my nginx `bot-network.conf` file to tag them as bots since there are SO many bad and malicious requests coming from there
|
||
|
||
```console
|
||
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
|
||
Purging 8975 hits from 221.219.100.42 in statistics
|
||
Purging 7577 hits from 122.10.101.60 in statistics
|
||
Purging 6536 hits from 135.125.21.38 in statistics
|
||
Purging 23950 hits from 163.237.216.11 in statistics
|
||
Purging 4093 hits from 51.254.154.148 in statistics
|
||
Purging 2797 hits from 221.219.103.211 in statistics
|
||
Purging 2618 hits from 216.218.223.53 in statistics
|
||
|
||
Total number of bot hits purged: 56546
|
||
```
|
||
|
||
- Also interesting to see a few new user agents:
|
||
- `RStudio Desktop (2022.7.1.554); R (4.2.1 x86_64-w64-mingw32 x86_64 mingw32)`
|
||
- `rstudio.cloud R (4.2.1 x86_64-pc-linux-gnu x86_64 linux-gnu)`
|
||
- `MEL`
|
||
- `Gov employment data scraper ([[your email]])`
|
||
- `RStudio Desktop (2021.9.0.351); R (4.1.1 x86_64-w64-mingw32 x86_64 mingw32)`
|
||
- I will purge all these:
|
||
|
||
```console
|
||
$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
|
||
Purging 6155 hits from RStudio in statistics
|
||
Purging 1929 hits from rstudio in statistics
|
||
Purging 1454 hits from MEL in statistics
|
||
Purging 1094 hits from Gov employment data scraper in statistics
|
||
|
||
Total number of bot hits purged: 10632
|
||
```
|
||
|
||
- Work on the CIAT Library items a bit again in OpenRefine
|
||
- I flagged items with:
|
||
- URL containing "#page" at the end (these are linking to book chapters, but we don't want to upload the PDF multiple times)
|
||
- Same URL used by more than one item ("Duplicates" facet in OpenRefine, these are some corner case I don't want to handle right now)
|
||
- URL containing ":8080" to CIAT's old DSpace (this server is no longer live)
|
||
- I want to try to handle the simple cases that should cover most of the items first
|
||
|
||
## 2022-11-09
|
||
|
||
- Continue working on the Python script to upload PDFs from CIAT Library to the relevant item on CGSpace
|
||
- I got the basic functionality working
|
||
|
||
## 2022-11-12
|
||
|
||
- Start a harvest on AReS
|
||
|
||
## 2022-11-15
|
||
|
||
- Meeting with Marie-Angelique, Sara, and Valentina about CG Core types
|
||
- We agreed to continue adding the feedback for each of the proposed actions
|
||
- The others will start filling in definitions for the types
|
||
- Sara had some good questions about duplicates on CGSpace and how we can possibly prevent them now that several systems are submitting items directly into the repository
|
||
- We need to be careful especially with regards to author's outputs that will be reported in the PRMS
|
||
|
||
## 2022-11-16
|
||
|
||
- Maria asked if we can extend the timeout for XMLUI sessions
|
||
- According to [this issue](https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44) it seems to be 30 minutes by default, as a Tomcat default
|
||
- I think we could extend this to an hour, as there is no real security risk (we're not a bank) and most user's lock screens would have activated after ten minutes or so anyways
|
||
|
||
## 2022-11-20
|
||
|
||
- Start a harvest on AReS
|
||
|
||
## 2022-11-22
|
||
|
||
- Check and upload some items to CGSpace for Peter
|
||
- I am waiting for some feedback from him about some duplicates and metadata issues for the rest
|
||
|
||
## 2022-11-23
|
||
|
||
- Fix some authorization issues for ABC's TIP submit tool on DSpace Test (the groups were correct on CGSpace, but not on test)
|
||
- Peter sent me feedback about the duplicates and metadata questions from yesterday
|
||
- I uploaded the eight items for COHESA and sixty-two for Gender
|
||
- I ran the script to tag ORCID identifiers with my `2022-09-22-add-orcids.csv` file and tagged twenty-seven
|
||
- Maria asked for help uploading a large PDF to CGSpace
|
||
- The PDF is only two pages, but it is 139MB!
|
||
- I decided to compress it with GhostScript, first with the screen profile (72dpi), then with the ebook profile (150dpi):
|
||
|
||
```console
|
||
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=Key\ facts\ from\ a\ traditional\ colombian\ food\ market-screen.pdf Key\ facts\ from\ a\ traditional\ colombian\ food\ market.pdf
|
||
$ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=Key\ facts\ from\ a\ traditional\ colombian\ food\ market-ebook.pdf Key\ facts\ from\ a\ traditional\ colombian\ food\ market.pdf
|
||
```
|
||
|
||
- The ebook one looks really good and is only 2.4MB...
|
||
- But for reference, this free Adobe tool seems to work: https://www.adobe.com/acrobat/online/compress-pdf.html
|
||
|
||
## 2022-11-24
|
||
|
||
- My script finished downloading the CIAT Library PDFs
|
||
- I did some more work on my `post-ciat-pdfs.py` script and tested uploading the items to my local DSpace and DSpace Test
|
||
- Then I ran the script on CGSpace, uploading ~1,500 PDFs to to existing items
|
||
|
||
## 2022-11-25
|
||
|
||
- Tony Murray, who is working on IFPRI's CGSpace integration, emailed me to ask some questions about the REST API
|
||
- Oh no, I realized there is a logic issue with the PDFbox cropbox code I added a few weeks ago:
|
||
|
||
```console
|
||
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v -f -i 10568/77010
|
||
The following MediaFilters are enabled:
|
||
Full Filter Name: org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
|
||
org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter
|
||
Loading @mire database changes for module MQM
|
||
Changes have been processed
|
||
IM Thumbnail tropentag2016_marshall.pdf is replacable.
|
||
File: tropentag2016_marshall.pdf.jpg
|
||
ERROR filtering, skipping bitstream:
|
||
|
||
Item Handle: 10568/77010
|
||
Bundle Name: ORIGINAL
|
||
File Size: 1486580
|
||
Checksum: 1ad66d918a56a5e84667386e1a32e352 (MD5)
|
||
Asset Store: 0
|
||
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
|
||
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
|
||
at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:325)
|
||
at org.apache.pdfbox.pdmodel.PDPageTree.get(PDPageTree.java:248)
|
||
at org.apache.pdfbox.pdmodel.PDDocument.getPage(PDDocument.java:1543)
|
||
at org.dspace.app.mediafilter.ImageMagickThumbnailFilter.getImageFile(ImageMagickThumbnailFilter.java:167)
|
||
at org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.getDestinationStream(ImageMagickPdfThumbnailFilter.java:27)
|
||
at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
|
||
at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
|
||
at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
|
||
at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
|
||
at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:232)
|
||
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
||
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
|
||
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
||
at java.lang.reflect.Method.invoke(Method.java:498)
|
||
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
|
||
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
|
||
```
|
||
|
||
- Salem gave me a list of CGSpace collections that have double spaces in the names
|
||
- Normally this would only be a minor annoyance, but he discovered that the REST API seems to trim the spaces, which causes an issue when trying to reference them!
|
||
- He sent me a list of about ten collection UUIDs so I fixed them
|
||
- I found a bunch of LIVES presentations on CGSpace that have presentations on SlideShare with incorrect licenses... I updated about fifty of them
|
||
|
||
## 2022-11-26
|
||
|
||
- Sync DSpace Test with CGSpace
|
||
- I increased the session timeout in Tomcat from thirty minutes to sixty, as requested by Maria a few weeks ago
|
||
- See: https://gitlab.inf.unibz.it/commul/docker/clarin-dspace/-/issues/44
|
||
- I re-built DSpace on CGSpace, ran all updates, and rebooted the machine
|
||
- Then after coming back up the handle server won't start
|
||
- The `handle-server.log` file shows:
|
||
|
||
```console
|
||
Shutting down...
|
||
"2022/11/26 02:12:17 CET" 25 Rotating log files
|
||
Error: null
|
||
(see the error log for details.)
|
||
```
|
||
|
||
- In the `error.log` file I see:
|
||
|
||
```console
|
||
"2022/11/26 02:12:18 CET" 25 Started new run.
|
||
java.lang.UnsupportedOperationException
|
||
at java.lang.Runtime.runFinalizersOnExit(Runtime.java:287)
|
||
at java.lang.System.runFinalizersOnExit(System.java:1059)
|
||
at net.handle.server.Main.initialize(Main.java:124)
|
||
at net.handle.server.Main.main(Main.java:75)
|
||
Shutting down...
|
||
```
|
||
|
||
- Ah, it seems to be due to an [issue in OpenJDK 1.8.0_352](https://groups.google.com/g/dspace-tech/c/PqjfA5mqG4w/m/FhxI5oXhFwAJ?pli=1)
|
||
- I see the server upgraded to the new JDK version on 2022-11-10:
|
||
|
||
```console
|
||
Upgrade: openjdk-8-jdk-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04), openjdk-8-jre-headless:amd64 (8u342-b07-0ubuntu1~20.04, 8u352-ga-1~20.04)
|
||
End-Date: 2022-11-10 04:10:45
|
||
```
|
||
|
||
- As highlighted in the dspace-tech mailing list thread above, [this OpenJDK release deprecated `Runtime.runFinalizersOnExit`](https://mail.openjdk.org/pipermail/jdk8u-dev/2022-October/015706.html):
|
||
|
||
```console
|
||
- JDK-8287132: Retire Runtime.runFinalizersOnExit so that it always throws UOE
|
||
```
|
||
|
||
- I downloaded the previous versions of the packages from Launchpad:
|
||
|
||
```console
|
||
# wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jdk-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
|
||
# wget https://launchpad.net/~openjdk-security/+archive/ubuntu/ppa/+build/24195357/+files/openjdk-8-jre-headless_8u342-b07-0ubuntu1~20.04_amd64.deb
|
||
# dpkg -i openjdk-8-j*8u342-b07*.deb
|
||
```
|
||
|
||
- Then the handle-server process starts up fine, so I held these OpenJDK versions for now:
|
||
|
||
```console
|
||
# apt-mark hold openjdk-8-jdk-headless:amd64 openjdk-8-jre-headless:amd64
|
||
openjdk-8-jdk-headless set on hold.
|
||
openjdk-8-jre-headless set on hold.
|
||
```
|
||
|
||
- Start a harvest on AReS
|
||
|
||
## 2022-11-27
|
||
|
||
- I realized I made a mistake in the PDF CropBox code I wrote for dspace-api a few weeks ago
|
||
- For PDFs with only one page I was seeing this in the filter-media output:
|
||
|
||
```console
|
||
java.lang.IndexOutOfBoundsException: 1-based index out of bounds: 2
|
||
```
|
||
|
||
- It turns out that [PDDocument's getPage() is zero-based](https://javadoc.io/static/org.apache.pdfbox/pdfbox/2.0.27/org/apache/pdfbox/pdmodel/PDDocument.html#getPage-int-)
|
||
- I also updated PDFBox from 2.0.24 to 2.0.27
|
||
- I synced DSpace 7 Test with CGSpace
|
||
- I had to follow my notes from 2022-03 to delete the missing Atmire migrations
|
||
|
||
## 2022-11-28
|
||
|
||
- Update `ilri/fix-metadata-values.py` to update the `last_modified` date for items when it updates metadata
|
||
- This should allow us to use the normal `index-discovery` (with out `-b`) as well as having REST API responses showing a correct last modified date
|
||
- Maria asked me to add some ORCID identifiers for Alliance staff to the controlled vocabulary
|
||
- I also updated the `add-orcid-identifiers-csv.py` to update the `last_modified` timestamp of the item
|
||
- I re-factored my CGSpace Python scripts to use a helper `util.py` module with common functions
|
||
- For now it only has the one for updating an item's `last_modified` timestamp but I will gradually add more
|
||
- I also ran our list of ORCID identifiers against ORCID's API to see if anyone changed their name format
|
||
- Then I ran them on CGSpace with `ilri/update-orcids.py` to fix them
|
||
- Normalize the `text_lang` values for CGSpace metadata again:
|
||
|
||
```console
|
||
localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
|
||
text_lang │ count
|
||
───────────┼─────────
|
||
en_US │ 2912429
|
||
│ 108387
|
||
en │ 12457
|
||
fr │ 2
|
||
vi │ 2
|
||
es │ 1
|
||
␀ │ 0
|
||
(7 rows)
|
||
|
||
Time: 624.651 ms
|
||
localhost/dspacetest= ☘ BEGIN;
|
||
BEGIN
|
||
Time: 0.130 ms
|
||
localhost/dspacetest= ☘ UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', '');
|
||
UPDATE 120844
|
||
Time: 4074.879 ms (00:04.075)
|
||
localhost/dspacetest= ☘ SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
|
||
text_lang │ count
|
||
───────────┼─────────
|
||
en_US │ 3033273
|
||
fr │ 2
|
||
vi │ 2
|
||
es │ 1
|
||
␀ │ 0
|
||
(5 rows)
|
||
|
||
Time: 346.913 ms
|
||
localhost/dspacetest= ☘ COMMIT;
|
||
```
|
||
|
||
- Discussing the UN M.49 regions on CGSpace with Valentina and Abenet
|
||
- The PRMS team is confused about our regions, which are mostly UN M.49 with some legacy stuff using different ones
|
||
- I think we can fix all the stuff for Initiatives from this year very easily, then work on the legacy stuff later
|
||
- Also, I noticed that that [country_converter was using the wrong UN M.49 region for Myanmar](https://github.com/konstantinstadler/country_converter/issues/124)
|
||
- I submitted a [pull request](https://github.com/konstantinstadler/country_converter/pull/125)
|
||
- I exported a CSV of the Initiatives and ran the csv-metadata-quality script to add missing UN M.49 regions
|
||
- To make sure everything was correct I got a list of the changes from csv-metadata-quality and checked them all manually on the UN M.49 site, just in case there was another bug in country_converter
|
||
- This fixed regions for about fifty items
|
||
- I dumped the UN M.49 regions from the CSV on the UNSD website:
|
||
|
||
```console
|
||
$ csvcut -d";" -c 'Region Name,Sub-region Name,Intermediate Region Name' ~/Downloads/UNSD\ —\ Methodology.csv | sed -e 1d -e 's/,/\n/g' | sort -u
|
||
|
||
Africa
|
||
Americas
|
||
Asia
|
||
Australia and New Zealand
|
||
Caribbean
|
||
Central America
|
||
Central Asia
|
||
Channel Islands
|
||
Eastern Africa
|
||
Eastern Asia
|
||
Eastern Europe
|
||
Europe
|
||
Latin America and the Caribbean
|
||
Melanesia
|
||
Micronesia
|
||
Middle Africa
|
||
Northern Africa
|
||
Northern America
|
||
Northern Europe
|
||
Oceania
|
||
Polynesia
|
||
South America
|
||
South-eastern Asia
|
||
Southern Africa
|
||
Southern Asia
|
||
Southern Europe
|
||
Sub-Saharan Africa
|
||
Western Africa
|
||
Western Asia
|
||
Western Europe
|
||
```
|
||
|
||
- For now I will combine it with our existing list, which contains a few legacy regions, while we discuss about a long-term plan with Peter and Abenet
|
||
- Peter wrote to ask me to change the PIM CRP's full name from `Policies, Institutions and Markets` to `Policies, Institutions, and Markets`
|
||
- It's apparently the only CRP with an Oxford comma...?
|
||
- I updated them all on CGSpace
|
||
- Also, I ran an `index-discovery` without the `-b` since now my metadata update scripts update the `last_modified` timestamp as well and it finished in fifteen minutes, and I see the changes in the Discovery search and facets
|
||
|
||
## 2022-11-29
|
||
|
||
- Meeting with Marie-Angelique, Abenet, Sara, Valentina, and Margarita about `dcterms.type` for CG Core
|
||
- We discussed some of the feedback from Peter
|
||
- Peter and Abenet and I agreed to update some of our metadata in response to the PRMS feedback
|
||
- I updated Pacific to Oceania, and Central Africa to Middle Africa, and removed the old ones from the submission form
|
||
- These are UN M.49 regions
|
||
|
||
## 2022-11-30
|
||
|
||
- I ran csv-metadata-quality on an export of the ILRI community on CGSpace, but only with title, country, and region fields
|
||
- It fixed some whitespace issues and added missing regions to about 1,200 items
|
||
- I thought of a way to delete duplicate metadata values, since the CSV upload method can't detect them correctly
|
||
- First, I wrote a [SQL query](https://chartio.com/learn/databases/how-to-find-duplicate-values-in-a-sql-table/) to identify metadata falues with the same `text_value`, `metadata_field_id`, and `dspace_object_id`:
|
||
|
||
```console
|
||
\COPY (SELECT a.text_value, a.metadata_value_id, a.metadata_field_id, a.dspace_object_id
|
||
FROM metadatavalue a
|
||
JOIN (
|
||
SELECT dspace_object_id, text_value, metadata_field_id, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id NOT IN (11, 12, 28, 136, 159) GROUP BY dspace_object_id, text_value, metadata_field_id HAVING COUNT(*) > 1
|
||
) b
|
||
ON a.dspace_object_id = b.dspace_object_id
|
||
AND a.text_value = b.text_value
|
||
AND a.metadata_field_id = b.metadata_field_id
|
||
ORDER BY a.text_value) TO /tmp/duplicates.txt
|
||
```
|
||
|
||
- (This query excludes metadata for accession and available dates, provenance, format, etc)
|
||
- Then, I sorted the file by fields four and one (`dspace_object_id` and `text_value`) so that the duplicate metadata for each item were next to each other, used awk to print the second field (`metadata_field_id`) from every _other_ line, and created a SQL script to delete the metadata
|
||
|
||
```console
|
||
$ sort -k4,1 /tmp/duplicates.txt | \
|
||
awk -F'\t' 'NR%2==0 {print $2}' | \
|
||
sed 's/^\(.*\)$/DELETE FROM metadatavalue WHERE metadata_value_id=\1;/' > /tmp/delete-duplicates.sql
|
||
```
|
||
|
||
- This worked very well, but there were some metadata values that were tripled or quadrupled, so it only deleted the first duplicate
|
||
- I just ran it again two more times to find the last duplicates, now we have none!
|
||
- I also generated another SQL file with commands to update the last modified timestamps of these items:
|
||
|
||
```console
|
||
$ awk -F'\t' '{print $4}' /tmp/duplicates.txt | sort -u | sed "s/^\(.*\)$/UPDATE item SET last_modified=NOW() WHERE uuid='\1';/" > /tmp/update-timestamp.sql
|
||
```
|
||
|
||
- Tezira said she was having trouble archiving submissions
|
||
- In the afternoon I looked and found a high number of locks:
|
||
|
||
```console
|
||
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c | sort -n
|
||
60 dspaceCli
|
||
176 dspaceApi
|
||
1194 dspaceWeb
|
||
```
|
||
|
||
![PostgreSQL database locks](/cgspace-notes/2022/11/postgres_locks_cgspace-day.png)
|
||
|
||
- The timing looks suspiciously close to when I was running the batch updates on the ILRI community this morning.
|
||
- I restarted Tomcat and PostgreSQL and everything was back to normal
|
||
- I found some items on CGSpace in Dinka, Ndogo, and Bari languages, but the `dcterms.language` field was "other"
|
||
- That's so unfortunate! These languages are not in ISO 639-1, but they are in ISO 639-3, which uses Alpha 3 and has more space for languages
|
||
- I changed them from other to use the three-letter codes, and I will suggest to the CG Core group that we use ISO 639-3 in the future
|
||
- Send feedback to Salem about some metadata issues with MEL submissions to CGSpace
|
||
|
||
<!-- vim: set sw=2 ts=2: -->
|