CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

February, 2024

2024-02-05

dspace=# BEGIN;
BEGIN
dspace=*# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 180
dspace=*# COMMIT;
COMMIT

2024-02-06

  • Discuss IWMI using the CGSpace REST API for their new website
  • Export the IWMI community to extract their ORCID identifiers:
$ dspace metadata-export -i 10568/16814 -f /tmp/iwmi.csv
$ csvcut -c 'cg.creator.identifier,cg.creator.identifier[en_US]' ~/Downloads/2024-02-06-iwmi.csv \
  | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' \
  | sort -u \
  | tee /tmp/iwmi-orcids.txt \
  | wc -l
353
$ ./ilri/resolve_orcids.py -i /tmp/iwmi-orcids.txt -o /tmp/iwmi-orcids-names.csv -d
  • I noticed some similar looking names in our list so I clustered them in OpenRefine and manually checked a dozen or so to update our list

2024-02-07

  • Maria asked me about the “missing” item from last week again
    • I can see it when I used the Admin search, but not in her workflow
    • It was submitted by TIP so I checked that user’s workspace and found it there
    • After depositing, it went into the workflow so Maria should be able to see it now

2024-02-09

  • Minor edits to CGSpace submission form
  • Upload 55 ISNAR book chapters to CGSpace from Peter

2024-02-19

2024-02-20

  • Minor work on OpenRXV to fix a bug in the ng-select drop downs
  • Minor work on the DSpace 7 nginx configuration to allow requesting robots.txt and sitemaps without hitting rate limits

2024-02-21

  • Minor updates on OpenRXV, including one bug fix for missing mapped collections
    • Salem had to re-work the harvester for DSpace 7 since the mapped collections and parent collection list are separate!

2024-02-22

  • Discuss tagging of datasets and re-work the submission form to encourage use of DOI field for any item that has a DOI, and the normal URL field if not
    • The “cg.identifier.dataurl” field will be used for “related” datasets
    • I still have to check and move some metadata for existing datasets

2024-02-23

  • This morning Tomcat died due to an OOM kill from the kernel:
kernel: Out of memory: Killed process 698 (java) total-vm:14151300kB, anon-rss:9665812kB, file-rss:320kB, shmem-rss:0kB, UID:997 pgtables:20436kB oom_score_adj:0
  • I don’t see any abnormal pattern in my Grafana graphs, for JVM or system load… very weird
  • I updated the submission form on CGSpace to include the new changes to URLs for datasets
    • I also updated about 80 datasets to move the URLs to the correct field

2024-02-25

  • This morning Tomcat died while I was doing a CSV export, with an OOM kill from the kernel:
kernel: Out of memory: Killed process 720768 (java) total-vm:14079976kB, anon-rss:9301684kB, file-rss:152kB, shmem-rss:0kB, UID:997 pgtables:19488kB oom_score_adj:0
  • I don’t know why this is happening so often recently…

2024-02-27

  • IFPRI sent me a list of authors to add to our list for now, until we can find a better way of doing it
    • I extracted the existing authors from our controlled vocabulary and combined them with IFPRI’s:
$ xmllint --xpath '//node/isComposedBy/node()' dspace/config/controlled-vocabularies/dc-contributor-author.xml \
  | grep -oE 'label=".*"' \
  | sed -e 's/label="//' -e 's/"$//' > /tmp/authors
$ cat /tmp/authors /tmp/ifpri-authors | sort -u > /tmp/new-authors

2024-02-28

  • I figured out a way to add a new Angular component to handle all our relation fields

2024-02-29

  • Clean up a bunch of metadata on CGSpace
tadata values:
$ csvcut -c dcterms.publisher ~/Downloads/2024-01-09-publishers4.csv | sed -e 1d -e 's/"//g' > /tmp/top-publishers.txt
localhost/dspace7= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2024-01-09-orcid-identifiers.txt;
localhost/dspace7= ☘ \q
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2024-01-09-orcid-identifiers.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2024-01-09-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2024-01-09-orcids.txt -o /tmp/2024-01-09-orcids-names.txt -d
$ ./ilri/update_orcids.py -i /tmp/2024-01-09-orcids-names.txt -db dspace -u dspace -p bahhhh
2024-01-09 06:23:35,893 ERROR unknown unknown org.dspace.authorize.AuthorizeServiceImpl @ Failed getting getting community/collection admin status for bahhhhh@cgiar.org The search error is: Error from server at http://localhost:8983/solr/search: org.apache.solr.search.SyntaxError: Cannot parse 'search.resourcetype:Community AND (admin:eef481147-daf3-4fd2-bb8d-e18af8131d8c OR admin:g80199ef9-bcd6-4961-9512-501dea076607 OR admin:g4ac29263-cf0c-48d0-8be7-7f09317d50ec OR admin:g0e594148-a0f6-4f00-970d-6b7812f89540 OR admin:g0265b87a-2183-4357-a971-7a5b0c7add3a OR admin:g371ae807-f014-4305-b4ec-f2a8f6f0dcfa OR admin:gdc5cb27c-4a5a-45c2-b656-a399fded70de OR admin:ge36d0ece-7a52-4925-afeb-6641d6a348cc OR admin:g15dc1173-7ddf-43cf-a89a-77a7f81c4cfc OR admin:gc3a599d3-c758-46cd-9855-c98f6ab58ae4 OR admin:g3d648c3e-58c3-4342-b500-07cba10ba52d OR admin:g82bf5168-65c1-4627-8eb4-724fa0ea51a7 OR admin:ge751e973-697d-419c-b59b-5a5644702874 OR admin:g44dd0a80-c1e6-4274-9be4-9f342d74928c OR admin:g4842f9c2-73ed-476a-a81a-7167d8aa7946 OR admin:g5f279b3f-c2ce-4c75-b151-1de52c1a540e OR admin:ga6df8adc-2e1d-40f2-8f1e-f77796d0eecd OR admin:gfdfc1621-382e-437a-8674-c9007627565c OR admin:g15cd114a-0b89-442b-a1b4-1febb6959571 OR admin:g12aede99-d018-4c00-b4d4-a732541d0017 OR admin:gc59529d7-002a-4216-b2e1-d909afd2d4a9 OR admin:gd0806714-bc13-460d-bedd-121bdd5436a4 OR admin:gce70739a-8820-4d56-b19c-f191855479e4 OR admin:g7d3409eb-81e3-4156-afb1-7f02de22065f OR admin:g54bc009e-2954-4dad-8c30-be6a09dc5093 OR admin:gc5e1d6b7-4603-40d7-852f-6654c159dec9 OR admin:g0046214d-c85b-4f12-a5e6-2f57a2c3abb0 OR admin:g4c7b4fd0-938f-40e9-ab3e-447c317296c1 OR admin:gcfae9b69-d8dd-4cf3-9a4e-d6e31ff68731 OR ... admin:g20f366c0-96c0-4416-ad0b-46884010925f)': too many boolean clauses The search resourceType filter was: search.resourcetype:Community
$ dspace dsrun org.dspace.eperson.Groomer -a -b 01/09/2018 -d
$ dspace user -L > /tmp/users-before.txt
$ wc -l /tmp/users-before.txt
8943 /tmp/users-before.txt

2024-01-10

localhost/dspace7= ☘ SELECT DISTINCT text_value AS "cg.identifier.ciatproject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata
_field_id = 232 GROUP BY "cg.identifier.ciatproject" ORDER BY count DESC;
 cg.identifier.ciatproject │ count
───────────────────────────┼───────
 D145                      │     4
 LAM_LivestockPlus         │     2
 A215                      │     1
 A217                      │     1
 A220                      │     1
 A223                      │     1
 A224                      │     1
 A227                      │     1
 A229                      │     1
 A230                      │     1
 CLIMATE CHANGE MITIGATION │     1
 LIVESTOCK                 │     1
(12 rows)

Time: 240.041 ms

2024-01-12

localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 211 GROUP BY "cg.contributor.affiliation" ORDER BY count DESC) to /tmp/2024-01-affiliations.csv WITH CSV HEADER;
COPY 11719
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&rows=0'
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"-id:/.{36}/",
      "rows":"0"}},
  "response":{"numFound":800167,"start":0,"numFoundExact":true,"docs":[]
  }}
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2010-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1YEAR&rows=0'
{
  "responseHeader":{
    "status":0,
    "QTime":13,
    "params":{
      "facet.range":"time",
      "q":"-id:/.{36}/",
      "facet.range.gap":"+1YEAR",
      "rows":"0",
      "facet":"true",
      "facet.range.start":"2010-01-01T00:00:00Z",
      "facet.range.end":"NOW"}},
  "response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{},
    "facet_ranges":{
      "time":{
        "counts":[
          "2010-01-01T00:00:00Z",0,
          "2011-01-01T00:00:00Z",0,
          "2012-01-01T00:00:00Z",0,
          "2013-01-01T00:00:00Z",0,
          "2014-01-01T00:00:00Z",0,
          "2015-01-01T00:00:00Z",89,
          "2016-01-01T00:00:00Z",11,
          "2017-01-01T00:00:00Z",0,
          "2018-01-01T00:00:00Z",0,
          "2019-01-01T00:00:00Z",0,
          "2020-01-01T00:00:00Z",1339,
          "2021-01-01T00:00:00Z",0,
          "2022-01-01T00:00:00Z",0,
          "2023-01-01T00:00:00Z",653736,
          "2024-01-01T00:00:00Z",144993],
        "gap":"+1YEAR",
        "start":"2010-01-01T00:00:00Z",
        "end":"2025-01-01T00:00:00Z"}},
    "facet_intervals":{},
    "facet_heatmaps":{}}}
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2023-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1MONTH&rows=0'
{
  "responseHeader":{
    "status":0,
    "QTime":196,
    "params":{
      "facet.range":"time",
      "q":"-id:/.{36}/",
      "facet.range.gap":"+1MONTH",
      "rows":"0",
      "facet":"true",
      "facet.range.start":"2023-01-01T00:00:00Z",
      "facet.range.end":"NOW"}},
  "response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{},
    "facet_ranges":{
      "time":{
        "counts":[
          "2023-01-01T00:00:00Z",1,
          "2023-02-01T00:00:00Z",0,
          "2023-03-01T00:00:00Z",0,
          "2023-04-01T00:00:00Z",0,
          "2023-05-01T00:00:00Z",0,
          "2023-06-01T00:00:00Z",0,
          "2023-07-01T00:00:00Z",0,
          "2023-08-01T00:00:00Z",27621,
          "2023-09-01T00:00:00Z",59165,
          "2023-10-01T00:00:00Z",115338,
          "2023-11-01T00:00:00Z",96147,
          "2023-12-01T00:00:00Z",355464,
          "2024-01-01T00:00:00Z",125429],
        "gap":"+1MONTH",
        "start":"2023-01-01T00:00:00Z",
        "end":"2024-02-01T00:00:00Z"}},
    "facet_intervals":{},
    "facet_heatmaps":{}}}

2024-01-13

2024-01-15

0|dspace-ui  | 1 rules skipped due to selector errors:
0|dspace-ui  |   .custom-file-input:lang(en)~.custom-file-label -> unmatched pseudo-class :lang
# zcat -f /var/log/nginx/*access.log  /var/log/nginx/*access.log.1 /var/log/nginx/*access.log.2.gz /var/log/nginx/*access.log.3.gz /var/log/nginx/*access.log.4.gz /var/log/nginx/*access.log.5.gz /var/log/nginx/*access.log.6.gz | awk '{print $1}' | sort -u |
tee /tmp/ips.txt | wc -l
196493

2024-01-17

2024-01-18

$ curl http://localhost:8983/solr/statistics/update -H "Content-type: text/xml" --data-binary '<delete><query>uid:3b4eefba-a302-4172-a286-dcb25d70129e</query></delete>'

2024-01-22

2024-01-23

$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml ~/Downloads/IFPRI\ ORCiD\ All.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2024-01-23-orcids.txt
$ ./ilri/resolve_orcids.py -i /tmp/2024-01-23-orcids.txt -o /tmp/2024-01-23-orcids-names.txt -d
$ ./ilri/update_orcids.py -i /tmp/2024-01-23-orcids-names.txt -db dspace -u dspace -p fuuu

2024-01-26

2024-01-29