CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

November, 2023

2023-11-01

  • Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
    • I improved the filtering and wrote some Python using pandas to merge my sources more reliably

2023-11-02

  • Export CGSpace to check missing Initiative collection mappings
  • Start a harvest on AReS
  • IFPRI contacted us about importing their Slideshare presentations to CGSpace
    • There are ~1,700 of them and date back to as early as 2008
    • I did a quick cleanup of the metadata export from Slideshare (including tagging with some AGROVOC in OpenRefine) and uploaded to DSpace Test

2023-11-03

  • A little bit of work on the CGIAR Climate Change Synthesis
  • Discuss some CGSpace migration plans with Leigh from IFPRI
    • For their Slideshare content we agreed:
      • Exclude private
      • Exclude deleted
      • Exclude non presentation types
      • Exclude duplicates within the collection for now until we can sort them out
    • That leaves about 1,500 items out of the 1,700
  • I did a duplicate check against CGSpace and found 44 items with 1.0 similarity so I removed those

2023-11-04

  • Export CGSpace to check for missing Initiative collection mappings
  • I ran through the list of potential duplicates on the IFPRI Slideshare presentations

2023-11-05

  • Work with Salem to migrate AReS to the new version

2023-11-07

  • DSpace 7 Test went down and there is very high load on the server
    • I saw very high load from Java but didn’t have time to check exactly what was wrong so I just rebooted the host
    • A few hours after restarting the system went down again, with very high load from Java again
    • I see lots of messages like this in the Tomcat log:
tomcat9[732]: [9955.662s][info   ][gc] GC(6291) Pause Full (G1 Compaction Pause) 4085M->4080M(4096M) 677.251ms
tomcat9[732]: [9955.662s][info   ][gc] GC(6290) Concurrent Mark Cycle 677.558ms
tomcat9[732]: [9955.666s][info   ][gc] GC(6292) To-space exhausted
  • I see some messages in dspace.log about heap space:
Caused by: java.lang.OutOfMemoryError: Java heap space
  • I will increase Tomcat’s heap from 4096m to 5120m
    • A few hours later it happened again, so I increased the heap from 5120m to 6144m
    • Not sure what’s going on today…
  • I tested moving the CGIAR Fund Council community to the CGIAR historic archive on DSpace Test:
$ dspace community-filiator -r -p 10568/83389 -c 10947/2516
$ dspace community-filiator -s -p 10947/2515 -c 10947/2516
$ dspace index-discovery -r 10947/2516
$ dspace index-discovery -r 10947/2515
$ dspace index-discovery -r 10568/83389
$ dspace index-discovery
  • I think this is the minimal we can do to avoid a full Discovery reindex which is very expensive
  • I helped Maria resize some massive PDFs for upload to CGSpace using GhostScript prepress mode as I had done before in September, 2023,

2023-11-08

  • DSpace 7 Test has very high load again and I see more Java heap space errors in the log
# grep -c 'Caused by: java.lang.OutOfMemoryError: Java heap space' /home/dspace7/log/dspace.log-2023-11-07 
35
# grep -c 'Caused by: java.lang.OutOfMemoryError: Java heap space' /home/dspace7/log/dspace.log
7
  • I don’t know what is happening… I will increase the heap size from 6144m to 7168m again…