cgspace-notes/content/posts/2023-11.md

95 lines
3.4 KiB
Markdown
Raw Normal View History

2023-11-02 18:58:43 +01:00
---
title: "November, 2023"
date: 2023-11-02T12:59:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2023-11-01
- Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
- I improved the filtering and wrote some Python using pandas to merge my sources more reliably
## 2023-11-02
- Export CGSpace to check missing Initiative collection mappings
- Start a harvest on AReS
<!--more-->
- IFPRI contacted us about importing their Slideshare presentations to CGSpace
- There are ~1,700 of them and date back to as early as 2008
- I did a quick cleanup of the metadata export from Slideshare (including tagging with some AGROVOC in OpenRefine) and uploaded to DSpace Test
2023-11-08 06:20:31 +01:00
## 2023-11-03
- A little bit of work on the CGIAR Climate Change Synthesis
- Discuss some CGSpace migration plans with Leigh from IFPRI
- For their Slideshare content we agreed:
- Exclude private
- Exclude deleted
- Exclude non presentation types
- Exclude duplicates within the collection for now until we can sort them out
- That leaves about 1,500 items out of the 1,700
- I did a duplicate check against CGSpace and found 44 items with 1.0 similarity so I removed those
## 2023-11-04
- Export CGSpace to check for missing Initiative collection mappings
- I ran through the list of potential duplicates on the IFPRI Slideshare presentations
## 2023-11-05
- Work with Salem to migrate AReS to the new version
## 2023-11-07
- DSpace 7 Test went down and there is very high load on the server
- I saw very high load from Java but didn't have time to check exactly what was wrong so I just rebooted the host
- A few hours after restarting the system went down again, with very high load from Java again
- I see lots of messages like this in the Tomcat log:
```
tomcat9[732]: [9955.662s][info ][gc] GC(6291) Pause Full (G1 Compaction Pause) 4085M->4080M(4096M) 677.251ms
tomcat9[732]: [9955.662s][info ][gc] GC(6290) Concurrent Mark Cycle 677.558ms
tomcat9[732]: [9955.666s][info ][gc] GC(6292) To-space exhausted
```
- I see some messages in `dspace.log` about heap space:
```
Caused by: java.lang.OutOfMemoryError: Java heap space
```
- I will increase Tomcat's heap from 4096m to 5120m
- A few hours later it happened again, so I increased the heap from 5120m to 6144m
- Not sure what's going on today...
- I tested moving the CGIAR Fund Council community to the CGIAR historic archive on DSpace Test:
```console
$ dspace community-filiator -r -p 10568/83389 -c 10947/2516
$ dspace community-filiator -s -p 10947/2515 -c 10947/2516
$ dspace index-discovery -r 10947/2516
$ dspace index-discovery -r 10947/2515
$ dspace index-discovery -r 10568/83389
$ dspace index-discovery
```
- I think this is the minimal we can do to avoid a full Discovery reindex which is very expensive
- I helped Maria resize some massive PDFs for upload to CGSpace using GhostScript prepress mode as I had done before in [September, 2023]({{< relref "2023-09.md" >}}),
## 2023-11-08
- DSpace 7 Test has very high load again and I see more Java heap space errors in the log
```console
# grep -c 'Caused by: java.lang.OutOfMemoryError: Java heap space' /home/dspace7/log/dspace.log-2023-11-07
35
# grep -c 'Caused by: java.lang.OutOfMemoryError: Java heap space' /home/dspace7/log/dspace.log
7
```
- I don't know what is happening... I will increase the heap size from 6144m to 7168m again...
2023-11-02 18:58:43 +01:00
<!-- vim: set sw=2 ts=2: -->