cgspace-notes/content/post/2018-02.md

86 lines
3.6 KiB
Markdown
Raw Normal View History

2018-02-01 15:30:28 +01:00
---
title: "February, 2018"
date: 2018-02-01T16:28:54+02:00
author: "Alan Orth"
tags: ["Notes"]
---
## 2018-02-01
- Peter gave feedback on the `dc.rights` proof of concept that I had sent him last week
- We don't need to distinguish between internal and external works, so that makes it just a simple list
2018-02-01 18:04:07 +01:00
- Yesterday I figured out how to monitor DSpace sessions using JMX
- I copied the logic in the `jmx_tomcat_dbpools` provided by Ubuntu's `munin-plugins-java` package and used the stuff I discovered about JMX [in 2018-01]({{< relref "2018-01.md" >}})
2018-02-01 15:30:28 +01:00
<!--more-->
2018-02-01 18:04:07 +01:00
![DSpace Sessions](/cgspace-notes/2018/02/jmx_dspace_sessions-day.png)
- Run all system updates and reboot DSpace Test
- Wow, I packaged up the `jmx_dspace_sessions` stuff in the [Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public) and deployed it on CGSpace and it totally works:
```
# munin-run jmx_dspace_sessions
v_.value 223
v_jspui.value 1
v_oai.value 0
```
2018-02-03 08:44:31 +01:00
## 2018-02-03
- Bram from Atmire responded about the high load caused by the Solr updater script and said it will be fixed with the updates to DSpace 5.8 compatibility: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566
- We will close that ticket for now and wait for the 5.8 stuff: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560
2018-02-03 23:00:51 +01:00
- I finally took a look at the second round of cleanups Peter had sent me for author affiliations in mid January
- After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace:
```
$ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
```
- Then I started a full Discovery reindex:
```
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
2018-02-04 10:26:07 +01:00
real 96m39.823s
user 14m10.975s
sys 2m29.088s
2018-02-03 23:00:51 +01:00
```
- Generate a new list of affiliations for Peter to sort through:
```
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 3723
```
- Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in [December]({{< relref "2017-12.md" >}}):
```
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2018"
3126109
real 0m23.839s
user 0m27.225s
sys 0m1.905s
```
2018-02-05 18:08:05 +01:00
## 2018-02-05
- Toying with correcting authors with trailing spaces via PostgreSQL:
```
dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
UPDATE 20
```
- I tried the `TRIM(TRAILING from text_value)` function and it said it changed 20 items but the spaces didn't go away
- This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.
- Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:
```
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
COPY 55630
```