- A few users noticed that CGSpace wasn't loading items today, item pages seem blank
- I looked at the PostgreSQL locks but they don't seem unusual
- I guess this is the same "blank item page" issue that we had a few times in 2019 that we never solved
- I restarted Tomcat and PostgreSQL and the issue was gone
- Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the `5_x-prod` branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter's request
<!--more-->
- Also, Linode is alerting that we had high outbound traffic rate early this morning around midnight AND high CPU load later in the morning
- They used to be "TurnitinBot"... hhmmmm, seems they use both: https://turnitin.com/robot/crawlerinfo.html
- I will add Turnitin to the DSpace bot user agent list, but I see they are reqesting `robots.txt` and only requesting item pages, so that's impressive! I don't need to add them to the "bad bot" rate limit list in nginx
- While looking at the logs I noticed eighty-one IPs in the range 185.152.250.x making little requests this user agent:
```
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0
- It's only a few hundred requests each, but I am very suspicious so I will record it here and purge their IPs from Solr
- Then I see 185.187.30.14 and 185.187.30.13 making requests also, with several different "normal" user agents
- They are both apparently in France, belonging to Scalair FR hosting
- I will purge their requests from Solr too
- Now I see some other new bots I hadn't noticed before:
-`Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) LinkCheck by Siteimprove.com`
-`Consilio (WebHare Platform 4.28.2-dev); LinkChecker)`, which appears to be a [university CMS](https://www.utwente.nl/en/websites/webhare/)
- I will add `LinkCheck`, `Consilio`, and `WebHare` to the list of DSpace bot agents and purge them from Solr stats
- COUNTER-Robots list already has `link.?check` but for some reason DSpace didn't match that and I see hits for some of these...
- Maybe I should add `[Ll]ink.?[Cc]heck.?` to a custom list for now?
- For now I added `Turnitin` to the [new bots pull request on COUNTER-Robots](https://github.com/atmire/COUNTER-Robots/pull/34)
- I purged 20,000 hits from IPs and 45,000 hits from user agents
- I will revert the default "example" agents file back to the upstream master branch of COUNTER-Robots, and then add all my custom ones that are pending in pull requests they haven't merged yet:
[java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
- Atmire says they are able to build fine, so I tried again and noticed that I had been building with `-Denv=dspacetest.cgiar.org`, which is not necessary for DSpace 6 of course
- Once I removed that it builds fine
- I quickly re-applied the Font Awesome 5 changes to use SVG+JS instead of web fonts (from 2020-04) and things are looking good!
- Run all system updates on DSpace Test (linode26), deploy latest `6_x-dev-atmire-modules` branch, and reboot it
## 2020-07-02
- I need to export some Solr statistics data from CGSpace to test Salem's modifications to the dspace-statistics-api
- He modified it to query Solr on the fly instead of indexing it, which will be heavier and slower, but allows us to get more granular stats and countries/cities
- Because have so many records I want to use solr-import-export-json to get several months at a time with a date range, but it seems there are first issues with curl (need to disable globbing with `-g` and URL encode the range)
- For reference, the [Solr 4.10.x DateField docs](https://lucene.apache.org/solr/4_10_2/solr-core/org/apache/solr/schema/DateField.html)
- This range works in Solr UI: `[2019-01-01T00:00:00Z TO 2019-06-30T23:59:59Z]`
- Import twelve items into the [CRP Livestock multimedia](https://hdl.handle.net/10568/97076) collection for Peter Ballantyne
- I ran the data through csv-metadata-quality first to validate and fix some common mistakes
- Interesting to check the data with `csvstat` to see if there are any duplicates
- Peter recently asked me to add Target audience (`cg.targetaudience`) to the CGSpace sidebar facets and AReS filters
- I added it on my local DSpace test instance, but I'm waiting for him to tell me what he wants the header to be "Audiences" or "Target audience" etc...
- Peter also asked me to increase the size of links in the CGSpace "Welcome" text
- I suggested using the CSS `font-size: larger` property to just bump it up one relative to what it already is
- He said it looks good, but that actually now the links seem OK (I told him to refresh, as I had made them bold a few days ago) so we don't need to adjust it actually
- Mohammed Salem modified my [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api) to query Solr directly so I started writing a script to benchmark it today
- I will monitor the JVM memory and CPU usage in visualvm, just like I did in 2019-04
- I noticed an issue with his limit parameter so I sent him some feedback on that in the meantime
- I noticed that we have 20,000 distinct values for `dc.subject`, but there are at least 500 that are lower or mixed case that we should simply uppercase without further thought:
```
dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
```
- DSpace Test needs a different query because it is running DSpace 6 with UUIDs for everything:
```
dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
```
- Note the use of the POSIX character class :)
- I suggest that we generate a list of the top 5,000 values that don't match AGROVOC so that Sisay can correct them
- Start by getting the top 6,500 subjects (assuming that the top ~1,500 are valid from our previous work):
```
dspace=# \COPY (SELECT DISTINCT text_value, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY text_value ORDER BY count DESC) TO /tmp/2020-07-05-subjects.csv WITH CSV;
COPY 19640
dspace=# \q
$ csvcut -c1 /tmp/2020-07-05-subjects-upper.csv | head -n 6500 > 2020-07-05-cgspace-subjects.txt
```
- Then start looking them up using `agrovoc-lookup.py`:
- I made some optimizations to the suite of Python utility scripts in our DSpace directory as well as the [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) script
- Mostly to make more efficient usage of the requests cache and to use parameterized requests instead of building the request URL by concatenating the URL with query parameters
- I modified the `agrovoc-lookup.py` script to save its results as a CSV, with the subject, language, type of match (preferred, alternate, and total number of matches) rather than save two separate files
- Note that I see `prefLabel`, `matchedPrefLabel`, and `altLabel` in the REST API responses and I'm not sure what the second one means
- They responded to say that `matchedPrefLabel` is not a property in SKOS/SKOSXL vocabulary, but their SKOSMOS system seems to use it to hint that the search terms matched a `prefLabel` in another language
- I will treat the `matchedPrefLabel` values as if they were `prefLabel` values for the indicated language then
- Peter asked me to send him a list of sponsors on CGSpace
```
dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC) TO /tmp/2020-07-07-sponsors.csv WITH CSV HEADER;
COPY 707
```
- I ran it quickly through my `csv-metadata-quality` tool and found two issues that I will correct with `fix-metadata-values.py` on CGSpace immediately:
```
$ cat 2020-07-07-fix-sponsors.csv
dc.description.sponsorship,correct
"Ministe`re des Affaires Etrange`res et Européennes, France","Ministère des Affaires Étrangères et Européennes, France"
"Global Food Security Programme, United Kingdom","Global Food Security Programme, United Kingdom"
- Upload the Capacity Development July newsletter to CGSpace for Ben Hack because Abenet and Bizu usually do it, but they are currently offline due to the Internet being turned off in Ethiopia
- Here: https://hdl.handle.net/10568/108708
- I implemented the Dimensions.ai badge on DSpace Test for Peter to see, as he's been asking me for awhile:
- Actually this raised an issue that the Altmetric badges weren't enabled in our DSpace 6 branch yet because I had forgotten to copy the config
- Also, I noticed a big issue in both our DSpace 5 and DSpace 6 branches related to the `$identifier_doi` variable being defined incorrectly and thus never getting set (has to do with DRI)
- I fixed both and now the Altmetric badge and the Dimensions badge both appear... nice
![Altmetric and Dimensions.ai badge](/cgspace-notes/2020/07/dimensions-badge2.png)
- Yesterday Gabriela from CIP emailed to say that she was removing the accents from her authors' names because of "funny character" issues with reports generated from CGSpace
- I told her that it's probably her Windows / Excel that is messing up the data, and she figured out how to open them correctly!
- Now she says she doesn't want to remove the accents after all and she sent me a new list of corrections
- I used csvgrep and found a few where she is still removing accents:
- Select lines with column two (the correction) having a value
- Select lines with column one (the original author name) having an accent / diacritic
- Select lines with column two (the correction) NOT having an accent (ie, she's not removing an accent)
- Select columns one and two
- Peter said he liked the work I didn on the badges yesterday so I put some finishing touches on it to detect more DOI URI styles and pushed it to the `5_x-prod` branch
- I will port it to DSpace 6 soon
![Altmetric and Dimensions badges](/cgspace-notes/2020/07/altmetrics-dimensions-badges.png)
- I wrote a quick script to lookup organizations (affiliations) in the Research Organization Repository (ROR) JSON data release v5
- I want to use this to evaluate ROR as a controlled vocabulary for CGSpace and MELSpace
- I exported a list of affiliations from CGSpace:
```
dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-07-08-affiliations.csv WITH CSV HEADER;
- So, minus the CSV header, we have 1405 case-insensitive matches out of 5866 (23.9%)
## 2020-07-09
- Atmire responded to the ticket about DSpace 6 and Solr yesterday
- They said that the CUA issue is due to the "unmigrated" Solr records and that we should delete them
- I told them that [the "unmigrated" IDs are a known issue in DSpace 6](https://wiki.lyrasis.org/display/DSDOC6x/SOLR+Statistics+Maintenance) and we should rather figure out why they are unmigrated
- I didn't see any discussion on the dspace-tech mailing list or on DSpace Jira about unmigrated IDs, so I sent a mail to the mailing list to ask
- I updated `ror-lookup.py` to check aliases and acronyms as well and now the results are better for CGSpace's affiliation list:
- On 2020-07-10 Macaroni Bros emailed to ask if there are issues with CGSpace because they are getting HTTP 504 on the REST API
- First, I looked in Munin and I see high number of DSpace sessions and threads on Friday evening around midnight, though that was much later than his email:
- CPU load and memory were not high then, but there was some load on the database and firewall...
- Looking in the nginx logs I see a few IPs we've seen recently, like those 199.47.x.x IPs from Turnitin (which I need to remember to purge from Solr again because I didn't update the spider agents on CGSpace yet) and some new one 186.112.8.167
- Also, the Turnitin bot doesn't re-use its Tomcat JSESSIONID, I see this from today:
- So I need to add this alternative user-agent to the Tomcat Crawler Session Manager valve to force it to re-use a common bot session
- There are around 9,000 requests from `186.112.8.167` in Colombia and has the user agent `Java/1.8.0_241`, but those were mostly to REST API and I don't see any hits in Solr
- Earlier in the day Linode had alerted that there was high outgoing bandwidth
- I see some new bot from 134.155.96.78 made ~10,000 requests with the user agent... but it appears to already be in our DSpace user agent list via COUNTER-Robots:
- Generate a list of sponsors to update our controlled vocabulary:
```
dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC LIMIT 125) TO /tmp/2020-07-12-sponsors.csv;
- Udana from WLE asked me about some items that didn't show Altmetric donuts
- I checked his list and at least three of them actually *did* show donuts, and for four others I tweeted them manually to see if they would get a donut in a few hours:
- https://hdl.handle.net/10568/108477
- https://hdl.handle.net/10568/108475
- https://hdl.handle.net/10568/108361
- https://hdl.handle.net/10568/108360
## 2020-07-15
- All four IWMI items that I tweeted yesterday have Altmetric donuts with a score of 1 now...
- Export CGSpace countries to check them against ISO 3166-1 and ISO 3166-3 (historic countries):
```
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-07-15-countries.csv;
COPY 194
```
- I wrote a script `iso3166-lookup.py` to check them:
- Check the database for DOIs that are not in the preferred "https://doi.org/" format:
```
dspace=# \COPY (SELECT text_value as "cg.identifier.doi" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=220 AND text_value NOT LIKE 'https://doi.org/%') TO /tmp/2020-07-15-doi.csv WITH CSV HEADER;
COPY 186
```
- Then I imported them into OpenRefine and replaced them in a new "correct" column using this GREL transform:
- I filed [an issue on Debian's iso-codes](https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/10) project to ask why "Swaziland" does not appear in the ISO 3166-3 list of historical country names despite it being changed to "Eswatini" in 2018.
- Atmire responded about the Solr issue
- They said that it seems like a DSpace issue so that it's not their responsibility, and nobody responded to my question on the dspace-tech mailing list...
- I said I would try to do a migration on DSpace Test with more of CGSpace's Solr data to try and approximate how much of our data be affected
- I also asked them about the Tomcat 8.5 issue with CUA as well as the CUA group name issue that I had asked originally in April
- Looking at the nginx logs on CGSpace (linode18) last night I see that the Macaroni Bros have started using a unique identifier for at least one of their harvesters:
- I still see 12,000 records in Solr from this user agent, though.
- I wonder why the DSpace bot list didn't get those... because it has "bot" which should cause Solr to not log the hit
- I purged ~30,000 hits from Solr statistics based on the IPs above, but also for some agents like Drupal (which isn't in the list yet) and OgScrper (which is as of 2020-03)
- Some of my user agent patterns had been incorporated into COUNTER-Robots in 2020-07, but not all
- I closed the [old pull request](https://github.com/atmire/COUNTER-Robots/pull/34) and created a [new one](https://github.com/atmire/COUNTER-Robots/pull/36)
- Then I updated the lists in the `5_x-prod` and 6.x branches
- I re-ran the `check-spider-hits.sh` script with the new lists and purged another 14,000 more stats hits for several years each (2020, 2019, 2018, 2017, 2016), around 70,000 total
- I looked at the [CLARISA](https://clarisa.cgiar.org/) institutions list again, since I hadn't looked at it in over six months: