cgspace-notes/content/posts/2020-04.md

504 lines
27 KiB
Markdown

---
title: "April, 2020"
date: 2020-04-02T10:53:24+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2020-04-02
- Maria asked me to update Charles Staver's ORCID iD in the submission template and on CGSpace, as his name was lower case before, and now he has corrected it
- I updated the fifty-eight existing items on CGSpace
- Looking into the items Udana had asked about last week that were missing Altmetric donuts:
- [The first](https://hdl.handle.net/10568/103225) is still missing its DOI, so I added it and [tweeted its handle](https://twitter.com/mralanorth/status/1245632619661766657) (after a few hours there was a donut with score 222)
- [The second item](https://hdl.handle.net/10568/106899) now has a donut with score 2 since I [tweeted its handle](https://twitter.com/mralanorth/status/1243158045540134913) last week
- [The third item](https://hdl.handle.net/10568/107258) now has a donut with score 1 since I [tweeted it](https://twitter.com/mralanorth/status/1243158786392625153) last week
- On the same note, the [one item](https://hdl.handle.net/10568/106573) Abenet pointed out last week now has a donut with score of 104 after I [tweeted it](https://twitter.com/mralanorth/status/1243163710241345536) last week
<!--more-->
- Altmetric responded about [one item](https://hdl.handle.net/10568/101286) that had no donut since at least 2019-12 and said they fixed some problems with their bot's user agent
- I decided to [tweet the item](https://twitter.com/mralanorth/status/1245703049445851140), as I can't remember if I ever did it before
## 2020-04-05
- Update PostgreSQL JDBC driver to version 42.2.12
## 2020-04-07
- Yesterday Atmire sent me their [pull request for DSpace 6 modules](https://github.com/ilri/DSpace/pull/445)
- Peter pointed out that some items have his ORCID identifier (`cg.creator.id`) twice
- I think this is because my early `add-orcid-identifiers.py` script was adding identifiers to existing records without properly checking if there was already one present (at first it only checked if there was one with the exact `place` value)
- As a test I dropped all his ORCID identifiers and added them back with the `add-orcid-identifiers.py` script:
```
$ psql -h localhost -U postgres dspace -c "DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value LIKE '%Ballantyne%';"
DELETE 97
$ ./add-orcid-identifiers-csv.py -i 2020-04-07-peter-orcids.csv -db dspace -u dspace -p 'fuuu' -d
```
- I used this CSV with the script (all records with his name have the name standardized like this):
```
dc.contributor.author,cg.creator.id
"Ballantyne, Peter G.","Peter G. Ballantyne: 0000-0001-9346-2893"
```
- Then I tried another way, to identify all duplicate ORCID identifiers for a given resource ID and group them so I can see if count is greater than 1:
```
dspace=# \COPY (SELECT DISTINCT(resource_id, text_value) as distinct_orcid, COUNT(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 240 GROUP BY distinct_orcid ORDER BY count DESC) TO /tmp/2020-04-07-duplicate-orcids.csv WITH CSV HEADER;
COPY 15209
```
- Of those, about nine authors had duplicate ORCID identifiers over about thirty records, so I created a CSV with all their name variations and ORCID identifiers:
```
dc.contributor.author,cg.creator.id
"Ballantyne, Peter G.","Peter G. Ballantyne: 0000-0001-9346-2893"
"Ramirez-Villegas, Julian","Julian Ramirez-Villegas: 0000-0002-8044-583X"
"Villegas-Ramirez, J","Julian Ramirez-Villegas: 0000-0002-8044-583X"
"Ishitani, Manabu","Manabu Ishitani: 0000-0002-6950-4018"
"Manabu, Ishitani","Manabu Ishitani: 0000-0002-6950-4018"
"Ishitani, M.","Manabu Ishitani: 0000-0002-6950-4018"
"Ishitani, M.","Manabu Ishitani: 0000-0002-6950-4018"
"Buruchara, Robin A.","Robin Buruchara: 0000-0003-0934-1218"
"Buruchara, Robin","Robin Buruchara: 0000-0003-0934-1218"
"Jarvis, Andy","Andy Jarvis: 0000-0001-6543-0798"
"Jarvis, Andrew","Andy Jarvis: 0000-0001-6543-0798"
"Jarvis, A.","Andy Jarvis: 0000-0001-6543-0798"
"Tohme, Joseph M.","Joe Tohme: 0000-0003-2765-7101"
"Hansen, James","James Hansen: 0000-0002-8599-7895"
"Hansen, James W.","James Hansen: 0000-0002-8599-7895"
"Asseng, Senthold","Senthold Asseng: 0000-0002-7583-3811"
```
- Then I deleted *all* their existing ORCID identifier records:
```
dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value SIMILAR TO '%(0000-0001-6543-0798|0000-0001-9346-2893|0000-0002-6950-4018|0000-0002-7583-3811|0000-0002-8044-583X|0000-0002-8599-7895|0000-0003-0934-1218|0000-0003-2765-7101)%';
DELETE 994
```
- And then I added them again using the `add-orcid-identifiers` records:
```
$ ./add-orcid-identifiers-csv.py -i 2020-04-07-fix-duplicate-orcids.csv -db dspace -u dspace -p 'fuuu' -d
```
- I ran the fixes on DSpace Test and CGSpace as well
- I started testing the [pull request](https://github.com/ilri/DSpace/pull/445) sent by Atmire yesterday
- I notice that we now need `yarn` to build, and I need to bump the Node.js `engine` version in our Mirage 2 theme in order to get it to build on Node.js 10.x
- Font Awesome icons for GitHub etc weren't loading, and after a bit of troubleshooting I replaced version 4.5.0 with 5.13.0 and to my surprise they now include Mendeley and ORCID so we can get rid of the Academicons dependency
## 2020-04-12
- Testing the Atmire DSpace 6.3 code with a clean CGSpace DSpace 5.8 database snapshot
- One Flyway migration failed so I had to manually remove it (and of course create the pgcrypto extension):
```
dspace63=# DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');
dspace63=# CREATE EXTENSION pgcrypto;
```
- Then DSpace 6.3 started up OK and I was able to see some statistics in the Content and Usage Analysis (CUA) module, but not on community, collection, or item pages
- I also noticed at least one of these errors in the DSpace log:
```
2020-04-12 16:34:33,363 ERROR com.atmire.dspace.app.xmlui.aspect.statistics.editorparts.DataTableTransformer @ java.lang.IllegalArgumentException: Invalid UUID string: 1
```
- And I remembered I actually need to run the DSpace 6.4 Solr UUID migrations:
```
$ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
$ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
```
- Run system updates on DSpace Test (linode26) and reboot it
- More work on the DSpace 6.3 stuff, improving the GDPR consent logic to use [haven](https://github.com/chiiya/haven) instead of cookieconsent
- It works better by injecting the Google Analytics script after the user clicks agree, and it also has a preferences section that gets automatically injected on the privacy page!
## 2020-04-13
- I realized that `solr-upgrade-statistics-6x` only processes 100,000 records by default so I think we actually need to finish running it for all legacy Solr records before asking Atmire why CUA statlets and detailed statistics aren't working
- For now I am just doing 250,000 records at a time on my local environment:
```
$ export JAVA_OPTS="-Xmx2000m -Dfile.encoding=UTF-8"
$ ~/dspace63/bin/dspace solr-upgrade-statistics-6x -n 250000
```
- Despite running the migration for all of my local 1.5 million Solr records, I still see a few hundred thousand like `-1` and `0-unmigrated`
- I will purge them all and try to import only a subset...
- After importing again I see there are indeed tens of thousands of these documents with IDs "-1" and "0"
- They are all `type: 5`, which is "SITE" according to `Constants.java`:
```
/** DSpace site type */
public static final int SITE = 5;
```
- Even after deleting those documents and re-running `solr-upgrade-statistics-6x` I still get the UUID errors when using CUA and the statlets
- I have sent some feedback and questions to Atmire (including about the  issue with glypicons in the header trail)
- In other news, my local Artifactory container stopped working for some reason so I re-created it and it seems some things have changed upstream (port 8082 for web UI?):
```
$ podman rm artifactory
$ podman pull docker.bintray.io/jfrog/artifactory-oss:latest
$ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081-8082:8081-8082 docker.bintray.io/jfrog/artifactory-oss
$ podman start artifactory
```
## 2020-04-14
- A few days ago Peter asked me to update an author's name on CGSpace and in the controlled vocabularies:
```
dspace=# UPDATE metadatavalue SET text_value='Knight-Jones, Theodore J.D.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='Knight-Jones, T.J.D.';
```
- I updated his existing records on CGSpace, changed the controlled lists, added his ORCID identifier to the controlled list, and tagged his thirty-nine items with the ORCID iD
- The new DSpace 6 stuff that Atmire sent modifies the Mirage 2's `pom.xml` to copy the each theme's resulting `node_modules` to each theme after building and installing with `ant update` because they moved some packages from bower to npm and now reference them in `page-structure.xsl`
- This is a good idea, because bower is no longer supported, and npm has gotten a lot better, but it causes an extra 200,000 files to get copied!
- Most scripts are concatenated into `theme.js` during build, so we don't need the `node_modules` after that, but there are three scripts in `page-structure.xsl` that are not included there
- The scripts are a very old version of modernizr which is not even available on npm, html5shiv, and respond.js
- For modernizr I can simply download a static copy and put it in `0_CGIAR/scripts` and concatenate it into `theme.js`
- For the others, I can revert to using them from bower's `vendor` directory, which is installed by the parent XMLUI Mirage 2 theme
- During this process I also realized that `mvn clean` doesn't actually clean everything, and `dspace/modules/xmlui-mirage2/target` is remaining from previous builds and contains a bunch of shit from previous builds (including all the themes which I was trying to build without!)
- This must be a DSpace bug, but I should theoretically check on vanilla DSpace and then file a bug...
## 2020-04-17
- Atmire responded to some of the issues I raised earlier this week about the DSpace 6 pull request
- They said they don't think the glyphicon encoding issue is due to their changes, but I built a new clean version of the vanilla `6_x-dev` branch from before their pull request and it *does not* have the encoding issue in the Mirage 2 header trails
- Also, they said we need to use something called `AtomicStatisticsUpdateCLI` to do the Solr legacy integer ID to UUID conversion so I asked for more information about that workflow
## 2020-04-20
- Looking into a high rate of outgoing bandwidth from yesterday on CGSpace (linode18):
```
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Apr/2020:0[6789]" | goaccess --log-format=COMBINED -
```
- One host in Russia (91.241.19.70) download 23GiB over those few hours in the morning
- It looks like all the requests were for one single item's bitstreams:
```
# grep -c 91.241.19.70 /var/log/nginx/access.log.1
8900
# grep 91.241.19.70 /var/log/nginx/access.log.1 | grep -c '10568/35187'
8900
```
- I thought the host might have been Yandex misbehaving, but its user agent is:
```
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_3; nl-nl) AppleWebKit/527 (KHTML, like Gecko) Version/3.1.1 Safari/525.20
```
- I will purge that IP from the Solr statistics using my `check-spider-ip-hits.sh` script:
```
$ ./check-spider-ip-hits.sh -d -f /tmp/ip -p
(DEBUG) Using spider IPs file: /tmp/ip
(DEBUG) Checking for hits from spider IP: 91.241.19.70
Purging 8909 hits from 91.241.19.70 in statistics
Total number of bot hits purged: 8909
```
- While investigating that I noticed ORCID identifiers missing from a few authors names, so I added them with my `add-orcid-identifiers.py` script:
```
$ ./add-orcid-identifiers-csv.py -i 2020-04-20-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
```
- The contents of `2020-04-20-add-orcids.csv` was:
```
dc.contributor.author,cg.creator.id
"Schut, Marc","Marc Schut: 0000-0002-3361-4581"
"Schut, M.","Marc Schut: 0000-0002-3361-4581"
"Kamau, G.","Geoffrey Kamau: 0000-0002-6995-4801"
"Kamau, G","Geoffrey Kamau: 0000-0002-6995-4801"
"Triomphe, Bernard","Bernard Triomphe: 0000-0001-6657-3002"
"Waters-Bayer, Ann","Ann Waters-Bayer: 0000-0003-1887-7903"
"Klerkx, Laurens","Laurens Klerkx: 0000-0002-1664-886X"
```
- I confirmed some of the authors' names from the report itself, then by looking at their profiles on ORCID.org
- Add new ILRI subject "COVID19" to the `5_x-prod` branch
- Add new CCAFS Phase II project tags to the `5_x-prod` branch
- I will deploy these to CGSpace in the next few days
## 2020-04-24
- Atmire responded to my ticket about the  issue with glypicons and said their test server does not show this same issue
- They asked if I am using the `JAVA_OPTS=-Dfile.encoding=UTF-8` when building DSpace and running Tomcat
- I set it explicitly for Maven and Ant just now (and cleared all XMLUI caches) but the issue is still there
- I asked them if they are building on macOS or Linux, and which Node.js version (I'm using 10.20.1, which is the current LTS branch).
## 2020-04-25
- I researched a bit more about the  issue with glypicons and realized it's due to [a bug in libsass](https://github.com/twbs/bootstrap-sass/issues/919)
- I have Ruby sass version 3.4.25 installed in my local environment, but DSpace Mirage 2 is supposed to build with 3.3.14
- Downgrading the version fixes it, though I wonder: why did I not have this issue on the `6.x-dev` branch before Atmire's pull request, and also on `5_x-prod` where I've been building for a few months here...
- I deployed the latest `5_x-prod` branch on CGSpace (linode18), ran all updates, and rebooted the server
- This includes the "COVID19" ILRI subject, the new CCAFS Phase II project tags, and the changes to an ILRI author name
- After restarting the server I had to restart Tomcat three times before all Solr statistics cores loaded properly.
- After that I started a full Discovery reindexing to pick up some author changes I made last week:
```
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
```
- I ran the `dspace cleanup -v` process on CGSpace and got an error:
```
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
Detail: Key (bitstream_id)=(184980) is still referenced from table "bundle".
```
- The solution is, as always:
```
$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
UPDATE 1
```
- I spent some time working on the XMLUI themes in DSpace 6
- Atmire's pull request modifies `pom.xml` to *not* exclude `node_modules`, which means an extra ~260,000 files get copied to our installation folder because of all the themes
- I worked on the `Gruntfile.js` to copy Font Awesome and Bootstrap glyphicon fonts out of `node_modules` and into `fonts` at build time, but still `jquery-ui.min.css` was being referenced as a `url()` in CSS
- SASS can include imported CSS in your compiled CSS—instead of including an `@import url(..)` if you import it without the ".css", but our version of Ruby SASS doesn't support that
- I hacked `Gruntfile.js` to use `dart-sass` instead of Ruby `compass` (including installing compass's mixins via npm!) but then [dart-sass converts all the glyphicon ASCII escape codes to Unicode literals](https://github.com/sass/dart-sass/issues/345) and they show up garbled in Firefox
- I tried to use `node-sass` instead of dart-sass and it doesn't replace the ASCII escapes with literals, but then I get the the  issue with the glyphicon in the header trail again! Back to square one!
- So that was a waste of five hours...
- I might just leave this [tiny hack](https://github.com/twbs/bootstrap-sass/issues/919) in `0_CGIAR/styles/_style.scss` to override this and be done with it:
```
.breadcrumb > li + li:before {
content: "/\00a0";
}
```
## 2020-04-27
- File an issue on DSpace Jira about the [`mvn clean` task not removing the Mirage 2 target directory](https://jira.lyrasis.org/browse/DS-4492)
- My changes to DSpace XMLUI Mirage 2 build process mean that we don't need Ruby gems at all anymore! We can completely build without them!
- Trying to test the `com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI` script but there is an error:
```
Exception: org.apache.solr.search.SyntaxError: Cannot parse 'cua_version:${cua.version.number}': Encountered " "}" "} "" at line 1, column 32.
Was expecting one of:
"TO" ...
<RANGE_QUOTED> ...
<RANGE_GOOP> ...
```
- Seems something is wrong with the variable interpolation, and I see two configurations in the `atmire-cua.cfg` file:
```
atmire-cua.cua.version.number=${cua.version.number}
atmire-cua.version.number=${cua.version.number}
```
- I sent a message to Atmire to check
## 2020-04-28
- I did some work on DSpace 6 to modify our XMLUI theme to use Font Awesome icons via SVG + JavaScript instead of using web fonts
- The difference is about 105K less, plus two fewer network requests since we don't need the web fonts anymore
- Before:
- `scripts/theme.js`: 654K
- `styles/main.css`: 220K
- `fa-brands-400.woff2`: 75K
- `fa-solid-900.woff2`: 78K
- Total: 1027K
- After:
- `scripts/theme.js`: 704K
- `styles/main.css`: 218K
- Total: 922K
- I manually editied the CUA version variable and was then able to run the `com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI` script
- On the first run it took one hour to process 100,000 records on my local test instance...
- On the second run it took one hour to process 140,000 records
- On the third run it took one hour to process 150,000 records
## 2020-04-29
- I found out that running the Atmire CUA script with more memory and a larger number of records (-r 20000) makes it run faster
- Now the process finishes, but there are errors on some records:
```
Record uid: ee085cc0-0110-42c5-80b9-0fad4015ed9f couldn't be processed
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: ee085cc0-0110-42c5-80b9-0fad4015ed9f, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(AtomicStatisticsUpdater.java:304)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(AtomicStatisticsUpdater.java:176)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(AtomicStatisticsUpdater.java:161)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(AtomicStatisticsUpdater.java:128)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(AtomicStatisticsUpdateCLI.java:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
Caused by: java.lang.NullPointerException
at com.atmire.dspace.cua.CUADSOServiceImpl.findByLegacyID(CUADSOServiceImpl.java:40)
at com.atmire.statistics.util.update.atomic.processor.AtomicUpdateProcessor.getDSpaceObject(AtomicUpdateProcessor.java:49)
at com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor.process(ContainerOwnerDBProcessor.java:45)
at com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor.visit(ContainerOwnerDBProcessor.java:38)
at com.atmire.statistics.util.update.atomic.record.UsageRecord.accept(UsageRecord.java:23)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(AtomicStatisticsUpdater.java:301)
... 10 more
```
- I've sent a message to Atmire to ask for advice
- Also, now I can actually see the CUA statlets and usage statistics
- Unfortunately it seems they are using Font Awesome 4 in their CUA module and this means that some icons are broken because the names have changed, but also some of their code is using Unicode characters instead of classes in spans!
- I've reverted my sweet SVG work from yesterday and adjusted the Font Awesome 5 SCSS to add a few more icons that they are using
- Tezira said she was having issues submitting items on CGSpace today
- I looked at all the errors in the DSpace log and see a few SQL pooling errors around mid day:
```
$ grep ERROR dspace.log.2020-04-29 | cut -f 3- -d' ' | sort | uniq -c | sort -n
1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL findByUnique Error -
1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL find Error -
1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
```
- I looked in Munin and I see that there is a strange spike in the database pool usage this afternoon:
![Tomcat Database Pool usage day](/cgspace-notes/2020/04/jmx_tomcat_dbpools-day.png)
- Looking at the past month it does seem to be something strange:
![Tomcat Database Pool usage month](/cgspace-notes/2020/04/jmx_tomcat_dbpools-month.png)
- Database connections do seem high:
```
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
6 dspaceCli
88 dspaceWeb
```
- Most of those are idle in transaction:
```
$ psql -c 'select * from pg_stat_activity' | grep 'dspaceWeb' | grep -c "idle in transaction"
67
```
- I don't see anything in the PostgreSQL or Tomcat logs suggesting anything is wrong... I think the solution to clear these idle connections is probably to just restart Tomcat
- I looked at the Solr stats for this month and see lots of suspicious IPs:
```
$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-04&rows=0&wt=json&indent=true&facet=true&facet.field=ip
"88.99.115.53",23621, # Hetzner, using XMLUI and REST API with no user agent
"104.154.216.0",11865,# Google cloud, scraping XMLUI with no user agent
"104.198.96.245",4925,# Google cloud, using REST API with no user agent
"52.34.238.26",2907, # EcoSearch on XMLUI, user agent: EcoSearch (+https://search.ecointernet.org/)
```
- And a bunch more... ugh...
- 70.32.90.172: scraping REST API for IWMI/WLE pages with no user agent
- 2a01:7e00::f03c:91ff:fe16:fcb: Linode, REST API, no user agent
- 2607:f298:5:101d:f816:3eff:fed9:a484: DreamHost, XMLUI and REST API, python-requests/2.18.4
- 2a00:1768:2001:7a::20: Netherlands, XMLUI, trying SQL injections
- I need to start blocking requests without a user agent...
- I purged these user agents using my `check-spider-ip-hits.sh` script:
```
$ for year in {2010..2019}; do ./check-spider-ip-hits.sh -f /tmp/ips -s statistics-$year -p; done
$ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
```
- Then I added a few of them to the bot mapping in the nginx config because it appears they are regular harvesters since 2018
- Looking through the Solr stats faceted by the `userAgent` field I see some interesting ones:
```
$ curl 'http://localhost:8081/solr/statistics/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=userAgent'
...
"Delphi 2009",50725,
"OgScrper/1.0.0",12421,
```
- Delphi is only used by IP addresses in Greece, so that's obviously the GARDIAN people harvesting us...
- I have no idea what OgScrper is, but it's not a user!
- Then there are 276,000 hits from `MEL-API` from Jordanian IPs in 2018, so that's obviously CodeObia guys...
- Other user agents:
- GigablastOpenSource/1
- Owlin Domain Resolver V1
- API scraper
- MetaURI
- I don't know why, but my `check-spider-hits.sh` script doesn't seem to be handling the user agents with spaces properly so I will delete those manually after
- First delete the ones without spaces, creating a temp file in `/tmp/agents` containing the patterns:
```
$ for year in {2010..2019}; do ./check-spider-hits.sh -f /tmp/agents -s statistics-$year -p; done
$ ./check-spider-hits.sh -f /tmp/agents -s statistics -p
```
- That's about 300,000 hits purged...
- Then remove the ones with spaces manually, checking the query syntax first, then deleting in yearly cores and the statistics core:
```
$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Delphi 2009/&rows=0"
...
<lst name="responseHeader"><int name="status">0</int><int name="QTime">52</int><lst name="params"><str name="q">userAgent:/Delphi 2009/</str><str name="rows">0</str></lst></lst><result name="response" numFound="38760" start="0"></result>
$ for year in {2010..2019}; do curl -s "http://localhost:8081/solr/statistics-$year/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Delphi 2009"</query></delete>'; done
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Delphi 2009"</query></delete>'
```
- Quoting them works for now until I can look into it and handle it properly in the script
- This was about 400,000 hits in total purged from the Solr statistics
## 2020-04-30
- The TLS certificates on DSpace Test (linode26) have not been renewing correctly
- The log shows the message "No renewals were attempted"
- The `certbot-auto certificates` command doesn't list the certificate I have installed
- I guess this is because I copied it from the previous server...
- I moved the Let's Encrypt directory, got a new cert, then revoked the old one:
```
# mv /etc/letsencrypt /etc/letsencrypt.bak
# /opt/certbot-auto certonly --standalone --email fu@m.com -d dspacetest.cgiar.org --standalone --pre-hook "/bin/systemctl stop nginx" --post-hook "/bin/systemctl start nginx"
# /opt/certbot-auto revoke --cert-path /etc/letsencrypt.bak/live/dspacetest.cgiar.org/cert.pem
# rm -rf /etc/letsencrypt.bak
```
- Run all system updates on DSpace Test and reboot it
- Tezira is still having issue submitting to CGSpace and the database is definitely busy according to Munin:
![Tomcat postgres connections week](/cgspace-notes/2020/04/postgres_connections_cgspace-week.png)
- But I don't see a lot of connections in PostgreSQL itself:
```
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
5 dspaceApi
6 dspaceCli
14 dspaceWeb
$ psql -c 'select * from pg_stat_activity' | wc -l
30
```
- Tezira said she cleared her browser cache and then was able to submit again
- She said once she logged back in she had very many "Untitled" submissions pending
- I see that the database connections are indeed much lower now:
![Tomcat postgres connections week](/cgspace-notes/2020/04/postgres_connections_cgspace-day.png)
- The PostgreSQL log shows a lot of errors about deadlocks and queries waiting on other processes...
```
ERROR: deadlock detected
```
<!-- vim: set sw=2 ts=2: -->