- Advise IWMI colleagues on best practices for thumbnails
- Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest
- I sent a message to Jacquie from WorldFish to ask if I can help her clean up the incorrect countries and regions in their repository, for example:
- WorldFish countries: Aegean, Euboea, Caribbean Sea, Caspian Sea, Chilwa Lake, Imo River, Indian Ocean, Indo-pacific
- WorldFish regions: Black Sea, Arabian Sea, Caribbean Sea, California Gulf, Mediterranean Sea, North Sea, Red Sea
- Looking at the July Solr statistics to find the top IP and user agents, looking for anything strange
- 35.174.144.154 made 11,000 requests last month with the following user agent:
```console
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
```
- That IP is on Amazon, and from looking at the DSpace logs I don't see them logging in at all, only scraping... so I will purge hits from that IP
- I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
- Same deal with 130.255.162.173, which is also in Sweden and makes requests every five seconds or so
- Same deal with 93.158.90.91, also in Sweden
- 3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart
- 61.143.40.50 is in China and uses this hilarious user agent:
```console
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
```
- 47.252.80.214 is owned by Alibaba in the US and has the same user agent
- 159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours
- 95.87.154.12 seems to be a new bot with the following user agent:
- They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
- I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots
- I see a new bot using this user agent:
```console
nettle (+https://www.nettle.sk)
```
- 129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.
- 217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day
- 103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human
- There are probably more but that's most of them over 1,000 hits last month, so I will purge them:
- Then in OpenRefine I merged all null, blank, and en fields into the `en_US` one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
- In total it was a few thousand metadata entries or so so I had to split the CSV with `xsv split` in order to process it
- I was reminded again how DSpace 6 is very fucking slow when it comes to any database-related operations, as it takes over an hour to process 200 metadata changes...
- In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:
- I wonder if I can try to use these [unofficial Java bindings](https://github.com/criteo/JVips/blob/master/src/test/java/com/criteo/vips/example/SimpleExample.java) in DSpace
- The authors of the JVips project wrote a nice blog post about libvips performance: https://medium.com/criteo-engineering/boosting-image-processing-performance-from-imagemagick-to-libvips-268cc3451d55
- Ouch, JVips is Java 8 only as far as I can tell... that works now, but it's a non-starter going forward
- Peter got back to me about the journal title cleanup
- From his corrections it seems an overwhelming majority of his choices match the Sherpa Romeo version of the titles rather than Crossref's
- Anyways, I exported the originals that were the same in Sherpa Romeo and Crossref as well as Peter's selections for where Sherpa Romeo and Crossref differred:
- Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine
- I exported a list of all the journal titles we have in the `cg.journal` field:
```console
localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
COPY 3245
```
- I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don't match, so I'd have to go check many of them manually before selecting a match or fixing them...
- I think it's better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way
- Or instead of doing it via SQL I could use CSV and parse the values there...
- A few more issues:
- Some ISSNs are non-existent in Sherpa Romeo and Crossref, but appear on issn.org's web search (their API is invite only)
- Some titles are different across all three datasets, for example ISSN 0003-1305:
- [According to ISSN.org](https://portal.issn.org/resource/ISSN/0003-1305) this is "The American statistician"
- [According to Sherpa Romeo](https://v2.sherpa.ac.uk/id/publication/20807) this is "American Statistician"
- [According to Crossref](https://search.crossref.org/?q=0003-1305&from_ui=yes&container-title=The+American+Statistician) this is "The American Statistician"
- I also realized that our previous controlled vocabulary came from CGSpace's top 500 journals, so when I replaced it with the generated list earlier today we lost some journals
- Now I went back and merged the previous with the new, and manually removed duplicates (sigh)
- I requested access to the issn.org OAI-PMH API so I can use their registry...
- They pointed me to their [inclusion criteria](https://v2.sherpa.ac.uk/romeo/about.html) and said that missing journals should submit their open access policies to be included
- The contact from issn.org got back to me and said I should pay 1,000/year EUR for 100,000 requests to their API... no thanks
- We specifically discussed formalizing the API and documenting its use to allow as an alternative to harvesting directly from CGSpace
- We also discussed allowing linking to search results to enable something like "Explore this collection" links on CGSpace collection pages
- Lower case all AGROVOC metadata, as I had noticed a few in sentence case:
```console
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 484
```
- Also update some DOIs using the `dx.doi.org` format, just to keep things uniform:
```console
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
UPDATE 469
```
- Then start a full Discovery re-indexing to update the Feed the Future community item counts that have been stuck at 0 since we moved the three projects to be a subcommunity a few days ago:
- Attempt to re-do my tests with VisualVM from 2019-04
- I found that I can't connect to the Tomcat JMX port using SSH forwarding (visualvm gives an error about localhost already being monitored)
- Instead, I had to create a SOCKS proxy with SSH (ssh -D 8096), then set that up as a proxy in the VisualVM network settings, and then add the JMX connection
- In the last few days I did a lot of work on OpenRXV
- I started exploring the Angular 9.0 to 9.1 update
- I tested some updates to dependencies for Angular 9 that we somehow missed, like @tinymce/tinymce-angular, @nicky-lenaers/ngx-scroll-to, and @ng-select/ng-select
- I changed the default target from ES5 to ES2015 because ES5 was released in 2009 and the only thing we lose by moving to ES2015 is IE11 support
- I fixed a handful of issues in the Docker build and deployment process
- I started exploring changing the Docker configuration from using volumes to `COPY` instructions in the `Dockerfile` because we are having sporadic issues with permissions in containers caused by copying the host's frontend/backend directories and not being able to write to them
- I tested moving from node-sass to sass, as it has been [supported since Angular 8 apparently](https://blog.ninja-squad.com/2019/05/29/angular-cli-8.0/) and will allow us to avoid stupid node-gyp issues
## 2021-08-25
- I did a bunch of tests of the OpenRXV Angular 9.1 update and merged it to master ([#115](https://github.com/ilri/OpenRXV/pull/115))
- Last week Maria Garruccio sent me a handful of new ORCID identifiers for Bioversity staff
- We currently have 1320 unique identifiers, so this adds eleven new ones:
- Yesterday I finished the work to make OpenRXV use a new multi-stage Docker build system and use smarter `COPY` instructions instead of runtime volumes
- Today I merged the changes to the master branch and re-deployed AReS on linode20
- Because the `docker-compose.yml` moved to the root the Docker volume prefix changed from `docker_` to `openrxv_` so I had to stop the containers and rsync the data from the old volume to the new one in /var/lib/docker