diff --git a/content/posts/2021-11.md b/content/posts/2021-11.md index bc14bd91c..434ff6a0e 100644 --- a/content/posts/2021-11.md +++ b/content/posts/2021-11.md @@ -71,6 +71,51 @@ $ docker-compose build ``` - Then restart the server and start a fresh harvest -- Continue splitting the Solr statistics into yearly shards on DSpace Test (doing 2017 today) +- Continue splitting the Solr statistics into yearly shards on DSpace Test (doing 2017, 2016, 2015, and 2014 today) +- Several users wrote to me last week to say that workflow emails haven't been working since 2021-10-21 or so + - I did a test on CGSpace and it's indeed broken: + +```console +$ dspace test-email + +About to send test email: + - To: fuuuu + - Subject: DSpace test email + - Server: smtp.office365.com + +Error sending email: + - Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 535 5.7.139 Authentication unsuccessful, the user credentials were incorrect. [AM5PR0701CA0005.eurprd07.prod.outlook.com] +) + +Please see the DSpace documentation for assistance. +``` + +- I sent a message to ILRI ICT to ask them to check the account/password +- I want to do one last test of the Elasticsearch updates on OpenRXV so I got a snapshot of the latest Elasticsearch volume used on the production AReS instance: + +```console +# tar czf openrxv_esData_7.tar.xz /var/lib/docker/volumes/openrxv_esData_7 +``` + +- Then on my local server: + +```console +$ mv ~/.local/share/containers/storage/volumes/openrxv_esData_7/ ~/.local/share/containers/storage/volumes/openrxv_esData_7.2021-11-07.bak +$ tar xf /tmp/openrxv_esData_7.tar.xz -C ~/.local/share/containers/storage/volumes --strip-components=4 +$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type f -exec chmod 660 {} \; +$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type d -exec chmod 770 {} \; +# copy backend/data to /tmp for the repository setup/layout +$ rsync -av --partial --progress --delete provisioning@ares:/tmp/data/ backend/data +``` + +- This seems to work: all items, stats, and repository setup/layout are OK +- I merged my [Elasticsearch pull request](https://github.com/ilri/OpenRXV/pull/126) from last month into OpenRXV + +## 2021-11-08 + +- File [an issue for the Angular flash of unstyled content](https://github.com/DSpace/dspace-angular/issues/1391) on DSpace 7 +- Help Udana from IWMI with a question about CGSpace statistics + - He found conflicting numbers when using the community and collection modes in Content and Usage Analysis + - I sent him more numbers directly from the DSpace Statistics API diff --git a/docs/2015-11/index.html b/docs/2015-11/index.html index 0488d1ac1..45a0f8697 100644 --- a/docs/2015-11/index.html +++ b/docs/2015-11/index.html @@ -34,7 +34,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 "/> - + diff --git a/docs/2015-12/index.html b/docs/2015-12/index.html index 96f50eb2f..ef55764e0 100644 --- a/docs/2015-12/index.html +++ b/docs/2015-12/index.html @@ -36,7 +36,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz "/> - + diff --git a/docs/2016-01/index.html b/docs/2016-01/index.html index 554de7f5d..4044ea3c4 100644 --- a/docs/2016-01/index.html +++ b/docs/2016-01/index.html @@ -28,7 +28,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_ I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated. Update GitHub wiki for documentation of maintenance tasks. "/> - + diff --git a/docs/2016-02/index.html b/docs/2016-02/index.html index fd2295778..cfcdb427f 100644 --- a/docs/2016-02/index.html +++ b/docs/2016-02/index.html @@ -38,7 +38,7 @@ I noticed we have a very interesting list of countries on CGSpace: Not only are there 49,000 countries, we have some blanks (25)… Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE” "/> - + diff --git a/docs/2016-03/index.html b/docs/2016-03/index.html index cd1e1290f..22e25fc61 100644 --- a/docs/2016-03/index.html +++ b/docs/2016-03/index.html @@ -28,7 +28,7 @@ Looking at issues with author authorities on CGSpace For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server "/> - + diff --git a/docs/2016-04/index.html b/docs/2016-04/index.html index b9e1f7d60..4fbf3b52f 100644 --- a/docs/2016-04/index.html +++ b/docs/2016-04/index.html @@ -32,7 +32,7 @@ After running DSpace for over five years I’ve never needed to look in any This will save us a few gigs of backup space we’re paying for on S3 Also, I noticed the checker log has some errors we should pay attention to: "/> - + diff --git a/docs/2016-05/index.html b/docs/2016-05/index.html index 64fc158c3..cf3d62784 100644 --- a/docs/2016-05/index.html +++ b/docs/2016-05/index.html @@ -34,7 +34,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period! # awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l 3168 "/> - + diff --git a/docs/2016-06/index.html b/docs/2016-06/index.html index 0ee8a6a32..4174bf659 100644 --- a/docs/2016-06/index.html +++ b/docs/2016-06/index.html @@ -34,7 +34,7 @@ This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRec You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship "/> - + diff --git a/docs/2016-07/index.html b/docs/2016-07/index.html index 5922910a0..ee3d51f92 100644 --- a/docs/2016-07/index.html +++ b/docs/2016-07/index.html @@ -44,7 +44,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and In this case the select query was showing 95 results before the update "/> - + diff --git a/docs/2016-08/index.html b/docs/2016-08/index.html index 597537ec1..f0de27b6d 100644 --- a/docs/2016-08/index.html +++ b/docs/2016-08/index.html @@ -42,7 +42,7 @@ $ git checkout -b 55new 5_x-prod $ git reset --hard ilri/5_x-prod $ git rebase -i dspace-5.5 "/> - + diff --git a/docs/2016-09/index.html b/docs/2016-09/index.html index e76e8cdcf..4c5accfa8 100644 --- a/docs/2016-09/index.html +++ b/docs/2016-09/index.html @@ -34,7 +34,7 @@ It looks like we might be able to use OUs now, instead of DCs: $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)" "/> - + diff --git a/docs/2016-10/index.html b/docs/2016-10/index.html index 70c82e421..cec9a1acb 100644 --- a/docs/2016-10/index.html +++ b/docs/2016-10/index.html @@ -42,7 +42,7 @@ I exported a random item’s metadata as CSV, deleted all columns except id 0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X "/> - + diff --git a/docs/2016-11/index.html b/docs/2016-11/index.html index 392aa1ae4..d7c9f794e 100644 --- a/docs/2016-11/index.html +++ b/docs/2016-11/index.html @@ -26,7 +26,7 @@ Add dc.type to the output options for Atmire’s Listings and Reports module Add dc.type to the output options for Atmire’s Listings and Reports module (#286) "/> - + diff --git a/docs/2016-12/index.html b/docs/2016-12/index.html index a272af1b9..233d0c30d 100644 --- a/docs/2016-12/index.html +++ b/docs/2016-12/index.html @@ -46,7 +46,7 @@ I see thousands of them in the logs for the last few months, so it’s not r I’ve raised a ticket with Atmire to ask Another worrying error from dspace.log is: "/> - + diff --git a/docs/2017-01/index.html b/docs/2017-01/index.html index 5f5f1c047..f32b29c56 100644 --- a/docs/2017-01/index.html +++ b/docs/2017-01/index.html @@ -28,7 +28,7 @@ I checked to see if the Solr sharding task that is supposed to run on January 1s I tested on DSpace Test as well and it doesn’t work there either I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years "/> - + diff --git a/docs/2017-02/index.html b/docs/2017-02/index.html index 212de38c2..f747e30cd 100644 --- a/docs/2017-02/index.html +++ b/docs/2017-02/index.html @@ -50,7 +50,7 @@ DELETE 1 Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301) Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name "/> - + diff --git a/docs/2017-03/index.html b/docs/2017-03/index.html index d19b70794..ce8a0ac47 100644 --- a/docs/2017-03/index.html +++ b/docs/2017-03/index.html @@ -54,7 +54,7 @@ Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing reg $ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 "/> - + diff --git a/docs/2017-04/index.html b/docs/2017-04/index.html index 9363d21a2..f48e4db8b 100644 --- a/docs/2017-04/index.html +++ b/docs/2017-04/index.html @@ -40,7 +40,7 @@ Testing the CMYK patch on a collection with 650 items: $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt "/> - + diff --git a/docs/2017-05/index.html b/docs/2017-05/index.html index 2f51e2282..4d056aff6 100644 --- a/docs/2017-05/index.html +++ b/docs/2017-05/index.html @@ -18,7 +18,7 @@ - + diff --git a/docs/2017-06/index.html b/docs/2017-06/index.html index 8a87f1271..6c3bf6fa6 100644 --- a/docs/2017-06/index.html +++ b/docs/2017-06/index.html @@ -18,7 +18,7 @@ - + diff --git a/docs/2017-07/index.html b/docs/2017-07/index.html index 4a57bc4b9..ff564235c 100644 --- a/docs/2017-07/index.html +++ b/docs/2017-07/index.html @@ -36,7 +36,7 @@ Merge changes for WLE Phase II theme rename (#329) Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML: "/> - + diff --git a/docs/2017-08/index.html b/docs/2017-08/index.html index 6babd9aba..7efa4a1b1 100644 --- a/docs/2017-08/index.html +++ b/docs/2017-08/index.html @@ -60,7 +60,7 @@ This was due to newline characters in the dc.description.abstract column, which I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet "/> - + diff --git a/docs/2017-09/index.html b/docs/2017-09/index.html index 7b4fbf66e..3f39459b2 100644 --- a/docs/2017-09/index.html +++ b/docs/2017-09/index.html @@ -32,7 +32,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group "/> - + diff --git a/docs/2017-10/index.html b/docs/2017-10/index.html index 9831851bb..83ab15209 100644 --- a/docs/2017-10/index.html +++ b/docs/2017-10/index.html @@ -34,7 +34,7 @@ http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336 There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections "/> - + diff --git a/docs/2017-11/index.html b/docs/2017-11/index.html index 063eb9d27..866630d18 100644 --- a/docs/2017-11/index.html +++ b/docs/2017-11/index.html @@ -48,7 +48,7 @@ Generate list of authors on CGSpace for Peter to go through and correct: dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 "/> - + diff --git a/docs/2017-12/index.html b/docs/2017-12/index.html index 4906ba3da..834bb0a87 100644 --- a/docs/2017-12/index.html +++ b/docs/2017-12/index.html @@ -30,7 +30,7 @@ The logs say “Timeout waiting for idle object” PostgreSQL activity says there are 115 connections currently The list of connections to XMLUI and REST API for today: "/> - + diff --git a/docs/2018-01/index.html b/docs/2018-01/index.html index 7fc57a2d7..a54846799 100644 --- a/docs/2018-01/index.html +++ b/docs/2018-01/index.html @@ -150,7 +150,7 @@ dspace.log.2018-01-02:34 Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains "/> - + diff --git a/docs/2018-02/index.html b/docs/2018-02/index.html index 6f56ad9d4..c5d0f7ddc 100644 --- a/docs/2018-02/index.html +++ b/docs/2018-02/index.html @@ -30,7 +30,7 @@ We don’t need to distinguish between internal and external works, so that Yesterday I figured out how to monitor DSpace sessions using JMX I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01 "/> - + diff --git a/docs/2018-03/index.html b/docs/2018-03/index.html index 4aa7447ad..974fd7896 100644 --- a/docs/2018-03/index.html +++ b/docs/2018-03/index.html @@ -24,7 +24,7 @@ Export a CSV of the IITA community metadata for Martin Mueller Export a CSV of the IITA community metadata for Martin Mueller "/> - + diff --git a/docs/2018-04/index.html b/docs/2018-04/index.html index 288e9edbd..67aa9f818 100644 --- a/docs/2018-04/index.html +++ b/docs/2018-04/index.html @@ -26,7 +26,7 @@ Catalina logs at least show some memory errors yesterday: I tried to test something on DSpace Test but noticed that it’s down since god knows when Catalina logs at least show some memory errors yesterday: "/> - + diff --git a/docs/2018-05/index.html b/docs/2018-05/index.html index 4077eb19f..4ce624298 100644 --- a/docs/2018-05/index.html +++ b/docs/2018-05/index.html @@ -38,7 +38,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E Then I reduced the JVM heap size from 6144 back to 5120m Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use "/> - + diff --git a/docs/2018-06/index.html b/docs/2018-06/index.html index 2fa52b741..fa669a915 100644 --- a/docs/2018-06/index.html +++ b/docs/2018-06/index.html @@ -58,7 +58,7 @@ real 74m42.646s user 8m5.056s sys 2m7.289s "/> - + diff --git a/docs/2018-07/index.html b/docs/2018-07/index.html index 3e30ee642..a44cfa517 100644 --- a/docs/2018-07/index.html +++ b/docs/2018-07/index.html @@ -36,7 +36,7 @@ During the mvn package stage on the 5.8 branch I kept getting issues with java r There is insufficient memory for the Java Runtime Environment to continue. "/> - + diff --git a/docs/2018-08/index.html b/docs/2018-08/index.html index e41da3e65..18c1f684e 100644 --- a/docs/2018-08/index.html +++ b/docs/2018-08/index.html @@ -46,7 +46,7 @@ Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes I ran all system updates on DSpace Test and rebooted it "/> - + diff --git a/docs/2018-09/index.html b/docs/2018-09/index.html index dbd19be32..c91dd050a 100644 --- a/docs/2018-09/index.html +++ b/docs/2018-09/index.html @@ -30,7 +30,7 @@ I’ll update the DSpace role in our Ansible infrastructure playbooks and ru Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again: "/> - + diff --git a/docs/2018-10/index.html b/docs/2018-10/index.html index 66232c9d6..772e1081f 100644 --- a/docs/2018-10/index.html +++ b/docs/2018-10/index.html @@ -26,7 +26,7 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now "/> - + diff --git a/docs/2018-11/index.html b/docs/2018-11/index.html index 7af5f2ae2..9ddc66e59 100644 --- a/docs/2018-11/index.html +++ b/docs/2018-11/index.html @@ -36,7 +36,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage Today these are the top 10 IPs: "/> - + @@ -458,7 +458,7 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
value.replace('�','')
$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>id:/.+-unmigrated/</query></delete>"
-
$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>id:/.+-unmigrated/</query></delete>"
+
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
-
-real 92m14.294s
+$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
+
+real 92m14.294s
user 7m59.840s
sys 2m22.327s
-
+
- I realized I had been using an incorrect Solr query to purge unmigrated items after processing with
solr-upgrade-statistics-6x
…
- Instead of this:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
@@ -1148,10 +1148,10 @@ $ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | sort | u
-$ cat /tmp/elasticsearch-mappings* > /tmp/new-elasticsearch-mappings.txt
+$ cat /tmp/elasticsearch-mappings* > /tmp/new-elasticsearch-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
-$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/new-elasticsearch-mappings.txt
-
+$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/new-elasticsearch-mappings.txt
+
- The latest indexing (second for today!) finally finshed on AReS and the countries and affiliations/crps/journals all look MUCH better
- There are still a few acronyms present, some of which are in the value mappings and some which aren’t
diff --git a/docs/2020-11/index.html b/docs/2020-11/index.html
index c75f15f88..320f6dedf 100644
--- a/docs/2020-11/index.html
+++ b/docs/2020-11/index.html
@@ -32,7 +32,7 @@ So far we’ve spent at least fifty hours to process the statistics and stat
"/>
-
+
diff --git a/docs/2020-12/index.html b/docs/2020-12/index.html
index 9b2f3276e..2b5204214 100644
--- a/docs/2020-12/index.html
+++ b/docs/2020-12/index.html
@@ -36,7 +36,7 @@ I started processing those (about 411,000 records):
"/>
-
+
@@ -132,8 +132,8 @@ I started processing those (about 411,000 records):
-$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
-
+$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
+
- AReS went down when the
renew-letsencrypt
service stopped the angular_nginx
container in the pre-update hook and failed to bring it back up
- I ran all system updates on the host and rebooted it and AReS came back up OK
@@ -179,18 +179,18 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
- First the 2010 core:
-
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
-$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
-$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
-
+$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
+$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
+$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
+
- Judging by the DSpace logs all these cores had a problem starting up in the last month:
-
# grep -rsI "Unable to create core" [dspace]/log/dspace.log.2020-* | grep -o -E "statistics-[0-9]+" | sort | uniq -c
+# grep -rsI "Unable to create core" [dspace]/log/dspace.log.2020-* | grep -o -E "statistics-[0-9]+" | sort | uniq -c
24 statistics-2010
24 statistics-2015
18 statistics-2016
6 statistics-2018
-
+
- The message is always this:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error CREATEing SolrCore 'statistics-2016': Unable to create core [statistics-2016] Caused by: Lock obtain timed out: NativeFSLock@/[dspace]/solr/statistics-2016/data/index/write.lock
@@ -223,9 +223,9 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
- There are apparently 1,700 locks right now:
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1739
-
2020-12-08
+2020-12-08
- Atmire sent some instructions for using the DeduplicateValuesProcessor
@@ -341,17 +341,17 @@ Caused by: org.apache.http.TruncatedChunkException: Truncated chunk ( expected s
- I can see it in the
openrxv-items-final
index:
-
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*' | json_pp
+$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*' | json_pp
{
- "_shards" : {
- "failed" : 0,
- "skipped" : 0,
- "successful" : 1,
- "total" : 1
+ "_shards" : {
+ "failed" : 0,
+ "skipped" : 0,
+ "successful" : 1,
+ "total" : 1
},
- "count" : 299922
+ "count" : 299922
}
-
+
- I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/64
- For now I will try to delete the index and start a re-harvest in the Admin UI:
@@ -371,8 +371,8 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-temp
-
localhost/dspace63= > SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';
-
2020-12-14
+localhost/dspace63= > SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';
+
2020-12-14
- The re-harvesting finished last night on AReS but there are no records in the
openrxv-items-final
index
@@ -380,44 +380,44 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-temp
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp
{
- "count" : 99992,
- "_shards" : {
- "skipped" : 0,
- "total" : 1,
- "failed" : 0,
- "successful" : 1
+ "count" : 99992,
+ "_shards" : {
+ "skipped" : 0,
+ "total" : 1,
+ "failed" : 0,
+ "successful" : 1
}
}
-
+
- I’m going to try to clone the temp index to the final one…
- First, set the
openrxv-items-temp
index to block writes (read only) and then clone it to openrxv-items-final
:
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
-{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final"}
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
-
+{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final"}
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
+
- Now I see that the
openrxv-items-final
index has items, but there are still none in AReS Explorer UI!
-$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
- "count" : 99992,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 99992,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
- The api logs show this from last night after the harvesting:
-[Nest] 92 - 12/13/2020, 1:58:52 PM [HarvesterService] Starting Harvest
+[Nest] 92 - 12/13/2020, 1:58:52 PM [HarvesterService] Starting Harvest
[Nest] 92 - 12/13/2020, 10:50:20 PM [FetchConsumer] OnGlobalQueueDrained
[Nest] 92 - 12/13/2020, 11:00:20 PM [PluginsConsumer] OnGlobalQueueDrained
[Nest] 92 - 12/13/2020, 11:00:20 PM [HarvesterService] reindex function is called
@@ -426,16 +426,16 @@ $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H
at IncomingMessage.emit (events.js:326:22)
at endReadableNT (_stream_readable.js:1223:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
-
+
- But I’m not sure why the frontend doesn’t show any data despite there being documents in the index…
- I talked to Moayad and he reminded me that OpenRXV uses an alias to point to temp and final indexes, but the UI actually uses the
openrxv-items
index
- I cloned the
openrxv-items-final
index to openrxv-items
index and now I see items in the explorer UI
- The PDF report was broken and I looked in the API logs and saw this:
-(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
+(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
-
+
- I installed
unoconv
in the backend api container and now it works… but I wonder why this changed…
- Skype with Abenet and Peter to discuss AReS that will be shown to ILRI scientists this week
@@ -487,10 +487,10 @@ $ query-json '.items | length' /tmp/policy2.json
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-2020-12-14
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
-
2020-12-15
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
+2020-12-15
- After the re-harvest last night there were 200,000 items in the
openrxv-items-temp
index again
@@ -499,36 +499,36 @@ $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H
- I checked the 1,534 fixes in Open Refine (had to fix a few UTF-8 errors, as always from Peter’s CSVs) and then applied them using the
fix-metadata-values.py
script:
-$ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
-$ ./delete-metadata-values.py -i /tmp/2020-10-28-delete-2-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3
-
+$ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
+$ ./delete-metadata-values.py -i /tmp/2020-10-28-delete-2-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3
+
- Since I was re-indexing Discovery anyways I decided to check for any uppercase AGROVOC and lowercase them:
-
dspace=# BEGIN;
+dspace=# BEGIN;
BEGIN
-dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
+dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
UPDATE 406
dspace=# COMMIT;
COMMIT
-
+
- I also updated the Font Awesome icon classes for version 5 syntax:
-dspace=# BEGIN;
-dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-rss','fas fa-rss', 'g') WHERE text_value LIKE '%fa fa-rss%';
+dspace=# BEGIN;
+dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-rss','fas fa-rss', 'g') WHERE text_value LIKE '%fa fa-rss%';
UPDATE 74
-dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-at','fas fa-at', 'g') WHERE text_value LIKE '%fa fa-at%';
+dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-at','fas fa-at', 'g') WHERE text_value LIKE '%fa fa-at%';
UPDATE 74
dspace=# COMMIT;
-
+
- Then I started a full Discovery re-index:
-$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
-$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
-
-real 265m11.224s
+$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
+$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
+
+real 265m11.224s
user 171m29.141s
sys 2m41.097s
-
+
- Udana sent a report that the WLE approver is experiencing the same issue Peter highlighted a few weeks ago: they are unable to save metadata edits in the workflow
- Yesterday Atmire responded about the owningComm and owningColl duplicates in Solr saying they didn’t see any anymore…
@@ -544,31 +544,31 @@ sys 2m41.097s
- After the Discovery re-indexing finished on CGSpace I prepared to start re-harvesting AReS by making sure the
openrxv-items-temp
index was empty and that the backup index I made yesterday was still there:
-
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
{
- "acknowledged" : true
+ "acknowledged" : true
}
-$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
- "count" : 0,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 0,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-$ curl -s 'http://localhost:9200/openrxv-items-2020-12-14/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-2020-12-14/_count?q=*&pretty'
{
- "count" : 99992,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 99992,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
2020-12-16
+2020-12-16
- The harvesting on AReS finished last night so this morning I manually cloned the
openrxv-items-temp
index to openrxv-items
@@ -576,32 +576,32 @@ $ curl -s 'http://localhost:9200/openrxv-items-2020-12-14/_count?q=*&pretty'
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 100046,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100046,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
-$ curl -s -X POST "http://localhost:9200/openrxv-items-temp/_clone/openrxv-items?pretty"
-$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
+$ curl -s -X POST "http://localhost:9200/openrxv-items-temp/_clone/openrxv-items?pretty"
+$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
{
- "count" : 100046,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100046,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
-
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
+
- Interestingly the item that we noticed was duplicated now only appears once
- The missing item is still missing
- Jane Poole noticed that the “previous page” and “next page” buttons are not working on AReS
@@ -611,16 +611,16 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
- Generate a list of submitters and approvers active in the last months using the Provenance field on CGSpace:
-$ psql -h localhost -U postgres dspace -c "SELECT text_value FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-(06|07|08|09|10|11|12)-*'" > /tmp/provenance.txt
-$ grep -o -E 'by .*)' /tmp/provenance.txt | grep -v -E "( on |checksum)" | sed -e 's/by //' -e 's/ (/,/' -e 's/)//' | sort | uniq > /tmp/recent-submitters-approvers.csv
-
+$ psql -h localhost -U postgres dspace -c "SELECT text_value FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-(06|07|08|09|10|11|12)-*'" > /tmp/provenance.txt
+$ grep -o -E 'by .*)' /tmp/provenance.txt | grep -v -E "( on |checksum)" | sed -e 's/by //' -e 's/ (/,/' -e 's/)//' | sort | uniq > /tmp/recent-submitters-approvers.csv
+
- Peter wanted it to send some mail to the users…
2020-12-17
- I see some errors from CUA in our Tomcat logs:
-
Thu Dec 17 07:35:27 CET 2020 | Query:containerItem:b049326a-0e76-45a8-ac0c-d8ec043a50c6
+Thu Dec 17 07:35:27 CET 2020 | Query:containerItem:b049326a-0e76-45a8-ac0c-d8ec043a50c6
Error while updating
java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1155)
@@ -628,7 +628,7 @@ java.lang.UnsupportedOperationException: Multiple update components target the s
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1140)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1129)
...
-
+
- I sent the full stack to Atmire to investigate
- I know we’ve had this “Multiple update components target the same field” error in the past with DSpace 5.x and Atmire said it was harmless, but would nevertheless be fixed in a future update
@@ -636,10 +636,10 @@ java.lang.UnsupportedOperationException: Multiple update components target the s
- I was trying to export the ILRI community on CGSpace so I could update one of the ILRI author’s names, but it throws an error…
-$ dspace metadata-export -i 10568/1 -f /tmp/2020-12-17-ILRI.csv
+$ dspace metadata-export -i 10568/1 -f /tmp/2020-12-17-ILRI.csv
Loading @mire database changes for module MQM
Changes have been processed
-Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
+Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
Exception: null
java.lang.NullPointerException
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212)
@@ -654,14 +654,14 @@ java.lang.NullPointerException
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
-
+
- I did it via CSV with
fix-metadata-values.py
instead:
-$ cat 2020-12-17-update-ILRI-author.csv
+$ cat 2020-12-17-update-ILRI-author.csv
dc.contributor.author,correct
-"Padmakumar, V.P.","Varijakshapanicker, Padmakumar"
-$ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
-
+"Padmakumar, V.P.","Varijakshapanicker, Padmakumar"
+$ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
+
- Abenet needed a list of all 2020 outputs from the Livestock CRP that were Limited Access
- I exported the community from CGSpace and used
csvcut
and csvgrep
to get a list:
@@ -689,7 +689,7 @@ $ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u ds
- The DeduplicateValuesProcessor has been running on DSpace Test since two days ago and it almost completed its second twelve-hour run, but crashed near the end:
-
...
+...
Run 1 — 100% — 8,230,000/8,239,228 docs — 39s — 9h 8m 31s
Exception: Java heap space
java.lang.OutOfMemoryError: Java heap space
@@ -725,7 +725,7 @@ java.lang.OutOfMemoryError: Java heap space
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
-
+
- That was with a JVM heap of 512m
- I looked in Solr and found dozens of duplicates of each field again…
@@ -744,30 +744,30 @@ java.lang.OutOfMemoryError: Java heap space
- The AReS harvest finished this morning and I moved the Elasticsearch index manually
- First, check the number of records in the temp index to make sure it seems complete and not with double data:
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 100135,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100135,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
- Then delete the old backup and clone the current items index as a backup:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-14?pretty'
-$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-14?pretty'
+$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-21
-
+
- Then delete the current items index and clone it from temp:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
-
2020-12-22
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
+2020-12-22
- I finished getting the Swagger UI integrated into the dspace-statistics-api
@@ -810,10 +810,10 @@ $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H
- I exported the 2012 stats from the year core and imported them to the main statistics core with solr-import-export-json:
-$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2012 -a export -o statistics-2012.json -k uid
-$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
-$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
-
+$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2012 -a export -o statistics-2012.json -k uid
+$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
+$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
+
- I decided to do the same for the remaining 2011, 2014, 2017, and 2019 cores…
2020-12-29
@@ -824,31 +824,31 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
-$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
{
- "count" : 100135,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100135,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
-$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
+$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-29
-$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
-
2020-12-30
+$ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
+2020-12-30
- The indexing on AReS finished so I cloned the
openrxv-items-temp
index to openrxv-items
and deleted the backup index:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-29?pretty'
-
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-29?pretty'
+
diff --git a/docs/2021-01/index.html b/docs/2021-01/index.html
index 9fb3a019d..be43ecb41 100644
--- a/docs/2021-01/index.html
+++ b/docs/2021-01/index.html
@@ -50,7 +50,7 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
"/>
-
+
@@ -160,29 +160,29 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
-
+
- Then, the next morning when it’s done, check the results of the harvesting, backup the current
openrxv-items
index, and clone the openrxv-items-temp
index to openrxv-items
:
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 100278,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100278,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-04
-$ curl -XDELETE 'http://localhost:9200/openrxv-items'
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-04'
-
2021-01-04
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-04'
+2021-01-04
- There is one item that appears twice in AReS: 10568/66839
@@ -214,8 +214,8 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-04'
-$ ./doi-to-handle.py -db dspace -u dspace -p 'fuuu' -i /tmp/dois.txt -o /tmp/out.csv
-
+$ ./doi-to-handle.py -db dspace -u dspace -p 'fuuu' -i /tmp/dois.txt -o /tmp/out.csv
+
- Help Udana export IWMI records from AReS
- He wanted me to give him CSV export permissions on CGSpace, but I told him that this requires super admin so I’m not comfortable with it
@@ -261,12 +261,12 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-04'
-
2021-01-10 10:03:27,692 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=1e8fb96c-b994-4fe2-8f0c-0a98ab138be0, ObjectType=(Unknown), ObjectID=null, TimeStamp=1610269383279, dispatcher=1544803905, detail=[null], transactionID="TX35636856957739531161091194485578658698")
-
+2021-01-10 10:03:27,692 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=1e8fb96c-b994-4fe2-8f0c-0a98ab138be0, ObjectType=(Unknown), ObjectID=null, TimeStamp=1610269383279, dispatcher=1544803905, detail=[null], transactionID="TX35636856957739531161091194485578658698")
+
- I filed a bug on Atmire’s issue tracker
- Peter asked me to move the CGIAR Gender Platform community to the top level of CGSpace, but I get an error when I use the community-filiator command:
-
$ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
+$ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
Loading @mire database changes for module MQM
Changes have been processed
Exception: null
@@ -282,7 +282,7 @@ java.lang.UnsupportedOperationException
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
-
+
- There is apparently a bug in DSpace 6.x that makes community-filiator not work
- There is a patch for the as-of-yet unreleased DSpace 6.4 so I will try that
@@ -301,24 +301,24 @@ java.lang.UnsupportedOperationException
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
... after ten hours
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 100411,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100411,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+
- Looking over the last month of Solr stats I see a familiar bot that should have been marked as a bot months ago:
@@ -331,9 +331,9 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ cat log/dspace.log.2020-12-2* | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=64.62.202.71' | sort | uniq | wc -l
+$ cat log/dspace.log.2020-12-2* | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=64.62.202.71' | sort | uniq | wc -l
0
-
+
- So now I should really add it to the DSpace spider agent list so it doesn’t create Solr hits
- I added it to the “ilri” lists of spider agent patterns
@@ -341,8 +341,8 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
- I purged the existing hits using my
check-spider-ip-hits.sh
script:
-$ ./check-spider-ip-hits.sh -d -f /tmp/ips -s http://localhost:8081/solr -s statistics -p
-
2021-01-11
+$ ./check-spider-ip-hits.sh -d -f /tmp/ips -s http://localhost:8081/solr -s statistics -p
+
2021-01-11
- The AReS indexing finished this morning and I moved the
openrxv-items-temp
core to openrxv-items
(see above)
@@ -351,8 +351,8 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
- I deployed the community-filiator fix on CGSpace and moved the Gender Platform community to the top level of CGSpace:
-$ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
-
2021-01-12
+$ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
+
2021-01-12
- IWMI is really pressuring us to have a periodic CSV export of their community
@@ -393,29 +393,29 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
-
+
- Then, the next morning when it’s done, check the results of the harvesting, backup the current
openrxv-items
index, and clone the openrxv-items-temp
index to openrxv-items
:
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 100540,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100540,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-18
-$ curl -XDELETE 'http://localhost:9200/openrxv-items'
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-18'
-
2021-01-18
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-18'
+2021-01-18
- Finish the indexing on AReS that I started yesterday
- Udana from IWMI emailed me to ask why the iwmi.csv doesn’t include items he approved to CGSpace this morning
@@ -462,9 +462,9 @@ localhost/dspace63= > COMMIT;
-$ docker exec -it api /bin/bash
-# apt update && apt install unoconv
-
+$ docker exec -it api /bin/bash
+# apt update && apt install unoconv
+
- Help Peter get a list of titles and DOIs for CGSpace items that Altmetric does not have an attention score for
- He generated a list from their dashboard and I extracted the DOIs in OpenRefine (because it was WINDOWS-1252 and csvcut couldn’t do it)
@@ -512,30 +512,30 @@ localhost/dspace63= > COMMIT;
-
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
-
+
- Then, the next morning when it’s done, check the results of the harvesting, backup the current
openrxv-items
index, and clone the openrxv-items-temp
index to openrxv-items
:
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 100699,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100699,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.b
-locks.write":true}}'
+$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.b
+locks.write":true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-25
-$ curl -XDELETE 'http://localhost:9200/openrxv-items'
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-25'
-
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-25'
+
- Resume working on CG Core v2, I realized a few things:
- We are trying to move from
dc.identifier.issn
(and ISBN) to cg.issn
, but this is currently implemented as a “qualdrop” input in DSpace’s submission form, which only works to fill in the qualifier (ie dc.identier.xxxx
)
@@ -601,8 +601,8 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
- I filed a bug on DSpace’s issue tracker (though I accidentally hit Enter and submitted it before I finished, and there is no edit function)
- Looking into Linode report that the load outbound traffic rate was high this morning:
-# grep -E '26/Jan/2021:(08|09|10|11|12)' /var/log/nginx/rest.log | goaccess --log-format=COMBINED -
-
+# grep -E '26/Jan/2021:(08|09|10|11|12)' /var/log/nginx/rest.log | goaccess --log-format=COMBINED -
+
- The culprit seems to be the ILRI publications importer, so that’s OK
- But I also see an IP in Jordan hitting the REST API 1,100 times today:
@@ -615,8 +615,8 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
- I purged all ~3,000 statistics hits that have the “http://wp.local/" referrer:
-$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>referrer:http\:\/\/wp\.local\/</query></delete>"
-
+$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>referrer:http\:\/\/wp\.local\/</query></delete>"
+
- Tag version 0.4.3 of the csv-metadata-quality tool on GitHub: https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.3
- I just realized that I never submitted this to CGSpace as a Big Data Platform output
@@ -661,9 +661,9 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
-
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
-
+
- Sent out emails about CG Core v2 to Macaroni Bros, Fabio, Hector at CCAFS, Dani and Tariku
- A bit more minor work on testing the series/report/journal changes for CG Core v2
diff --git a/docs/2021-02/index.html b/docs/2021-02/index.html
index ce2fd2cb8..e9495a11d 100644
--- a/docs/2021-02/index.html
+++ b/docs/2021-02/index.html
@@ -20,12 +20,12 @@ Check the results of the AReS harvesting from last night:
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 100875,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100875,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
" />
@@ -51,16 +51,16 @@ Check the results of the AReS harvesting from last night:
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 100875,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100875,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
"/>
-
+
@@ -157,34 +157,34 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty
I had a call with CodeObia to discuss the work on OpenRXV
Check the results of the AReS harvesting from last night:
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 100875,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100875,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
- Set the current items index to read only and make a backup:
-$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
+$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-01
-
+
- Delete the current items index and clone the temp one to it:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items'
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
-
+
- Then delete the temp and backup:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-{"acknowledged":true}%
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-01'
-
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+{"acknowledged":true}%
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-01'
+
- Meeting with Peter and Abenet about CGSpace goals and progress
- Test submission to DSpace via REST API to see if Abenet can fix / reject it (submit workflow?)
- Get Peter a list of users who have submitted or approved on DSpace everrrrrrr, so he can remove some
@@ -196,10 +196,10 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-01'
- I tried to export the ILRI community from CGSpace but I got an error:
-
$ dspace metadata-export -i 10568/1 -f /tmp/2021-02-01-ILRI.csv
+$ dspace metadata-export -i 10568/1 -f /tmp/2021-02-01-ILRI.csv
Loading @mire database changes for module MQM
Changes have been processed
-Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
+Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
Exception: null
java.lang.NullPointerException
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212)
@@ -214,7 +214,7 @@ java.lang.NullPointerException
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
-
+
- I imported the production database to my local development environment and I get the same error… WTF is this?
- I was able to export another smaller community
@@ -234,16 +234,16 @@ java.lang.NullPointerException
- Maria Garruccio sent me some new ORCID iDs for Bioversity authors, as well as a correction for Stefan Burkart’s iD
- I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using
resolve-orcids.py
:
-$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-02-02-combined-orcids.txt
+$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-02-02-combined-orcids.txt
$ ./ilri/resolve-orcids.py -i /tmp/2021-02-02-combined-orcids.txt -o /tmp/2021-02-02-combined-orcid-names.txt
-
+
- I sorted the names and added the XML formatting in vim, then ran it through tidy:
-$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
-
+$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
+
- Then I added all the changed names plus Stefan’s incorrect ones to a CSV and processed them with
fix-metadata-values.py
:
-
$ cat 2021-02-02-fix-orcid-ids.csv
+$ cat 2021-02-02-fix-orcid-ids.csv
cg.creator.id,correct
Burkart Stefan: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184
Burkart Stefan: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184
@@ -254,8 +254,8 @@ Bedru: 0000-0002-7344-5743,Bedru B. Balana: 0000-0002-7344-5743
Leigh Winowiecki: 0000-0001-5572-1284,Leigh Ann Winowiecki: 0000-0001-5572-1284
Sander J. Zwart: 0000-0002-5091-1801,Sander Zwart: 0000-0002-5091-1801
saul lozano-fuentes: 0000-0003-1517-6853,Saul Lozano: 0000-0003-1517-6853
-$ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u dspace -p 'fuuu' -f cg.creator.id -t 'correct' -m 240
-
+$ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u dspace -p 'fuuu' -f cg.creator.id -t 'correct' -m 240
+
- I also looked up which of these new authors might have existing items that are missing ORCID iDs
- I had to port my
add-orcid-identifiers-csv.py
to DSpace 6 UUIDs and I think it’s working but I want to do a few more tests because it uses a sequence for the metadata_value_id
@@ -263,23 +263,23 @@ $ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u
- Tag forty-three items from Bioversity’s new authors with ORCID iDs using
add-orcid-identifiers-csv.py
:
-$ cat /tmp/2021-02-02-add-orcid-ids.csv
+$ cat /tmp/2021-02-02-add-orcid-ids.csv
dc.contributor.author,cg.creator.id
-"Nchanji, E.",Eileen Bogweh Nchanji: 0000-0002-6859-0962
-"Nchanji, Eileen",Eileen Bogweh Nchanji: 0000-0002-6859-0962
-"Nchanji, Eileen Bogweh",Eileen Bogweh Nchanji: 0000-0002-6859-0962
-"Machida, Lewis",Lewis Machida: 0000-0002-0012-3997
-"Mockshell, Jonathan",Jonathan Mockshell: 0000-0003-1990-6657"
-"Aubert, C.",Celine Aubert: 0000-0001-6284-4821
-"Aubert, Céline",Celine Aubert: 0000-0001-6284-4821
-"Devare, M.",Medha Devare: 0000-0003-0041-4812
-"Devare, Medha",Medha Devare: 0000-0003-0041-4812
-"Benites-Alfaro, O.E.",Omar E. Benites-Alfaro: 0000-0002-6852-9598
-"Benites-Alfaro, Omar Eduardo",Omar E. Benites-Alfaro: 0000-0002-6852-9598
-"Johnson, Vincent",VINCENT JOHNSON: 0000-0001-7874-178X
-"Lesueur, Didier",didier lesueur: 0000-0002-6694-0869
-$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -d
-
+"Nchanji, E.",Eileen Bogweh Nchanji: 0000-0002-6859-0962
+"Nchanji, Eileen",Eileen Bogweh Nchanji: 0000-0002-6859-0962
+"Nchanji, Eileen Bogweh",Eileen Bogweh Nchanji: 0000-0002-6859-0962
+"Machida, Lewis",Lewis Machida: 0000-0002-0012-3997
+"Mockshell, Jonathan",Jonathan Mockshell: 0000-0003-1990-6657"
+"Aubert, C.",Celine Aubert: 0000-0001-6284-4821
+"Aubert, Céline",Celine Aubert: 0000-0001-6284-4821
+"Devare, M.",Medha Devare: 0000-0003-0041-4812
+"Devare, Medha",Medha Devare: 0000-0003-0041-4812
+"Benites-Alfaro, O.E.",Omar E. Benites-Alfaro: 0000-0002-6852-9598
+"Benites-Alfaro, Omar Eduardo",Omar E. Benites-Alfaro: 0000-0002-6852-9598
+"Johnson, Vincent",VINCENT JOHNSON: 0000-0001-7874-178X
+"Lesueur, Didier",didier lesueur: 0000-0002-6694-0869
+$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -d
+
- I’m working on the CGSpace accession for Karl Rich’s Viet Nam Pig Model 2018 and I noticed his ORCID iD is missing from CGSpace
- I added it and tagged 141 items of his with the iD
@@ -300,9 +300,9 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db d
-$ time chrt -b 0 dspace index-discovery -b
+$ time chrt -b 0 dspace index-discovery -b
$ dspace oai import -c
-
+
- Attend Accenture meeting for repository managers
- Not clear what the SMO wants to get out of us
@@ -333,8 +333,8 @@ $ dspace oai import -c
-$ ./ilri/delete-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -m 43
-
+$ ./ilri/delete-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -m 43
+
- The corrected versions have a lot of encoding issues so I asked Peter to give me the correct ones so I can search/replace them:
- CIAT Publicaçao
@@ -358,8 +358,8 @@ $ dspace oai import -c
- I ended up using python-ftfy to fix those very easily, then replaced them in the CSV
- Then I trimmed whitespace at the beginning, end, and around the “;”, and applied the 1,600 fixes using
fix-metadata-values.py
:
-$ ./ilri/fix-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -t 'correct' -m 43
-
+$ ./ilri/fix-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -t 'correct' -m 43
+
- Help Peter debug an issue with one of Alan Duncan’s new FEAST Data reports on CGSpace
- For some reason the default policy for the item was “COLLECTION_492_DEFAULT_READ” group, which had zero members
@@ -372,12 +372,12 @@ $ dspace oai import -c
- Run system updates on CGSpace (linode18), deploy latest 6_x-prod branch, and reboot the server
- After the server came back up I started a full Discovery re-indexing:
-$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
-
-real 247m30.850s
+$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
+
+real 247m30.850s
user 160m36.657s
sys 2m26.050s
-
+
- Regarding the CG Core v2 migration, Fabio wrote to tell me that he is not using CGSpace directly, instead harvesting via GARDIAN
- He gave me the contact of Sotiris Konstantinidis, who is the CTO at SCIO Systems and works on the GARDIAN platform
@@ -385,30 +385,30 @@ sys 2m26.050s
- Delete the old Elasticsearch temp index to prepare for starting an AReS re-harvest:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
-
2021-02-08
+2021-02-08
- Finish rotating the AReS indexes after the harvesting last night:
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 100983,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100983,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write":true}}'
+$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write":true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-08
-$ curl -XDELETE 'http://localhost:9200/openrxv-items'
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-08'
-
2021-02-10
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-08'
+2021-02-10
- Talk to Abdullah from CodeObia about a few of the issues we filed on OpenRXV
@@ -429,11 +429,11 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-08'
-$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | wc -l
+$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | wc -l
30354
-$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort -u | wc -l
+$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort -u | wc -l
18555
-$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h | tail
+$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h | tail
5 c21a79e5-e24e-4861-aa07-e06703d1deb7
5 c2460aa1-ae28-4003-9a99-2d7c5cd7fd38
5 d73fb3ae-9fac-4f7e-990f-e394f344246c
@@ -444,7 +444,7 @@ $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h |
6 fb76888c-03ae-4d53-b27d-87d7ca91371a
6 ff42d1e6-c489-492c-a40a-803cabd901ed
7 094e9e1d-09ff-40ca-a6b9-eca580936147
-
+
- I added a comment to that bug to ask if this is a side effect of the patch
- I started working on tagging pre-2010 ILRI items with license information, like we talked about with Peter and Abenet last week
@@ -452,23 +452,23 @@ $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h |
-$ csvcut -c 'id,dc.date.issued,dc.date.issued[],dc.date.issued[en_US],dc.rights,dc.rights[],dc.rights[en],dc.rights[en_US],dc.publisher,dc.publisher[],dc.publisher[en_US],dc.type[en_US]' /tmp/2021-02-10-ILRI.csv | csvgrep -c 'dc.type[en_US]' -r '^.+[^(Journal Item|Journal Article|Book|Book Chapter)]'
-
+$ csvcut -c 'id,dc.date.issued,dc.date.issued[],dc.date.issued[en_US],dc.rights,dc.rights[],dc.rights[en],dc.rights[en_US],dc.publisher,dc.publisher[],dc.publisher[en_US],dc.type[en_US]' /tmp/2021-02-10-ILRI.csv | csvgrep -c 'dc.type[en_US]' -r '^.+[^(Journal Item|Journal Article|Book|Book Chapter)]'
+
- I imported the CSV into OpenRefine and converted the date text values to date types so I could facet by dates before 2010:
-
if(diff(value,"01/01/2010".toDate(),"days")<0, true, false)
-
+if(diff(value,"01/01/2010".toDate(),"days")<0, true, false)
+
- Then I filtered by publisher to make sure they were only ours:
-
or(
- value.contains("International Livestock Research Institute"),
- value.contains("ILRI"),
- value.contains("International Livestock Centre for Africa"),
- value.contains("ILCA"),
- value.contains("ILRAD"),
- value.contains("International Laboratory for Research on Animal Diseases")
+or(
+ value.contains("International Livestock Research Institute"),
+ value.contains("ILRI"),
+ value.contains("International Livestock Centre for Africa"),
+ value.contains("ILCA"),
+ value.contains("ILRAD"),
+ value.contains("International Laboratory for Research on Animal Diseases")
)
-
+
- I tagged these pre-2010 items with “Other” if they didn’t already have a license
- I checked 2010 to 2015, and 2016 to date, but they were all tagged already!
- In the end I added the “Other” license to 1,523 items from before 2010
@@ -504,8 +504,8 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (S
- Clear the OpenRXV temp items index:
-
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+
- Then start a full harvesting of CGSpace in the AReS Explorer admin dashboard
- Peter asked me about a few other recently submitted FEAST items that are restricted
@@ -521,35 +521,35 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (S
-
$ ./ilri/move-metadata-values.py -i /tmp/move.txt -db dspace -u dspace -p 'fuuu' -f 43 -t 55
-
2021-02-15
+$ ./ilri/move-metadata-values.py -i /tmp/move.txt -db dspace -u dspace -p 'fuuu' -f 43 -t 55
+
2021-02-15
- Check the results of the AReS Harvesting from last night:
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 101126,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 101126,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
- Set the current items index to read only and make a backup:
-$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
+$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-15
-
+
- Delete the current items index and clone the temp one:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items'
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-15'
-
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-15'
+
- Call with Abdullah from CodeObia to discuss community and collection statistics reporting
2021-02-16
@@ -563,49 +563,49 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-15'
- They are definitely bots posing as users, as I see they have created six thousand DSpace sessions today:
-
$ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=45.146.165.203' | sort | uniq | wc -l
+$ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=45.146.165.203' | sort | uniq | wc -l
4007
-$ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=130.255.161.231' | sort | uniq | wc -l
+$ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=130.255.161.231' | sort | uniq | wc -l
2128
-
+
- Ah, actually 45.146.165.203 is making requests like this:
-"http://cgspace.cgiar.org:80/bitstream/handle/10568/238/Res_report_no3.pdf;jsessionid=7311DD88B30EEF9A8F526FF89378C2C5%' AND 4313=CONCAT(CHAR(113)+CHAR(98)+CHAR(106)+CHAR(112)+CHAR(113),(SELECT (CASE WHEN (4313=4313) THEN CHAR(49) ELSE CHAR(48) END)),CHAR(113)+CHAR(106)+CHAR(98)+CHAR(112)+CHAR(113)) AND 'XzQO%'='XzQO"
-
+"http://cgspace.cgiar.org:80/bitstream/handle/10568/238/Res_report_no3.pdf;jsessionid=7311DD88B30EEF9A8F526FF89378C2C5%' AND 4313=CONCAT(CHAR(113)+CHAR(98)+CHAR(106)+CHAR(112)+CHAR(113),(SELECT (CASE WHEN (4313=4313) THEN CHAR(49) ELSE CHAR(48) END)),CHAR(113)+CHAR(106)+CHAR(98)+CHAR(112)+CHAR(113)) AND 'XzQO%'='XzQO"
+
- I purged the hits from these two using my
check-spider-ip-hits.sh
:
-
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
+$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
Purging 4005 hits from 45.146.165.203 in statistics
Purging 3493 hits from 130.255.161.231 in statistics
-
-Total number of bot hits purged: 7498
-
+
+Total number of bot hits purged: 7498
+
- Ugh, I looked in Solr for the top IPs in 2021-01 and found a few more of these Russian IPs so I purged them too:
-$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
+$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
Purging 27163 hits from 45.146.164.176 in statistics
Purging 19556 hits from 45.146.165.105 in statistics
Purging 15927 hits from 45.146.165.83 in statistics
Purging 8085 hits from 45.146.165.104 in statistics
-
-Total number of bot hits purged: 70731
-
+
+Total number of bot hits purged: 70731
+
- My god, and 64.39.99.15 is from Qualys, the domain scanning security people, who are making queries trying to see if we are vulnerable or something (wtf?)
- Looking in Solr I see a few different IPs with DNS like
sn003.s02.iad01.qualys.com.
so I will purge their requests too:
-$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
+$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
Purging 3 hits from 130.255.161.231 in statistics
Purging 16773 hits from 64.39.99.15 in statistics
Purging 6976 hits from 64.39.99.13 in statistics
Purging 13 hits from 64.39.99.63 in statistics
Purging 12 hits from 64.39.99.65 in statistics
Purging 12 hits from 64.39.99.94 in statistics
-
-Total number of bot hits purged: 23789
-
2021-02-17
+
+Total number of bot hits purged: 23789
+2021-02-17
- I tested Node.js 10 vs 12 on CGSpace (linode18) and DSpace Test (linode26) and the build times were surprising
@@ -627,11 +627,11 @@ Total number of bot hits purged: 23789
- Abenet asked me to add Tom Randolph’s ORCID identifier to CGSpace
- I also tagged all his 247 existing items on CGSpace:
-$ cat 2021-02-17-add-tom-orcid.csv
+$ cat 2021-02-17-add-tom-orcid.csv
dc.contributor.author,cg.creator.id
-"Randolph, Thomas F.","Thomas Fitz Randolph: 0000-0003-1849-9877"
-$ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace -u dspace -p 'fuuu'
-
2021-02-20
+"Randolph, Thomas F.","Thomas Fitz Randolph: 0000-0003-1849-9877"
+$ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace -u dspace -p 'fuuu'
+2021-02-20
- Test the CG Core v2 migration on DSpace Test (linode26) one last time
@@ -640,17 +640,17 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace
- Start the CG Core v2 migration on CGSpace (linode18)
- After deploying the latest
6_x-prod
branch and running migrate-fields.sh
I started a full Discovery reindex:
-$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
-
-real 311m12.617s
+$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
+
+real 311m12.617s
user 217m3.102s
sys 2m37.363s
-
+
- Then update OAI:
-$ dspace oai import -c
-$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
-
+$ dspace oai import -c
+$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
+
- Ben Hack was asking if there is a REST API query that will give him all ILRI outputs for their new Sharepoint intranet
- I told him he can try to use something like this if it’s just something like the ILRI articles in journals collection:
@@ -668,16 +668,16 @@ $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
-
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
+$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
-
+
- The process took an hour or so!
- I added colorized output to the csv-metadata-quality tool and tagged version 0.4.4 on GitHub
- I updated the fields in AReS Explorer and then removed the old temp index so I can start a fresh re-harvest of CGSpace:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
-
2021-02-22
+2021-02-22
- Start looking at splitting the series name and number in
dcterms.isPartOf
now that we have migrated to CG Core v2
@@ -687,43 +687,43 @@ $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
-localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
+localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
UPDATE 104
-
+
- As for splitting the other values, I think I can export the
dspace_object_id
and text_value
and then upload it as a CSV rather than writing a Python script to create the new metadata values
2021-02-22
- Check the results of the AReS harvesting from last night:
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 101380,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 101380,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
- Set the current items index to read only and make a backup:
-$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
+$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-22
-
+
- Delete the current items index and clone the temp one to it:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items'
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
-
+
- Then delete the temp and backup:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-{"acknowledged":true}%
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22'
-
2021-02-23
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+{"acknowledged":true}%
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22'
+
2021-02-23
- CodeObia sent a pull request for clickable countries on AReS
@@ -732,22 +732,22 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22'
- Remove semicolons from series names without numbers:
-dspace=# BEGIN;
-dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
+dspace=# BEGIN;
+dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
UPDATE 104
dspace=# COMMIT;
-
+
- Set all
text_lang
values on CGSpace to en_US
to make the series replacements easier (this didn’t work, read below):
-dspace=# BEGIN;
-dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE text_lang !='en_US' AND dspace_object_id IN (SELECT uuid FROM item);
+dspace=# BEGIN;
+dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE text_lang !='en_US' AND dspace_object_id IN (SELECT uuid FROM item);
UPDATE 911
cgspace=# COMMIT;
-
+
- Then export all series with their IDs to CSV:
-dspace=# \COPY (SELECT dspace_object_id, text_value as "dcterms.isPartOf[en_US]" FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER;
-
+dspace=# \COPY (SELECT dspace_object_id, text_value as "dcterms.isPartOf[en_US]" FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER;
+
- In OpenRefine I trimmed and consolidated whitespace, then made some quick cleanups to normalize the fields based on a sanity check
- For example many Spore items are like “Spore, Spore 23”
@@ -761,23 +761,23 @@ cgspace=# COMMIT;
-
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_value_id=5355845;
+dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_value_id=5355845;
UPDATE 1
-
+
- This also seems to work, using the id for just that one item:
-dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id='9840d19b-a6ae-4352-a087-6d74d2629322';
+dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id='9840d19b-a6ae-4352-a087-6d74d2629322';
UPDATE 37
-
+
- This seems to work better for some reason:
-dspacetest=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item);
+dspacetest=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item);
UPDATE 18659
-
+
- I split the CSV file in batches of 5,000 using xsv, then imported them one by one in CGSpace:
-$ dspace metadata-import -f /tmp/0.csv
-
+$ dspace metadata-import -f /tmp/0.csv
+
- It took FOREVER to import each file… like several hours each. MY GOD DSpace 6 is slow.
- Help Dominique Perera debug some issues with the WordPress DSpace importer plugin from Macaroni Bros
@@ -785,40 +785,40 @@ UPDATE 18659
-
104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] "GET /rest/communities?limit=1000 HTTP/1.1" 200 188779 "https://cgspace.cgiar.org/rest /communities?limit=1000" "RTB website BOT"
-104.198.97.97 - - [23/Feb/2021:11:41:18 +0100] "GET /rest/communities//communities HTTP/1.1" 404 714 "https://cgspace.cgiar.org/rest/communities//communities" "RTB website BOT"
-
+104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] "GET /rest/communities?limit=1000 HTTP/1.1" 200 188779 "https://cgspace.cgiar.org/rest /communities?limit=1000" "RTB website BOT"
+104.198.97.97 - - [23/Feb/2021:11:41:18 +0100] "GET /rest/communities//communities HTTP/1.1" 404 714 "https://cgspace.cgiar.org/rest/communities//communities" "RTB website BOT"
+
- The first request is OK, but the second one is malformed for sure
2021-02-24
- Export a list of journals for Peter to look through:
-
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.journal", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.journal", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
COPY 3345
-
+
- Start a fresh harvesting on AReS because Udana mapped some items today and wants to include them in his report:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
-
+
- Also, I want to include the new series name/number cleanups so it’s not a total waste of time
2021-02-25
- Hmm the AReS harvest last night seems to have finished successfully, but the number of items is less than I was expecting:
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 99546,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 99546,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
- The current items index has 101380 items… I wonder what happened
- I started a new indexing
@@ -843,9 +843,9 @@ COPY 3345
-value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/\(.*\)/,"")
-value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1")
-
+value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/\(.*\)/,"")
+value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1")
+
- This
value.partition
was new to me… and it took me a bit of time to figure out whether I needed to escape the parentheses in the issue number or not (no) and how to reference a capture group with value.replace
- I tried to check the 1095 CIFOR records from last week for duplicates on DSpace Test, but the page says “Processing” and never loads
@@ -857,7 +857,7 @@ value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1")
- Niroshini from IWMI is still having issues adding WLE subjects to items during the metadata review step in the workflow
- It seems the BatchEditConsumer log spam is gone since I applied Atmire’s patch
-$ grep -c 'BatchEditConsumer should not have been given' dspace.log.2021-02-[12]*
+$ grep -c 'BatchEditConsumer should not have been given' dspace.log.2021-02-[12]*
dspace.log.2021-02-10:5067
dspace.log.2021-02-11:2647
dspace.log.2021-02-12:4231
@@ -877,7 +877,7 @@ dspace.log.2021-02-25:0
dspace.log.2021-02-26:0
dspace.log.2021-02-27:0
dspace.log.2021-02-28:0
-
+
diff --git a/docs/2021-03/index.html b/docs/2021-03/index.html
index 198b1a6eb..fb89d7478 100644
--- a/docs/2021-03/index.html
+++ b/docs/2021-03/index.html
@@ -34,7 +34,7 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
"/>
-
+
@@ -163,19 +163,19 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
- I looked at the number of connections in PostgreSQL and it’s definitely high again:
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1020
-
+
- I reported it to Atmire to take a look, on the same issue we had been tracking this before
- Abenet asked me to add a new ORCID for ILRI staff member Zoe Campbell
- I added it to the controlled vocabulary and then tagged her existing items on CGSpace using my
add-orcid-identifier.py
script:
-$ cat 2021-03-04-add-zoe-campbell-orcid.csv
+$ cat 2021-03-04-add-zoe-campbell-orcid.csv
dc.contributor.author,cg.creator.identifier
-"Campbell, Zoë","Zoe Campbell: 0000-0002-4759-9976"
-"Campbell, Zoe A.","Zoe Campbell: 0000-0002-4759-9976"
-$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p 'fuuu'
-
+"Campbell, Zoë","Zoe Campbell: 0000-0002-4759-9976"
+"Campbell, Zoe A.","Zoe Campbell: 0000-0002-4759-9976"
+$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p 'fuuu'
+
- I still need to do cleanup on the journal articles metadata
- Peter sent me some cleanups but I can’t use them in the search/replace format he gave
@@ -183,9 +183,9 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -
-localhost/dspace63= > \COPY (SELECT dspace_object_id AS id, text_value as "cg.journal" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT dspace_object_id AS id, text_value as "cg.journal" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
COPY 32087
-
+
- I used OpenRefine to remove all journal values that didn’t have one of these values: ; ( )
- Then I cloned the
cg.journal
field to cg.volume
and cg.issue
@@ -193,10 +193,10 @@ COPY 32087
-value.partition(';')[0].trim() # to get journal names
-value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,"$1") # to get journal volumes
-value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") # to get journal issues
-
+value.partition(';')[0].trim() # to get journal names
+value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,"$1") # to get journal volumes
+value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") # to get journal issues
+
- Then I uploaded the changes to CGSpace using
dspace metadata-import
- Margarita from CCAFS was asking about an error deleting some items that were showing up in Google and should have been private
@@ -233,14 +233,14 @@ value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") #
- I migrated the Docker bind mount for the AReS Elasticsearch container to a Docker volume:
-
$ docker-compose -f docker/docker-compose.yml down
+$ docker-compose -f docker/docker-compose.yml down
$ docker volume create docker_esData_7
$ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
$ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
$ docker rm es_dummy
# edit docker/docker-compose.yml to switch from bind mount to volume
$ docker-compose -f docker/docker-compose.yml up -d
-
+
- The trick is that when you create a volume like “myvolume” from a
docker-compose.yml
file, Docker will create it with the name “docker_myvolume”
- If you create it manually on the command line with
docker volume create myvolume
then the name is literally “myvolume”
@@ -249,39 +249,39 @@ $ docker-compose -f docker/docker-compose.yml up -d
- I still need to make the changes to git master and add these notes to the pull request so Moayad and others can benefit
- Delete the
openrxv-items-temp
index to test a fresh harvesting:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-
2021-03-05
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+
2021-03-05
- Check the results of the AReS harvesting from last night:
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 101761,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 101761,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
- Set the current items index to read only and make a backup:
-$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
+$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05
-
+
- Delete the current items index and clone the temp one to it:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items'
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
-
+
- Then delete the temp and backup:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-{"acknowledged":true}%
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
-
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+{"acknowledged":true}%
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
+
- I made some pull requests to OpenRXV:
- docker/docker-compose.yml: Use docker volumes
@@ -298,57 +298,57 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
-
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
+$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
- "openrxv-items-final": {
- "aliases": {
- "openrxv-items": {}
+ "openrxv-items-final": {
+ "aliases": {
+ "openrxv-items": {}
}
},
-
+
- But on AReS production
openrxv-items
has somehow become a concrete index:
-$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
+$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
- "openrxv-items": {
- "aliases": {}
+ "openrxv-items": {
+ "aliases": {}
},
- "openrxv-items-final": {
- "aliases": {}
+ "openrxv-items-final": {
+ "aliases": {}
},
- "openrxv-items-temp": {
- "aliases": {}
+ "openrxv-items-temp": {
+ "aliases": {}
},
-
+
- I fixed the issue on production by cloning the
openrxv-items
index to openrxv-items-final
, deleting openrxv-items
, and then re-creating it as an alias:
-$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
-$ curl -XDELETE 'http://localhost:9200/openrxv-items'
-$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
-
+$ curl -XDELETE 'http://localhost:9200/openrxv-items'
+$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
+
- Delete backups and remove read-only mode on
openrxv-items
:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
-$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
-
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
+$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
+
- Linode sent alerts about the CPU usage on CGSpace yesterday and the day before
- Looking in the logs I see a few IPs making heavy usage on the REST API and XMLUI:
-
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -
-
+# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -
+
- I see the usual IPs for CCAFS and ILRI importer bots, but also
143.233.242.132
which appears to be for GARDIAN:
-
# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
+# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
6237
-# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c -v Delphi
+# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c -v Delphi
6418
-
+
- They seem to make requests twice, once with the Delphi user agent that we know and already mark as a bot, and once with a “normal” user agent
- Looking in Solr I see they have been using this IP for awhile, as they have 100,000 hits going back into 2020
@@ -375,9 +375,9 @@ $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Typ
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
-
+
- On 2021-03-03 the PostgreSQL transactions started rising:
@@ -409,10 +409,10 @@ $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Typ
-
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
# start harvesting on AReS
-
+
- As I saw on my local test instance, even when you cancel a harvesting, it replaces the
openrxv-items-final
index with whatever is in openrxv-items-temp
automatically, so I assume it will do the same now
2021-03-09
@@ -434,8 +434,8 @@ $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items
-
$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p 'fuuu'
-
2021-03-10
+$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p 'fuuu'
+
2021-03-10
- Colleagues from ICARDA asked about how we should handle ISI journals in CG Core, as CGSpace uses
cg.isijournal
and MELSpace uses mel.impact-factor
@@ -444,12 +444,12 @@ $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items
- Peter said he doesn’t see “Source Code” or “Software” in the output type facet on the ILRI community, but I see it on the home page, so I will try to do a full Discovery re-index:
-$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
-
-real 318m20.485s
+$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
+
+real 318m20.485s
user 215m15.196s
sys 2m51.529s
-
+
- Now I see ten items for “Source Code” in the facets…
- Add GPL and MIT licenses to the list of licenses on CGSpace input form since we will start capturing more software and source code
- Added the ability to check
dcterms.license
values against the SPDX licenses in the csv-metadata-quality tool
@@ -467,34 +467,34 @@ sys 2m51.529s
- Switch to linux-kvm kernel on linode20 and linode18:
-# apt update && apt full-upgrade
+# apt update && apt full-upgrade
# apt install linux-kvm
# apt remove linux-generic linux-image-generic linux-headers-generic linux-firmware
-# apt autoremove && apt autoclean
+# apt autoremove && apt autoclean
# reboot
-
+
- Deploy latest changes from
6_x-prod
branch on CGSpace
- Deploy latest changes from OpenRXV
master
branch on AReS
- Last week Peter added OpenRXV to CGSpace: https://hdl.handle.net/10568/112982
- Back up the current
openrxv-items-final
index on AReS to start a new harvest:
-$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-14
-$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
-
+$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
+
- After the harvesting finished it seems the indexes got messed up again, as
openrxv-items
is an alias of openrxv-items-temp
instead of openrxv-items-final
:
-$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
+$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
- "openrxv-items-final": {
- "aliases": {}
+ "openrxv-items-final": {
+ "aliases": {}
},
- "openrxv-items-temp": {
- "aliases": {
- "openrxv-items": {}
+ "openrxv-items-temp": {
+ "aliases": {
+ "openrxv-items": {}
}
},
-
+
- Anyways, the number of items in
openrxv-items
seems OK and the AReS Explorer UI is working fine
- I will have to manually fix the indexes before the next harvesting
@@ -535,54 +535,54 @@ $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Conte
- Back up the current
openrxv-items-final
index to start a fresh AReS Harvest:
-$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-21
-$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
-
+$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
+
- Then start harvesting in the AReS Explorer admin UI
2021-03-22
- The harvesting on AReS yesterday completed, but somehow I have twice the number of items:
-$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
- "count" : 206204,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 206204,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
- Hmmm and even my backup index has a strange number of items:
-$ curl -s 'http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&pretty'
{
- "count" : 844,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 844,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
- I deleted all indexes and re-created the openrxv-items alias:
-$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
-$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
+$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
+$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
- "openrxv-items-temp": {
- "aliases": {}
+ "openrxv-items-temp": {
+ "aliases": {}
},
- "openrxv-items-final": {
- "aliases": {
- "openrxv-items": {}
+ "openrxv-items-final": {
+ "aliases": {
+ "openrxv-items": {}
}
}
-
+
- Then I started a new harvesting
- I switched the Node.js in the Ansible infrastructure scripts to v12 since v10 will cease to be supported soon
@@ -591,26 +591,26 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
- The AReS harvest finally finished, with 1047 pages of items, but the
openrxv-items-final
index is empty and the openrxv-items-temp
index has a 103,000 items:
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 103162,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 103162,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
- I tried to clone the temp index to the final, but got an error:
-$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
-{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"}],"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"},"status":400}%
-
+$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
+{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"}],"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"},"status":400}%
+
- I looked in the Docker logs for Elasticsearch and saw a few memory errors:
-
java.lang.OutOfMemoryError: Java heap space
-
+java.lang.OutOfMemoryError: Java heap space
+
- According to
/usr/share/elasticsearch/config/jvm.options
in the Elasticsearch container the default JVM heap is 1g
- I see the running Java process has
-Xms 1g -Xmx 1g
in its process invocation so I guess that it must be indeed using 1g
@@ -622,20 +622,20 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
-
"openrxv-items-final": {
- "aliases": {}
+ "openrxv-items-final": {
+ "aliases": {}
},
- "openrxv-items-temp": {
- "aliases": {
- "openrxv-items": {}
+ "openrxv-items-temp": {
+ "aliases": {
+ "openrxv-items": {}
}
},
-
2021-03-23
+2021-03-23
- For reference you can also get the Elasticsearch JVM stats from the API:
-$ curl -s 'http://localhost:9200/_nodes/jvm?human' | python -m json.tool
-
+$ curl -s 'http://localhost:9200/_nodes/jvm?human' | python -m json.tool
+
- I re-deployed AReS with 1.5GB of heap using the
ES_JAVA_OPTS
environment variable
- It turns out that this is the recommended way to set the heap: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/jvm-options.html
@@ -644,8 +644,8 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
- Then I fixed the aliases to make sure
openrxv-items
was an alias of openrxv-items-final
, similar to how I did a few weeks ago
- I re-created the temp index:
-$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
-
2021-03-24
+$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
+
2021-03-24
- Atmire responded to the ticket about the Duplicate Checker
@@ -659,35 +659,35 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
-# du -s /home/dspacetest.cgiar.org/solr/statistics
+# du -s /home/dspacetest.cgiar.org/solr/statistics
57861236 /home/dspacetest.cgiar.org/solr/statistics
-
+
- I applied their changes to
config/spring/api/atmire-cua-update.xml
and started the duplicate processor:
-$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
-$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 1000 -c statistics -t 12
-
+$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
+$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 1000 -c statistics -t 12
+
- The default number of records per query is 10,000, which caused memory issues, so I will try with 1000 (Atmire used 100, but that seems too low!)
- Hah, I still got a memory error after only a few minutes:
-
...
+...
Run 1 — 80% — 5,000/6,263 docs — 25s — 6m 31s
Exception: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
-
+
- I guess we really do have to use
-r 100
- Now the thing runs for a few minutes and “finishes”:
-$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12
+$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12
Loading @mire database changes for module MQM
Changes have been processed
-
-
-*************************
+
+
+*************************
* Update Script Started *
*************************
-
-Run 1
+
+Run 1
Start updating Solr Storage Reports | Wed Mar 24 14:42:17 CET 2021
Deleting old storage docs from Solr... | Wed Mar 24 14:42:17 CET 2021
Done. | Wed Mar 24 14:42:17 CET 2021
@@ -752,12 +752,12 @@ Run 1 — 97% — 4,700/4,824 docs — 2s — 5m 49s
Run 1 — 100% — 4,800/4,824 docs — 2s — 5m 51s
Run 1 — 100% — 4,824/4,824 docs — 2s — 5m 53s
Run 1 took 5m 53s
-
-
-**************************
+
+
+**************************
* Update Script Finished *
**************************
-
+
- If I run it again it finds the same 4,824 docs and processes them…
- I asked Atmire for feedback on this: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839
@@ -796,8 +796,8 @@ Run 1 took 5m 53s
-2021-03-29 08:55:40,073 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&wt=javabin&version=2} hits=143 status=0 QTime=0
-
+2021-03-29 08:55:40,073 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&wt=javabin&version=2} hits=143 status=0 QTime=0
+
- But the item mapper only displays ten items, with no pagination
- There is no way to search by handle or ID
@@ -845,9 +845,9 @@ r = requests.
- I exported a list of all our ISSNs from CGSpace:
-localhost/dspace63= > \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
COPY 3081
-
+
- I wrote a script to check the ISSNs against Crossref’s API:
crossref-issn-lookup.py
- I suspect Crossref might have better data actually…
diff --git a/docs/2021-04/index.html b/docs/2021-04/index.html
index 8937baa56..7ab29c856 100644
--- a/docs/2021-04/index.html
+++ b/docs/2021-04/index.html
@@ -44,7 +44,7 @@ Perhaps one of the containers crashed, I should have looked closer but I was in
"/>
-
+
@@ -54,7 +54,7 @@ Perhaps one of the containers crashed, I should have looked closer but I was in
"@type": "BlogPosting",
"headline": "April, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-04/",
- "wordCount": "4669",
+ "wordCount": "4668",
"datePublished": "2021-04-01T09:50:54+03:00",
"dateModified": "2021-04-28T18:57:48+03:00",
"author": {
@@ -153,21 +153,21 @@ Perhaps one of the containers crashed, I should have looked closer but I was in
-$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-account" -W "(sAMAccountName=otheraccounttoquery)"
-
2021-04-04
+$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-account" -W "(sAMAccountName=otheraccounttoquery)"
+
2021-04-04
- Check the index aliases on AReS Explorer to make sure they are sane before starting a new harvest:
-$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
-
+$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
+
- Then set the
openrxv-items-final
index to read-only so we can make a backup:
-
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
-{"acknowledged":true}%
+$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+{"acknowledged":true}%
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-backup
-{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final-backup"}%
-$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
-
+{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final-backup"}%
+$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
+
- Then start a harvesting on AReS Explorer
- Help Enrico get some 2020 statistics for the Roots, Tubers and Bananas (RTB) community on CGSpace
@@ -181,8 +181,8 @@ $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Conte
-$ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253
-
+$ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253
+
- For now I only fixed obvious errors like “1234-5678.” and “e-ISSN: 1234-5678” etc, but there are still lots of invalid ones which need more manual work:
- Too few characters
@@ -196,19 +196,19 @@ $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Conte
- The AReS Explorer harvesting from yesterday finished, and the results look OK, but actually the Elasticsearch indexes are messed up again:
-
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
+$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
{
- "openrxv-items-final": {
- "aliases": {}
+ "openrxv-items-final": {
+ "aliases": {}
},
- "openrxv-items-temp": {
- "aliases": {
- "openrxv-items": {}
+ "openrxv-items-temp": {
+ "aliases": {
+ "openrxv-items": {}
}
},
...
}
-
+
openrxv-items
should be an alias of openrxv-items-final
, not openrxv-temp
… I will have to fix that manually
- Enrico asked for more information on the RTB stats I gave him yesterday
@@ -218,16 +218,16 @@ $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Conte
-$ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
-$ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.csv | \
- sed '1d' | \
- csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, "")||COALESCE(c, "")||COALESCE(d, "") AS issued FROM stdin' | \
+$ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
+$ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.csv | \
+ sed '1d' | \
+ csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, "")||COALESCE(c, "")||COALESCE(d, "") AS issued FROM stdin' | \
csvgrep -c issued -m 2020 | \
csvcut -c id | \
- sed '1d' | \
+ sed '1d' | \
sort | \
uniq
-
+
- So I remember in the future, this basically does the following:
- Use csvcut to extract the id and all date issued columns from the CSV
@@ -257,17 +257,17 @@ $ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.
- Then I submitted the file three times (changing the page parameter):
-$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page1.json
+$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page1.json
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page2.json
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page3.json
-
+
- Then I extracted the views and downloads in the most ridiculous way:
-$ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
+$ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
30364
-$ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
+$ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
9100
-
+
- For curiousity I did the same exercise for items issued in 2019 and got the following:
- Views: 30721
@@ -290,17 +290,17 @@ $ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs |
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12413
-
+
- The system journal shows thousands of these messages in the system journal, this is the first one:
-Apr 06 07:52:13 linode18 tomcat7[556]: Apr 06, 2021 7:52:13 AM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
-
+Apr 06 07:52:13 linode18 tomcat7[556]: Apr 06, 2021 7:52:13 AM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
+
- Around that time in the dspace log I see nothing unusual, but maybe these?
-
2021-04-06 07:52:29,409 INFO com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics
-
+2021-04-06 07:52:29,409 INFO com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics
+
- (BTW what is the deal with the “200/127”? I should send a comment to Atmire)
- I file a ticket with Atmire: https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets
@@ -308,17 +308,17 @@ $ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs |
- I restarted the PostgreSQL and Tomcat services and now I see less connections, but still WAY high:
-
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3640
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2968
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
-
+
- After ten minutes or so it went back down…
- And now it’s back up in the thousands… I am seeing a lot of stuff in dspace log like this:
-2021-04-06 11:59:34,364 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
+2021-04-06 11:59:34,364 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717952
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717953
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717954
@@ -339,7 +339,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717969
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717970
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717971
-
+
- I sent some notes and a log to Atmire on our existing issue about the database stuff
- Also I asked them about the possibility of doing a formal review of Hibernate
@@ -354,17 +354,17 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
- I had a meeting with Peter and Abenet about CGSpace TODOs
- CGSpace went down again and the PostgreSQL locks are through the roof:
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12154
-
+
- I don’t see any activity on REST API, but in the last four hours there have been 3,500 DSpace sessions:
-# grep -a -E '2021-04-06 (13|14|15|16|17):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
+# grep -a -E '2021-04-06 (13|14|15|16|17):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
3547
-
+
- I looked at the same time of day for the past few weeks and it seems to be a normal number of sessions:
-# for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do grep -a -E "2021-0(3|4)-[0-9]{2} (13|14|15|16|17):" "$file" | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l; done
+# for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do grep -a -E "2021-0(3|4)-[0-9]{2} (13|14|15|16|17):" "$file" | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l; done
...
3572
4085
@@ -387,10 +387,10 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
599
4463
3547
-
+
- What about total number of sessions per day?
-# for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do echo "$file:"; grep -a -o -E 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
+# for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do echo "$file:"; grep -a -o -E 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
...
/home/cgspace.cgiar.org/log/dspace.log.2021-03-28:
11784
@@ -412,7 +412,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
16756
/home/cgspace.cgiar.org/log/dspace.log.2021-04-06:
12343
-
+
- So it’s not the number of sessions… it’s something with the workload…
- I had to step away for an hour or so and when I came back the site was still down and there were still 12,000 locks
@@ -421,13 +421,13 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
- The locks in PostgreSQL shot up again…
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3447
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3527
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
4582
-
+
- I don’t know what the hell is going on, but the PostgreSQL connections and locks are way higher than ever before:
@@ -440,9 +440,9 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
- While looking at the nginx logs I see that MEL is trying to log into CGSpace’s REST API and delete items:
-34.209.213.122 - - [06/Apr/2021:03:50:46 +0200] "POST /rest/login HTTP/1.1" 401 727 "-" "MEL"
-34.209.213.122 - - [06/Apr/2021:03:50:48 +0200] "DELETE /rest/items/95f52bf1-f082-4e10-ad57-268a76ca18ec/metadata HTTP/1.1" 401 704 "-" "-"
-
+34.209.213.122 - - [06/Apr/2021:03:50:46 +0200] "POST /rest/login HTTP/1.1" 401 727 "-" "MEL"
+34.209.213.122 - - [06/Apr/2021:03:50:48 +0200] "DELETE /rest/items/95f52bf1-f082-4e10-ad57-268a76ca18ec/metadata HTTP/1.1" 401 704 "-" "-"
+
- I see a few of these per day going back several months
- I sent a message to Salem and Enrico to ask if they know
@@ -450,13 +450,13 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
- Also annoying, I see tons of what look like penetration testing requests from Qualys:
-
2021-04-04 06:35:17,889 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user "'><qss a=X158062356Y1_2Z>
-2021-04-04 06:35:17,889 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user="'><qss a=X158062356Y1_2Z>
-2021-04-04 06:35:17,890 INFO org.dspace.app.xmlui.utils.AuthenticationUtil @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:email="'><qss a=X158062356Y1_2Z>, realm=null, result=2
+2021-04-04 06:35:17,889 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user "'><qss a=X158062356Y1_2Z>
+2021-04-04 06:35:17,889 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user="'><qss a=X158062356Y1_2Z>
+2021-04-04 06:35:17,890 INFO org.dspace.app.xmlui.utils.AuthenticationUtil @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:email="'><qss a=X158062356Y1_2Z>, realm=null, result=2
2021-04-04 06:35:18,145 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:auth:attempting trivial auth of user=was@qualys.com
2021-04-04 06:35:18,519 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user was@qualys.com
2021-04-04 06:35:18,520 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user=was@qualys.com
-
+
- I deleted the ilri/AReS repository on GitHub since we haven’t updated it in two years
- All development is happening in https://github.com/ilri/openRXV now
@@ -464,27 +464,27 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
- 10PM and the server is down again, with locks through the roof:
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12198
-
+
- I see that there are tons of PostgreSQL connections getting abandoned today, compared to very few in the past few weeks:
-$ journalctl -u tomcat7 --since=today | grep -c 'ConnectionPool abandon'
+$ journalctl -u tomcat7 --since=today | grep -c 'ConnectionPool abandon'
1838
-$ journalctl -u tomcat7 --since=2021-03-20 --until=2021-04-05 | grep -c 'ConnectionPool abandon'
+$ journalctl -u tomcat7 --since=2021-03-20 --until=2021-04-05 | grep -c 'ConnectionPool abandon'
3
-
+
- I even restarted the server and connections were low for a few minutes until they shot back up:
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
8651
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
8940
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
10504
-
+
- I had to go to bed and I bet it will crash and be down for hours until I wake up…
- What the hell is this user agent?
@@ -493,9 +493,9 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
- CGSpace was still down from last night of course, with tons of database locks:
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12168
-
+
- I restarted the server again and the locks came back
- Atmire responded to the message from yesterday
@@ -504,8 +504,8 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
-2021-04-01 12:45:11,414 WARN org.dspace.workflowbasic.BasicWorkflowServiceImpl @ a.akwarandu@cgiar.org:session_id=2F20F20D4A8C36DB53D42DE45DFA3CCE:notifyGroupofTask:cannot email user group_id=aecf811b-b7e9-4b6f-8776-3d372e6a048b workflow_item_id=33085\colon; Invalid Addresses (com.sun.mail.smtp.SMTPAddressFailedException\colon; 501 5.1.3 Invalid address
-
+2021-04-01 12:45:11,414 WARN org.dspace.workflowbasic.BasicWorkflowServiceImpl @ a.akwarandu@cgiar.org:session_id=2F20F20D4A8C36DB53D42DE45DFA3CCE:notifyGroupofTask:cannot email user group_id=aecf811b-b7e9-4b6f-8776-3d372e6a048b workflow_item_id=33085\colon; Invalid Addresses (com.sun.mail.smtp.SMTPAddressFailedException\colon; 501 5.1.3 Invalid address
+
- The issue is not the named user above, but a member of the group…
- And the group does have users with invalid email addresses (probably accounts created automatically after authenticating with LDAP):
@@ -513,7 +513,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
- I extracted all the group IDs from recent logs that had users with invalid email addresses:
-
$ grep -a -E 'email user group_id=\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' /home/cgspace.cgiar.org/log/dspace.log.* | grep -o -E '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' | sort | uniq
+$ grep -a -E 'email user group_id=\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' /home/cgspace.cgiar.org/log/dspace.log.* | grep -o -E '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' | sort | uniq
0a30d6ae-74a6-4eee-a8f5-ee5d15192ee6
1769137c-36d4-42b2-8fec-60585e110db7
203c8614-8a97-4ac8-9686-d9d62cb52acc
@@ -557,7 +557,7 @@ ede59734-adac-4c01-8691-b45f19088d37
f88bd6bb-f93f-41cb-872f-ff26f6237068
f985f5fb-be5c-430b-a8f1-cf86ae4fc49a
fe800006-aaec-4f9e-9ab4-f9475b4cbdc3
-
2021-04-08
+2021-04-08
- I can’t believe it but the server has been down for twelve hours or so
@@ -565,26 +565,26 @@ fe800006-aaec-4f9e-9ab4-f9475b4cbdc3
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12070
-
+
- I restarted PostgreSQL and Tomcat and the locks go straight back up!
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
986
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1194
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1212
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1489
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2124
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
5934
-
2021-04-09
+2021-04-09
- Atmire managed to get CGSpace back up by killing all the PostgreSQL connections yesterday
@@ -608,46 +608,46 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
-$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-backup
-$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
-
+$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
+
- Then I updated all Docker containers and rebooted the server (linode20) so that the correct indexes would be created again:
-$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
-
+$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
+
- Then I realized I have to clone the backup index directly to
openrxv-items-final
, and re-create the openrxv-items
alias:
-
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
-$ curl -X PUT "localhost:9200/openrxv-items-backup/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
+$ curl -X PUT "localhost:9200/openrxv-items-backup/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-backup/_clone/openrxv-items-final
-$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
-
+$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
+
- Now I see both
openrxv-items-final
and openrxv-items
have the current number of items:
-$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
{
- "count" : 103373,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 103373,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
- "count" : 103373,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 103373,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
- Then I started a fresh harvesting in the AReS Explorer admin dashboard
2021-04-12
@@ -672,24 +672,24 @@ $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
- 13,000 requests in the last two months from a user with user agent
SomeRandomText
, for example:
-84.33.2.97 - - [06/Apr/2021:06:25:13 +0200] "GET /bitstream/handle/10568/77776/CROP%20SCIENCE.jpg.jpg HTTP/1.1" 404 10890 "-" "SomeRandomText"
-
+84.33.2.97 - - [06/Apr/2021:06:25:13 +0200] "GET /bitstream/handle/10568/77776/CROP%20SCIENCE.jpg.jpg HTTP/1.1" 404 10890 "-" "SomeRandomText"
+
- I purged them:
-
$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
+$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
Purging 13159 hits from SomeRandomText in statistics
-
-Total number of bot hits purged: 13159
-
+
+Total number of bot hits purged: 13159
+
- I noticed there were 78 items submitted in the hour before CGSpace crashed:
-# grep -a -E '2021-04-06 0(6|7):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item
+# grep -a -E '2021-04-06 0(6|7):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item
78
-
+
- Of those 78, 77 of them were from Udana
- Compared to other mornings (0 to 9 AM) this month that seems to be pretty high:
-# for num in {01..13}; do grep -a -E "2021-04-$num 0" /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a
+# for num in {01..13}; do grep -a -E "2021-04-$num 0" /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a
add_item; done
32
0
@@ -704,7 +704,7 @@ Total number of bot hits purged: 13159
1
1
2
-
2021-04-15
+2021-04-15
- Release v1.4.2 of the DSpace Statistics API on GitHub: https://github.com/ilri/dspace-statistics-api/releases/tag/v1.4.2
@@ -723,8 +723,8 @@ Total number of bot hits purged: 13159
- Create a test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:
-$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
-
+$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
+
- I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
- According to my notes from 2020-10 the account must be in the admin group in order to submit via the REST API
@@ -735,12 +735,12 @@ Total number of bot hits purged: 13159
- Update all containers on AReS (linode20):
-
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
-
+$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
+
- Then run all system updates and reboot the server
- I learned a new command for Elasticsearch:
-
$ curl http://localhost:9200/_cat/indices
+$ curl http://localhost:9200/_cat/indices
yellow open openrxv-values ChyhGwMDQpevJtlNWO1vcw 1 1 1579 0 537.6kb 537.6kb
yellow open openrxv-items-temp PhV5ieuxQsyftByvCxzSIw 1 1 103585 104372 482.7mb 482.7mb
yellow open openrxv-shared J_8cxIz6QL6XTRZct7UBBQ 1 1 127 0 115.7kb 115.7kb
@@ -751,46 +751,46 @@ green open .apm-agent-configuration f3RAkSEBRGaxJZs3ePVxsA 1 0 0 0
yellow open openrxv-items-final sgk-s8O-RZKdcLRoWt3G8A 1 1 970 0 2.3mb 2.3mb
green open .kibana_1 HHPN7RD_T7qe0zDj4rauQw 1 0 25 7 36.8kb 36.8kb
yellow open users M0t2LaZhSm2NrF5xb64dnw 1 1 2 0 11.6kb 11.6kb
-
+
- Somehow the
openrxv-items-final
index only has a few items and the majority are in openrxv-items-temp
, via the openrxv-items
alias (which is in the temp index):
-$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
{
- "count" : 103585,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 103585,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
- I found a cool tool to help with exporting and restoring Elasticsearch indexes:
-$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
-$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
...
Sun, 18 Apr 2021 06:27:07 GMT | Total Writes: 103585
Sun, 18 Apr 2021 06:27:07 GMT | dump complete
-
+
- It took only two or three minutes to export everything…
- I did a test to restore the index:
-$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-test --type=mapping
-$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-test --limit 1000 --type=data
-
+$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-test --type=mapping
+$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-test --limit 1000 --type=data
+
- So that’s pretty cool!
- I deleted the
openrxv-items-final
index and openrxv-items-temp
indexes and then restored the mappings to openrxv-items-final
, added the openrxv-items
alias, and started restoring the data to openrxv-items
with elasticdump:
-
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
-$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
-$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
-$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items --limit 1000 --type=data
-
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
+$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
+$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
+$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items --limit 1000 --type=data
+
- AReS seems to be working fine аfter that, so I created the
openrxv-items-temp
index and then started a fresh harvest on AReS Explorer:
-
$ curl -X PUT "localhost:9200/openrxv-items-temp"
-
+$ curl -X PUT "localhost:9200/openrxv-items-temp"
+
- Run system updates on CGSpace (linode18) and run the latest Ansible infrastructure playbook to update the DSpace Statistics API, PostgreSQL JDBC driver, etc, and then reboot the system
- I wasted a bit of time trying to get TSLint and then ESLint running for OpenRXV on GitHub Actions
@@ -798,35 +798,35 @@ $ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localh
- The AReS harvesting last night seems to have completed successfully, but the number of results is strange:
-
$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
+$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp kNUlupUyS_i7vlBGiuVxwg 1 1 103741 105553 483.6mb 483.6mb
yellow open openrxv-items-final HFc3uytTRq2GPpn13vkbmg 1 1 970 0 2.3mb 2.3mb
-
+
- The indices endpoint doesn’t include the
openrxv-items
alias, but it is currently in the openrxv-items-temp
index so the number of items is the same:
-$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
{
- "count" : 103741,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 103741,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
- A user was having problems resetting their password on CGSpace, with some message about SMTP etc
- I checked and we are indeed locked out of our mailbox:
-$ dspace test-email
+$ dspace test-email
...
Error sending email:
- Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 550 5.2.1 Mailbox cannot be accessed [PR0P264CA0280.FRAP264.PROD.OUTLOOK.COM]
)
-
+
- I have to write to ICT…
- I decided to switch back to the G1GC garbage collector on DSpace Test
@@ -869,46 +869,46 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
-$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
-$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
-$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
-$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
-$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items --limit 1000 --type=data
-
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
+$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
+$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
+$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items --limit 1000 --type=data
+
- Then I started a fresh AReS harvest
2021-04-26
- The AReS harvest last night seems to have finished successfully and the number of items looks good:
-
$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
+$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 0 0 283b 283b
yellow open openrxv-items-final ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0 254mb 254mb
-
+
- And the aliases seem correct for once:
-$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
+$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
...
- "openrxv-items-final": {
- "aliases": {
- "openrxv-items": {}
+ "openrxv-items-final": {
+ "aliases": {
+ "openrxv-items": {}
}
},
- "openrxv-items-temp": {
- "aliases": {}
+ "openrxv-items-temp": {
+ "aliases": {}
},
...
-
+
- That’s 250 new items in the index since the last harvest!
- Re-create my local Artifactory container because I’m getting errors starting it and it has been a few months since it was updated:
-$ podman rm artifactory
+$ podman rm artifactory
$ podman pull docker.bintray.io/jfrog/artifactory-oss:latest
-$ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081-8082:8081-8082 docker.bintray.io/jfrog/artifactory-oss
+$ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081-8082:8081-8082 docker.bintray.io/jfrog/artifactory-oss
$ podman start artifactory
-
+
- Start testing DSpace 7.0 Beta 5 so I can evaluate if it solves some of the problems we are having on DSpace 6, and if it’s missing things like multiple handle resolvers, etc
- I see it needs Java JDK 11, Tomcat 9, Solr 8, and PostgreSQL 11
@@ -925,13 +925,13 @@ $ podman start artifactory
- I tried to delete all the Atmire SQL migrations:
-localhost/dspace7b5= > DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
-
+localhost/dspace7b5= > DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
+
- But I got an error when running
dspace database migrate
:
-
$ ~/dspace7b5/bin/dspace database migrate
-
-Database URL: jdbc:postgresql://localhost:5432/dspace7b5
+$ ~/dspace7b5/bin/dspace database migrate
+
+Database URL: jdbc:postgresql://localhost:5432/dspace7b5
Migrating database to latest version... (Check dspace logs for details)
Migration exception:
java.sql.SQLException: Flyway migration error occurred
@@ -949,8 +949,8 @@ Caused by: org.flywaydb.core.api.FlywayException: Validate failed:
Detected applied migration not resolved locally: 5.0.2017.09.25
Detected applied migration not resolved locally: 6.0.2017.01.30
Detected applied migration not resolved locally: 6.0.2017.09.25
-
- at org.flywaydb.core.Flyway.doValidate(Flyway.java:292)
+
+ at org.flywaydb.core.Flyway.doValidate(Flyway.java:292)
at org.flywaydb.core.Flyway.access$100(Flyway.java:73)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:166)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:158)
@@ -958,14 +958,14 @@ Detected applied migration not resolved locally: 6.0.2017.09.25
at org.flywaydb.core.Flyway.migrate(Flyway.java:158)
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:729)
... 9 more
-
+
- I deleted those migrations:
-localhost/dspace7b5= > DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
-
+localhost/dspace7b5= > DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
+
- Then when I ran the migration again it failed for a new reason, related to the configurable workflow:
-
Database URL: jdbc:postgresql://localhost:5432/dspace7b5
+Database URL: jdbc:postgresql://localhost:5432/dspace7b5
Migrating database to latest version... (Check dspace logs for details)
Migration exception:
java.sql.SQLException: Flyway migration error occurred
@@ -984,24 +984,24 @@ Migration V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql failed
--------------------------------------------------------------------
SQL State : 42P01
Error Code : 0
-Message : ERROR: relation "cwf_pooltask" does not exist
+Message : ERROR: relation "cwf_pooltask" does not exist
Position: 8
Location : org/dspace/storage/rdbms/sqlmigration/postgres/V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql (/home/aorth/src/apache-tomcat-9.0.45/file:/home/aorth/dspace7b5/lib/dspace-api-7.0-beta5.jar!/org/dspace/storage/rdbms/sqlmigration/postgres/V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql)
Line : 16
-Statement : UPDATE cwf_pooltask SET workflow_id='defaultWorkflow' WHERE workflow_id='default'
+Statement : UPDATE cwf_pooltask SET workflow_id='defaultWorkflow' WHERE workflow_id='default'
...
-
+
- The DSpace 7 upgrade docs say I need to apply these previously optional migrations:
-$ ~/dspace7b5/bin/dspace database migrate ignored
-
+$ ~/dspace7b5/bin/dspace database migrate ignored
+
- Now I see all migrations have completed and DSpace actually starts up fine!
- I will try to do a full re-index to see how long it takes:
-
$ time ~/dspace7b5/bin/dspace index-discovery -b
+$ time ~/dspace7b5/bin/dspace index-discovery -b
...
~/dspace7b5/bin/dspace index-discovery -b 25156.71s user 64.22s system 97% cpu 7:11:09.94 total
-
+
- Not good, that shit took almost seven hours!
2021-04-27
@@ -1012,9 +1012,9 @@ Statement : UPDATE cwf_pooltask SET workflow_id='defaultWorkflow' WHERE workflo
-$ csvgrep -e 'windows-1252' -c 'Handle.net IDs' -i -m '10568/' ~/Downloads/Altmetric\ -\ Research\ Outputs\ -\ CGSpace\ -\ 2021-04-26.csv | csvcut -c DOI | sed '1d' > /tmp/dois.txt
-$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.csv -db dspace63 -u dspace -p 'fuuu' -d
-
+$ csvgrep -e 'windows-1252' -c 'Handle.net IDs' -i -m '10568/' ~/Downloads/Altmetric\ -\ Research\ Outputs\ -\ CGSpace\ -\ 2021-04-26.csv | csvcut -c DOI | sed '1d' > /tmp/dois.txt
+$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.csv -db dspace63 -u dspace -p 'fuuu' -d
+
- He will Tweet them…
2021-04-28
diff --git a/docs/2021-05/index.html b/docs/2021-05/index.html
index 4d5045a07..71b1a3aed 100644
--- a/docs/2021-05/index.html
+++ b/docs/2021-05/index.html
@@ -36,7 +36,7 @@ I looked at the top user agents and IPs in the Solr statistics for last month an
I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one… as that’s an actual user…
"/>
-
+
@@ -147,17 +147,17 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
-
193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
-193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata-21%2B21*01 HTTP/1.1" 200 458201 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
-193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata'||lower('')||' HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
-193.169.254.178 - - [21/Apr/2021:02:02:10 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata'%2Brtrim('')%2B' HTTP/1.1" 200 458209 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
-
+193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
+193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata-21%2B21*01 HTTP/1.1" 200 458201 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
+193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata'||lower('')||' HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
+193.169.254.178 - - [21/Apr/2021:02:02:10 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata'%2Brtrim('')%2B' HTTP/1.1" 200 458209 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
+
- I will report the IP on abuseipdb.com and purge their hits from Solr
- The second IP is in Colombia and is making thousands of requests for what looks like some test site:
-
181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"
-181.62.166.177 - - [20/Apr/2021:22:55:39 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"
-
+181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"
+181.62.166.177 - - [20/Apr/2021:22:55:39 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"
+
- But this site does not exist (yet?)
- I will purge them from Solr
@@ -165,46 +165,46 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
- The third IP is in Russia apparently, and the user agent has the
pl-PL
locale with thousands of requests like this:
-
45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] "GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&isAllowed=y HTTP/1.1" 200 918998 "http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf" "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15"
-
+45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] "GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&isAllowed=y HTTP/1.1" 200 918998 "http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf" "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15"
+
- I will purge these all with my
check-spider-ip-hits.sh
script:
-
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
+$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 21648 hits from 193.169.254.178 in statistics
Purging 20323 hits from 181.62.166.177 in statistics
Purging 19376 hits from 45.146.166.180 in statistics
-
-Total number of bot hits purged: 61347
-
2021-05-02
+
+Total number of bot hits purged: 61347
+2021-05-02
- Check the AReS Harvester indexes:
-$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
+$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 0 0 283b 283b
yellow open openrxv-items-final ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0 254mb 254mb
-$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
+$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
...
- "openrxv-items-temp": {
- "aliases": {}
+ "openrxv-items-temp": {
+ "aliases": {}
},
- "openrxv-items-final": {
- "aliases": {
- "openrxv-items": {}
+ "openrxv-items-final": {
+ "aliases": {
+ "openrxv-items": {}
}
},
-
+
- I think they look OK (
openrxv-items
is an alias of openrxv-items-final
), but I took a backup just in case:
-$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
-$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
-
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
+
- Then I started an indexing in the AReS Explorer admin dashboard
- The indexing finished, but it looks like the aliases are messed up again:
-
$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
+$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0 2.2mb 2.2mb
-
2021-05-05
+2021-05-05
- Peter noticed that we no longer display
cg.link.reference
on the item view
@@ -229,9 +229,9 @@ yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0
-$ time ~/dspace64/bin/dspace index-discovery -b
+$ time ~/dspace64/bin/dspace index-discovery -b
~/dspace64/bin/dspace index-discovery -b 4053.24s user 53.17s system 38% cpu 2:58:53.83 total
-
+
- Nope! Still slow, and still no mapped item…
- I even tried unmapping it from all collections, and adding it to a single new owning collection…
@@ -244,53 +244,53 @@ yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0
- The indexes on AReS Explorer are messed up after last week’s harvesting:
-$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
+$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0 2.2mb 2.2mb
-
-$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
+
+$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
...
- "openrxv-items-final": {
- "aliases": {}
+ "openrxv-items-final": {
+ "aliases": {}
},
- "openrxv-items-temp": {
- "aliases": {
- "openrxv-items": {}
+ "openrxv-items-temp": {
+ "aliases": {
+ "openrxv-items": {}
}
}
-
+
openrxv-items
should be an alias of openrxv-items-final
…
- I made a backup of the temp index and then started indexing on the AReS Explorer admin dashboard:
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-temp-backup
-$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
-
2021-05-10
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
+2021-05-10
- Amazing, the harvesting on AReS finished but it messed up all the indexes and now there are no items in any index!
-$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
+$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp 8thRX0WVRUeAzmd2hkG6TA 1 1 0 0 283b 283b
yellow open openrxv-items-temp-backup _0tyvctBTg2pjOlcoVP1LA 1 1 104165 20134 305.5mb 305.5mb
yellow open openrxv-items-final BtvV9kwVQ3yBYCZvJS1QyQ 1 1 0 0 283b 283b
-
+
- I fixed the indexes manually by re-creating them and cloning from the backup:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
-$ curl -X PUT "localhost:9200/openrxv-items-temp-backup/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
+$ curl -X PUT "localhost:9200/openrxv-items-temp-backup/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp-backup/_clone/openrxv-items-final
-$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp-backup'
-
+$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp-backup'
+
- Also I ran all updated on the server and updated all Docker images, then rebooted the server (linode20):
-$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
-
+$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
+
- I backed up the AReS Elasticsearch data using elasticdump, then started a new harvest:
-
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
-$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
-
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
+
- Discuss CGSpace statistics with the CIP team
- They were wondering why their numbers for 2020 were so low
@@ -329,10 +329,10 @@ $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/o
- I checked the CLARISA list against ROR’s April, 2020 release (“Version 9”, on figshare, though it is version 8 in the dump):
-
$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
-$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
+$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
+$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
1770
-
+
- With 1770 out of 6230 matched, that’s 28.5%…
- I sent an email to Hector Tobon to point out the issues in CLARISA again and ask him to chat
- Meeting with GARDIAN developers about CG Core and how GARDIAN works
@@ -341,11 +341,11 @@ $ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
- Fix a few thousand IWMI URLs that are using HTTP instead of HTTPS on CGSpace:
-
localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://www.iwmi.cgiar.org','https://www.iwmi.cgiar.org', 'g') WHERE text_value LIKE 'http://www.iwmi.cgiar.org%' AND metadata_field_id=219;
+localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://www.iwmi.cgiar.org','https://www.iwmi.cgiar.org', 'g') WHERE text_value LIKE 'http://www.iwmi.cgiar.org%' AND metadata_field_id=219;
UPDATE 1132
-localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://publications.iwmi.org','https://publications.iwmi.org', 'g') WHERE text_value LIKE 'http://publications.iwmi.org%' AND metadata_field_id=219;
+localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://publications.iwmi.org','https://publications.iwmi.org', 'g') WHERE text_value LIKE 'http://publications.iwmi.org%' AND metadata_field_id=219;
UPDATE 1803
-
+
- In the case of the latter, the HTTP links don’t even work! The web server returns HTTP 404 unless the request is HTTPS
- IWMI also says that their subjects are a subset of AGROVOC so they no longer want to use
cg.subject.iwmi
for their subjects
@@ -367,67 +367,67 @@ UPDATE 1803
- I have to fix the Elasticsearch indexes on AReS after last week’s harvesting because, as always, the
openrxv-items
index should be an alias of openrxv-items-final
instead of openrxv-items-temp
:
-
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
- "openrxv-items-final": {
- "aliases": {}
+$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
+ "openrxv-items-final": {
+ "aliases": {}
},
- "openrxv-items-temp": {
- "aliases": {
- "openrxv-items": {}
+ "openrxv-items-temp": {
+ "aliases": {
+ "openrxv-items": {}
}
},
...
-
+
- I took a backup of the
openrxv-items
index with elasticdump so I can re-create them manually before starting a new harvest tomorrow:
-$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
-$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
-
2021-05-16
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
+
2021-05-16
- I deleted and re-created the Elasticsearch indexes on AReS:
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
-$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-$ curl -XPUT 'http://localhost:9200/openrxv-items-final'
-$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
-$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
-
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+$ curl -XPUT 'http://localhost:9200/openrxv-items-final'
+$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
+$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
+
- Then I re-imported the backup that I created with elasticdump yesterday:
-
$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
-$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
-
+$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
+$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
+
- Then I started a new harvest on AReS
2021-05-17
- The AReS harvest finished and the Elasticsearch indexes seem OK so I shouldn’t have to fix them next time…
-
$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
+$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 0 0 283b 283b
yellow open openrxv-items-final TrJ1Ict3QZ-vFkj-4VcAzw 1 1 104317 0 259.4mb 259.4mb
-$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
- "openrxv-items-temp": {
- "aliases": {}
+$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
+ "openrxv-items-temp": {
+ "aliases": {}
},
- "openrxv-items-final": {
- "aliases": {
- "openrxv-items": {}
+ "openrxv-items-final": {
+ "aliases": {
+ "openrxv-items": {}
}
},
...
-
+
- Abenet said she and some others can’t log into CGSpace
- I tried to check the CGSpace LDAP account and it does seem to be not working:
-$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-ldap@cgiarad.org" -W "(sAMAccountName=aorth)"
+$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-ldap@cgiarad.org" -W "(sAMAccountName=aorth)"
Enter LDAP Password:
ldap_bind: Invalid credentials (49)
additional info: 80090308: LdapErr: DSID-0C090453, comment: AcceptSecurityContext error, data 532, v3839
-
+
- I sent a message to Biruk so he can check the LDAP account
- IWMI confirmed that they do indeed want to move all their subjects to AGROVOC, so I made the changes in the XMLUI and config (#467)
@@ -446,14 +446,14 @@ ldap_bind: Invalid credentials (49)
-$ xmllint --xpath '//value-pairs[@value-pairs-name="ccafsprojectpii"]/pair/stored-value/node()' dspace/config/input-forms.xml
-
+$ xmllint --xpath '//value-pairs[@value-pairs-name="ccafsprojectpii"]/pair/stored-value/node()' dspace/config/input-forms.xml
+
- I formatted the input file with tidy, especially because one of the new project tags has an ampersand character… grrr:
-
$ tidy -xml -utf8 -m -iq -w 0 dspace/config/input-forms.xml
-line 3658 column 26 - Warning: unescaped & or unknown entity "&WA_EU-IFAD"
-line 3659 column 23 - Warning: unescaped & or unknown entity "&WA_EU-IFAD"
-
+$ tidy -xml -utf8 -m -iq -w 0 dspace/config/input-forms.xml
+line 3658 column 26 - Warning: unescaped & or unknown entity "&WA_EU-IFAD"
+line 3659 column 23 - Warning: unescaped & or unknown entity "&WA_EU-IFAD"
+
- After testing whether this escaped value worked during submission, I created and merged a pull request to
6_x-prod
(#468)
2021-05-18
@@ -461,34 +461,34 @@ line 3659 column 23 - Warning: unescaped & or unknown entity "&WA_E
- Paola from the Alliance emailed me some new ORCID identifiers to add to CGSpace
- I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using
resolve-orcids.py
:
-$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-05-18-combined.txt
+$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-05-18-combined.txt
$ ./ilri/resolve-orcids.py -i /tmp/2021-05-18-combined.txt -o /tmp/2021-05-18-combined-names.txt
-
+
- I sorted the names and added the XML formatting in vim, then ran it through tidy:
-$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-identifier.xml
-
+$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-identifier.xml
+
- Tag fifty-five items from the Alliance’s new authors with ORCID iDs using
add-orcid-identifiers-csv.py
:
-
$ cat 2021-05-18-add-orcids.csv
+$ cat 2021-05-18-add-orcids.csv
dc.contributor.author,cg.creator.identifier
-"Urioste Daza, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
-"Urioste, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
-"Villegas, Daniel",Daniel M. Villegas: 0000-0001-6801-3332
-"Villegas, Daniel M.",Daniel M. Villegas: 0000-0001-6801-3332
-"Giles, James",James Giles: 0000-0003-1899-9206
-"Simbare, Alice",Alice Simbare: 0000-0003-2389-0969
-"Simbare, Alice",Alice Simbare: 0000-0003-2389-0969
-"Simbare, A.",Alice Simbare: 0000-0003-2389-0969
-"Dita Rodriguez, Miguel",Miguel Angel Dita Rodriguez: 0000-0002-0496-4267
-"Templer, Noel",Noel Templer: 0000-0002-3201-9043
-"Jalonen, R.",Riina Jalonen: 0000-0003-1669-9138
-"Jalonen, Riina",Riina Jalonen: 0000-0003-1669-9138
-"Izquierdo, Paulo",Paulo Izquierdo: 0000-0002-2153-0655
-"Reyes, Byron",Byron Reyes: 0000-0003-2672-9636
-"Reyes, Byron A.",Byron Reyes: 0000-0003-2672-9636
-$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
-
+"Urioste Daza, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
+"Urioste, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
+"Villegas, Daniel",Daniel M. Villegas: 0000-0001-6801-3332
+"Villegas, Daniel M.",Daniel M. Villegas: 0000-0001-6801-3332
+"Giles, James",James Giles: 0000-0003-1899-9206
+"Simbare, Alice",Alice Simbare: 0000-0003-2389-0969
+"Simbare, Alice",Alice Simbare: 0000-0003-2389-0969
+"Simbare, A.",Alice Simbare: 0000-0003-2389-0969
+"Dita Rodriguez, Miguel",Miguel Angel Dita Rodriguez: 0000-0002-0496-4267
+"Templer, Noel",Noel Templer: 0000-0002-3201-9043
+"Jalonen, R.",Riina Jalonen: 0000-0003-1669-9138
+"Jalonen, Riina",Riina Jalonen: 0000-0003-1669-9138
+"Izquierdo, Paulo",Paulo Izquierdo: 0000-0002-2153-0655
+"Reyes, Byron",Byron Reyes: 0000-0003-2672-9636
+"Reyes, Byron A.",Byron Reyes: 0000-0003-2672-9636
+$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
+
- I deployed the latest
6_x-prod
branch on CGSpace, ran all system updates, and rebooted the server
- This included the IWMI changes, so I also migrated the
cg.subject.iwmi
metadata to dcterms.subject
and deleted the subject term
@@ -504,9 +504,9 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspa
-dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
+dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 47405
-
+
- That’s interesting because we lowercased them all a few months ago, so these must all be new… wow
- We have 405,000 total AGROVOC terms, with 20,600 of them being unique
@@ -518,12 +518,12 @@ UPDATE 47405
- Export the top 5,000 AGROVOC terms to validate them:
-
localhost/dspace63= > \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
COPY 5000
-$ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d > /tmp/2021-05-20-agrovoc.txt
+$ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d > /tmp/2021-05-20-agrovoc.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv
-$ csvgrep -c "number of matches" -r '^0$' /tmp/2021-05-20-agrovoc-results.csv > /tmp/2021-05-20-agrovoc-rejected.csv
-
+$ csvgrep -c "number of matches" -r '^0$' /tmp/2021-05-20-agrovoc-results.csv > /tmp/2021-05-20-agrovoc-rejected.csv
+
- Meeting with Medha and Pythagoras about the FAIR Workflow tool
- Discussed the need for such a tool, other tools being developed, etc
@@ -545,54 +545,54 @@ $ csvgrep -c "number of matches" -r '^0$' /tmp/2021-05-20-agrovoc-resu
- Add ORCID identifiers for missing ILRI authors and tag 550 others based on a few authors I noticed that were missing them:
-
$ cat 2021-05-24-add-orcids.csv
+$ cat 2021-05-24-add-orcids.csv
dc.contributor.author,cg.creator.identifier
-"Patel, Ekta","Ekta Patel: 0000-0001-9400-6988"
-"Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417"
-"Tadelle, D.","Tadelle Dessie: 0000-0002-1630-0417"
-"Dione, Michel M.","Michel Dione: 0000-0001-7812-5776"
-"Kiara, Henry K.","Henry Kiara: 0000-0001-9578-1636"
-"Naessens, Jan","Jan Naessens: 0000-0002-7075-9915"
-"Steinaa, Lucilla","Lucilla Steinaa: 0000-0003-3691-3971"
-"Wieland, Barbara","Barbara Wieland: 0000-0003-4020-9186"
-"Grace, Delia","Delia Grace: 0000-0002-0195-9489"
-"Rao, Idupulapati M.","Idupulapati M. Rao: 0000-0002-8381-9358"
-"Cardoso Arango, Juan Andrés","Juan Andrés Cardoso Arango: 0000-0002-0252-4655"
-$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p 'fuuu'
-
+"Patel, Ekta","Ekta Patel: 0000-0001-9400-6988"
+"Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417"
+"Tadelle, D.","Tadelle Dessie: 0000-0002-1630-0417"
+"Dione, Michel M.","Michel Dione: 0000-0001-7812-5776"
+"Kiara, Henry K.","Henry Kiara: 0000-0001-9578-1636"
+"Naessens, Jan","Jan Naessens: 0000-0002-7075-9915"
+"Steinaa, Lucilla","Lucilla Steinaa: 0000-0003-3691-3971"
+"Wieland, Barbara","Barbara Wieland: 0000-0003-4020-9186"
+"Grace, Delia","Delia Grace: 0000-0002-0195-9489"
+"Rao, Idupulapati M.","Idupulapati M. Rao: 0000-0002-8381-9358"
+"Cardoso Arango, Juan Andrés","Juan Andrés Cardoso Arango: 0000-0002-0252-4655"
+$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p 'fuuu'
+
- A few days ago I took a backup of the Elasticsearch indexes on AReS using elasticdump:
-$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
-$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
-
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
+$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
+
- The indexes look OK so I started a harvesting on AReS
2021-05-25
- The AReS harvest got messed up somehow, as I see the number of items in the indexes are the same as before the harvesting:
-
$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
+$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb
yellow open openrxv-items-final soEzAnp3TDClIGZbmVyEIw 1 1 953 0 2.3mb 2.3mb
-
+
- Update all docker images on the AReS server (linode20):
-$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
+$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose -f docker/docker-compose.yml down
$ docker-compose -f docker/docker-compose.yml build
-
+
- Then run all system updates on the server and reboot it
- Oh crap, I deleted everything on AReS and restored the backup and the total items are now 104317… so it was actually correct before!
- For reference, this is how I re-created everything:
-curl -XDELETE 'http://localhost:9200/openrxv-items-final'
-curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-curl -XPUT 'http://localhost:9200/openrxv-items-final'
-curl -XPUT 'http://localhost:9200/openrxv-items-temp'
-curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
+curl -XDELETE 'http://localhost:9200/openrxv-items-final'
+curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+curl -XPUT 'http://localhost:9200/openrxv-items-final'
+curl -XPUT 'http://localhost:9200/openrxv-items-temp'
+curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
-
+
- I will just start a new harvest… sigh
2021-05-26
@@ -638,18 +638,18 @@ May 26, 02:57 UTC
- And indeed the email seems to be broken:
-$ dspace test-email
-
-About to send test email:
+$ dspace test-email
+
+About to send test email:
- To: fuuuuuu
- Subject: DSpace test email
- Server: smtp.office365.com
-
-Error sending email:
+
+Error sending email:
- Error: javax.mail.SendFailedException: Send failure (javax.mail.MessagingException: Could not convert socket to TLS (javax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)))
-
-Please see the DSpace documentation for assistance.
-
+
+Please see the DSpace documentation for assistance.
+
- I saw a recent thread on the dspace-tech mailing list about this that makes me wonder if Microsoft changed something on Office 365
- I added
mail.smtp.ssl.protocols=TLSv1.2
to the mail.extraproperties
in dspace.cfg and the test email sent successfully
diff --git a/docs/2021-06/index.html b/docs/2021-06/index.html
index fc47f96fd..e3b2ea641 100644
--- a/docs/2021-06/index.html
+++ b/docs/2021-06/index.html
@@ -36,7 +36,7 @@ I simply started it and AReS was running again:
"/>
-
+
@@ -132,8 +132,8 @@ I simply started it and AReS was running again:
-$ docker-compose -f docker/docker-compose.yml start angular_nginx
-
+$ docker-compose -f docker/docker-compose.yml start angular_nginx
+
- Margarita from CCAFS emailed me to say that workflow alerts haven’t been working lately
- I guess this is related to the SMTP issues last week
@@ -162,14 +162,14 @@ I simply started it and AReS was running again:
- The Elasticsearch indexes are messed up so I dumped and re-created them correctly:
-
curl -XDELETE 'http://localhost:9200/openrxv-items-final'
-curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-curl -XPUT 'http://localhost:9200/openrxv-items-final'
-curl -XPUT 'http://localhost:9200/openrxv-items-temp'
-curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
+curl -XDELETE 'http://localhost:9200/openrxv-items-final'
+curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+curl -XPUT 'http://localhost:9200/openrxv-items-final'
+curl -XPUT 'http://localhost:9200/openrxv-items-temp'
+curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
-
+
- Then I started a harvesting on AReS
2021-06-07
@@ -208,8 +208,8 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
-$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
-
+$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
+
- The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it’s much faster
- I harvested 90,000+ items from DSpace Test in ~3 hours
@@ -231,23 +231,23 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
-
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
+$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
90459
-$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
+$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
90380
-$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
+$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
...
- 2 "10568/99409"
- 2 "10568/99410"
- 2 "10568/99411"
- 2 "10568/99516"
- 3 "10568/102093"
- 3 "10568/103524"
- 3 "10568/106664"
- 3 "10568/106940"
- 3 "10568/107195"
- 3 "10568/96546"
-
2021-06-20
+ 2 "10568/99409"
+ 2 "10568/99410"
+ 2 "10568/99411"
+ 2 "10568/99516"
+ 3 "10568/102093"
+ 3 "10568/103524"
+ 3 "10568/106664"
+ 3 "10568/106940"
+ 3 "10568/107195"
+ 3 "10568/96546"
+2021-06-20
- Udana asked me to update their IWMI subjects from
farmer managed irrigation systems
to farmer-led irrigation
@@ -255,12 +255,12 @@ $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-it
-$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
-
+$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
+
- Then I used
csvcut
to extract just the columns I needed and do the replacement into a new CSV:
-
$ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' > /tmp/2021-06-20-IWMI-new-subjects.csv
-
+$ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' > /tmp/2021-06-20-IWMI-new-subjects.csv
+
-$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | wc -l
+$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | wc -l
90937
-$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort -u | wc -l
+$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort -u | wc -l
85709
-
+
- So those could be duplicates from the way we harvest pages, but they could also be from mappings…
- Manually inspecting the duplicates where handles appear more than once:
-$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort | uniq -c | sort -h
-
+$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort | uniq -c | sort -h
+
- Unfortunately I found no pattern:
- Some appear twice in the Elasticsearch index, but appear in only one collection
@@ -312,23 +312,23 @@ $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep
-
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
+$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
5
-$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle'
-"10673/4"
-"10673/3"
-"10673/6"
-"10673/5"
-"10673/7"
-# log into DSpace Demo XMLUI as admin and make one item private (for example 10673/6)
-$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
+$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle'
+"10673/4"
+"10673/3"
+"10673/6"
+"10673/5"
+"10673/7"
+# log into DSpace Demo XMLUI as admin and make one item private (for example 10673/6)
+$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
4
-$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle'
-"10673/4"
-"10673/3"
-"10673/5"
-"10673/7"
-
+$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle'
+"10673/4"
+"10673/3"
+"10673/5"
+"10673/7"
+
- I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira
- Last week I noticed that the Gender Platform website is using “cgspace.cgiar.org” links for CGSpace, instead of handles
@@ -355,11 +355,11 @@ $ curl -s -H "Accept: application/json" "https://demo.dspace.org/
-$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | wc -l
+$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | wc -l
90327
-$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | sort -u | wc -l
+$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | sort -u | wc -l
90317
-
2021-06-22
+2021-06-22
- Make a pull request to the COUNTER-Robots project to add two new user agents: crusty and newspaper
@@ -368,13 +368,13 @@ $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-it
-$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
+$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 1339 hits from RI\/1\.0 in statistics
Purging 447 hits from crusty in statistics
Purging 3736 hits from newspaper in statistics
-
-Total number of bot hits purged: 5522
-
+
+Total number of bot hits purged: 5522
+
- Surprised to see RI/1.0 in there because it’s been in the override file for a while
- Looking at the 2021 statistics in Solr I see a few more suspicious user agents:
@@ -397,11 +397,11 @@ Total number of bot hits purged: 5522
-# journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
+# journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
978
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
10100
-
+
- I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now
- In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)
- I also applied the following patches from the 6.4 milestone to our
6_x-prod
branch:
@@ -412,17 +412,17 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
- After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:
-$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
63
-
+
- Looking in the DSpace log, the first “pool empty” message I saw this morning was at 4AM:
-2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
-
+2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
+
- Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:
-
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
-
+Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
+
- We can purge them, as this is not user traffic: https://about.flipboard.com/browserproxy/
- I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots
@@ -448,17 +448,17 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
-
$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l
+$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l
104797
-$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
+$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
99186
-
+
- This number is probably unique for that particular harvest, but I don’t think it represents the true number of items…
- The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:
-$ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l
+$ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l
90990
-
+
- So the harvest on the live site is missing items, then why didn’t the add missing items plugin find them?!
- I notice that we are missing the
type
in the metadata structure config for each repository on the production site, and we are using type
for item type in the actual schema… so maybe there is a conflict there
@@ -469,8 +469,8 @@ $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspa
-172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV"
-
+172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV"
+
- I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins… now it’s checking 180,000+ handles to see if they are collections or items…
- I see it fetched the sitemap three times, we need to make sure it’s only doing it once for each repository
@@ -478,9 +478,9 @@ $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspa
- According to the api logs we will be adding 5,697 items:
-
$ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
+$ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
5697
-
+
- Spent a few hours with Moayad troubleshooting and improving OpenRXV
- We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs
@@ -496,35 +496,35 @@ $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspa
-$ redis-cli
+$ redis-cli
127.0.0.1:6379> SCAN 0 COUNT 5
-1) "49152"
-2) 1) "bull:plugins:476595"
- 2) "bull:plugins:367382"
- 3) "bull:plugins:369228"
- 4) "bull:plugins:438986"
- 5) "bull:plugins:366215"
-
+1) "49152"
+2) 1) "bull:plugins:476595"
+ 2) "bull:plugins:367382"
+ 3) "bull:plugins:369228"
+ 4) "bull:plugins:438986"
+ 5) "bull:plugins:366215"
+
- We can apparently get the names of the jobs in each hash using
hget
:
-127.0.0.1:6379> TYPE bull:plugins:401827
+127.0.0.1:6379> TYPE bull:plugins:401827
hash
127.0.0.1:6379> HGET bull:plugins:401827 name
-"dspace_add_missing_items"
-
+"dspace_add_missing_items"
+
- I whipped up a one liner to get the keys for all plugin jobs, convert to redis
HGET
commands to extract the value of the name field, and then sort them by their counts:
-$ redis-cli KEYS "bull:plugins:*" \
- | sed -e 's/^bull/HGET bull/' -e 's/\([[:digit:]]\)$/\1 name/' \
+$ redis-cli KEYS "bull:plugins:*" \
+ | sed -e 's/^bull/HGET bull/' -e 's/\([[:digit:]]\)$/\1 name/' \
| ncat -w 3 localhost 6379 \
- | grep -v -E '^\$' | sort | uniq -c | sort -h
+ | grep -v -E '^\$' | sort | uniq -c | sort -h
3 dspace_health_check
- 4 -ERR wrong number of arguments for 'hget' command
+ 4 -ERR wrong number of arguments for 'hget' command
12 mel_downloads_and_views
129 dspace_altmetrics
932 dspace_downloads_and_views
186428 dspace_add_missing_items
-
+
- Note that this uses
ncat
to send commands directly to redis all at once instead of one at a time (netcat
didn’t work here, as it doesn’t know when our input is finished and never quits)
- I thought of using
redis-cli --pipe
but then you have to construct the commands in the redis protocol format with the number of args and length of each command
@@ -544,7 +544,7 @@ hash
- Looking at the DSpace log I see there was definitely a higher number of sessions that day, perhaps twice the normal:
-
$ for file in dspace.log.2021-06-[12]*; do echo "$file"; grep -oE 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
+$ for file in dspace.log.2021-06-[12]*; do echo "$file"; grep -oE 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
dspace.log.2021-06-10
19072
dspace.log.2021-06-11
@@ -581,12 +581,12 @@ dspace.log.2021-06-26
16163
dspace.log.2021-06-27
5886
-
+
- I see 15,000 unique IPs in the XMLUI logs alone on that day:
-# zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep '23/Jun/2021' | awk '{print $1}' | sort | uniq | wc -l
+# zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep '23/Jun/2021' | awk '{print $1}' | sort | uniq | wc -l
15835
-
+
- Annoyingly I found 37,000 more hits from Bing using
dns:*msnbot* AND dns:*.msn.com.
as a Solr filter
- WTF, they are using a normal user agent:
Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko
@@ -628,8 +628,8 @@ dspace.log.2021-06-27
- The DSpace log shows:
-2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
-
+2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
+
- The first one of these I see is from last night at 2021-06-29 at 10:47 PM
- I restarted Tomcat 7 and CGSpace came back up…
- I didn’t see that Atmire had responded last week (on 2021-06-23) about the issues we had
@@ -641,14 +641,14 @@ dspace.log.2021-06-27
- Export a list of all CGSpace’s AGROVOC keywords with counts for Enrico and Elizabeth Arnaud to discuss with AGROVOC:
-
localhost/dspace63= > \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
COPY 20780
-
+
- Actually Enrico wanted NON AGROVOC, so I extracted all the center and CRP subjects (ignoring system office and themes):
-localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
COPY 1710
-
+
- Fix an issue in the Ansible infrastructure playbooks for the DSpace role
- It was causing the template module to fail when setting up the npm environment
@@ -657,13 +657,13 @@ COPY 1710
- I saw a strange message in the Tomcat 7 journal on DSpace Test (linode26):
-Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
-
+Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
+
- What’s even crazier is that it is twice that on CGSpace (linode18)!
- Apparently OpenJDK defaults to using
/dev/random
(see /etc/java-8-openjdk/security/java.security
):
-
securerandom.source=file:/dev/urandom
-
+securerandom.source=file:/dev/urandom
+
/dev/random
blocks and can take a long time to get entropy, and urandom on modern Linux is a cryptographically secure pseudorandom number generator
- Now Tomcat starts much faster and no warning is printed so I’m going to add this to our Ansible infrastructure playbooks
diff --git a/docs/2021-07/index.html b/docs/2021-07/index.html
index ff0cc2fa6..fe712081b 100644
--- a/docs/2021-07/index.html
+++ b/docs/2021-07/index.html
@@ -30,7 +30,7 @@ Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVO
localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
"/>
-
+
@@ -120,17 +120,17 @@ COPY 20994
- Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
-
localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
-
2021-07-04
+2021-07-04
- Update all Docker containers on the AReS server (linode20) and rebuild OpenRXV:
-$ cd OpenRXV
+$ cd OpenRXV
$ docker-compose -f docker/docker-compose.yml down
-$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
+$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose -f docker/docker-compose.yml build
-
+
- Then run all system updates and reboot the server
- After the server came back up I cloned the
openrxv-items-final
index to openrxv-items-temp
and started the plugins
@@ -172,7 +172,7 @@ $ docker-compose -f docker/docker-compose.yml build
-$ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
+$ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
Purging 95 hits from Drupal in statistics
Purging 38 hits from DTS Agent in statistics
Purging 601 hits from Microsoft Office Existence Discovery in statistics
@@ -183,16 +183,16 @@ Purging 144 hits from FlipboardProxy in statistics
Purging 37 hits from LinkWalker in statistics
Purging 1 hits from [Ll]ink.?[Cc]heck.? in statistics
Purging 427 hits from WordPress in statistics
-
-Total number of bot hits purged: 15030
-
+
+Total number of bot hits purged: 15030
+
- Meet with the CGIAR–AGROVOC task group to discuss how we want to do the workflow for submitting new terms to AGROVOC
- I extracted another list of all subjects to check against AGROVOC:
-\COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
-$ csvcut -c 1 /tmp/2021-07-06-all-subjects.csv | sed 1d > /tmp/2021-07-06-all-subjects.txt
+\COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
+$ csvcut -c 1 /tmp/2021-07-06-all-subjects.csv | sed 1d > /tmp/2021-07-06-all-subjects.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-06-agrovoc-results-all-subjects.csv -d
-
+
- Test Hrafn Malmquist’s proposed DBCP2 changes for DSpace 6.4 (DS-4574)
- His changes reminded me that we can perhaps switch back to using this pooling instead of Tomcat 7’s JDBC pooling via JNDI
@@ -205,7 +205,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
-# for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
+# for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
2021-06-10
10693
2021-06-11
@@ -240,10 +240,10 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
9439
2021-06-26
7930
-
+
- Similarly, the number of connections to the REST API was around the average for the recent weeks before:
-# for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/rest.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
+# for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/rest.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
2021-06-10
1183
2021-06-11
@@ -278,11 +278,11 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
969
2021-06-26
904
-
+
- According to goaccess, the traffic spike started at 2AM (remember that the first “Pool empty” error in dspace.log was at 4:01AM):
-# zcat /var/log/nginx/access.log.1[45].gz /var/log/nginx/library-access.log.1[45].gz | grep -E '23/Jun/2021' | goaccess --log-format=COMBINED -
-
+# zcat /var/log/nginx/access.log.1[45].gz /var/log/nginx/library-access.log.1[45].gz | grep -E '23/Jun/2021' | goaccess --log-format=COMBINED -
+
- Moayad sent a fix for the add missing items plugins issue (#107)
- It works MUCH faster because it correctly identifies the missing handles in each repository
@@ -311,19 +311,19 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
-
postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2302
-postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2564
-postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2530
-
+
- The locks are held by XMLUI, not REST API or OAI:
-postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n
+postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n
57 dspaceApi
2671 dspaceWeb
-
+
- I ran all updates on the server (linode18) and restarted it, then DSpace came back up
- I sent a message to Atmire, as I never heard from them last week when we blocked access to the REST API for two days for them to investigate the server issues
- Clone the
openrxv-items-temp
index on AReS and re-run all the plugins, but most of the “dspace_add_missing_items” tasks failed so I will just run a full re-harvest
@@ -338,7 +338,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
-
# grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
+# grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
32 91.243.191.124
33 91.243.191.129
33 91.243.191.200
@@ -362,7 +362,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
45 91.243.191.151
46 91.243.191.103
56 91.243.191.172
-
+
- I found a few people complaining about these Russian attacks too:
- https://community.cloudflare.com/t/russian-ddos-completley-unmitigated-by-cloudflare/284578
@@ -392,13 +392,13 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
-$ ./asn -n 45.80.217.235
-
-╭──────────────────────────────╮
+$ ./asn -n 45.80.217.235
+
+╭──────────────────────────────╮
│ ASN lookup for 45.80.217.235 │
╰──────────────────────────────╯
-
- 45.80.217.235 ┌PTR -
+
+ 45.80.217.235 ┌PTR -
├ASN 46844 (ST-BGP, US)
├ORG Sharktech
├NET 45.80.217.0/24 (TrafficTransitSolutionNet)
@@ -407,7 +407,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
├TYP Proxy host Hosting/DC
├GEO Los Angeles, California (US)
└REP ✓ NONE
-
+
- Slowly slowly I manually built up a list of the IPs, ISP names, and network blocks, for example:
IP, Organization, Website, Network
@@ -496,17 +496,17 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
-# grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq > /tmp/ips-sorted.txt
+# grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq > /tmp/ips-sorted.txt
# wc -l /tmp/ips-sorted.txt
10776 /tmp/ips-sorted.txt
-
+
- Then resolve them all:
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips-sorted.txt -o /tmp/out.csv
- Then get the top 10 organizations and top ten ASNs:
-$ csvcut -c 2 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
+$ csvcut -c 2 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
213 AMAZON-AES
218 ASN-QUADRANET-GLOBAL
246 Silverstar Invest Limited
@@ -517,7 +517,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
814 UGB Hosting OU
1010 ST-BGP
1757 Global Layer B.V.
-$ csvcut -c 3 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
+$ csvcut -c 3 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
213 14618
218 8100
246 35624
@@ -528,10 +528,10 @@ $ csvcut -c 3 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
814 206485
1010 46844
1757 49453
-
+
- I will download blocklists for all these except Ethiopian Telecom, Quadranet, and Amazon, though I’m concerned about Global Layer because it’s a huge ASN that seems to have legit hosts too…?
-$ wget https://asn.ipinfo.app/api/text/nginx/AS49453
+$ wget https://asn.ipinfo.app/api/text/nginx/AS49453
$ wget https://asn.ipinfo.app/api/text/nginx/AS46844
$ wget https://asn.ipinfo.app/api/text/nginx/AS206485
$ wget https://asn.ipinfo.app/api/text/nginx/AS62282
@@ -540,12 +540,12 @@ $ wget https://asn.ipinfo.app/api/text/nginx/AS35624
$ cat AS* | sort | uniq > /tmp/abusive-networks.txt
$ wc -l /tmp/abusive-networks.txt
2276 /tmp/abusive-networks.txt
-
+
- Combining with my existing rules and filtering uniques:
-$ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l
+$ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l
2298
-
+
- According to Scamalytics all these are high risk ISPs (as recently as 2021-06) so I will just keep blocking them
- I deployed the block list on CGSpace (linode18) and the load is down to 1.0 but I see there are still some DDoS IPs getting through… sigh
- The next thing I need to do is purge all the IPs from Solr using grepcidr…
@@ -558,12 +558,12 @@ $ wc -l /tmp/abusive-networks.txt
-
$ sudo zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 | grep -E " (200|499) " | awk '{print $1}' | sort | uniq > /tmp/all-ips.txt
+$ sudo zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 | grep -E " (200|499) " | awk '{print $1}' | sort | uniq > /tmp/all-ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips.txt -o /tmp/all-ips-out.csv
-$ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/all-ips-to-block.txt
+$ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/all-ips-to-block.txt
$ wc -l /tmp/all-ips-to-block.txt
5095 /tmp/all-ips-to-block.txt
-
+
- Then I added them to the normal ipset we are already using with firewalld
- I will check again in a few hours and ban more
@@ -571,10 +571,10 @@ $ wc -l /tmp/all-ips-to-block.txt
- I decided to extract the networks from the GeoIP database with
resolve-addresses-geoip2.py
so I can block them more efficiently than using the 5,000 IPs in an ipset:
-$ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c network | sed 1d | sort | uniq > /tmp/all-networks-to-block.txt
+$ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c network | sed 1d | sort | uniq > /tmp/all-networks-to-block.txt
$ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq | wc -l
2354
-
+
- Combined with the previous networks this brings about 200 more for a total of 2,354 networks
- I think I need to re-work the ipset stuff in my common Ansible role so that I can add such abusive networks as an iptables ipset / nftables set, and have a cron job to update them daily (from Spamhaus’s DROP and EDROP lists, for example)
@@ -582,25 +582,25 @@ $ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq
- Then I got a list of all the 5,095 IPs from above and used
check-spider-ip-hits.sh
to purge them from Solr:
-$ ilri/check-spider-ip-hits.sh -f /tmp/all-ips-to-block.txt -p
+$ ilri/check-spider-ip-hits.sh -f /tmp/all-ips-to-block.txt -p
...
Total number of bot hits purged: 197116
-
+
- I started a harvest on AReS and it finished in a few hours now that the load on CGSpace is back to a normal level
2021-07-20
- Looking again at the IPs making connections to CGSpace over the last few days from these seven ASNs, it’s much higher than I noticed yesterday:
-$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
+$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
5643
-
+
- I purged 27,000 more hits from the Solr stats using this new list of IPs with my
check-spider-ip-hits.sh
script
- Surprise surprise, I checked the nginx logs from 2021-06-23 when we last had issues with thousands of XMLUI sessions and PostgreSQL connections and I see IPs from the same ASNs!
-$ sudo zcat --force /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/all-ips-june-23.txt
+$ sudo zcat --force /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/all-ips-june-23.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips-june-23.txt -o /tmp/out.csv
-$ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
+$ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
265 GOOGLE,15169
277 Silverstar Invest Limited,35624
280 FACEBOOK,32934
@@ -616,17 +616,17 @@ $ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
874 Ethiopian Telecommunication Corporation,24757
912 UGB Hosting OU,206485
1607 Global Layer B.V.,49453
-
+
- Again it was over 5,000 IPs:
-$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
+$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
5228
-
+
- Interestingly, it seems these are five thousand different IP addresses than the attack from last weekend, as there are over 10,000 unique ones if I combine them!
-$ cat /tmp/ips-june23.txt /tmp/ips-jul16.txt | sort | uniq | wc -l
+$ cat /tmp/ips-june23.txt /tmp/ips-jul16.txt | sort | uniq | wc -l
10458
-
+
- I purged all the (26,000) hits from these new IP addresses from Solr as well
- Looking back at my notes for the 2019-05 attack I see that I had already identified most of these network providers (!)…
@@ -636,30 +636,30 @@ $ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
- Adding QuadraNet brings the total networks seen during these two attacks to 262, and the number of unique IPs to 10900:
-# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/ddos-ips.txt
+# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/ddos-ips.txt
# wc -l /tmp/ddos-ips.txt
54002 /tmp/ddos-ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ddos-ips.txt -o /tmp/ddos-ips.csv
-$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/ddos-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/ddos-ips-to-purge.txt
+$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/ddos-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/ddos-ips-to-purge.txt
$ wc -l /tmp/ddos-ips-to-purge.txt
10900 /tmp/ddos-ips-to-purge.txt
-$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/ddos-ips.csv | csvcut -c network | sed 1d | sort | uniq > /tmp/ddos-networks-to-block.txt
+$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/ddos-ips.csv | csvcut -c network | sed 1d | sort | uniq > /tmp/ddos-networks-to-block.txt
$ wc -l /tmp/ddos-networks-to-block.txt
262 /tmp/ddos-networks-to-block.txt
-
+
- The new total number of networks to block, including the network prefixes for these ASNs downloaded from asn.ipinfo.app, is 4,007:
-$ wget https://asn.ipinfo.app/api/text/nginx/AS49453 \
-https://asn.ipinfo.app/api/text/nginx/AS46844 \
+$ wget https://asn.ipinfo.app/api/text/nginx/AS49453 \
+https://asn.ipinfo.app/api/text/nginx/AS46844 \
https://asn.ipinfo.app/api/text/nginx/AS206485 \
https://asn.ipinfo.app/api/text/nginx/AS62282 \
https://asn.ipinfo.app/api/text/nginx/AS36352 \
https://asn.ipinfo.app/api/text/nginx/AS35913 \
https://asn.ipinfo.app/api/text/nginx/AS35624 \
https://asn.ipinfo.app/api/text/nginx/AS8100
-$ cat AS* /tmp/ddos-networks-to-block.txt | sed -e '/^$/d' -e '/^#/d' -e '/^{/d' -e 's/deny //' -e 's/;//' | sort | uniq | wc -l
+$ cat AS* /tmp/ddos-networks-to-block.txt | sed -e '/^$/d' -e '/^#/d' -e '/^{/d' -e 's/deny //' -e 's/;//' | sort | uniq | wc -l
4007
-
+
- I re-applied these networks to nginx on CGSpace (linode18) and DSpace Test (linode26), and purged 14,000 more Solr statistics hits from these IPs
2021-07-22
diff --git a/docs/2021-08/index.html b/docs/2021-08/index.html
index ccac3f0ca..c65ddcce2 100644
--- a/docs/2021-08/index.html
+++ b/docs/2021-08/index.html
@@ -32,7 +32,7 @@ Update Docker images on AReS server (linode20) and reboot the server:
I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
"/>
-
+
@@ -122,37 +122,37 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
- Update Docker images on AReS server (linode20) and reboot the server:
-# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
-
+# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
+
- I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
- First running all existing updates, taking some backups, checking for broken packages, and then rebooting:
-
# apt update && apt dist-upgrade
-# apt autoremove && apt autoclean
-# check for any packages with residual configs we can purge
-# dpkg -l | grep -E '^rc' | awk '{print $2}'
-# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
+# apt update && apt dist-upgrade
+# apt autoremove && apt autoclean
+# check for any packages with residual configs we can purge
+# dpkg -l | grep -E '^rc' | awk '{print $2}'
+# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
# dpkg -C
# dpkg -l > 2021-08-01-linode20-dpkg.txt
# tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc
# reboot
-# sed -i 's/bionic/focal/' /etc/apt/sources.list.d/*.list
-# do-release-upgrade
-
+# sed -i 's/bionic/focal/' /etc/apt/sources.list.d/*.list
+# do-release-upgrade
+
- … but of course it hit the libxcrypt bug
- I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually
-# apt install -f
+# apt install -f
# apt dist-upgrade
# reboot
-
+
- After rebooting I purged all packages with residual configs and cleaned up again:
-# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
-# apt autoremove && apt autoclean
-
+# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
+# apt autoremove && apt autoclean
+
- Then I cleared my local Ansible fact cache and re-ran the infrastructure playbooks
- Open an issue for the value mappings global replacement bug in OpenRXV
- Advise Peter and Abenet on expected CGSpace budget for 2022
@@ -190,21 +190,21 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
-# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/2021-08-05-all-ips.txt
+# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/2021-08-05-all-ips.txt
# wc -l /tmp/2021-08-05-all-ips.txt
43428 /tmp/2021-08-05-all-ips.txt
-
+
- Already I can see that the total is much less than during the attack on one weekend last month (over 50,000!)
- Indeed, now I see that there are no IPs from those networks coming in now:
-$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
-$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/2021-08-05-all-ips-to-purge.csv
+$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
+$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/2021-08-05-all-ips-to-purge.csv
$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
0 /tmp/2021-08-05-all-ips-to-purge.csv
-
2021-08-08
+2021-08-08
- Advise IWMI colleagues on best practices for thumbnails
- Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest
@@ -220,8 +220,8 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
-Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
-
+Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
+
- That IP is on Amazon, and from looking at the DSpace logs I don’t see them logging in at all, only scraping… so I will purge hits from that IP
- I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
@@ -232,14 +232,14 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
- 3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart
- 61.143.40.50 is in China and uses this hilarious user agent:
-Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
-
+Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
+
- 47.252.80.214 is owned by Alibaba in the US and has the same user agent
- 159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours
- 95.87.154.12 seems to be a new bot with the following user agent:
-
Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
-
+Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
+
- They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
- I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots
@@ -247,14 +247,14 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
- I see a new bot using this user agent:
-
nettle (+https://www.nettle.sk)
-
+nettle (+https://www.nettle.sk)
+
- 129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.
- 217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day
- 103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human
- There are probably more but that’s most of them over 1,000 hits last month, so I will purge them:
-
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
+$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 10796 hits from 35.174.144.154 in statistics
Purging 9993 hits from 93.158.90.30 in statistics
Purging 6092 hits from 130.255.162.173 in statistics
@@ -267,17 +267,17 @@ Purging 2786 hits from 47.252.80.214 in statistics
Purging 1485 hits from 129.0.211.251 in statistics
Purging 8952 hits from 217.182.21.193 in statistics
Purging 3446 hits from 103.135.104.139 in statistics
-
-Total number of bot hits purged: 90485
-
+
+Total number of bot hits purged: 90485
+
- Then I purged a few thousand more by user agent:
-$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
+$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
Found 2707 hits from MaCoCu in statistics
Found 1785 hits from nettle in statistics
-
-Total number of hits from bots: 4492
-
+
+Total number of hits from bots: 4492
+
- I found some CGSpace metadata in the wrong fields
- Seven metadata in dc.subject (57) should be in dcterms.subject (187)
@@ -289,8 +289,8 @@ Total number of hits from bots: 4492
- I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:
-$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv
-
+$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv
+
- Then in OpenRefine I merged all null, blank, and en fields into the
en_US
one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
- In total it was a few thousand metadata entries or so so I had to split the CSV with
xsv split
in order to process it
@@ -303,20 +303,20 @@ Total number of hits from bots: 4492
- Extract all unique ISSNs to look up on Sherpa Romeo and Crossref
-
$ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq > /tmp/2021-08-09-issns.txt
+$ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq > /tmp/2021-08-09-issns.txt
$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
-
+
- Then I updated the CSV headers for each and joined the CSVs on the issn column:
-$ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
-$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
+$ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
+$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv > /tmp/2021-08-09-journals-all.csv
-
+
- In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:
-if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,"same","different")
-
+if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,"same","different")
+
- Then I exported the list of journals that differ and sent it to Peter for comments and corrections
- I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it
@@ -332,15 +332,15 @@ $ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-jour
- I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:
-
$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
+$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
39004:0.08
-$ /usr/bin/time -f %M:%e gm convert IPCC.pdf\[0\] -quality 85 -thumbnail x600 -flatten IPCC-gm.jpg
+$ /usr/bin/time -f %M:%e gm convert IPCC.pdf\[0\] -quality 85 -thumbnail x600 -flatten IPCC-gm.jpg
40932:0.53
-$ /usr/bin/time -f %M:%e convert IPCC.pdf\[0\] -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
+$ /usr/bin/time -f %M:%e convert IPCC.pdf\[0\] -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
41724:0.59
-$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality 85 -thumbnail 600x600 IPCC-im.jpg
+$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality 85 -thumbnail 600x600 IPCC-im.jpg
24736:0.04
-
+
- The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail)
- libvips does use less time and memory… I should do more tests!
@@ -359,17 +359,17 @@ $ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409
-$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
-$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
+$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
+$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
1911
-
+
- Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine
- I exported a list of all the journal titles we have in the
cg.journal
field:
-localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
+localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
COPY 3245
-
+
- I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don’t match, so I’d have to go check many of them manually before selecting a match or fixing them…
- I think it’s better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way
@@ -421,10 +421,10 @@ COPY 3245
-$ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
-$ dspace community-filiator --set --parent=10568/114644 --child=10568/35730
-$ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
-
+$ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
+$ dspace community-filiator --set --parent=10568/114644 --child=10568/35730
+$ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
+
- I made a minor fix to OpenRXV to prefix all image names with
docker.io
so it works with less changes on podman
- Docker assumes the
docker.io
registry by default, but we should be explicit
@@ -446,40 +446,40 @@ $ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
- Lower case all AGROVOC metadata, as I had noticed a few in sentence case:
-
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
+dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 484
-
+
- Also update some DOIs using the
dx.doi.org
format, just to keep things uniform:
-dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
+dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
UPDATE 469
-
+
- Then start a full Discovery re-indexing to update the Feed the Future community item counts that have been stuck at 0 since we moved the three projects to be a subcommunity a few days ago:
-$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
-
-real 322m16.917s
+$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
+
+real 322m16.917s
user 226m43.121s
sys 3m17.469s
-
+
- I learned how to use the OpenRXV API, which is just a thin wrapper around Elasticsearch:
-$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
- -H 'Content-Type: application/json' \
- -d '{
- "size": 10,
- "query": {
- "bool": {
- "filter": {
- "term": {
- "repo.keyword": "CGSpace"
+$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
+ -H 'Content-Type: application/json' \
+ -d '{
+ "size": 10,
+ "query": {
+ "bool": {
+ "filter": {
+ "term": {
+ "repo.keyword": "CGSpace"
}
}
}
}
-}'
-$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ=='
-
+}'
+$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ=='
+
- This uses the Elasticsearch scroll ID to page through results
- The second query doesn’t need the request body because it is saved for 1 day as part of the first request
@@ -525,46 +525,46 @@ $ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5
-$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-08-25-combined-orcids.txt
+$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-08-25-combined-orcids.txt
$ wc -l /tmp/2021-08-25-combined-orcids.txt
1331
-
+
- After I combined them and removed duplicates, I resolved all the names using my
resolve-orcids.py
script:
-$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
-
+$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
+
- Tag existing items from the Alliance’s new authors with ORCID iDs using
add-orcid-identifiers-csv.py
(181 new metadata fields added):
-
$ cat 2021-08-25-add-orcids.csv
+$ cat 2021-08-25-add-orcids.csv
dc.contributor.author,cg.creator.identifier
-"Chege, Christine G. Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
-"Chege, Christine Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
-"Kiria, C.","Christine G.Kiria Chege: 0000-0001-8360-0279"
-"Kinyua, Ivy","Ivy Kinyua :0000-0002-1978-8833"
-"Rahn, E.","Eric Rahn: 0000-0001-6280-7430"
-"Rahn, Eric","Eric Rahn: 0000-0001-6280-7430"
-"Jager M.","Matthias Jager: 0000-0003-1059-3949"
-"Jager, M.","Matthias Jager: 0000-0003-1059-3949"
-"Jager, Matthias","Matthias Jager: 0000-0003-1059-3949"
-"Waswa, Boaz","Boaz Waswa: 0000-0002-0066-0215"
-"Waswa, Boaz S.","Boaz Waswa: 0000-0002-0066-0215"
-"Rivera, Tatiana","Tatiana Rivera: 0000-0003-4876-5873"
-"Andrade, Robert","Robert Andrade: 0000-0002-5764-3854"
-"Ceccarelli, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
-"Ceccarellia, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
-"Nyawira, Sylvia","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
-"Nyawira, Sylvia S.","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
-"Nyawira, Sylvia Sarah","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
-"Groot, J.C.","Groot, J.C.J.: 0000-0001-6516-5170"
-"Groot, J.C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
-"Groot, Jeroen C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
-"Groot, Jeroen CJ","Groot, J.C.J.: 0000-0001-6516-5170"
-"Abera, W.","Wuletawu Abera: 0000-0002-3657-5223"
-"Abera, Wuletawu","Wuletawu Abera: 0000-0002-3657-5223"
-"Kanyenga Lubobo, Antoine","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
-"Lubobo Antoine, Kanyenga","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
-$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p 'fuuu'
-
2021-08-29
+"Chege, Christine G. Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
+"Chege, Christine Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
+"Kiria, C.","Christine G.Kiria Chege: 0000-0001-8360-0279"
+"Kinyua, Ivy","Ivy Kinyua :0000-0002-1978-8833"
+"Rahn, E.","Eric Rahn: 0000-0001-6280-7430"
+"Rahn, Eric","Eric Rahn: 0000-0001-6280-7430"
+"Jager M.","Matthias Jager: 0000-0003-1059-3949"
+"Jager, M.","Matthias Jager: 0000-0003-1059-3949"
+"Jager, Matthias","Matthias Jager: 0000-0003-1059-3949"
+"Waswa, Boaz","Boaz Waswa: 0000-0002-0066-0215"
+"Waswa, Boaz S.","Boaz Waswa: 0000-0002-0066-0215"
+"Rivera, Tatiana","Tatiana Rivera: 0000-0003-4876-5873"
+"Andrade, Robert","Robert Andrade: 0000-0002-5764-3854"
+"Ceccarelli, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
+"Ceccarellia, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
+"Nyawira, Sylvia","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
+"Nyawira, Sylvia S.","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
+"Nyawira, Sylvia Sarah","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
+"Groot, J.C.","Groot, J.C.J.: 0000-0001-6516-5170"
+"Groot, J.C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
+"Groot, Jeroen C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
+"Groot, Jeroen CJ","Groot, J.C.J.: 0000-0001-6516-5170"
+"Abera, W.","Wuletawu Abera: 0000-0002-3657-5223"
+"Abera, Wuletawu","Wuletawu Abera: 0000-0002-3657-5223"
+"Kanyenga Lubobo, Antoine","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
+"Lubobo Antoine, Kanyenga","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
+$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p 'fuuu'
+2021-08-29
- Run a full harvest on AReS
- Also do more work the past few days on OpenRXV
diff --git a/docs/2021-09/index.html b/docs/2021-09/index.html
index b789e40c4..50c66660c 100644
--- a/docs/2021-09/index.html
+++ b/docs/2021-09/index.html
@@ -48,7 +48,7 @@ The syntax Moayad showed me last month doesn’t seem to honor the search qu
"/>
-
+
@@ -154,9 +154,9 @@ The syntax Moayad showed me last month doesn’t seem to honor the search qu
- Update Docker images on AReS server (linode20) and rebuild OpenRXV:
-$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
+$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
-
+
- Then run system updates and reboot the server
- After the system came back up I started a fresh re-harvesting
@@ -201,8 +201,8 @@ $ docker-compose build
-$ vipsthumbnail ARRTB2020ST.pdf -s x600 -o '%s.jpg[Q=85,optimize_coding,strip]'
-
+$ vipsthumbnail ARRTB2020ST.pdf -s x600 -o '%s.jpg[Q=85,optimize_coding,strip]'
+
- Looking at the PDF’s metadata I see:
- Producer: iLovePDF
@@ -236,11 +236,11 @@ $ docker-compose build
-
$ cat 2021-09-15-add-orcids.csv
+$ cat 2021-09-15-add-orcids.csv
dc.contributor.author,cg.creator.identifier
-"Kotchofa, Pacem","Pacem Kotchofa: 0000-0002-1640-8807"
-$ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u dspace -p 'fuuuu'
-
+"Kotchofa, Pacem","Pacem Kotchofa: 0000-0002-1640-8807"
+$ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u dspace -p 'fuuuu'
+
- Meeting with Leroy Mwanzia and some other Alliance people about depositing to CGSpace via API
- I gave them some technical information about the CGSpace API and links to the controlled vocabularies and metadata registries we are using
@@ -273,24 +273,24 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u
-$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
+$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
63
-
+
- Load on the server is under 1.0, and there are only about 1,000 XMLUI sessions, which seems to be normal for this time of day according to Munin
- But the DSpace log file shows tons of database issues:
-$ grep -c "Timeout waiting for idle object" dspace.log.2021-09-17
+$ grep -c "Timeout waiting for idle object" dspace.log.2021-09-17
14779
-
+
- The earliest one I see is around midnight (now is 2PM):
-2021-09-17 00:01:49,572 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: null
+2021-09-17 00:01:49,572 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: null
2021-09-17 00:01:49,572 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
-
+
- But I was definitely logged into the site this morning so there were no issues then…
- It seems that a few errors are normal, but there’s obviously something wrong today:
-$ grep -c "Timeout waiting for idle object" dspace.log.2021-09-*
+$ grep -c "Timeout waiting for idle object" dspace.log.2021-09-*
dspace.log.2021-09-01:116
dspace.log.2021-09-02:163
dspace.log.2021-09-03:77
@@ -308,7 +308,7 @@ dspace.log.2021-09-14:102
dspace.log.2021-09-15:542
dspace.log.2021-09-16:368
dspace.log.2021-09-17:15235
-
+
- I restarted the server and DSpace came up fine… so it must have been some kind of fluke
- Continue working on cleaning up and annotating the metadata registry on CGSpace
@@ -338,9 +338,9 @@ dspace.log.2021-09-17:15235
-$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
+$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
-
2021-09-20
+2021-09-20
- I synchronized the production CGSpace PostreSQL, Solr, and Assetstore data with DSpace Test
- Over the weekend a few users reported that they could not log into CGSpace
@@ -349,10 +349,10 @@ $ docker-compose build
-$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-ldap-account@cgiarad.org" -W "(sAMAccountName=someaccountnametocheck)"
+$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-ldap-account@cgiarad.org" -W "(sAMAccountName=someaccountnametocheck)"
Enter LDAP Password:
-ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
-
+ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
+
- I sent a message to CGNET to ask about the server settings and see if our IP is still whitelisted
- It turns out that CGNET created a new Active Directory server (AZCGNEROOT3.cgiarad.org) and decomissioned the old one last week
@@ -361,8 +361,8 @@ ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
- Create another test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:
-$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
-
+$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
+
- I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
- According to my notes from 2020-10 the account must be in the admin group in order to submit via the REST API
@@ -371,13 +371,13 @@ ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
- Run
dspace cleanup -v
process on CGSpace to clean up old bitstreams
- Export lists of authors, donors, and affiliations for Peter Ballantyne to clean up:
-localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-authors.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-authors.csv WITH CSV HEADER;
COPY 80901
-localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-donors.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-donors.csv WITH CSV HEADER;
COPY 1274
-localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-affiliations.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-affiliations.csv WITH CSV HEADER;
COPY 8091
-
2021-09-23
+2021-09-23
- Peter sent me back the corrections for the affiliations
@@ -386,24 +386,24 @@ COPY 8091
-$ csv-metadata-quality -i ~/Downloads/2021-09-20-affiliations.csv -o /tmp/affiliations.csv -x cg.contributor.affiliation
-$ csvgrep -c 'correct' -m 'DELETE' /tmp/affiliations.csv > /tmp/affiliations-delete.csv
-$ csvgrep -c 'correct' -r '^.+$' /tmp/affiliations.csv | csvgrep -i -c 'correct' -m 'DELETE' > /tmp/affiliations-fix.csv
-$ ./ilri/fix-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
-$ ./ilri/delete-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
-
+$ csv-metadata-quality -i ~/Downloads/2021-09-20-affiliations.csv -o /tmp/affiliations.csv -x cg.contributor.affiliation
+$ csvgrep -c 'correct' -m 'DELETE' /tmp/affiliations.csv > /tmp/affiliations-delete.csv
+$ csvgrep -c 'correct' -r '^.+$' /tmp/affiliations.csv | csvgrep -i -c 'correct' -m 'DELETE' > /tmp/affiliations-fix.csv
+$ ./ilri/fix-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
+$ ./ilri/delete-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
+
- Then I updated the controlled vocabulary for affiliations by exporting the top 1,000 used terms:
-
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2021-09-23-affiliations.csv WITH CSV HEADER;
-$ csvcut -c 1 /tmp/2021-09-23-affiliations.csv | sed 1d > /tmp/affiliations.txt
-
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2021-09-23-affiliations.csv WITH CSV HEADER;
+$ csvcut -c 1 /tmp/2021-09-23-affiliations.csv | sed 1d > /tmp/affiliations.txt
+
- Peter also sent me 310 corrections and 234 deletions for donors so I applied those and updated the controlled vocabularies too
- Move some One CGIAR-related collections around the CGSpace hierarchy for Peter Ballantyne
- Mohammed Salem asked me for an ID to UUID mapping for CGSpace collections, so I generated one similar to the ID one I sent him in 2020-11:
-
localhost/dspace63= > \COPY (SELECT collection_id,uuid FROM collection WHERE collection_id IS NOT NULL) TO /tmp/2021-09-23-collection-id2uuid.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT collection_id,uuid FROM collection WHERE collection_id IS NOT NULL) TO /tmp/2021-09-23-collection-id2uuid.csv WITH CSV HEADER;
COPY 1139
-
2021-09-24
+2021-09-24
- Peter and Abenet agreed that we should consider converting more of our UPPER CASE metadata values to Title Case
@@ -435,33 +435,33 @@ COPY 1139
-localhost/dspace63= > UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=231;
+localhost/dspace63= > UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=231;
UPDATE 2903
-localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.coverage.subregion" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 231) to /tmp/2021-09-24-subregions.txt;
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.coverage.subregion" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 231) to /tmp/2021-09-24-subregions.txt;
COPY 1200
-
+
- Then I process the list for matches with my
subdivision-lookup.py
script, and extract only the values that matched:
-$ ./ilri/subdivision-lookup.py -i /tmp/2021-09-24-subregions.txt -o /tmp/subregions.csv
-$ csvgrep -c matched -m 'true' /tmp/subregions.csv | csvcut -c 1 | sed 1d > /tmp/subregions-matched.txt
+$ ./ilri/subdivision-lookup.py -i /tmp/2021-09-24-subregions.txt -o /tmp/subregions.csv
+$ csvgrep -c matched -m 'true' /tmp/subregions.csv | csvcut -c 1 | sed 1d > /tmp/subregions-matched.txt
$ wc -l /tmp/subregions-matched.txt
81 /tmp/subregions-matched.txt
-
+
- Then I updated the controlled vocabulary in the submission forms
- I did the same for
dcterms.audience
, taking special care to a few all-caps values:
-localhost/dspace63= > UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value != 'NGOS' AND text_value != 'CGIAR';
-localhost/dspace63= > UPDATE metadatavalue SET text_value='NGOs' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'NGOS';
-
+localhost/dspace63= > UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value != 'NGOS' AND text_value != 'CGIAR';
+localhost/dspace63= > UPDATE metadatavalue SET text_value='NGOs' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'NGOS';
+
- Update submission form comment for DOIs because it was still recommending people use the “dx.doi.org” format even though I batch updated all DOIs to the “doi.org” format a few times in the last year
- Then I updated all existing metadata to the new format again:
-
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
+dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
UPDATE 49
-
2021-09-26
+2021-09-26
- Mohammed Salem told me last week that MELSpace and WorldFish have been upgraded to DSpace 6 so I updated the repository setup in AReS to use the UUID field instead of IDs
@@ -489,26 +489,26 @@ UPDATE 49
-$ csvcut -c 'id,collection,dc.title[en_US]' ~/Downloads/10568-106990.csv > /tmp/2021-09-28-alliance-reports.csv
-
+$ csvcut -c 'id,collection,dc.title[en_US]' ~/Downloads/10568-106990.csv > /tmp/2021-09-28-alliance-reports.csv
+
- She sent it back fairly quickly with a new column marked “Move” so I extracted those items that matched and set them to the new owning collection:
-
$ csvgrep -c Move -m 'Yes' ~/Downloads/2021_28_09_alliance_reports_csv.csv | csvcut -c 1,2 | sed 's_10568/106990_10568/111506_' > /tmp/alliance-move.csv
-
+$ csvgrep -c Move -m 'Yes' ~/Downloads/2021_28_09_alliance_reports_csv.csv | csvcut -c 1,2 | sed 's_10568/106990_10568/111506_' > /tmp/alliance-move.csv
+
- Maria from the Alliance emailed us to say that approving submissions was slow on CGSpace
- I looked at the PostgreSQL activity and it seems low:
-
postgres@linode18:~$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
+postgres@linode18:~$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
59
-
+
- Locks look high though:
-postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | sort | uniq -c | wc -l
+postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | sort | uniq -c | wc -l
1154
-
+
- Indeed it seems something started causing locks to increase yesterday:
@@ -520,9 +520,9 @@ UPDATE 49
- The number of DSpace sessions is normal, hovering around 1,000…
- Looking closer at the PostgreSQL activity log, I see the locks are all held by the
dspaceCli
user… which seem weird:
-
postgres@linode18:~$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceCli';" | wc -l
+postgres@linode18:~$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceCli';" | wc -l
1096
-
+
- Now I’m wondering why there are no connections from
dspaceApi
or dspaceWeb
. Could it be that our Tomcat JDBC pooling via JNDI isn’t working?
- I see the same thing on DSpace Test hmmmm
@@ -536,14 +536,14 @@ UPDATE 49
- Export a list of ILRI subjects from CGSpace to validate against AGROVOC for Peter and Abenet:
-
localhost/dspace63= > \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 203) to /tmp/2021-09-29-ilri-subject.txt;
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 203) to /tmp/2021-09-29-ilri-subject.txt;
COPY 149
-
+
- Then validate and format the matches:
-$ ./ilri/agrovoc-lookup.py -i /tmp/2021-09-29-ilri-subject.txt -o /tmp/2021-09-29-ilri-subjects.csv -d
-$ csvcut -c subject,'match type' /tmp/2021-09-29-ilri-subjects.csv | sed -e 's/match type/matched/' -e 's/\(alt\|pref\)Label/yes/' > /tmp/2021-09-29-ilri-subjects2.csv
-
+$ ./ilri/agrovoc-lookup.py -i /tmp/2021-09-29-ilri-subject.txt -o /tmp/2021-09-29-ilri-subjects.csv -d
+$ csvcut -c subject,'match type' /tmp/2021-09-29-ilri-subjects.csv | sed -e 's/match type/matched/' -e 's/\(alt\|pref\)Label/yes/' > /tmp/2021-09-29-ilri-subjects2.csv
+
- I talked to Salem about depositing from MEL to CGSpace
- He mentioned that the one issue is that when you deposit to a workflow you don’t get a Handle or any kind of identifier back!
diff --git a/docs/2021-10/index.html b/docs/2021-10/index.html
index 321295fc4..ae234137b 100644
--- a/docs/2021-10/index.html
+++ b/docs/2021-10/index.html
@@ -11,7 +11,7 @@
Export all affiliations on CGSpace and run them against the latest RoR data dump:
-localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
@@ -35,7 +35,7 @@ So we have 1879/7100 (26.46%) matching already
Export all affiliations on CGSpace and run them against the latest RoR data dump:
-localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
@@ -46,7 +46,7 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
So we have 1879/7100 (26.46%) matching already
"/>
-
+
@@ -136,15 +136,15 @@ So we have 1879/7100 (26.46%) matching already
- Export all affiliations on CGSpace and run them against the latest RoR data dump:
-
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
-$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
+$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
-
+
- So we have 1879/7100 (26.46%) matching already
2021-10-03
@@ -185,19 +185,19 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
-# zcat --force /var/log/nginx/*.log* | grep 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' | awk '{print $1}' | sort | uniq > /tmp/mozilla-4.0-ips.txt
+# zcat --force /var/log/nginx/*.log* | grep 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' | awk '{print $1}' | sort | uniq > /tmp/mozilla-4.0-ips.txt
# wc -l /tmp/mozilla-4.0-ips.txt
543 /tmp/mozilla-4.0-ips.txt
-
+
- Then I resolved the IPs and extracted the ones belonging to Amazon:
-$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k "$ABUSEIPDB_API_KEY" -o /tmp/mozilla-4.0-ips.csv
-$ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
-
+$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k "$ABUSEIPDB_API_KEY" -o /tmp/mozilla-4.0-ips.csv
+$ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
+
- I am thinking I will purge them all, as I have several indicators that they are bots: mysterious user agent, IP owned by Amazon
- Even more interesting, these requests are weighted VERY heavily on the CGIAR System community:
-
1592 GET /handle/10947/2526
+ 1592 GET /handle/10947/2526
1592 GET /handle/10947/2527
1592 GET /handle/10947/34
1593 GET /handle/10947/6
@@ -215,7 +215,7 @@ $ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee
1600 GET /handle/10947/4467
1607 GET /handle/10568/103816
290382 GET /handle/10568/83389
-
+
- Before I purge all those I will ask someone Samuel Stacey from the System Office to hopefully get an insight…
- Meeting with Michael Victor, Peter, Jane, and Abenet about the future of repositories in the One CGIAR
- Meeting with Michelle from Altmetric about their new CSV upload system
@@ -231,10 +231,10 @@ $ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee
- Extract the AGROVOC subjects from IWMI’s 292 publications to validate them against AGROVOC:
-$ csvcut -c 'dcterms.subject[en_US]' ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e 's/||/\n/g' -e 's/"//g' | sort -u > /tmp/agrovoc.txt
+$ csvcut -c 'dcterms.subject[en_US]' ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e 's/||/\n/g' -e 's/"//g' | sort -u > /tmp/agrovoc.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv
-$ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 > /tmp/invalid-agrovoc.csv
-
2021-10-05
+$ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 > /tmp/invalid-agrovoc.csv
+2021-10-05
- Sam put me in touch with Dodi from the System Office web team and he confirmed that the Amazon requests are not theirs
@@ -243,11 +243,11 @@ $ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 &
-$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
+$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
...
-
-Total number of bot hits purged: 465119
-
2021-10-06
+
+Total number of bot hits purged: 465119
+2021-10-06
- Thinking about how we could check for duplicates before importing
@@ -255,14 +255,14 @@ Total number of bot hits purged: 465119
-localhost/dspace63= > CREATE EXTENSION pg_trgm;
-localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
+localhost/dspace63= > CREATE EXTENSION pg_trgm;
+localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
metadata_value_id │ text_value │ dspace_object_id
───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
(2 rows)
-
+
- I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)
- I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items
@@ -291,10 +291,10 @@ localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id
- Then I ran this new version of csv-metadata-quality on an export of IWMI’s community, minus some fields I don’t want to check:
-$ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv
+$ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv
$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
-$ xsv split -s 2000 /tmp /tmp/iwmi.csv
-
+$ xsv split -s 2000 /tmp /tmp/iwmi.csv
+
- I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs…
- I cut a subset of the fields from the main CSV and tried again, but DSpace said “no changes detected”
@@ -319,18 +319,18 @@ Try doing it in two imports. In first import, remove all authors. In second impo
-$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv
+$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv > /tmp/iwmi-duplicate-metadata.csv
# Copy and blank columns in OpenRefine
$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
-$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
-
+$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
+
- It takes a few hours per 2,000 items because DSpace processes them so slowly… sigh…
2021-10-08
- I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:
-cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
+cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2603711
@@ -342,31 +342,31 @@ $ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
| 0
(7 rows)
cgspace=# BEGIN;
-cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
+cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
UPDATE 129673
cgspace=# COMMIT;
-
+
- So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:
-$ grep -c 'Removing duplicate value' /tmp/out.log
+$ grep -c 'Removing duplicate value' /tmp/out.log
391
-
+
- I tried to export ILRI’s community, but ran into the export bug (DS-4211)
- After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):
-$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l
+$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l
32070
-$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
+$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
19315
-
+
- It seems there are only about 200 duplicate values in this subset of fields in ILRI’s community:
-$ grep -c 'Removing duplicate value' /tmp/out.log
+$ grep -c 'Removing duplicate value' /tmp/out.log
220
-
+
- I found a cool way to select only the items with corrections
- First, extract a handful of fields from the CSV with csvcut
@@ -376,11 +376,11 @@ $ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
-$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv | csvsort | uniq > /tmp/ilri-deduplicated-items.csv
+$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv | csvsort | uniq > /tmp/ilri-deduplicated-items.csv
$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
-$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv
+$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv
$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv > /tmp/ilri-deduplicated-items-cleaned-joined.csv
-
+
- Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes:
if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,"same","different")
@@ -392,9 +392,9 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
- I did the same for CIAT but there were over 7,000 duplicate metadata values! Hard to believe:
-
$ grep -c 'Removing duplicate value' /tmp/out.log
+$ grep -c 'Removing duplicate value' /tmp/out.log
7720
-
+
- I applied these to the CIAT community, so in total that’s over 8,000 duplicate metadata values removed in a handful of fields…
2021-10-09
@@ -402,14 +402,14 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
- I did similar metadata cleanups for CCAFS and IITA too, but there were only a few hundred duplicates there
- Also of note, there are some other fixes too, for example in IITA’s community:
-
$ grep -c -E '(Fixing|Removing) (duplicate|excessive|invalid)' /tmp/out.log
+$ grep -c -E '(Fixing|Removing) (duplicate|excessive|invalid)' /tmp/out.log
249
-
+
- I ran a full Discovery re-indexing on CGSpace
- Then I exported all of CGSpace and extracted the ISSNs and ISBNs:
-$ csvcut -c 'id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]' /tmp/cgspace.csv > /tmp/cgspace-issn-isbn.csv
-
+$ csvcut -c 'id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]' /tmp/cgspace.csv > /tmp/cgspace-issn-isbn.csv
+
- I did cleanups on about seventy items with invalid and mixed ISSNs/ISBNs
2021-10-10
@@ -417,42 +417,42 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
- Start testing DSpace 7.1-SNAPSHOT to see if it has the duplicate item bug on
metadata-export
(DS-4211)
- First create a new PostgreSQL 13 container:
-$ podman run --name dspacedb13 -v dspacedb13_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5433:5432 -d postgres:13-alpine
-$ createuser -h localhost -p 5433 -U postgres --pwprompt dspacetest
-$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7
-$ psql -h localhost -p 5433 -U postgres dspace7 -c 'CREATE EXTENSION pgcrypto;'
-
+$ podman run --name dspacedb13 -v dspacedb13_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5433:5432 -d postgres:13-alpine
+$ createuser -h localhost -p 5433 -U postgres --pwprompt dspacetest
+$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7
+$ psql -h localhost -p 5433 -U postgres dspace7 -c 'CREATE EXTENSION pgcrypto;'
+
- Then edit setting in
dspace/config/local.cfg
and build the backend server with Java 11:
-
$ mvn package
+$ mvn package
$ cd dspace/target/dspace-installer
$ ant fresh_install
# fix database not being fully ready, causing Tomcat to fail to start the server application
$ ~/dspace7/bin/dspace database migrate
-
+
- Copy Solr configs and start Solr:
-$ cp -Rv ~/dspace7/solr/* ~/src/solr-8.8.2/server/solr/configsets
+$ cp -Rv ~/dspace7/solr/* ~/src/solr-8.8.2/server/solr/configsets
$ ~/src/solr-8.8.2/bin/solr start
-
+
- Start my local Tomcat 9 instance:
-$ systemctl --user start tomcat9@dspace7
-
+$ systemctl --user start tomcat9@dspace7
+
- This works, so now I will drop the default database and import a dump from CGSpace
-
$ systemctl --user stop tomcat9@dspace7
-$ dropdb -h localhost -p 5433 -U postgres dspace7
-$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7
-$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest superuser;'
-$ pg_restore -h localhost -p 5433 -U postgres -d dspace7 -O --role=dspacetest -h localhost dspace-2021-10-09.backup
-$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest nosuperuser;'
-
+$ systemctl --user stop tomcat9@dspace7
+$ dropdb -h localhost -p 5433 -U postgres dspace7
+$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7
+$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest superuser;'
+$ pg_restore -h localhost -p 5433 -U postgres -d dspace7 -O --role=dspacetest -h localhost dspace-2021-10-09.backup
+$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest nosuperuser;'
+
- Delete Atmire migrations and some others that were “unresolved”:
-
$ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';"
-$ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');"
-
+$ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';"
+$ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');"
+
- Now DSpace 7 starts with my CGSpace data… nice
- The Discovery indexing still takes seven hours… fuck
@@ -469,11 +469,11 @@ $ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_vers
- Start a full Discovery reindex on my local DSpace 6.3 instance:
-
$ /usr/bin/time -f %M:%e chrt -b 0 ~/dspace63/bin/dspace index-discovery -b
+$ /usr/bin/time -f %M:%e chrt -b 0 ~/dspace63/bin/dspace index-discovery -b
Loading @mire database changes for module MQM
Changes have been processed
836140:6543.6
-
+
- So that’s 1.8 hours versus 7 on DSpace 7, with the same database!
- Several users wrote to me that CGSpace was slow recently
@@ -481,13 +481,13 @@ Changes have been processed
-$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
+$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
53
-$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | wc -l
+$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | wc -l
1697
-$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceWeb'" | wc -l
+$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceWeb'" | wc -l
1681
-
+
- Looking at Munin, I see there are indeed a higher number of locks starting on the morning of 2021-10-07:
@@ -516,71 +516,71 @@ $ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.p
-localhost/dspace= > SET pg_trgm.similarity_threshold = 0.5;
-
+localhost/dspace= > SET pg_trgm.similarity_threshold = 0.5;
+
- Next I experimented with using GIN or GiST indexes on
metadatavalue
, but they were slower than the existing DSpace indexes
- I tested a few variations of the query I had been using and found it’s much faster if I use the similarity operator and keep the condition that object IDs are in the item table…
-
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
+localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
-
-Time: 739.948 ms
-
+
+Time: 739.948 ms
+
- Now this script runs in four minutes (versus twenty-four!) and it still finds the same seven duplicates! Amazing!
- I still don’t understand the differences in the query plan well enough, but I see it is using the DSpace default indexes and the results are accurate
- So to summarize, the best to the worst query, all returning the same result:
-localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
-localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
+localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
+localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
-
-Time: 683.165 ms
+
+Time: 683.165 ms
Time: 635.364 ms
Time: 674.666 ms
-
-localhost/dspace= > DISCARD ALL;
+
+localhost/dspace= > DISCARD ALL;
localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
-localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
+localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
-
-Time: 1584.765 ms (00:01.585)
+
+Time: 1584.765 ms (00:01.585)
Time: 1665.594 ms (00:01.666)
Time: 1623.726 ms (00:01.624)
-
-localhost/dspace= > DISCARD ALL;
-localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
+
+localhost/dspace= > DISCARD ALL;
+localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
-
-Time: 4028.939 ms (00:04.029)
+
+Time: 4028.939 ms (00:04.029)
Time: 4022.239 ms (00:04.022)
Time: 4061.820 ms (00:04.062)
-
-localhost/dspace= > DISCARD ALL;
-localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
+
+localhost/dspace= > DISCARD ALL;
+localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
-
-Time: 4358.713 ms (00:04.359)
+
+Time: 4358.713 ms (00:04.359)
Time: 4301.248 ms (00:04.301)
Time: 4417.909 ms (00:04.418)
-
2021-10-13
+2021-10-13
- I looked into the REST API issue where fields without qualifiers throw an HTTP 500
@@ -640,11 +640,11 @@ Time: 4417.909 ms (00:04.418)
-$ ldapsearch -x -H ldaps://AZCGNEROOT3.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "booo" -W "(sAMAccountName=fuuu)"
+$ ldapsearch -x -H ldaps://AZCGNEROOT3.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "booo" -W "(sAMAccountName=fuuu)"
Enter LDAP Password:
ldap_bind: Invalid credentials (49)
additional info: 80090308: LdapErr: DSID-0C090447, comment: AcceptSecurityContext error, data 52e, v3839
-
+
- I sent a message to ILRI ICT to ask them to check the account
- They reset the password so I ran all system updates and rebooted the server since users weren’t able to log in anyways
@@ -664,17 +664,17 @@ ldap_bind: Invalid credentials (49)
-$ http 'localhost:8081/solr/statistics/select?q=time%3A2021-04*&fl=ip&wt=json&indent=true&facet=true&facet.field=ip&facet.limit=200000&facet.mincount=1' > /tmp/2021-04-ips.json
-# Ghetto way to extract the IPs using jq, but I can't figure out how only print them and not the facet counts, so I just use sed
-$ jq '.facet_counts.facet_fields.ip[]' /tmp/2021-04-ips.json | grep -E '^"' | sed -e 's/"//g' > /tmp/ips.txt
+$ http 'localhost:8081/solr/statistics/select?q=time%3A2021-04*&fl=ip&wt=json&indent=true&facet=true&facet.field=ip&facet.limit=200000&facet.mincount=1' > /tmp/2021-04-ips.json
+# Ghetto way to extract the IPs using jq, but I can't figure out how only print them and not the facet counts, so I just use sed
+$ jq '.facet_counts.facet_fields.ip[]' /tmp/2021-04-ips.json | grep -E '^"' | sed -e 's/"//g' > /tmp/ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-04-ips.csv
-$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-04-ips.csv | csvcut -c network | sed 1d | sort -u > /tmp/networks-to-block.txt
+$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-04-ips.csv | csvcut -c network | sed 1d | sort -u > /tmp/networks-to-block.txt
$ wc -l /tmp/networks-to-block.txt
125 /tmp/networks-to-block.txt
$ grepcidr -f /tmp/networks-to-block.txt /tmp/ips.txt > /tmp/ips-to-purge.txt
$ wc -l /tmp/ips-to-purge.txt
202
-
+
- Attempting to purge those only shows about 3,500 hits, but I will do it anyways
- Adding 64.39.108.48 from Qualys I get a total of 22631 hits purged
@@ -715,9 +715,9 @@ $ wc -l /tmp/ips-to-purge.txt
- Even more annoying, they are not re-using their session ID:
-$ grep 93.158.91.62 log/dspace.log.2021-10-29 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
+$ grep 93.158.91.62 log/dspace.log.2021-10-29 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
4888
-
+
- This IP has made 36,000 requests to CGSpace…
- The IP is owned by Internet Vikings in Sweden
- I purged their statistics and set up a temporary HTTP 403 telling them to use a real user agent
@@ -733,13 +733,13 @@ $ wc -l /tmp/ips-to-purge.txt
- I reported them to AbuseIPDB.com and purged their hits:
-$ ./ilri/check-spider-ip-hits.sh -f /tmp/ip.txt -p
+$ ./ilri/check-spider-ip-hits.sh -f /tmp/ip.txt -p
Purging 6364 hits from 45.9.20.71 in statistics
Purging 8039 hits from 45.146.166.157 in statistics
Purging 3383 hits from 45.155.204.82 in statistics
-
-Total number of bot hits purged: 17786
-
2021-10-31
+
+Total number of bot hits purged: 17786
+2021-10-31
- Update Docker containers for AReS on linode20 and run a fresh harvest
- Found some strange IP (94.71.3.44) making 51,000 requests today with the user agent “Microsoft Internet Explorer”
@@ -757,13 +757,13 @@ Total number of bot hits purged: 17786
- That’s from ASN 12552 (IPO-EU, SE), which is operated by Internet Vikings, though AbuseIPDB.com says it’s Availo Networks AB
- There’s another IP (3.225.28.105) that made a few thousand requests to the REST API from Amazon, though it’s using a normal user agent
-# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | wc -l
+# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | wc -l
3991
-~# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | grep -oE 'GET /rest/(collections|handle|items)' | sort | uniq -c
+~# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | grep -oE 'GET /rest/(collections|handle|items)' | sort | uniq -c
3154 GET /rest/collections
427 GET /rest/handle
410 GET /rest/items
-
+
- It requested the CIAT Story Maps collection over 3,000 times last month…
- I will purge those hits
diff --git a/docs/2021-11/index.html b/docs/2021-11/index.html
index 0213996ac..bff2e20ad 100644
--- a/docs/2021-11/index.html
+++ b/docs/2021-11/index.html
@@ -18,7 +18,7 @@ $ zstd statistics-2019.json
-
+
@@ -32,7 +32,7 @@ First I exported all the 2019 stats from CGSpace:
$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
"/>
-
+
@@ -42,9 +42,9 @@ $ zstd statistics-2019.json
"@type": "BlogPosting",
"headline": "November, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-11/",
- "wordCount": "468",
+ "wordCount": "722",
"datePublished": "2021-11-02T22:27:07+02:00",
- "dateModified": "2021-11-03T15:56:15+02:00",
+ "dateModified": "2021-11-07T11:26:32+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -123,16 +123,16 @@ $ zstd statistics-2019.json
- I experimented with manually sharding the Solr statistics on DSpace Test
- First I exported all the 2019 stats from CGSpace:
-$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
+$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
-
+
- Then on DSpace Test I created a
statistics-2019
core with the same instance dir as the main statistics
core (as illustrated in the DSpace docs)
-$ mkdir -p /home/dspacetest.cgiar.org/solr/statistics-2019/data
+$ mkdir -p /home/dspacetest.cgiar.org/solr/statistics-2019/data
# create core in Solr admin
-$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2019-*</query></delete>"
+$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2019-*</query></delete>"
$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics-2019.json -k uid
-
+
- The key thing above is that you create the core in the Solr admin UI, but the data directory must already exist so you have to do that first in the file system
- I restarted the server after the import was done to see if the cores would come back up OK
@@ -165,13 +165,13 @@ $ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics
-91.213.50.11 - - [03/Nov/2021:06:47:20 +0100] "HEAD /bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf?sequence=1%60%20WHERE%206158%3D6158%20AND%204894%3D4741--%20kIlq&isAllowed=y HTTP/1.1" 200 0 "https://cgspace.cgiar.org:443/bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf" "Mozilla/5.0 (X11; U; Linux i686; en-CA; rv:1.8.0.10) Gecko/20070223 Fedora/1.5.0.10-1.fc5 Firefox/1.5.0.10"
-
+91.213.50.11 - - [03/Nov/2021:06:47:20 +0100] "HEAD /bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf?sequence=1%60%20WHERE%206158%3D6158%20AND%204894%3D4741--%20kIlq&isAllowed=y HTTP/1.1" 200 0 "https://cgspace.cgiar.org:443/bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf" "Mozilla/5.0 (X11; U; Linux i686; en-CA; rv:1.8.0.10) Gecko/20070223 Fedora/1.5.0.10-1.fc5 Firefox/1.5.0.10"
+
- Another is in China, and they grabbed 1,200 PDFs from the REST API in under an hour:
-
# zgrep 222.129.53.160 /var/log/nginx/rest.log.2.gz | wc -l
+# zgrep 222.129.53.160 /var/log/nginx/rest.log.2.gz | wc -l
1178
-
+
- I will continue to split the Solr statistics back into year-shards on DSpace Test (linode26)
- Today I did all 2018 stats…
@@ -183,11 +183,56 @@ $ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics
- Update all Docker containers on AReS and rebuild OpenRXV:
-
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
+$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
-
+
- Then restart the server and start a fresh harvest
-- Continue splitting the Solr statistics into yearly shards on DSpace Test (doing 2017 today)
+- Continue splitting the Solr statistics into yearly shards on DSpace Test (doing 2017, 2016, 2015, and 2014 today)
+- Several users wrote to me last week to say that workflow emails haven’t been working since 2021-10-21 or so
+
+- I did a test on CGSpace and it’s indeed broken:
+
+
+
+$ dspace test-email
+
+About to send test email:
+ - To: fuuuu
+ - Subject: DSpace test email
+ - Server: smtp.office365.com
+
+Error sending email:
+ - Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 535 5.7.139 Authentication unsuccessful, the user credentials were incorrect. [AM5PR0701CA0005.eurprd07.prod.outlook.com]
+)
+
+Please see the DSpace documentation for assistance.
+
+- I sent a message to ILRI ICT to ask them to check the account/password
+- I want to do one last test of the Elasticsearch updates on OpenRXV so I got a snapshot of the latest Elasticsearch volume used on the production AReS instance:
+
+# tar czf openrxv_esData_7.tar.xz /var/lib/docker/volumes/openrxv_esData_7
+
+- Then on my local server:
+
+$ mv ~/.local/share/containers/storage/volumes/openrxv_esData_7/ ~/.local/share/containers/storage/volumes/openrxv_esData_7.2021-11-07.bak
+$ tar xf /tmp/openrxv_esData_7.tar.xz -C ~/.local/share/containers/storage/volumes --strip-components=4
+$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type f -exec chmod 660 {} \;
+$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type d -exec chmod 770 {} \;
+# copy backend/data to /tmp for the repository setup/layout
+$ rsync -av --partial --progress --delete provisioning@ares:/tmp/data/ backend/data
+
+- This seems to work: all items, stats, and repository setup/layout are OK
+- I merged my Elasticsearch pull request from last month into OpenRXV
+
+2021-11-08
+
+- File an issue for the Angular flash of unstyled content on DSpace 7
+- Help Udana from IWMI with a question about CGSpace statistics
+
+- He found conflicting numbers when using the community and collection modes in Content and Usage Analysis
+- I sent him more numbers directly from the DSpace Statistics API
+
+
diff --git a/docs/404.html b/docs/404.html
index b756feb93..1891b017a 100644
--- a/docs/404.html
+++ b/docs/404.html
@@ -17,7 +17,7 @@
-
+
diff --git a/docs/categories/index.html b/docs/categories/index.html
index 3c509cb31..9564e17d3 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index c2db6fed6..cba5a16dd 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -10,14 +10,14 @@
-
+
-
+
@@ -95,9 +95,9 @@
- I experimented with manually sharding the Solr statistics on DSpace Test
- First I exported all the 2019 stats from CGSpace:
-$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
+$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
-
+
Read more →
@@ -119,15 +119,15 @@ $ zstd statistics-2019.json
- Export all affiliations on CGSpace and run them against the latest RoR data dump:
-localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
-$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
+$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
-
+
- So we have 1879/7100 (26.46%) matching already
Read more →
@@ -184,8 +184,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
- Update Docker images on AReS server (linode20) and reboot the server:
-# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
-
+# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
+
- I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
Read more →
@@ -209,9 +209,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
- Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
-
localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
-
+
Read more →
diff --git a/docs/categories/notes/index.xml b/docs/categories/notes/index.xml
index eba7dec68..e9bcbbaaf 100644
--- a/docs/categories/notes/index.xml
+++ b/docs/categories/notes/index.xml
@@ -18,9 +18,9 @@
<li>I experimented with manually sharding the Solr statistics on DSpace Test</li>
<li>First I exported all the 2019 stats from CGSpace:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">'time:2019-*'</span> -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
-</code></pre>
+</code></pre></div>
-
@@ -33,15 +33,15 @@ $ zstd statistics-2019.json
<ul>
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
-$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
+$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
-</code></pre><ul>
+</code></pre></div><ul>
<li>So we have 1879/7100 (26.46%) matching already</li>
</ul>
@@ -80,8 +80,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
-</code></pre><ul>
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">'s/ \+/:/g'</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
+</code></pre></div><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
@@ -96,9 +96,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
-</code></pre>
+</code></pre></div>
-
@@ -203,17 +203,17 @@ COPY 20994
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'</span>
{
- "count" : 100875,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100875,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-</code></pre>
+</code></pre></div>
-
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index cc704f181..f3046ca99 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -10,14 +10,14 @@
-
+
-
+
@@ -101,17 +101,17 @@
- I had a call with CodeObia to discuss the work on OpenRXV
- Check the results of the AReS harvesting from last night:
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 100875,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100875,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
Read more →
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index 0b1528376..37b25f8eb 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 7d43fabc4..8b74a7403 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html
index 020c2d635..d10f5a335 100644
--- a/docs/categories/notes/page/5/index.html
+++ b/docs/categories/notes/page/5/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html
index 2a516bea5..da9cc7edf 100644
--- a/docs/categories/notes/page/6/index.html
+++ b/docs/categories/notes/page/6/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/cgiar-library-migration/index.html b/docs/cgiar-library-migration/index.html
index 7df4e338c..313b81dde 100644
--- a/docs/cgiar-library-migration/index.html
+++ b/docs/cgiar-library-migration/index.html
@@ -18,7 +18,7 @@
-
+
diff --git a/docs/cgspace-cgcorev2-migration/index.html b/docs/cgspace-cgcorev2-migration/index.html
index 4f257314d..0dcacaa5a 100644
--- a/docs/cgspace-cgcorev2-migration/index.html
+++ b/docs/cgspace-cgcorev2-migration/index.html
@@ -18,7 +18,7 @@
-
+
diff --git a/docs/cgspace-dspace6-upgrade/index.html b/docs/cgspace-dspace6-upgrade/index.html
index 14931dd2f..86820e5d1 100644
--- a/docs/cgspace-dspace6-upgrade/index.html
+++ b/docs/cgspace-dspace6-upgrade/index.html
@@ -18,7 +18,7 @@
-
+
@@ -129,20 +129,20 @@
Re-import OAI with clean index
After the upgrade is complete, re-index all items into OAI with a clean index:
-$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
+$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
$ dspace oai -c import
-
The process ran out of memory several times so I had to keep trying again with more JVM heap memory.
+The process ran out of memory several times so I had to keep trying again with more JVM heap memory.
Processing Solr Statistics With solr-upgrade-statistics-6x
After the main upgrade process was finished and DSpace was running I started processing the Solr statistics with solr-upgrade-statistics-6x
to migrate all IDs to UUIDs.
statistics
First process the current year’s statistics core:
-$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
-$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
+$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
+$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
...
=================================================================
*** Statistics Records with Legacy Id ***
-
- 3,817,407 Bistream View
+
+ 3,817,407 Bistream View
1,693,443 Item View
105,974 Collection View
62,383 Community View
@@ -152,22 +152,22 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
--------------------------------------
6,475,268 TOTAL
=================================================================
-
After several rounds of processing it finished. Here are some statistics about unmigrated documents:
+After several rounds of processing it finished. Here are some statistics about unmigrated documents:
- 227,000:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 471,000:
id:/.+-unmigrated/
- 698,000:
*:* NOT id:/.{36}/
- Majority are
type: 5
(aka SITE, according to Constants.java
) so we can purge them:
-$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
-
statistics-2019
+$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
+
statistics-2019
Processing the statistics-2019 core:
-$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
+$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
...
=================================================================
*** Statistics Records with Legacy Id ***
-
- 5,569,344 Bistream View
+
+ 5,569,344 Bistream View
2,179,105 Item View
117,194 Community View
104,091 Collection View
@@ -177,22 +177,22 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
--------------------------------------
10,794,839 TOTAL
=================================================================
-
After several rounds of processing it finished. Here are some statistics about unmigrated documents:
+After several rounds of processing it finished. Here are some statistics about unmigrated documents:
- 2,690,309:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 1,494,587:
id:/.+-unmigrated/
- 4,184,896:
*:* NOT id:/.{36}/
- 4,172,929 are
type: 5
(aka SITE) so we can purge them:
-$ curl -s "http://localhost:8081/solr/statistics-2019/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
-
statistics-2018
+$ curl -s "http://localhost:8081/solr/statistics-2019/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
+
statistics-2018
Processing the statistics-2018 core:
-$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
+$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
...
=================================================================
*** Statistics Records with Legacy Id ***
-
- 3,561,532 Bistream View
+
+ 3,561,532 Bistream View
1,129,326 Item View
97,401 Community View
63,508 Collection View
@@ -202,25 +202,25 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
--------------------------------------
5,561,166 TOTAL
=================================================================
-
After some time I got an error about Java heap space so I increased the JVM memory and restarted processing:
-$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
-$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
-
Eventually the processing finished. Here are some statistics about unmigrated documents:
+After some time I got an error about Java heap space so I increased the JVM memory and restarted processing:
+$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
+$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
+
Eventually the processing finished. Here are some statistics about unmigrated documents:
- 365,473:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 546,955:
id:/.+-unmigrated/
- 923,158:
*:* NOT id:/.{36}/
- 823,293: are
type: 5
so we can purge them:
-$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
-
statistics-2017
+$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
+
statistics-2017
Processing the statistics-2017 core:
-$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2017
+$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2017
...
=================================================================
*** Statistics Records with Legacy Id ***
-
- 2,529,208 Bistream View
+
+ 2,529,208 Bistream View
1,618,717 Item View
144,945 Community View
74,249 Collection View
@@ -230,22 +230,22 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
--------------------------------------
5,813,639 TOTAL
=================================================================
-
Eventually the processing finished. Here are some statistics about unmigrated documents:
+Eventually the processing finished. Here are some statistics about unmigrated documents:
- 808,309:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 893,868:
id:/.+-unmigrated/
- 1,702,177:
*:* NOT id:/.{36}/
- 1,660,524 are
type: 5
(SITE) so we can purge them:
-$ curl -s "http://localhost:8081/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
-
statistics-2016
+$ curl -s "http://localhost:8081/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
+
statistics-2016
Processing the statistics-2016 core:
-$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2016
+$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2016
...
=================================================================
*** Statistics Records with Legacy Id ***
-
- 1,765,924 Bistream View
+
+ 1,765,924 Bistream View
1,151,575 Item View
187,110 Community View
51,204 Collection View
@@ -255,21 +255,21 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
--------------------------------------
4,190,098 TOTAL
=================================================================
-
+
- 849,408:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 627,747:
id:/.+-unmigrated/
- 1,477,155:
*:* NOT id:/.{36}/
- 1,469,706 are
type: 5
(SITE) so we can purge them:
-$ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
-
statistics-2015
+$ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
+
statistics-2015
Processing the statistics-2015 core:
-$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2015
+$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2015
...
=================================================================
*** Statistics Records with Legacy Id ***
-
- 990,916 Bistream View
+
+ 990,916 Bistream View
506,070 Item View
116,153 Community View
33,282 Collection View
@@ -279,22 +279,22 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
--------------------------------------
1,730,378 TOTAL
=================================================================
-
Summary of stats after processing:
+Summary of stats after processing:
- 195,293:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 67,146:
id:/.+-unmigrated/
- 262,439:
*:* NOT id:/.{36}/
- 247,400 are
type: 5
(SITE) so we can purge them:
-$ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
-
statistics-2014
+$ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
+
statistics-2014
Processing the statistics-2014 core:
-$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2014
+$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2014
...
=================================================================
*** Statistics Records with Legacy Id ***
-
- 2,381,603 Item View
+
+ 2,381,603 Item View
1,323,357 Bistream View
501,545 Community View
247,805 Collection View
@@ -305,22 +305,22 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
--------------------------------------
4,465,716 TOTAL
=================================================================
-
Summary of unmigrated documents after processing:
+Summary of unmigrated documents after processing:
- 182,131:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 39,947:
id:/.+-unmigrated/
- 222,078:
*:* NOT id:/.{36}/
- 188,791 are
type: 5
(SITE) so we can purge them:
-$ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
-
statistics-2013
+$ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
+
statistics-2013
Processing the statistics-2013 core:
-$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2013
+$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2013
...
=================================================================
*** Statistics Records with Legacy Id ***
-
- 2,352,124 Item View
+
+ 2,352,124 Item View
1,117,676 Bistream View
575,711 Community View
171,639 Collection View
@@ -331,81 +331,81 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
--------------------------------------
4,218,862 TOTAL
=================================================================
-
Summary of unmigrated docs after processing:
+Summary of unmigrated docs after processing:
- 2,548 :
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 29,772:
id:/.+-unmigrated/
- 32,320:
*:* NOT id:/.{36}/
- 15,691 are
type: 5
(SITE) so we can purge them:
-$ curl -s "http://localhost:8081/solr/statistics-2013/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
-
statistics-2012
+$ curl -s "http://localhost:8081/solr/statistics-2013/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
+
statistics-2012
Processing the statistics-2012 core:
-$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2012
+$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2012
...
=================================================================
*** Statistics Records with Legacy Id ***
-
- 2,229,332 Item View
+
+ 2,229,332 Item View
913,577 Bistream View
215,577 Collection View
104,734 Community View
--------------------------------------
3,463,220 TOTAL
=================================================================
-
Summary of unmigrated docs after processing:
+Summary of unmigrated docs after processing:
- 0:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 33,161:
id:/.+-unmigrated/
- 33,161:
*:* NOT id:/.{36}/
- 33,161 are
type: 3
(COLLECTION), which is different than I’ve seen previously… but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:
-$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
-
statistics-2011
+$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
+
statistics-2011
Processing the statistics-2011 core:
-$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2011
+$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2011
...
=================================================================
*** Statistics Records with Legacy Id ***
-
- 904,896 Item View
+
+ 904,896 Item View
385,789 Bistream View
154,356 Collection View
62,978 Community View
--------------------------------------
1,508,019 TOTAL
=================================================================
-
Summary of unmigrated docs after processing:
+Summary of unmigrated docs after processing:
- 0:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 17,551:
id:/.+-unmigrated/
- 17,551:
*:* NOT id:/.{36}/
- 12,116 are
type: 3
(COLLECTION), which is different than I’ve seen previously… but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:
-$ curl -s "http://localhost:8081/solr/statistics-2011/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
-
statistics-2010
+$ curl -s "http://localhost:8081/solr/statistics-2011/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
+
statistics-2010
Processing the statistics-2010 core:
-$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2010
+$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2010
...
=================================================================
*** Statistics Records with Legacy Id ***
-
- 26,067 Item View
+
+ 26,067 Item View
15,615 Bistream View
4,116 Collection View
1,094 Community View
--------------------------------------
46,892 TOTAL
=================================================================
-
Summary of unmigrated docs after processing:
+Summary of unmigrated docs after processing:
- 0:
(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)
- 1,012:
id:/.+-unmigrated/
- 1,012:
*:* NOT id:/.{36}/
- 654 are
type: 3
(COLLECTION), which is different than I’ve seen previously… but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:
-$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
-
Processing Solr statistics with AtomicStatisticsUpdateCLI
+$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
+
Processing Solr statistics with AtomicStatisticsUpdateCLI
On 2020-11-18 I finished processing the Solr statistics with solr-upgrade-statistics-6x and I started processing them with AtomicStatisticsUpdateCLI.
statistics
First the current year’s statistics core, in 12-hour batches:
diff --git a/docs/index.html b/docs/index.html
index 620e5f1a9..a566182f4 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -10,14 +10,14 @@
-
+
-
+
@@ -110,9 +110,9 @@
I experimented with manually sharding the Solr statistics on DSpace Test
First I exported all the 2019 stats from CGSpace:
-$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
+$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
-
+
Read more →
@@ -134,15 +134,15 @@ $ zstd statistics-2019.json
- Export all affiliations on CGSpace and run them against the latest RoR data dump:
-localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
-$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
+$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
-
+
- So we have 1879/7100 (26.46%) matching already
Read more →
@@ -199,8 +199,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
- Update Docker images on AReS server (linode20) and reboot the server:
-# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
-
+# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
+
- I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
Read more →
@@ -224,9 +224,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
- Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
-
localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
-
+
Read more →
diff --git a/docs/index.xml b/docs/index.xml
index fb8ab83ff..bd7c07749 100644
--- a/docs/index.xml
+++ b/docs/index.xml
@@ -18,9 +18,9 @@
<li>I experimented with manually sharding the Solr statistics on DSpace Test</li>
<li>First I exported all the 2019 stats from CGSpace:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">'time:2019-*'</span> -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
-</code></pre>
+</code></pre></div>
-
@@ -33,15 +33,15 @@ $ zstd statistics-2019.json
<ul>
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
-$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
+$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
-</code></pre><ul>
+</code></pre></div><ul>
<li>So we have 1879/7100 (26.46%) matching already</li>
</ul>
@@ -80,8 +80,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
-</code></pre><ul>
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">'s/ \+/:/g'</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
+</code></pre></div><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
@@ -96,9 +96,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
-</code></pre>
+</code></pre></div>
-
@@ -203,17 +203,17 @@ COPY 20994
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'</span>
{
- "count" : 100875,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100875,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-</code></pre>
+</code></pre></div>
-
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index 30fc08c50..b3fb927fa 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -10,14 +10,14 @@
-
+
-
+
@@ -116,17 +116,17 @@
I had a call with CodeObia to discuss the work on OpenRXV
Check the results of the AReS harvesting from last night:
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 100875,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100875,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
Read more →
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 6d1aab47d..75290458f 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index d4eb6b8cf..c89f93fb4 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index 7041dd162..111cef00b 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index 52fcf4caa..cf89d17d5 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index 5937bff4f..1dfa0ae6a 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/page/8/index.html b/docs/page/8/index.html
index 7a20079a0..8c74463d8 100644
--- a/docs/page/8/index.html
+++ b/docs/page/8/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index dc16cd239..49b496948 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -10,14 +10,14 @@
-
+
-
+
@@ -110,9 +110,9 @@
I experimented with manually sharding the Solr statistics on DSpace Test
First I exported all the 2019 stats from CGSpace:
-$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
+$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
-
+
Read more →
@@ -134,15 +134,15 @@ $ zstd statistics-2019.json
- Export all affiliations on CGSpace and run them against the latest RoR data dump:
-localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
-$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
+localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
+$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
-
+
- So we have 1879/7100 (26.46%) matching already
Read more →
@@ -199,8 +199,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
- Update Docker images on AReS server (linode20) and reboot the server:
-# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
-
+# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
+
- I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
Read more →
@@ -224,9 +224,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
- Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
-
localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
+localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
-
+
Read more →
diff --git a/docs/posts/index.xml b/docs/posts/index.xml
index dcbd0cf6b..157659247 100644
--- a/docs/posts/index.xml
+++ b/docs/posts/index.xml
@@ -18,9 +18,9 @@
<li>I experimented with manually sharding the Solr statistics on DSpace Test</li>
<li>First I exported all the 2019 stats from CGSpace:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">'time:2019-*'</span> -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
-</code></pre>
+</code></pre></div>
-
@@ -33,15 +33,15 @@ $ zstd statistics-2019.json
<ul>
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
-$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
+$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
-</code></pre><ul>
+</code></pre></div><ul>
<li>So we have 1879/7100 (26.46%) matching already</li>
</ul>
@@ -80,8 +80,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
-</code></pre><ul>
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">'s/ \+/:/g'</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
+</code></pre></div><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
@@ -96,9 +96,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
-</code></pre>
+</code></pre></div>
-
@@ -203,17 +203,17 @@ COPY 20994
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
-<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'</span>
{
- "count" : 100875,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100875,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-</code></pre>
+</code></pre></div>
-
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index 583d72272..1a27ce6fa 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -10,14 +10,14 @@
-
+
-
+
@@ -116,17 +116,17 @@
I had a call with CodeObia to discuss the work on OpenRXV
Check the results of the AReS harvesting from last night:
-$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
- "count" : 100875,
- "_shards" : {
- "total" : 1,
- "successful" : 1,
- "skipped" : 0,
- "failed" : 0
+ "count" : 100875,
+ "_shards" : {
+ "total" : 1,
+ "successful" : 1,
+ "skipped" : 0,
+ "failed" : 0
}
}
-
+
Read more →
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index bc6677106..fd6a0fae6 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index 628881e6b..656485917 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 0e94f3bb3..3041d0d37 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index a65b5ca7a..7059037d3 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index 59b81f955..9bccce178 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html
index 21b9089d4..fda113a7c 100644
--- a/docs/posts/page/8/index.html
+++ b/docs/posts/page/8/index.html
@@ -10,14 +10,14 @@
-
+
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index b23dce96d..d5c5c7b88 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
https://alanorth.github.io/cgspace-notes/categories/
- 2021-11-03T15:56:15+02:00
+ 2021-11-07T11:26:32+02:00
https://alanorth.github.io/cgspace-notes/
- 2021-11-03T15:56:15+02:00
+ 2021-11-07T11:26:32+02:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2021-11-03T15:56:15+02:00
+ 2021-11-07T11:26:32+02:00
https://alanorth.github.io/cgspace-notes/2021-11/
- 2021-11-03T15:56:15+02:00
+ 2021-11-07T11:26:32+02:00
https://alanorth.github.io/cgspace-notes/posts/
- 2021-11-03T15:56:15+02:00
+ 2021-11-07T11:26:32+02:00
https://alanorth.github.io/cgspace-notes/2021-10/
2021-11-01T10:48:13+02:00
diff --git a/docs/tags/index.html b/docs/tags/index.html
index fc3d36a85..64445eea0 100644
--- a/docs/tags/index.html
+++ b/docs/tags/index.html
@@ -17,7 +17,7 @@
-
+
diff --git a/docs/tags/migration/index.html b/docs/tags/migration/index.html
index c42d31b09..f5cb37b29 100644
--- a/docs/tags/migration/index.html
+++ b/docs/tags/migration/index.html
@@ -17,7 +17,7 @@
-
+
diff --git a/docs/tags/notes/index.html b/docs/tags/notes/index.html
index 0eb4c1095..e56555b3c 100644
--- a/docs/tags/notes/index.html
+++ b/docs/tags/notes/index.html
@@ -17,7 +17,7 @@
-
+
diff --git a/docs/tags/notes/page/2/index.html b/docs/tags/notes/page/2/index.html
index cb6f23e24..c5dd2835c 100644
--- a/docs/tags/notes/page/2/index.html
+++ b/docs/tags/notes/page/2/index.html
@@ -17,7 +17,7 @@
-
+
diff --git a/docs/tags/notes/page/3/index.html b/docs/tags/notes/page/3/index.html
index 24c405c55..1d52b4768 100644
--- a/docs/tags/notes/page/3/index.html
+++ b/docs/tags/notes/page/3/index.html
@@ -17,7 +17,7 @@
-
+