Add notes for 2021-11-08

This commit is contained in:
Alan Orth 2021-11-09 06:29:52 +02:00
parent b3df4ff58f
commit 9afe5c13f9
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
110 changed files with 1827 additions and 1737 deletions

View File

@ -71,6 +71,51 @@ $ docker-compose build
```
- Then restart the server and start a fresh harvest
- Continue splitting the Solr statistics into yearly shards on DSpace Test (doing 2017 today)
- Continue splitting the Solr statistics into yearly shards on DSpace Test (doing 2017, 2016, 2015, and 2014 today)
- Several users wrote to me last week to say that workflow emails haven't been working since 2021-10-21 or so
- I did a test on CGSpace and it's indeed broken:
```console
$ dspace test-email
About to send test email:
- To: fuuuu
- Subject: DSpace test email
- Server: smtp.office365.com
Error sending email:
- Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 535 5.7.139 Authentication unsuccessful, the user credentials were incorrect. [AM5PR0701CA0005.eurprd07.prod.outlook.com]
)
Please see the DSpace documentation for assistance.
```
- I sent a message to ILRI ICT to ask them to check the account/password
- I want to do one last test of the Elasticsearch updates on OpenRXV so I got a snapshot of the latest Elasticsearch volume used on the production AReS instance:
```console
# tar czf openrxv_esData_7.tar.xz /var/lib/docker/volumes/openrxv_esData_7
```
- Then on my local server:
```console
$ mv ~/.local/share/containers/storage/volumes/openrxv_esData_7/ ~/.local/share/containers/storage/volumes/openrxv_esData_7.2021-11-07.bak
$ tar xf /tmp/openrxv_esData_7.tar.xz -C ~/.local/share/containers/storage/volumes --strip-components=4
$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type f -exec chmod 660 {} \;
$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type d -exec chmod 770 {} \;
# copy backend/data to /tmp for the repository setup/layout
$ rsync -av --partial --progress --delete provisioning@ares:/tmp/data/ backend/data
```
- This seems to work: all items, stats, and repository setup/layout are OK
- I merged my [Elasticsearch pull request](https://github.com/ilri/OpenRXV/pull/126) from last month into OpenRXV
## 2021-11-08
- File [an issue for the Angular flash of unstyled content](https://github.com/DSpace/dspace-angular/issues/1391) on DSpace 7
- Help Udana from IWMI with a question about CGSpace statistics
- He found conflicting numbers when using the community and collection modes in Content and Usage Analysis
- I sent him more numbers directly from the DSpace Statistics API
<!-- vim: set sw=2 ts=2: -->

View File

@ -34,7 +34,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now
$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -36,7 +36,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
-rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
-rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -28,7 +28,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_
I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
Update GitHub wiki for documentation of maintenance tasks.
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -38,7 +38,7 @@ I noticed we have a very interesting list of countries on CGSpace:
Not only are there 49,000 countries, we have some blanks (25)&hellip;
Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&rdquo;
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -28,7 +28,7 @@ Looking at issues with author authorities on CGSpace
For some reason we still have the index-lucene-update cron job active on CGSpace, but I&rsquo;m pretty sure we don&rsquo;t need it as of the latest few versions of Atmire&rsquo;s Listings and Reports module
Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -32,7 +32,7 @@ After running DSpace for over five years I&rsquo;ve never needed to look in any
This will save us a few gigs of backup space we&rsquo;re paying for on S3
Also, I noticed the checker log has some errors we should pay attention to:
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -34,7 +34,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
# awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -34,7 +34,7 @@ This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRec
You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -44,7 +44,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
In this case the select query was showing 95 results before the update
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -42,7 +42,7 @@ $ git checkout -b 55new 5_x-prod
$ git reset --hard ilri/5_x-prod
$ git rebase -i dspace-5.5
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -34,7 +34,7 @@ It looks like we might be able to use OUs now, instead of DCs:
$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -42,7 +42,7 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -26,7 +26,7 @@ Add dc.type to the output options for Atmire&rsquo;s Listings and Reports module
Add dc.type to the output options for Atmire&rsquo;s Listings and Reports module (#286)
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -46,7 +46,7 @@ I see thousands of them in the logs for the last few months, so it&rsquo;s not r
I&rsquo;ve raised a ticket with Atmire to ask
Another worrying error from dspace.log is:
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -28,7 +28,7 @@ I checked to see if the Solr sharding task that is supposed to run on January 1s
I tested on DSpace Test as well and it doesn&rsquo;t work there either
I asked on the dspace-tech mailing list because it seems to be broken, and actually now I&rsquo;m not sure if we&rsquo;ve ever had the sharding task run successfully over all these years
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -50,7 +50,7 @@ DELETE 1
Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301)
Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -54,7 +54,7 @@ Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing reg
$ identify ~/Desktop/alc_contrastes_desafios.jpg
/Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600&#43;0&#43;0 8-bit CMYK 168KB 0.000u 0:00.000
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -40,7 +40,7 @@ Testing the CMYK patch on a collection with 650 items:
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="May, 2017"/>
<meta name="twitter:description" content="2017-05-01 ICARDA apparently started working on CG Core on their MEL repository They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items: https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full https://cgspace.cgiar.org/handle/10568/73683 2017-05-02 Atmire got back about the Workflow Statistics issue, and apparently it&rsquo;s a bug in the CUA module so they will send us a pull request 2017-05-04 Sync DSpace Test with database and assetstore from CGSpace Re-deploy DSpace Test with Atmire&rsquo;s CUA patch for workflow statistics, run system updates, and restart the server Now I can see the workflow statistics and am able to select users, but everything returns 0 items Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="June, 2017"/>
<meta name="twitter:description" content="2017-06-01 After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes Then we&rsquo;ll create a new sub-community for Phase II and create collections for the research themes there The current &ldquo;Research Themes&rdquo; community will be renamed to &ldquo;WLE Phase I Research Themes&rdquo; Tagged all items in the current Phase I collections with their appropriate themes Create pull request to add Phase II research themes to the submission form: #328 Add cg."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -36,7 +36,7 @@ Merge changes for WLE Phase II theme rename (#329)
Looking at extracting the metadata registries from ICARDA&rsquo;s MEL DSpace database so we can compare fields with CGSpace
We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the output into quasi XML:
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -60,7 +60,7 @@ This was due to newline characters in the dc.description.abstract column, which
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -32,7 +32,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two
Ask Sisay to clean up the WLE approvers a bit, as Marianne&rsquo;s user account is both in the approvers step as well as the group
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -34,7 +34,7 @@ http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -48,7 +48,7 @@ Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -30,7 +30,7 @@ The logs say &ldquo;Timeout waiting for idle object&rdquo;
PostgreSQL activity says there are 115 connections currently
The list of connections to XMLUI and REST API for today:
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -150,7 +150,7 @@ dspace.log.2018-01-02:34
Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let&rsquo;s Encrypt if it&rsquo;s just a handful of domains
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -30,7 +30,7 @@ We don&rsquo;t need to distinguish between internal and external works, so that
Yesterday I figured out how to monitor DSpace sessions using JMX
I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -24,7 +24,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
Export a CSV of the IITA community metadata for Martin Mueller
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -26,7 +26,7 @@ Catalina logs at least show some memory errors yesterday:
I tried to test something on DSpace Test but noticed that it&rsquo;s down since god knows when
Catalina logs at least show some memory errors yesterday:
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -38,7 +38,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
Then I reduced the JVM heap size from 6144 back to 5120m
Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -58,7 +58,7 @@ real 74m42.646s
user 8m5.056s
sys 2m7.289s
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -36,7 +36,7 @@ During the mvn package stage on the 5.8 branch I kept getting issues with java r
There is insufficient memory for the Java Runtime Environment to continue.
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -46,7 +46,7 @@ Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did
The server only has 8GB of RAM so we&rsquo;ll eventually need to upgrade to a larger one because we&rsquo;ll start starving the OS, PostgreSQL, and command line batch processes
I ran all system updates on DSpace Test and rebooted it
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -30,7 +30,7 @@ I&rsquo;ll update the DSpace role in our Ansible infrastructure playbooks and ru
Also, I&rsquo;ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system&rsquo;s RAM, and we never re-ran them after migrating to larger Linodes last month
I&rsquo;m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I&rsquo;m getting those autowire errors in Tomcat 8.5.30 again:
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -26,7 +26,7 @@ I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nai
Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
I created a GitHub issue to track this #389, because I&rsquo;m super busy in Nairobi right now
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -36,7 +36,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list
Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
Today these are the top 10 IPs:
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -458,7 +458,7 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
<li>fix column with invalid spaces in metadata field name (cg. subject. wle)</li>
<li>remove columns with no metadata (place, target audience, isbn, uri, publisher, ispartofseries, subject)</li>
<li>remove some weird Unicode characters (0xfffd) from abstracts, citations, and titles using Open Refine: <code>value.replace('<27>','')</code></li>
<li>I notice a few items using DOIs pointing at ICARDA&rsquo;s DSpace like: <a href="https://doi.org/20.500.11766/8178,">https://doi.org/20.500.11766/8178,</a> which then points at the &ldquo;real&rdquo; DOI on the publisher&rsquo;s site&hellip; these should be using the real DOI instead of ICARDA&rsquo;s &ldquo;fake&rdquo; Handle DOI</li>
<li>I notice a few items using DOIs pointing at ICARDA&rsquo;s DSpace like: <a href="https://doi.org/20.500.11766/8178">https://doi.org/20.500.11766/8178</a>, which then points at the &ldquo;real&rdquo; DOI on the publisher&rsquo;s site&hellip; these should be using the real DOI instead of ICARDA&rsquo;s &ldquo;fake&rdquo; Handle DOI</li>
<li>Some items missing DOIs, but they clearly have them if you look at the publisher&rsquo;s site</li>
</ul>
</li>

View File

@ -36,7 +36,7 @@ Then I ran all system updates and restarted the server
I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another Ghostscript vulnerability last week
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -50,7 +50,7 @@ I don&rsquo;t see anything interesting in the web server logs around that time t
357 207.46.13.1
903 54.70.40.11
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -72,7 +72,7 @@ real 0m19.873s
user 0m22.203s
sys 0m1.979s
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -46,7 +46,7 @@ Most worryingly, there are encoding errors in the abstracts for eleven items, fo
I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -64,7 +64,7 @@ $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u ds
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -48,7 +48,7 @@ DELETE 1
But after this I tried to delete the item from the XMLUI and it is still present&hellip;
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -34,7 +34,7 @@ Run system updates on CGSpace (linode18) and reboot it
Skype with Marie-Angélique and Abenet about CG Core v2
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -38,7 +38,7 @@ CGSpace
Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -46,7 +46,7 @@ After rebooting, all statistics cores were loaded&hellip; wow, that&rsquo;s luck
Run system updates on DSpace Test (linode19) and reboot it
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -72,7 +72,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
7249 2a01:7e00::f03c:91ff:fe18:7396
9124 45.5.186.2
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2019"/>
<meta name="twitter:description" content="2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U&#43;00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -58,7 +58,7 @@ Let&rsquo;s see how many of the REST API requests were for bitstreams (because t
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
106781
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -46,7 +46,7 @@ Make sure all packages are up to date and the package manager is up to date, the
# dpkg -C
# reboot
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -56,7 +56,7 @@ I tweeted the CGSpace repository link
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -38,7 +38,7 @@ The code finally builds and runs with a fresh install
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -42,7 +42,7 @@ You need to download this into the DSpace 6.x source and compile it
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -48,7 +48,7 @@ The third item now has a donut with score 1 since I tweeted it last week
On the same note, the one item Abenet pointed out last week now has a donut with score of 104 after I tweeted it last week
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -34,7 +34,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -36,7 +36,7 @@ I sent Atmire the dspace.log from today and told them to log into the server to
In other news, I checked the statistics API on DSpace 6 and it&rsquo;s working
I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Test and I get an error:
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -38,7 +38,7 @@ I restarted Tomcat and PostgreSQL and the issue was gone
Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter&rsquo;s request
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -36,7 +36,7 @@ It is class based so I can easily add support for other vocabularies, and the te
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -48,7 +48,7 @@ I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/39
I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -44,7 +44,7 @@ During the FlywayDB migration I got an error:
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -967,8 +967,8 @@ java.lang.OutOfMemoryError: Java heap space
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8083/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;id:/.+-unmigrated/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8083/solr/statistics-2018/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;id:/.+-unmigrated/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><ul>
<li>I restarted the process and it crashed again a few minutes later
<ul>
<li>I increased the memory to 4096m and tried again</li>
@ -1092,12 +1092,12 @@ $ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace
</code></pre><ul>
<li>Then I started a full Discovery re-indexing:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 92m14.294s
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 92m14.294s
user 7m59.840s
sys 2m22.327s
</code></pre><ul>
</code></pre></div><ul>
<li>I realized I had been using an incorrect Solr query to purge unmigrated items after processing with <code>solr-upgrade-statistics-6x</code>&hellip;
<ul>
<li>Instead of this: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
@ -1148,10 +1148,10 @@ $ cat /tmp/elasticsearch-mappings* | grep -v '{&quot;index&quot;:{}}' | sort | u
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* &gt; /tmp/new-elasticsearch-mappings.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat /tmp/elasticsearch-mappings* &gt; /tmp/new-elasticsearch-mappings.txt
$ curl -XDELETE http://localhost:9200/openrxv-values
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H &quot;Content-Type: application/json&quot; --data-binary @/tmp/new-elasticsearch-mappings.txt
</code></pre><ul>
$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H <span style="color:#e6db74">&#34;Content-Type: application/json&#34;</span> --data-binary @/tmp/new-elasticsearch-mappings.txt
</code></pre></div><ul>
<li>The latest indexing (second for today!) finally finshed on AReS and the countries and affiliations/crps/journals all look MUCH better
<ul>
<li>There are still a few acronyms present, some of which are in the value mappings and some which aren&rsquo;t</li>

View File

@ -32,7 +32,7 @@ So far we&rsquo;ve spent at least fifty hours to process the statistics and stat
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -36,7 +36,7 @@ I started processing those (about 411,000 records):
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -132,8 +132,8 @@ I started processing those (about 411,000 records):
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t <span style="color:#ae81ff">12</span> -c statistics-2015
</code></pre></div><ul>
<li>AReS went down when the <code>renew-letsencrypt</code> service stopped the <code>angular_nginx</code> container in the pre-update hook and failed to bring it back up
<ul>
<li>I ran all system updates on the host and rebooted it and AReS came back up OK</li>
@ -179,18 +179,18 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
<ul>
<li>First the 2010 core:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
$ chrt -b <span style="color:#ae81ff">0</span> ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2010/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><ul>
<li>Judging by the DSpace logs all these cores had a problem starting up in the last month:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># grep -rsI &quot;Unable to create core&quot; [dspace]/log/dspace.log.2020-* | grep -o -E &quot;statistics-[0-9]+&quot; | sort | uniq -c
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># grep -rsI <span style="color:#e6db74">&#34;Unable to create core&#34;</span> <span style="color:#f92672">[</span>dspace<span style="color:#f92672">]</span>/log/dspace.log.2020-* | grep -o -E <span style="color:#e6db74">&#34;statistics-[0-9]+&#34;</span> | sort | uniq -c
24 statistics-2010
24 statistics-2015
18 statistics-2016
6 statistics-2018
</code></pre><ul>
</code></pre></div><ul>
<li>The message is always this:</li>
</ul>
<pre tabindex="0"><code>org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error CREATEing SolrCore 'statistics-2016': Unable to create core [statistics-2016] Caused by: Lock obtain timed out: NativeFSLock@/[dspace]/solr/statistics-2016/data/index/write.lock
@ -223,9 +223,9 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
<ul>
<li>There are apparently 1,700 locks right now:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
1739
</code></pre><h2 id="2020-12-08">2020-12-08</h2>
</code></pre></div><h2 id="2020-12-08">2020-12-08</h2>
<ul>
<li>Atmire sent some instructions for using the DeduplicateValuesProcessor
<ul>
@ -341,17 +341,17 @@ Caused by: org.apache.http.TruncatedChunkException: Truncated chunk ( expected s
<ul>
<li>I can see it in the <code>openrxv-items-final</code> index:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*' | json_pp
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final/_count?q=*&#39;</span> | json_pp
{
&quot;_shards&quot; : {
&quot;failed&quot; : 0,
&quot;skipped&quot; : 0,
&quot;successful&quot; : 1,
&quot;total&quot; : 1
&#34;_shards&#34; : {
&#34;failed&#34; : 0,
&#34;skipped&#34; : 0,
&#34;successful&#34; : 1,
&#34;total&#34; : 1
},
&quot;count&quot; : 299922
&#34;count&#34; : 299922
}
</code></pre><ul>
</code></pre></div><ul>
<li>I filed a bug on OpenRXV: <a href="https://github.com/ilri/OpenRXV/issues/64">https://github.com/ilri/OpenRXV/issues/64</a></li>
<li>For now I will try to delete the index and start a re-harvest in the Admin UI:</li>
</ul>
@ -371,8 +371,8 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-temp
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';
</code></pre><h2 id="2020-12-14">2020-12-14</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ &#39;^.*on 2020-[0-9]{2}-*&#39;;
</code></pre></div><h2 id="2020-12-14">2020-12-14</h2>
<ul>
<li>The re-harvesting finished last night on AReS but there are no records in the <code>openrxv-items-final</code> index
<ul>
@ -380,44 +380,44 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-temp
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&#39;</span> | json_pp
{
&quot;count&quot; : 99992,
&quot;_shards&quot; : {
&quot;skipped&quot; : 0,
&quot;total&quot; : 1,
&quot;failed&quot; : 0,
&quot;successful&quot; : 1
&#34;count&#34; : 99992,
&#34;_shards&#34; : {
&#34;skipped&#34; : 0,
&#34;total&#34; : 1,
&#34;failed&#34; : 0,
&#34;successful&#34; : 1
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>I&rsquo;m going to try to <a href="https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-clone-index.html">clone</a> the temp index to the final one&hellip;
<ul>
<li>First, set the <code>openrxv-items-temp</code> index to block writes (read only) and then clone it to <code>openrxv-items-final</code>:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{&quot;acknowledged&quot;:true,&quot;shards_acknowledged&quot;:true,&quot;index&quot;:&quot;openrxv-items-final&quot;}
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><ul>
{&#34;acknowledged&#34;:true,&#34;shards_acknowledged&#34;:true,&#34;index&#34;:&#34;openrxv-items-final&#34;}
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><ul>
<li>Now I see that the <code>openrxv-items-final</code> index has items, but there are still none in AReS Explorer UI!</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 99992,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 99992,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>The api logs show this from last night after the harvesting:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">[Nest] 92 - 12/13/2020, 1:58:52 PM [HarvesterService] Starting Harvest
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">[Nest] 92 - 12/13/2020, 1:58:52 PM [HarvesterService] Starting Harvest
[Nest] 92 - 12/13/2020, 10:50:20 PM [FetchConsumer] OnGlobalQueueDrained
[Nest] 92 - 12/13/2020, 11:00:20 PM [PluginsConsumer] OnGlobalQueueDrained
[Nest] 92 - 12/13/2020, 11:00:20 PM [HarvesterService] reindex function is called
@ -426,16 +426,16 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H
at IncomingMessage.emit (events.js:326:22)
at endReadableNT (_stream_readable.js:1223:12)
at processTicksAndRejections (internal/process/task_queues.js:84:21)
</code></pre><ul>
</code></pre></div><ul>
<li>But I&rsquo;m not sure why the frontend doesn&rsquo;t show any data despite there being documents in the index&hellip;</li>
<li>I talked to Moayad and he reminded me that OpenRXV uses an alias to point to temp and final indexes, but the UI actually uses the <code>openrxv-items</code> index</li>
<li>I cloned the <code>openrxv-items-final</code> index to <code>openrxv-items</code> index and now I see items in the explorer UI</li>
<li>The PDF report was broken and I looked in the API logs and saw this:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">(node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
</code></pre><ul>
</code></pre></div><ul>
<li>I installed <code>unoconv</code> in the backend api container and now it works&hellip; but I wonder why this changed&hellip;</li>
<li>Skype with Abenet and Peter to discuss AReS that will be shown to ILRI scientists this week
<ul>
@ -487,10 +487,10 @@ $ query-json '.items | length' /tmp/policy2.json
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-2020-12-14
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><h2 id="2020-12-15">2020-12-15</h2>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><h2 id="2020-12-15">2020-12-15</h2>
<ul>
<li>After the re-harvest last night there were 200,000 items in the <code>openrxv-items-temp</code> index again
<ul>
@ -499,36 +499,36 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H
</li>
<li>I checked the 1,534 fixes in Open Refine (had to fix a few UTF-8 errors, as always from Peter&rsquo;s CSVs) and then applied them using the <code>fix-metadata-values.py</code> script:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
$ ./delete-metadata-values.py -i /tmp/2020-10-28-delete-2-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.contributor.author -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">3</span>
$ ./delete-metadata-values.py -i /tmp/2020-10-28-delete-2-Authors.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.contributor.author -m <span style="color:#ae81ff">3</span>
</code></pre></div><ul>
<li>Since I was re-indexing Discovery anyways I decided to check for any uppercase AGROVOC and lowercase them:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# BEGIN;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# BEGIN;
BEGIN
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ &#39;[[:upper:]]&#39;;
UPDATE 406
dspace=# COMMIT;
COMMIT
</code></pre><ul>
</code></pre></div><ul>
<li>I also updated the Font Awesome icon classes for version 5 syntax:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-rss','fas fa-rss', 'g') WHERE text_value LIKE '%fa fa-rss%';
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;fa fa-rss&#39;,&#39;fas fa-rss&#39;, &#39;g&#39;) WHERE text_value LIKE &#39;%fa fa-rss%&#39;;
UPDATE 74
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-at','fas fa-at', 'g') WHERE text_value LIKE '%fa fa-at%';
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;fa fa-at&#39;,&#39;fas fa-at&#39;, &#39;g&#39;) WHERE text_value LIKE &#39;%fa fa-at%&#39;;
UPDATE 74
dspace=# COMMIT;
</code></pre><ul>
</code></pre></div><ul>
<li>Then I started a full Discovery re-index:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx512m&quot;
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 265m11.224s
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Dfile.encoding=UTF-8 -Xmx512m&#34;</span>
$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 265m11.224s
user 171m29.141s
sys 2m41.097s
</code></pre><ul>
</code></pre></div><ul>
<li>Udana sent a report that the WLE approver is experiencing the same issue Peter highlighted a few weeks ago: they are unable to save metadata edits in the workflow</li>
<li>Yesterday Atmire responded about the owningComm and owningColl duplicates in Solr saying they didn&rsquo;t see any anymore&hellip;
<ul>
@ -544,31 +544,31 @@ sys 2m41.097s
<ul>
<li>After the Discovery re-indexing finished on CGSpace I prepared to start re-harvesting AReS by making sure the <code>openrxv-items-temp</code> index was empty and that the backup index I made yesterday was still there:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp?pretty&#39;</span>
{
&quot;acknowledged&quot; : true
&#34;acknowledged&#34; : true
}
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty'
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 0,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 0,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -s 'http://localhost:9200/openrxv-items-2020-12-14/_count?q=*&amp;pretty'
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2020-12-14/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 99992,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 99992,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre><h2 id="2020-12-16">2020-12-16</h2>
</code></pre></div><h2 id="2020-12-16">2020-12-16</h2>
<ul>
<li>The harvesting on AReS finished last night so this morning I manually cloned the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>
<ul>
@ -576,32 +576,32 @@ $ curl -s 'http://localhost:9200/openrxv-items-2020-12-14/_count?q=*&amp;pretty'
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 100046,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100046,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
$ curl -s -X POST &quot;http://localhost:9200/openrxv-items-temp/_clone/openrxv-items?pretty&quot;
$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&amp;pretty'
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items?pretty&#39;</span>
$ curl -s -X POST <span style="color:#e6db74">&#34;http://localhost:9200/openrxv-items-temp/_clone/openrxv-items?pretty&#34;</span>
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 100046,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100046,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
</code></pre><ul>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp?pretty&#39;</span>
</code></pre></div><ul>
<li>Interestingly <a href="https://hdl.handle.net/10568/110447">the item</a> that we noticed was duplicated now only appears once</li>
<li>The <a href="https://hdl.handle.net/10568/110133">missing item</a> is still missing</li>
<li>Jane Poole noticed that the &ldquo;previous page&rdquo; and &ldquo;next page&rdquo; buttons are not working on AReS
@ -611,16 +611,16 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
</li>
<li>Generate a list of submitters and approvers active in the last months using the Provenance field on CGSpace:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -h localhost -U postgres dspace -c &quot;SELECT text_value FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-(06|07|08|09|10|11|12)-*'&quot; &gt; /tmp/provenance.txt
$ grep -o -E 'by .*)' /tmp/provenance.txt | grep -v -E &quot;( on |checksum)&quot; | sed -e 's/by //' -e 's/ (/,/' -e 's/)//' | sort | uniq &gt; /tmp/recent-submitters-approvers.csv
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -h localhost -U postgres dspace -c <span style="color:#e6db74">&#34;SELECT text_value FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ &#39;^.*on 2020-(06|07|08|09|10|11|12)-*&#39;&#34;</span> &gt; /tmp/provenance.txt
$ grep -o -E <span style="color:#e6db74">&#39;by .*)&#39;</span> /tmp/provenance.txt | grep -v -E <span style="color:#e6db74">&#34;( on |checksum)&#34;</span> | sed -e <span style="color:#e6db74">&#39;s/by //&#39;</span> -e <span style="color:#e6db74">&#39;s/ (/,/&#39;</span> -e <span style="color:#e6db74">&#39;s/)//&#39;</span> | sort | uniq &gt; /tmp/recent-submitters-approvers.csv
</code></pre></div><ul>
<li>Peter wanted it to send some mail to the users&hellip;</li>
</ul>
<h2 id="2020-12-17">2020-12-17</h2>
<ul>
<li>I see some errors from CUA in our Tomcat logs:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">Thu Dec 17 07:35:27 CET 2020 | Query:containerItem:b049326a-0e76-45a8-ac0c-d8ec043a50c6
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Thu Dec 17 07:35:27 CET 2020 | Query:containerItem:b049326a-0e76-45a8-ac0c-d8ec043a50c6
Error while updating
java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1155)
@ -628,7 +628,7 @@ java.lang.UnsupportedOperationException: Multiple update components target the s
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1140)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1129)
...
</code></pre><ul>
</code></pre></div><ul>
<li>I sent the full stack to Atmire to investigate
<ul>
<li>I know we&rsquo;ve had this &ldquo;Multiple update components target the same field&rdquo; error in the past with DSpace 5.x and Atmire said it was harmless, but would nevertheless be fixed in a future update</li>
@ -636,10 +636,10 @@ java.lang.UnsupportedOperationException: Multiple update components target the s
</li>
<li>I was trying to export the ILRI community on CGSpace so I could update one of the ILRI author&rsquo;s names, but it throws an error&hellip;</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/1 -f /tmp/2020-12-17-ILRI.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/1 -f /tmp/2020-12-17-ILRI.csv
Loading @mire database changes for module MQM
Changes have been processed
Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
Exporting community &#39;International Livestock Research Institute (ILRI)&#39; (10568/1)
Exception: null
java.lang.NullPointerException
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212)
@ -654,14 +654,14 @@ java.lang.NullPointerException
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre><ul>
</code></pre></div><ul>
<li>I did it via CSV with <code>fix-metadata-values.py</code> instead:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2020-12-17-update-ILRI-author.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2020-12-17-update-ILRI-author.csv
dc.contributor.author,correct
&quot;Padmakumar, V.P.&quot;,&quot;Varijakshapanicker, Padmakumar&quot;
$ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
</code></pre><ul>
&#34;Padmakumar, V.P.&#34;,&#34;Varijakshapanicker, Padmakumar&#34;
$ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.contributor.author -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">3</span>
</code></pre></div><ul>
<li>Abenet needed a list of all 2020 outputs from the Livestock CRP that were Limited Access
<ul>
<li>I exported the community from CGSpace and used <code>csvcut</code> and <code>csvgrep</code> to get a list:</li>
@ -689,7 +689,7 @@ $ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u ds
<ul>
<li>The DeduplicateValuesProcessor has been running on DSpace Test since two days ago and it almost completed its second twelve-hour run, but crashed near the end:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">...
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">...
Run 1 — 100% — 8,230,000/8,239,228 docs — 39s — 9h 8m 31s
Exception: Java heap space
java.lang.OutOfMemoryError: Java heap space
@ -725,7 +725,7 @@ java.lang.OutOfMemoryError: Java heap space
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre><ul>
</code></pre></div><ul>
<li>That was with a JVM heap of 512m</li>
<li>I looked in Solr and found dozens of duplicates of each field again&hellip;
<ul>
@ -744,30 +744,30 @@ java.lang.OutOfMemoryError: Java heap space
<li>The AReS harvest finished this morning and I moved the Elasticsearch index manually</li>
<li>First, check the number of records in the temp index to make sure it seems complete and not with double data:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 100135,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100135,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>Then delete the old backup and clone the current items index as a backup:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-14?pretty'
$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2020-12-14?pretty&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-21
</code></pre><ul>
</code></pre></div><ul>
<li>Then delete the current items index and clone it from temp:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items?pretty&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><h2 id="2020-12-22">2020-12-22</h2>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><h2 id="2020-12-22">2020-12-22</h2>
<ul>
<li>I finished getting the Swagger UI integrated into the dspace-statistics-api
<ul>
@ -810,10 +810,10 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H
</code></pre><ul>
<li>I exported the 2012 stats from the year core and imported them to the main statistics core with solr-import-export-json:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2012 -a export -o statistics-2012.json -k uid
$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> ./run.sh -s http://localhost:8081/solr/statistics-2012 -a export -o statistics-2012.json -k uid
$ chrt -b <span style="color:#ae81ff">0</span> ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2012/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><ul>
<li>I decided to do the same for the remaining 2011, 2014, 2017, and 2019 cores&hellip;</li>
</ul>
<h2 id="2020-12-29">2020-12-29</h2>
@ -824,31 +824,31 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=tru
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 100135,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100135,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp?pretty&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-29
$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><h2 id="2020-12-30">2020-12-30</h2>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><h2 id="2020-12-30">2020-12-30</h2>
<ul>
<li>The indexing on AReS finished so I cloned the <code>openrxv-items-temp</code> index to <code>openrxv-items</code> and deleted the backup index:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items?pretty&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-29?pretty'
</code></pre><!-- raw HTML omitted -->
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp?pretty&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2020-12-29?pretty&#39;</span>
</code></pre></div><!-- raw HTML omitted -->

View File

@ -50,7 +50,7 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -160,29 +160,29 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
</code></pre><ul>
</code></pre></div><ul>
<li>Then, the next morning when it&rsquo;s done, check the results of the harvesting, backup the current <code>openrxv-items</code> index, and clone the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 100278,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100278,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-04
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-04'
</code></pre><h2 id="2021-01-04">2021-01-04</h2>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-01-04&#39;</span>
</code></pre></div><h2 id="2021-01-04">2021-01-04</h2>
<ul>
<li>There is one item that appears twice in AReS: <a href="https://cgspace.cgiar.org/handle/10568/66839">10568/66839</a>
<ul>
@ -214,8 +214,8 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-04'
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./doi-to-handle.py -db dspace -u dspace -p 'fuuu' -i /tmp/dois.txt -o /tmp/out.csv
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./doi-to-handle.py -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -i /tmp/dois.txt -o /tmp/out.csv
</code></pre></div><ul>
<li>Help Udana export IWMI records from AReS
<ul>
<li>He wanted me to give him CSV export permissions on CGSpace, but I told him that this requires super admin so I&rsquo;m not comfortable with it</li>
@ -261,12 +261,12 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-04'
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">2021-01-10 10:03:27,692 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=1e8fb96c-b994-4fe2-8f0c-0a98ab138be0, ObjectType=(Unknown), ObjectID=null, TimeStamp=1610269383279, dispatcher=1544803905, detail=[null], transactionID=&quot;TX35636856957739531161091194485578658698&quot;)
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-01-10 10:03:27,692 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=1e8fb96c-b994-4fe2-8f0c-0a98ab138be0, ObjectType=(Unknown), ObjectID=null, TimeStamp=1610269383279, dispatcher=1544803905, detail=[null], transactionID=&#34;TX35636856957739531161091194485578658698&#34;)
</code></pre></div><ul>
<li>I filed <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=907">a bug on Atmire&rsquo;s issue tracker</a></li>
<li>Peter asked me to move the CGIAR Gender Platform community to the top level of CGSpace, but I get an error when I use the community-filiator command:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/66598 --child<span style="color:#f92672">=</span>10568/106605
Loading @mire database changes for module MQM
Changes have been processed
Exception: null
@ -282,7 +282,7 @@ java.lang.UnsupportedOperationException
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre><ul>
</code></pre></div><ul>
<li>There is apparently <a href="https://jira.lyrasis.org/browse/DS-3914">a bug</a> in DSpace 6.x that makes community-filiator not work
<ul>
<li>There is <a href="https://github.com/DSpace/DSpace/pull/2178">a patch</a> for the as-of-yet unreleased DSpace 6.4 so I will try that</li>
@ -301,24 +301,24 @@ java.lang.UnsupportedOperationException
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
... after ten hours
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 100411,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100411,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings?pretty&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings?pretty&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
</code></pre><ul>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</code></pre></div><ul>
<li>Looking over the last month of Solr stats I see a familiar bot that <em>should</em> have been marked as a bot months ago:</li>
</ul>
<blockquote>
@ -331,9 +331,9 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat log/dspace.log.2020-12-2* | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=64.62.202.71' | sort | uniq | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat log/dspace.log.2020-12-2* | grep -E <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}:ip_addr=64.62.202.71&#39;</span> | sort | uniq | wc -l
0
</code></pre><ul>
</code></pre></div><ul>
<li>So now I should really add it to the DSpace spider agent list so it doesn&rsquo;t create Solr hits
<ul>
<li>I added it to the &ldquo;ilri&rdquo; lists of spider agent patterns</li>
@ -341,8 +341,8 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
</li>
<li>I purged the existing hits using my <code>check-spider-ip-hits.sh</code> script:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./check-spider-ip-hits.sh -d -f /tmp/ips -s http://localhost:8081/solr -s statistics -p
</code></pre><h2 id="2021-01-11">2021-01-11</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./check-spider-ip-hits.sh -d -f /tmp/ips -s http://localhost:8081/solr -s statistics -p
</code></pre></div><h2 id="2021-01-11">2021-01-11</h2>
<ul>
<li>The AReS indexing finished this morning and I moved the <code>openrxv-items-temp</code> core to <code>openrxv-items</code> (see above)
<ul>
@ -351,8 +351,8 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
</li>
<li>I deployed the community-filiator fix on CGSpace and moved the Gender Platform community to the top level of CGSpace:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
</code></pre><h2 id="2021-01-12">2021-01-12</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace community-filiator --remove --parent<span style="color:#f92672">=</span>10568/66598 --child<span style="color:#f92672">=</span>10568/106605
</code></pre></div><h2 id="2021-01-12">2021-01-12</h2>
<ul>
<li>IWMI is really pressuring us to have a periodic CSV export of their community
<ul>
@ -393,29 +393,29 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
</code></pre><ul>
</code></pre></div><ul>
<li>Then, the next morning when it&rsquo;s done, check the results of the harvesting, backup the current <code>openrxv-items</code> index, and clone the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 100540,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100540,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-18
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-18'
</code></pre><h2 id="2021-01-18">2021-01-18</h2>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-01-18&#39;</span>
</code></pre></div><h2 id="2021-01-18">2021-01-18</h2>
<ul>
<li>Finish the indexing on AReS that I started yesterday</li>
<li>Udana from IWMI emailed me to ask why the iwmi.csv doesn&rsquo;t include items he approved to CGSpace this morning
@ -462,9 +462,9 @@ localhost/dspace63= &gt; COMMIT;
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker exec -it api /bin/bash
# apt update &amp;&amp; apt install unoconv
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker exec -it api /bin/bash
# apt update <span style="color:#f92672">&amp;&amp;</span> apt install unoconv
</code></pre></div><ul>
<li>Help Peter get a list of titles and DOIs for CGSpace items that Altmetric does not have an attention score for
<ul>
<li>He generated a list from their dashboard and I extracted the DOIs in OpenRefine (because it was WINDOWS-1252 and csvcut couldn&rsquo;t do it)</li>
@ -512,30 +512,30 @@ localhost/dspace63= &gt; COMMIT;
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
</code></pre><ul>
</code></pre></div><ul>
<li>Then, the next morning when it&rsquo;s done, check the results of the harvesting, backup the current <code>openrxv-items</code> index, and clone the <code>openrxv-items-temp</code> index to <code>openrxv-items</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 100699,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100699,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.b
locks.write&quot;:true}}'
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#960050;background-color:#1e0010">&#39;</span><span style="color:#f92672">{</span><span style="color:#e6db74">&#34;settings&#34;</span>: <span style="color:#f92672">{</span><span style="color:#e6db74">&#34;index.b
</span><span style="color:#e6db74"></span>locks.write&#34;:true}}&#39;
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-01-25
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-25'
</code></pre><ul>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-01-25&#39;</span>
</code></pre></div><ul>
<li>Resume working on CG Core v2, I realized a few things:
<ul>
<li>We are trying to move from <code>dc.identifier.issn</code> (and ISBN) to <code>cg.issn</code>, but this is currently implemented as a &ldquo;qualdrop&rdquo; input in DSpace&rsquo;s submission form, which only works to fill in the qualifier (ie <code>dc.identier.xxxx</code>)
@ -601,8 +601,8 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
<li>I <a href="https://jira.lyrasis.org/browse/DS-4566">filed a bug</a> on DSpace&rsquo;s issue tracker (though I accidentally hit Enter and submitted it before I finished, and there is no edit function)</li>
<li>Looking into Linode report that the load outbound traffic rate was high this morning:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># grep -E '26/Jan/2021:(08|09|10|11|12)' /var/log/nginx/rest.log | goaccess --log-format=COMBINED -
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># grep -E <span style="color:#e6db74">&#39;26/Jan/2021:(08|09|10|11|12)&#39;</span> /var/log/nginx/rest.log | goaccess --log-format<span style="color:#f92672">=</span>COMBINED -
</code></pre></div><ul>
<li>The culprit seems to be the ILRI publications importer, so that&rsquo;s OK</li>
<li>But I also see an IP in Jordan hitting the REST API 1,100 times today:</li>
</ul>
@ -615,8 +615,8 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
</li>
<li>I purged all ~3,000 statistics hits that have the &ldquo;<a href="http://wp.local/%22">http://wp.local/&quot;</a> referrer:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;referrer:http\:\/\/wp\.local\/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;referrer:http\:\/\/wp\.local\/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><ul>
<li>Tag version 0.4.3 of the csv-metadata-quality tool on GitHub: <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.3">https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.3</a>
<ul>
<li>I just realized that I never submitted this to CGSpace as a Big Data Platform output</li>
@ -661,9 +661,9 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
</code></pre><ul>
</code></pre></div><ul>
<li>Sent out emails about CG Core v2 to Macaroni Bros, Fabio, Hector at CCAFS, Dani and Tariku</li>
<li>A bit more minor work on testing the series/report/journal changes for CG Core v2</li>
</ul>

View File

@ -20,12 +20,12 @@ Check the results of the AReS harvesting from last night:
$ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;
{
&quot;count&quot; : 100875,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100875,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
" />
@ -51,16 +51,16 @@ Check the results of the AReS harvesting from last night:
$ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;
{
&quot;count&quot; : 100875,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100875,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -157,34 +157,34 @@ $ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#3
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 100875,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100875,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>Set the current items index to read only and make a backup:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d' {&quot;settings&quot;: {&quot;index.blocks.write&quot;:true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39; {&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-01
</code></pre><ul>
</code></pre></div><ul>
<li>Delete the current items index and clone the temp one to it:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</code></pre><ul>
</code></pre></div><ul>
<li>Then delete the temp and backup:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
{&quot;acknowledged&quot;:true}%
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-01'
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
{&#34;acknowledged&#34;:true}%
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-02-01&#39;</span>
</code></pre></div><ul>
<li>Meeting with Peter and Abenet about CGSpace goals and progress</li>
<li>Test submission to DSpace via REST API to see if Abenet can fix / reject it (submit workflow?)</li>
<li>Get Peter a list of users who have submitted or approved on DSpace everrrrrrr, so he can remove some</li>
@ -196,10 +196,10 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-01'
</li>
<li>I tried to export the ILRI community from CGSpace but I got an error:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/1 -f /tmp/2021-02-01-ILRI.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/1 -f /tmp/2021-02-01-ILRI.csv
Loading @mire database changes for module MQM
Changes have been processed
Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
Exporting community &#39;International Livestock Research Institute (ILRI)&#39; (10568/1)
Exception: null
java.lang.NullPointerException
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212)
@ -214,7 +214,7 @@ java.lang.NullPointerException
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre><ul>
</code></pre></div><ul>
<li>I imported the production database to my local development environment and I get the same error&hellip; WTF is this?
<ul>
<li>I was able to export another smaller community</li>
@ -234,16 +234,16 @@ java.lang.NullPointerException
<li>Maria Garruccio sent me some new ORCID iDs for Bioversity authors, as well as a correction for Stefan Burkart&rsquo;s iD</li>
<li>I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using <code>resolve-orcids.py</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2021-02-02-combined-orcids.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort | uniq &gt; /tmp/2021-02-02-combined-orcids.txt
$ ./ilri/resolve-orcids.py -i /tmp/2021-02-02-combined-orcids.txt -o /tmp/2021-02-02-combined-orcid-names.txt
</code></pre><ul>
</code></pre></div><ul>
<li>I sorted the names and added the XML formatting in vim, then ran it through tidy:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w <span style="color:#ae81ff">0</span> dspace/config/controlled-vocabularies/cg-creator-id.xml
</code></pre></div><ul>
<li>Then I added all the changed names plus Stefan&rsquo;s incorrect ones to a CSV and processed them with <code>fix-metadata-values.py</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-02-02-fix-orcid-ids.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-02-02-fix-orcid-ids.csv
cg.creator.id,correct
Burkart Stefan: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184
Burkart Stefan: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184
@ -254,8 +254,8 @@ Bedru: 0000-0002-7344-5743,Bedru B. Balana: 0000-0002-7344-5743
Leigh Winowiecki: 0000-0001-5572-1284,Leigh Ann Winowiecki: 0000-0001-5572-1284
Sander J. Zwart: 0000-0002-5091-1801,Sander Zwart: 0000-0002-5091-1801
saul lozano-fuentes: 0000-0003-1517-6853,Saul Lozano: 0000-0003-1517-6853
$ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u dspace -p 'fuuu' -f cg.creator.id -t 'correct' -m 240
</code></pre><ul>
$ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.creator.id -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">240</span>
</code></pre></div><ul>
<li>I also looked up which of these new authors might have existing items that are missing ORCID iDs</li>
<li>I had to port my <code>add-orcid-identifiers-csv.py</code> to DSpace 6 UUIDs and I think it&rsquo;s working but I want to do a few more tests because it uses a sequence for the metadata_value_id</li>
</ul>
@ -263,23 +263,23 @@ $ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u
<ul>
<li>Tag forty-three items from Bioversity&rsquo;s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat /tmp/2021-02-02-add-orcid-ids.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat /tmp/2021-02-02-add-orcid-ids.csv
dc.contributor.author,cg.creator.id
&quot;Nchanji, E.&quot;,Eileen Bogweh Nchanji: 0000-0002-6859-0962
&quot;Nchanji, Eileen&quot;,Eileen Bogweh Nchanji: 0000-0002-6859-0962
&quot;Nchanji, Eileen Bogweh&quot;,Eileen Bogweh Nchanji: 0000-0002-6859-0962
&quot;Machida, Lewis&quot;,Lewis Machida: 0000-0002-0012-3997
&quot;Mockshell, Jonathan&quot;,Jonathan Mockshell: 0000-0003-1990-6657&quot;
&quot;Aubert, C.&quot;,Celine Aubert: 0000-0001-6284-4821
&quot;Aubert, Céline&quot;,Celine Aubert: 0000-0001-6284-4821
&quot;Devare, M.&quot;,Medha Devare: 0000-0003-0041-4812
&quot;Devare, Medha&quot;,Medha Devare: 0000-0003-0041-4812
&quot;Benites-Alfaro, O.E.&quot;,Omar E. Benites-Alfaro: 0000-0002-6852-9598
&quot;Benites-Alfaro, Omar Eduardo&quot;,Omar E. Benites-Alfaro: 0000-0002-6852-9598
&quot;Johnson, Vincent&quot;,VINCENT JOHNSON: 0000-0001-7874-178X
&quot;Lesueur, Didier&quot;,didier lesueur: 0000-0002-6694-0869
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -d
</code></pre><ul>
&#34;Nchanji, E.&#34;,Eileen Bogweh Nchanji: 0000-0002-6859-0962
&#34;Nchanji, Eileen&#34;,Eileen Bogweh Nchanji: 0000-0002-6859-0962
&#34;Nchanji, Eileen Bogweh&#34;,Eileen Bogweh Nchanji: 0000-0002-6859-0962
&#34;Machida, Lewis&#34;,Lewis Machida: 0000-0002-0012-3997
&#34;Mockshell, Jonathan&#34;,Jonathan Mockshell: 0000-0003-1990-6657&#34;
&#34;Aubert, C.&#34;,Celine Aubert: 0000-0001-6284-4821
&#34;Aubert, Céline&#34;,Celine Aubert: 0000-0001-6284-4821
&#34;Devare, M.&#34;,Medha Devare: 0000-0003-0041-4812
&#34;Devare, Medha&#34;,Medha Devare: 0000-0003-0041-4812
&#34;Benites-Alfaro, O.E.&#34;,Omar E. Benites-Alfaro: 0000-0002-6852-9598
&#34;Benites-Alfaro, Omar Eduardo&#34;,Omar E. Benites-Alfaro: 0000-0002-6852-9598
&#34;Johnson, Vincent&#34;,VINCENT JOHNSON: 0000-0001-7874-178X
&#34;Lesueur, Didier&#34;,didier lesueur: 0000-0002-6694-0869
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -d
</code></pre></div><ul>
<li>I&rsquo;m working on the CGSpace accession for Karl Rich&rsquo;s <a href="https://github.com/ilri/vietnam-pig-model-2018">Viet Nam Pig Model 2018</a> and I noticed his ORCID iD is missing from CGSpace
<ul>
<li>I added it and tagged 141 items of his with the iD</li>
@ -300,9 +300,9 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db d
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 dspace index-discovery -b
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> dspace index-discovery -b
$ dspace oai import -c
</code></pre><ul>
</code></pre></div><ul>
<li>Attend Accenture meeting for repository managers
<ul>
<li>Not clear what the SMO wants to get out of us</li>
@ -333,8 +333,8 @@ $ dspace oai import -c
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/delete-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -m 43
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/delete-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.relation.ispartofseries -m <span style="color:#ae81ff">43</span>
</code></pre></div><ul>
<li>The corrected versions have a lot of encoding issues so I asked Peter to give me the correct ones so I can search/replace them:
<ul>
<li>CIAT Publicaçao</li>
@ -358,8 +358,8 @@ $ dspace oai import -c
<li>I ended up using <a href="https://github.com/LuminosoInsight/python-ftfy">python-ftfy</a> to fix those very easily, then replaced them in the CSV</li>
<li>Then I trimmed whitespace at the beginning, end, and around the &ldquo;;&rdquo;, and applied the 1,600 fixes using <code>fix-metadata-values.py</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/fix-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -t 'correct' -m 43
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/fix-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.relation.ispartofseries -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">43</span>
</code></pre></div><ul>
<li>Help Peter debug an issue with one of Alan Duncan&rsquo;s new FEAST Data reports on CGSpace
<ul>
<li>For some reason the default policy for the item was &ldquo;COLLECTION_492_DEFAULT_READ&rdquo; group, which had zero members</li>
@ -372,12 +372,12 @@ $ dspace oai import -c
<li>Run system updates on CGSpace (linode18), deploy latest 6_x-prod branch, and reboot the server</li>
<li>After the server came back up I started a full Discovery re-indexing:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 247m30.850s
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 247m30.850s
user 160m36.657s
sys 2m26.050s
</code></pre><ul>
</code></pre></div><ul>
<li>Regarding the CG Core v2 migration, Fabio wrote to tell me that he is not using CGSpace directly, instead harvesting via GARDIAN
<ul>
<li>He gave me the contact of Sotiris Konstantinidis, who is the CTO at SCIO Systems and works on the GARDIAN platform</li>
@ -385,30 +385,30 @@ sys 2m26.050s
</li>
<li>Delete the old Elasticsearch temp index to prepare for starting an AReS re-harvest:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
</code></pre><h2 id="2021-02-08">2021-02-08</h2>
</code></pre></div><h2 id="2021-02-08">2021-02-08</h2>
<ul>
<li>Finish rotating the AReS indexes after the harvesting last night:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 100983,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100983,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;:true}}'
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-08
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-08'
</code></pre><h2 id="2021-02-10">2021-02-10</h2>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-02-08&#39;</span>
</code></pre></div><h2 id="2021-02-10">2021-02-10</h2>
<ul>
<li>Talk to Abdullah from CodeObia about a few of the issues we filed on OpenRXV
<ul>
@ -429,11 +429,11 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-08'
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | wc -l
30354
$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort -u | wc -l
$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | sort -u | wc -l
18555
$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h | tail
$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | sort | uniq -c | sort -h | tail
5 c21a79e5-e24e-4861-aa07-e06703d1deb7
5 c2460aa1-ae28-4003-9a99-2d7c5cd7fd38
5 d73fb3ae-9fac-4f7e-990f-e394f344246c
@ -444,7 +444,7 @@ $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h |
6 fb76888c-03ae-4d53-b27d-87d7ca91371a
6 ff42d1e6-c489-492c-a40a-803cabd901ed
7 094e9e1d-09ff-40ca-a6b9-eca580936147
</code></pre><ul>
</code></pre></div><ul>
<li>I added a comment to that bug to ask if this is a side effect of the patch</li>
<li>I started working on tagging pre-2010 ILRI items with license information, like we talked about with Peter and Abenet last week
<ul>
@ -452,23 +452,23 @@ $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h |
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,dc.date.issued,dc.date.issued[],dc.date.issued[en_US],dc.rights,dc.rights[],dc.rights[en],dc.rights[en_US],dc.publisher,dc.publisher[],dc.publisher[en_US],dc.type[en_US]' /tmp/2021-02-10-ILRI.csv | csvgrep -c 'dc.type[en_US]' -r '^.+[^(Journal Item|Journal Article|Book|Book Chapter)]'
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,dc.date.issued,dc.date.issued[],dc.date.issued[en_US],dc.rights,dc.rights[],dc.rights[en],dc.rights[en_US],dc.publisher,dc.publisher[],dc.publisher[en_US],dc.type[en_US]&#39;</span> /tmp/2021-02-10-ILRI.csv | csvgrep -c <span style="color:#e6db74">&#39;dc.type[en_US]&#39;</span> -r <span style="color:#e6db74">&#39;^.+[^(Journal Item|Journal Article|Book|Book Chapter)]&#39;</span>
</code></pre></div><ul>
<li>I imported the CSV into OpenRefine and converted the date text values to date types so I could facet by dates before 2010:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">if(diff(value,&quot;01/01/2010&quot;.toDate(),&quot;days&quot;)&lt;0, true, false)
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">if(diff(value,&#34;01/01/2010&#34;.toDate(),&#34;days&#34;)&lt;0, true, false)
</code></pre></div><ul>
<li>Then I filtered by publisher to make sure they were only ours:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">or(
value.contains(&quot;International Livestock Research Institute&quot;),
value.contains(&quot;ILRI&quot;),
value.contains(&quot;International Livestock Centre for Africa&quot;),
value.contains(&quot;ILCA&quot;),
value.contains(&quot;ILRAD&quot;),
value.contains(&quot;International Laboratory for Research on Animal Diseases&quot;)
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">or(
value.contains(&#34;International Livestock Research Institute&#34;),
value.contains(&#34;ILRI&#34;),
value.contains(&#34;International Livestock Centre for Africa&#34;),
value.contains(&#34;ILCA&#34;),
value.contains(&#34;ILRAD&#34;),
value.contains(&#34;International Laboratory for Research on Animal Diseases&#34;)
)
</code></pre><ul>
</code></pre></div><ul>
<li>I tagged these pre-2010 items with &ldquo;Other&rdquo; if they didn&rsquo;t already have a license</li>
<li>I checked 2010 to 2015, and 2016 to date, but they were all tagged already!</li>
<li>In the end I added the &ldquo;Other&rdquo; license to 1,523 items from before 2010</li>
@ -504,8 +504,8 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (S
<ul>
<li>Clear the OpenRXV temp items index:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</code></pre></div><ul>
<li>Then start a full harvesting of CGSpace in the AReS Explorer admin dashboard</li>
<li>Peter asked me about a few other recently submitted FEAST items that are restricted
<ul>
@ -521,35 +521,35 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (S
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/move-metadata-values.py -i /tmp/move.txt -db dspace -u dspace -p 'fuuu' -f 43 -t 55
</code></pre><h2 id="2021-02-15">2021-02-15</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/move-metadata-values.py -i /tmp/move.txt -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f <span style="color:#ae81ff">43</span> -t <span style="color:#ae81ff">55</span>
</code></pre></div><h2 id="2021-02-15">2021-02-15</h2>
<ul>
<li>Check the results of the AReS Harvesting from last night:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 101126,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 101126,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>Set the current items index to read only and make a backup:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d' {&quot;settings&quot;: {&quot;index.blocks.write&quot;:true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39; {&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-15
</code></pre><ul>
</code></pre></div><ul>
<li>Delete the current items index and clone the temp one:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-15'
</code></pre><ul>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-02-15&#39;</span>
</code></pre></div><ul>
<li>Call with Abdullah from CodeObia to discuss community and collection statistics reporting</li>
</ul>
<h2 id="2021-02-16">2021-02-16</h2>
@ -563,49 +563,49 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-15'
</li>
<li>They are definitely bots posing as users, as I see they have created six thousand DSpace sessions today:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=45.146.165.203' | sort | uniq | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat dspace.log.2021-02-16 | grep -E <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}:ip_addr=45.146.165.203&#39;</span> | sort | uniq | wc -l
4007
$ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=130.255.161.231' | sort | uniq | wc -l
$ cat dspace.log.2021-02-16 | grep -E <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}:ip_addr=130.255.161.231&#39;</span> | sort | uniq | wc -l
2128
</code></pre><ul>
</code></pre></div><ul>
<li>Ah, actually 45.146.165.203 is making requests like this:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">&quot;http://cgspace.cgiar.org:80/bitstream/handle/10568/238/Res_report_no3.pdf;jsessionid=7311DD88B30EEF9A8F526FF89378C2C5%' AND 4313=CONCAT(CHAR(113)+CHAR(98)+CHAR(106)+CHAR(112)+CHAR(113),(SELECT (CASE WHEN (4313=4313) THEN CHAR(49) ELSE CHAR(48) END)),CHAR(113)+CHAR(106)+CHAR(98)+CHAR(112)+CHAR(113)) AND 'XzQO%'='XzQO&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">&#34;http://cgspace.cgiar.org:80/bitstream/handle/10568/238/Res_report_no3.pdf;jsessionid=7311DD88B30EEF9A8F526FF89378C2C5%&#39; AND 4313=CONCAT(CHAR(113)+CHAR(98)+CHAR(106)+CHAR(112)+CHAR(113),(SELECT (CASE WHEN (4313=4313) THEN CHAR(49) ELSE CHAR(48) END)),CHAR(113)+CHAR(106)+CHAR(98)+CHAR(112)+CHAR(113)) AND &#39;XzQO%&#39;=&#39;XzQO&#34;
</code></pre></div><ul>
<li>I purged the hits from these two using my <code>check-spider-ip-hits.sh</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
Purging 4005 hits from 45.146.165.203 in statistics
Purging 3493 hits from 130.255.161.231 in statistics
Total number of bot hits purged: 7498
</code></pre><ul>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 7498
</code></pre></div><ul>
<li>Ugh, I looked in Solr for the top IPs in 2021-01 and found a few more of these Russian IPs so I purged them too:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
Purging 27163 hits from 45.146.164.176 in statistics
Purging 19556 hits from 45.146.165.105 in statistics
Purging 15927 hits from 45.146.165.83 in statistics
Purging 8085 hits from 45.146.165.104 in statistics
Total number of bot hits purged: 70731
</code></pre><ul>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 70731
</code></pre></div><ul>
<li>My god, and 64.39.99.15 is from Qualys, the domain scanning security people, who are making queries trying to see if we are vulnerable or something (wtf?)
<ul>
<li>Looking in Solr I see a few different IPs with DNS like <code>sn003.s02.iad01.qualys.com.</code> so I will purge their requests too:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
Purging 3 hits from 130.255.161.231 in statistics
Purging 16773 hits from 64.39.99.15 in statistics
Purging 6976 hits from 64.39.99.13 in statistics
Purging 13 hits from 64.39.99.63 in statistics
Purging 12 hits from 64.39.99.65 in statistics
Purging 12 hits from 64.39.99.94 in statistics
Total number of bot hits purged: 23789
</code></pre><h2 id="2021-02-17">2021-02-17</h2>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 23789
</code></pre></div><h2 id="2021-02-17">2021-02-17</h2>
<ul>
<li>I tested Node.js 10 vs 12 on CGSpace (linode18) and DSpace Test (linode26) and the build times were surprising
<ul>
@ -627,11 +627,11 @@ Total number of bot hits purged: 23789
<li>Abenet asked me to add Tom Randolph&rsquo;s ORCID identifier to CGSpace</li>
<li>I also tagged all his 247 existing items on CGSpace:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-02-17-add-tom-orcid.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-02-17-add-tom-orcid.csv
dc.contributor.author,cg.creator.id
&quot;Randolph, Thomas F.&quot;,&quot;Thomas Fitz Randolph: 0000-0003-1849-9877&quot;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace -u dspace -p 'fuuu'
</code></pre><h2 id="2021-02-20">2021-02-20</h2>
&#34;Randolph, Thomas F.&#34;,&#34;Thomas Fitz Randolph: 0000-0003-1849-9877&#34;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</code></pre></div><h2 id="2021-02-20">2021-02-20</h2>
<ul>
<li>Test the CG Core v2 migration on DSpace Test (linode26) one last time</li>
</ul>
@ -640,17 +640,17 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace
<li>Start the CG Core v2 migration on CGSpace (linode18)</li>
<li>After deploying the latest <code>6_x-prod</code> branch and running <code>migrate-fields.sh</code> I started a full Discovery reindex:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 311m12.617s
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 311m12.617s
user 217m3.102s
sys 2m37.363s
</code></pre><ul>
</code></pre></div><ul>
<li>Then update OAI:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace oai import -c
$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace oai import -c
$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Dfile.encoding=UTF-8 -Xmx2048m&#34;</span>
</code></pre></div><ul>
<li>Ben Hack was asking if there is a REST API query that will give him all ILRI outputs for their new Sharepoint intranet
<ul>
<li>I told him he can try to use something like this if it&rsquo;s just something like the ILRI articles in journals collection:</li>
@ -668,16 +668,16 @@ $ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m&quot;
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;-Dfile.encoding=UTF-8 -Xmx1024m&#39;</span>
$ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
</code></pre><ul>
</code></pre></div><ul>
<li>The process took an hour or so!</li>
<li>I added colorized output to the csv-metadata-quality tool and tagged <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.4">version 0.4.4 on GitHub</a></li>
<li>I updated the fields in AReS Explorer and then removed the old temp index so I can start a fresh re-harvest of CGSpace:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
</code></pre><h2 id="2021-02-22">2021-02-22</h2>
</code></pre></div><h2 id="2021-02-22">2021-02-22</h2>
<ul>
<li>Start looking at splitting the series name and number in <code>dcterms.isPartOf</code> now that we have migrated to CG Core v2
<ul>
@ -687,43 +687,43 @@ $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;^(.+?);$&#39;,&#39;\1&#39;, &#39;g&#39;) WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ &#39;;$&#39;;
UPDATE 104
</code></pre><ul>
</code></pre></div><ul>
<li>As for splitting the other values, I think I can export the <code>dspace_object_id</code> and <code>text_value</code> and then upload it as a CSV rather than writing a Python script to create the new metadata values</li>
</ul>
<h2 id="2021-02-22-1">2021-02-22</h2>
<ul>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 101380,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 101380,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>Set the current items index to read only and make a backup:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d' {&quot;settings&quot;: {&quot;index.blocks.write&quot;:true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39; {&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-22
</code></pre><ul>
</code></pre></div><ul>
<li>Delete the current items index and clone the temp one to it:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</code></pre><ul>
</code></pre></div><ul>
<li>Then delete the temp and backup:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
{&quot;acknowledged&quot;:true}%
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22'
</code></pre><h2 id="2021-02-23">2021-02-23</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
{&#34;acknowledged&#34;:true}%
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-02-22&#39;</span>
</code></pre></div><h2 id="2021-02-23">2021-02-23</h2>
<ul>
<li>CodeObia sent a <a href="https://github.com/ilri/OpenRXV/pull/75">pull request for clickable countries on AReS</a>
<ul>
@ -732,22 +732,22 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22'
</li>
<li>Remove semicolons from series names without numbers:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;^(.+?);$&#39;,&#39;\1&#39;, &#39;g&#39;) WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ &#39;;$&#39;;
UPDATE 104
dspace=# COMMIT;
</code></pre><ul>
</code></pre></div><ul>
<li>Set all <code>text_lang</code> values on CGSpace to <code>en_US</code> to make the series replacements easier (this didn&rsquo;t work, read below):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE text_lang !='en_US' AND dspace_object_id IN (SELECT uuid FROM item);
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE text_lang !=&#39;en_US&#39; AND dspace_object_id IN (SELECT uuid FROM item);
UPDATE 911
cgspace=# COMMIT;
</code></pre><ul>
</code></pre></div><ul>
<li>Then export all series with their IDs to CSV:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# \COPY (SELECT dspace_object_id, text_value as &quot;dcterms.isPartOf[en_US]&quot; FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# \COPY (SELECT dspace_object_id, text_value as &#34;dcterms.isPartOf[en_US]&#34; FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER;
</code></pre></div><ul>
<li>In OpenRefine I trimmed and consolidated whitespace, then made some quick cleanups to normalize the fields based on a sanity check
<ul>
<li>For example many Spore items are like &ldquo;Spore, Spore 23&rdquo;</li>
@ -761,23 +761,23 @@ cgspace=# COMMIT;
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_value_id=5355845;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE metadata_value_id=5355845;
UPDATE 1
</code></pre><ul>
</code></pre></div><ul>
<li>This also seems to work, using the id for just that one item:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id='9840d19b-a6ae-4352-a087-6d74d2629322';
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id=&#39;9840d19b-a6ae-4352-a087-6d74d2629322&#39;;
UPDATE 37
</code></pre><ul>
</code></pre></div><ul>
<li>This seems to work better for some reason:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">dspacetest=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item);
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspacetest=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item);
UPDATE 18659
</code></pre><ul>
</code></pre></div><ul>
<li>I split the CSV file in batches of 5,000 using xsv, then imported them one by one in CGSpace:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace metadata-import -f /tmp/0.csv
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace metadata-import -f /tmp/0.csv
</code></pre></div><ul>
<li>It took FOREVER to import each file&hellip; like several hours <em>each</em>. MY GOD DSpace 6 is slow.</li>
<li>Help Dominique Perera debug some issues with the WordPress DSpace importer plugin from Macaroni Bros
<ul>
@ -785,40 +785,40 @@ UPDATE 18659
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] &quot;GET /rest/communities?limit=1000 HTTP/1.1&quot; 200 188779 &quot;https://cgspace.cgiar.org/rest /communities?limit=1000&quot; &quot;RTB website BOT&quot;
104.198.97.97 - - [23/Feb/2021:11:41:18 +0100] &quot;GET /rest/communities//communities HTTP/1.1&quot; 404 714 &quot;https://cgspace.cgiar.org/rest/communities//communities&quot; &quot;RTB website BOT&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] &#34;GET /rest/communities?limit=1000 HTTP/1.1&#34; 200 188779 &#34;https://cgspace.cgiar.org/rest /communities?limit=1000&#34; &#34;RTB website BOT&#34;
104.198.97.97 - - [23/Feb/2021:11:41:18 +0100] &#34;GET /rest/communities//communities HTTP/1.1&#34; 404 714 &#34;https://cgspace.cgiar.org/rest/communities//communities&#34; &#34;RTB website BOT&#34;
</code></pre></div><ul>
<li>The first request is OK, but the second one is malformed for sure</li>
</ul>
<h2 id="2021-02-24">2021-02-24</h2>
<ul>
<li>Export a list of journals for Peter to look through:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &quot;cg.journal&quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.journal&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
COPY 3345
</code></pre><ul>
</code></pre></div><ul>
<li>Start a fresh harvesting on AReS because Udana mapped some items today and wants to include them in his report:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
# start indexing in AReS
</code></pre><ul>
</code></pre></div><ul>
<li>Also, I want to include the new series name/number cleanups so it&rsquo;s not a total waste of time</li>
</ul>
<h2 id="2021-02-25">2021-02-25</h2>
<ul>
<li>Hmm the AReS harvest last night seems to have finished successfully, but the number of items is less than I was expecting:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 99546,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 99546,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>The current items index has 101380 items&hellip; I wonder what happened
<ul>
<li>I started a new indexing</li>
@ -843,9 +843,9 @@ COPY 3345
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/\(.*\)/,&quot;&quot;)
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&quot;$1&quot;)
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/\(.*\)/,&#34;&#34;)
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&#34;$1&#34;)
</code></pre></div><ul>
<li>This <code>value.partition</code> was new to me&hellip; and it took me a bit of time to figure out whether I needed to escape the parentheses in the issue number or not (no) and how to reference a capture group with <code>value.replace</code></li>
<li>I tried to check the 1095 CIFOR records from last week for duplicates on DSpace Test, but the page says &ldquo;Processing&rdquo; and never loads
<ul>
@ -857,7 +857,7 @@ value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&quot;$1&quot;)
<li>Niroshini from IWMI is still having issues adding WLE subjects to items during the metadata review step in the workflow</li>
<li>It seems the BatchEditConsumer log spam is gone since I applied <a href="https://github.com/ilri/DSpace/pull/462">Atmire&rsquo;s patch</a></li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -c 'BatchEditConsumer should not have been given' dspace.log.2021-02-[12]*
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">&#39;BatchEditConsumer should not have been given&#39;</span> dspace.log.2021-02-<span style="color:#f92672">[</span>12<span style="color:#f92672">]</span>*
dspace.log.2021-02-10:5067
dspace.log.2021-02-11:2647
dspace.log.2021-02-12:4231
@ -877,7 +877,7 @@ dspace.log.2021-02-25:0
dspace.log.2021-02-26:0
dspace.log.2021-02-27:0
dspace.log.2021-02-28:0
</code></pre><!-- raw HTML omitted -->
</code></pre></div><!-- raw HTML omitted -->

View File

@ -34,7 +34,7 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -163,19 +163,19 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
<ul>
<li>I looked at the number of connections in PostgreSQL and it&rsquo;s definitely high again:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
1020
</code></pre><ul>
</code></pre></div><ul>
<li>I reported it to Atmire to take a look, on the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=851">same issue</a> we had been tracking this before</li>
<li>Abenet asked me to add a new ORCID for ILRI staff member Zoe Campbell</li>
<li>I added it to the controlled vocabulary and then tagged her existing items on CGSpace using my <code>add-orcid-identifier.py</code> script:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-03-04-add-zoe-campbell-orcid.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-03-04-add-zoe-campbell-orcid.csv
dc.contributor.author,cg.creator.identifier
&quot;Campbell, Zoë&quot;,&quot;Zoe Campbell: 0000-0002-4759-9976&quot;
&quot;Campbell, Zoe A.&quot;,&quot;Zoe Campbell: 0000-0002-4759-9976&quot;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p 'fuuu'
</code></pre><ul>
&#34;Campbell, Zoë&#34;,&#34;Zoe Campbell: 0000-0002-4759-9976&#34;
&#34;Campbell, Zoe A.&#34;,&#34;Zoe Campbell: 0000-0002-4759-9976&#34;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</code></pre></div><ul>
<li>I still need to do cleanup on the journal articles metadata
<ul>
<li>Peter sent me some cleanups but I can&rsquo;t use them in the search/replace format he gave</li>
@ -183,9 +183,9 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT dspace_object_id AS id, text_value as &quot;cg.journal&quot; FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT dspace_object_id AS id, text_value as &#34;cg.journal&#34; FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
COPY 32087
</code></pre><ul>
</code></pre></div><ul>
<li>I used OpenRefine to remove all journal values that didn&rsquo;t have one of these values: ; ( )
<ul>
<li>Then I cloned the <code>cg.journal</code> field to <code>cg.volume</code> and <code>cg.issue</code></li>
@ -193,10 +193,10 @@ COPY 32087
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">value.partition(';')[0].trim() # to get journal names
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,&quot;$1&quot;) # to get journal volumes
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&quot;$1&quot;) # to get journal issues
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">value.partition(&#39;;&#39;)[0].trim() # to get journal names
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,&#34;$1&#34;) # to get journal volumes
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&#34;$1&#34;) # to get journal issues
</code></pre></div><ul>
<li>Then I uploaded the changes to CGSpace using <code>dspace metadata-import</code></li>
<li>Margarita from CCAFS was asking about an error deleting some items that were showing up in Google and should have been private
<ul>
@ -233,14 +233,14 @@ value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,&quot;$1&quot;) #
<ul>
<li>I migrated the Docker bind mount for the AReS Elasticsearch container to a Docker volume:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml down
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml down
$ docker volume create docker_esData_7
$ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
$ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
$ docker rm es_dummy
# edit docker/docker-compose.yml to switch from bind mount to volume
$ docker-compose -f docker/docker-compose.yml up -d
</code></pre><ul>
</code></pre></div><ul>
<li>The trick is that when you create a volume like &ldquo;myvolume&rdquo; from a <code>docker-compose.yml</code> file, Docker will create it with the name &ldquo;docker_myvolume&rdquo;
<ul>
<li>If you create it manually on the command line with <code>docker volume create myvolume</code> then the name is literally &ldquo;myvolume&rdquo;</li>
@ -249,39 +249,39 @@ $ docker-compose -f docker/docker-compose.yml up -d
<li>I still need to make the changes to git master and add these notes to the pull request so Moayad and others can benefit</li>
<li>Delete the <code>openrxv-items-temp</code> index to test a fresh harvesting:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
</code></pre><h2 id="2021-03-05-1">2021-03-05</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</code></pre></div><h2 id="2021-03-05-1">2021-03-05</h2>
<ul>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 101761,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 101761,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>Set the current items index to read only and make a backup:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d' {&quot;settings&quot;: {&quot;index.blocks.write&quot;:true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39; {&#34;settings&#34;: {&#34;index.blocks.write&#34;:true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05
</code></pre><ul>
</code></pre></div><ul>
<li>Delete the current items index and clone the temp one to it:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
</code></pre><ul>
</code></pre></div><ul>
<li>Then delete the temp and backup:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
{&quot;acknowledged&quot;:true}%
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
{&#34;acknowledged&#34;:true}%
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-03-05&#39;</span>
</code></pre></div><ul>
<li>I made some pull requests to OpenRXV:
<ul>
<li><a href="https://github.com/ilri/OpenRXV/pull/86">docker/docker-compose.yml: Use docker volumes</a></li>
@ -298,57 +298,57 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
...
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {
&quot;openrxv-items&quot;: {}
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
},
</code></pre><ul>
</code></pre></div><ul>
<li>But on AReS production <code>openrxv-items</code> has somehow become a concrete index:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
...
&quot;openrxv-items&quot;: {
&quot;aliases&quot;: {}
&#34;openrxv-items&#34;: {
&#34;aliases&#34;: {}
},
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {}
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {}
},
&quot;openrxv-items-temp&quot;: {
&quot;aliases&quot;: {}
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {}
},
</code></pre><ul>
</code></pre></div><ul>
<li>I fixed the issue on production by cloning the <code>openrxv-items</code> index to <code>openrxv-items-final</code>, deleting <code>openrxv-items</code>, and then re-creating it as an alias:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
</code></pre><ul>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items&#39;</span>
$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
</code></pre></div><ul>
<li>Delete backups and remove read-only mode on <code>openrxv-items</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-2021-03-07&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><ul>
<li>Linode sent alerts about the CPU usage on CGSpace yesterday and the day before
<ul>
<li>Looking in the logs I see a few IPs making heavy usage on the REST API and XMLUI:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E <span style="color:#e6db74">&#39;0[56]/Mar/2021&#39;</span> | goaccess --log-format<span style="color:#f92672">=</span>COMBINED -
</code></pre></div><ul>
<li>I see the usual IPs for CCAFS and ILRI importer bots, but also <code>143.233.242.132</code> which appears to be for GARDIAN:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zgrep <span style="color:#e6db74">&#39;143.233.242.132&#39;</span> /var/log/nginx/access.log.1 | grep -c Delphi
6237
# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c -v Delphi
# zgrep <span style="color:#e6db74">&#39;143.233.242.132&#39;</span> /var/log/nginx/access.log.1 | grep -c -v Delphi
6418
</code></pre><ul>
</code></pre></div><ul>
<li>They seem to make requests twice, once with the Delphi user agent that we know and already mark as a bot, and once with a &ldquo;normal&rdquo; user agent
<ul>
<li>Looking in Solr I see they have been using this IP for awhile, as they have 100,000 hits going back into 2020</li>
@ -375,9 +375,9 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Typ
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
13
</code></pre><ul>
</code></pre></div><ul>
<li>On 2021-03-03 the PostgreSQL transactions started rising:</li>
</ul>
<p><img src="/cgspace-notes/2021/03/postgres_querylength_ALL-week.png" alt="PostgreSQL query length week"></p>
@ -409,10 +409,10 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Typ
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
# start harvesting on AReS
</code></pre><ul>
</code></pre></div><ul>
<li>As I saw on my local test instance, even when you cancel a harvesting, it replaces the <code>openrxv-items-final</code> index with whatever is in <code>openrxv-items-temp</code> automatically, so I assume it will do the same now</li>
</ul>
<h2 id="2021-03-09">2021-03-09</h2>
@ -434,8 +434,8 @@ $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p 'fuuu'
</code></pre><h2 id="2021-03-10">2021-03-10</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</code></pre></div><h2 id="2021-03-10">2021-03-10</h2>
<ul>
<li>Colleagues from ICARDA asked about how we should handle ISI journals in CG Core, as CGSpace uses <code>cg.isijournal</code> and MELSpace uses <code>mel.impact-factor</code>
<ul>
@ -444,12 +444,12 @@ $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items
</li>
<li>Peter said he doesn&rsquo;t see &ldquo;Source Code&rdquo; or &ldquo;Software&rdquo; in the <a href="https://cgspace.cgiar.org/handle/10568/1/search-filter?field=type">output type facet on the ILRI community</a>, but I see it on the home page, so I will try to do a full Discovery re-index:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 318m20.485s
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 318m20.485s
user 215m15.196s
sys 2m51.529s
</code></pre><ul>
</code></pre></div><ul>
<li>Now I see ten items for &ldquo;Source Code&rdquo; in the facets&hellip;</li>
<li>Add GPL and MIT licenses to the list of licenses on CGSpace input form since we will start capturing more software and source code</li>
<li>Added the ability to check <code>dcterms.license</code> values against the SPDX licenses in the csv-metadata-quality tool
@ -467,34 +467,34 @@ sys 2m51.529s
<ul>
<li>Switch to linux-kvm kernel on linode20 and linode18:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># apt update &amp;&amp; apt full-upgrade
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># apt update <span style="color:#f92672">&amp;&amp;</span> apt full-upgrade
# apt install linux-kvm
# apt remove linux-generic linux-image-generic linux-headers-generic linux-firmware
# apt autoremove &amp;&amp; apt autoclean
# apt autoremove <span style="color:#f92672">&amp;&amp;</span> apt autoclean
# reboot
</code></pre><ul>
</code></pre></div><ul>
<li>Deploy latest changes from <code>6_x-prod</code> branch on CGSpace</li>
<li>Deploy latest changes from OpenRXV <code>master</code> branch on AReS</li>
<li>Last week Peter added OpenRXV to CGSpace: <a href="https://hdl.handle.net/10568/112982">https://hdl.handle.net/10568/112982</a></li>
<li>Back up the current <code>openrxv-items-final</code> index on AReS to start a new harvest:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-14
$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><ul>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><ul>
<li>After the harvesting finished it seems the indexes got messed up again, as <code>openrxv-items</code> is an alias of <code>openrxv-items-temp</code> instead of <code>openrxv-items-final</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
...
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {}
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {}
},
&quot;openrxv-items-temp&quot;: {
&quot;aliases&quot;: {
&quot;openrxv-items&quot;: {}
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
},
</code></pre><ul>
</code></pre></div><ul>
<li>Anyways, the number of items in <code>openrxv-items</code> seems OK and the AReS Explorer UI is working fine
<ul>
<li>I will have to manually fix the indexes before the next harvesting</li>
@ -535,54 +535,54 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Conte
</li>
<li>Back up the current <code>openrxv-items-final</code> index to start a fresh AReS Harvest:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-21
$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><ul>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><ul>
<li>Then start harvesting in the AReS Explorer admin UI</li>
</ul>
<h2 id="2021-03-22">2021-03-22</h2>
<ul>
<li>The harvesting on AReS yesterday completed, but somehow I have twice the number of items:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 206204,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 206204,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>Hmmm and even my backup index has a strange number of items:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 844,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 844,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>I deleted all indexes and re-created the openrxv-items alias:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
...
&quot;openrxv-items-temp&quot;: {
&quot;aliases&quot;: {}
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {}
},
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {
&quot;openrxv-items&quot;: {}
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>Then I started a new harvesting</li>
<li>I switched the Node.js in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a> to v12 since v10 will cease to be supported soon
<ul>
@ -591,26 +591,26 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
</li>
<li>The AReS harvest finally finished, with 1047 pages of items, but the <code>openrxv-items-final</code> index is empty and the <code>openrxv-items-temp</code> index has a 103,000 items:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 103162,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 103162,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>I tried to clone the temp index to the final, but got an error:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{&quot;error&quot;:{&quot;root_cause&quot;:[{&quot;type&quot;:&quot;resource_already_exists_exception&quot;,&quot;reason&quot;:&quot;index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists&quot;,&quot;index_uuid&quot;:&quot;LmxH-rQsTRmTyWex2d8jxw&quot;,&quot;index&quot;:&quot;openrxv-items-final&quot;}],&quot;type&quot;:&quot;resource_already_exists_exception&quot;,&quot;reason&quot;:&quot;index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists&quot;,&quot;index_uuid&quot;:&quot;LmxH-rQsTRmTyWex2d8jxw&quot;,&quot;index&quot;:&quot;openrxv-items-final&quot;},&quot;status&quot;:400}%
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{&#34;error&#34;:{&#34;root_cause&#34;:[{&#34;type&#34;:&#34;resource_already_exists_exception&#34;,&#34;reason&#34;:&#34;index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists&#34;,&#34;index_uuid&#34;:&#34;LmxH-rQsTRmTyWex2d8jxw&#34;,&#34;index&#34;:&#34;openrxv-items-final&#34;}],&#34;type&#34;:&#34;resource_already_exists_exception&#34;,&#34;reason&#34;:&#34;index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists&#34;,&#34;index_uuid&#34;:&#34;LmxH-rQsTRmTyWex2d8jxw&#34;,&#34;index&#34;:&#34;openrxv-items-final&#34;},&#34;status&#34;:400}%
</code></pre></div><ul>
<li>I looked in the Docker logs for Elasticsearch and saw a few memory errors:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">java.lang.OutOfMemoryError: Java heap space
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">java.lang.OutOfMemoryError: Java heap space
</code></pre></div><ul>
<li>According to <code>/usr/share/elasticsearch/config/jvm.options</code> in the Elasticsearch container the default JVM heap is 1g
<ul>
<li>I see the running Java process has <code>-Xms 1g -Xmx 1g</code> in its process invocation so I guess that it must be indeed using 1g</li>
@ -622,20 +622,20 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"> &quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {}
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"> &#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {}
},
&quot;openrxv-items-temp&quot;: {
&quot;aliases&quot;: {
&quot;openrxv-items&quot;: {}
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
},
</code></pre><h2 id="2021-03-23">2021-03-23</h2>
</code></pre></div><h2 id="2021-03-23">2021-03-23</h2>
<ul>
<li>For reference you can also get the Elasticsearch JVM stats from the API:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_nodes/jvm?human' | python -m json.tool
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_nodes/jvm?human&#39;</span> | python -m json.tool
</code></pre></div><ul>
<li>I re-deployed AReS with 1.5GB of heap using the <code>ES_JAVA_OPTS</code> environment variable
<ul>
<li>It turns out that this <em>is</em> the recommended way to set the heap: <a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.6/jvm-options.html">https://www.elastic.co/guide/en/elasticsearch/reference/7.6/jvm-options.html</a></li>
@ -644,8 +644,8 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
<li>Then I fixed the aliases to make sure <code>openrxv-items</code> was an alias of <code>openrxv-items-final</code>, similar to how I did a few weeks ago</li>
<li>I re-created the temp index:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
</code></pre><h2 id="2021-03-24">2021-03-24</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XPUT <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
</code></pre></div><h2 id="2021-03-24">2021-03-24</h2>
<ul>
<li>Atmire responded to the <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=934">ticket about the Duplicate Checker</a>
<ul>
@ -659,35 +659,35 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># du -s /home/dspacetest.cgiar.org/solr/statistics
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># du -s /home/dspacetest.cgiar.org/solr/statistics
57861236 /home/dspacetest.cgiar.org/solr/statistics
</code></pre><ul>
</code></pre></div><ul>
<li>I applied their changes to <code>config/spring/api/atmire-cua-update.xml</code> and started the duplicate processor:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 1000 -c statistics -t 12
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;-Dfile.encoding=UTF-8 -Xmx4096m&#39;</span>
$ chrt -b <span style="color:#ae81ff">0</span> dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r <span style="color:#ae81ff">1000</span> -c statistics -t <span style="color:#ae81ff">12</span>
</code></pre></div><ul>
<li>The default number of records per query is 10,000, which caused memory issues, so I will try with 1000 (Atmire used 100, but that seems too low!)</li>
<li>Hah, I still got a memory error after only a few minutes:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">...
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">...
Run 1 —  80% — 5,000/6,263 docs — 25s — 6m 31s
Exception: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
</code></pre><ul>
</code></pre></div><ul>
<li>I guess we really do have to use <code>-r 100</code></li>
<li>Now the thing runs for a few minutes and &ldquo;finishes&rdquo;:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r <span style="color:#ae81ff">100</span> -c statistics -t <span style="color:#ae81ff">12</span>
Loading @mire database changes for module MQM
Changes have been processed
*************************
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>*************************
* Update Script Started *
*************************
Run 1
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Run 1
Start updating Solr Storage Reports | Wed Mar 24 14:42:17 CET 2021
Deleting old storage docs from Solr... | Wed Mar 24 14:42:17 CET 2021
Done. | Wed Mar 24 14:42:17 CET 2021
@ -752,12 +752,12 @@ Run 1 —  97% — 4,700/4,824 docs — 2s — 5m 49s
Run 1 — 100% — 4,800/4,824 docs — 2s — 5m 51s
Run 1 — 100% — 4,824/4,824 docs — 2s — 5m 53s
Run 1 took 5m 53s
**************************
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>**************************
* Update Script Finished *
**************************
</code></pre><ul>
</code></pre></div><ul>
<li>If I run it again it finds the same 4,824 docs and processes them&hellip;
<ul>
<li>I asked Atmire for feedback on this: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839">https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839</a></li>
@ -796,8 +796,8 @@ Run 1 took 5m 53s
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">2021-03-29 08:55:40,073 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&amp;fl=handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&amp;wt=javabin&amp;version=2} hits=143 status=0 QTime=0
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-03-29 08:55:40,073 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&amp;fl=handle,search.resourcetype,search.resourceid,search.uniqueid&amp;start=0&amp;fq=NOT(withdrawn:true)&amp;fq=NOT(discoverable:false)&amp;fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&amp;wt=javabin&amp;version=2} hits=143 status=0 QTime=0
</code></pre></div><ul>
<li>But the item mapper only displays ten items, with no pagination
<ul>
<li>There is no way to search by handle or ID</li>
@ -845,9 +845,9 @@ r <span style="color:#f92672">=</span> requests<span style="color:#f92672">.</sp
</code></pre></div><ul>
<li>I exported a list of all our ISSNs from CGSpace:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
COPY 3081
</code></pre><ul>
</code></pre></div><ul>
<li>I wrote a script to check the ISSNs against Crossref&rsquo;s API: <code>crossref-issn-lookup.py</code>
<ul>
<li>I suspect Crossref might have better data actually&hellip;</li>

View File

@ -44,7 +44,7 @@ Perhaps one of the containers crashed, I should have looked closer but I was in
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -54,7 +54,7 @@ Perhaps one of the containers crashed, I should have looked closer but I was in
"@type": "BlogPosting",
"headline": "April, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-04/",
"wordCount": "4669",
"wordCount": "4668",
"datePublished": "2021-04-01T09:50:54+03:00",
"dateModified": "2021-04-28T18:57:48+03:00",
"author": {
@ -153,21 +153,21 @@ Perhaps one of the containers crashed, I should have looked closer but I was in
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;cgspace-account&quot; -W &quot;(sAMAccountName=otheraccounttoquery)&quot;
</code></pre><h2 id="2021-04-04">2021-04-04</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b <span style="color:#e6db74">&#34;dc=cgiarad,dc=org&#34;</span> -D <span style="color:#e6db74">&#34;cgspace-account&#34;</span> -W <span style="color:#e6db74">&#34;(sAMAccountName=otheraccounttoquery)&#34;</span>
</code></pre></div><h2 id="2021-04-04">2021-04-04</h2>
<ul>
<li>Check the index aliases on AReS Explorer to make sure they are sane before starting a new harvest:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool | less
</code></pre></div><ul>
<li>Then set the <code>openrxv-items-final</code> index to read-only so we can make a backup:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
{&quot;acknowledged&quot;:true}%
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
{&#34;acknowledged&#34;:true}%
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-backup
{&quot;acknowledged&quot;:true,&quot;shards_acknowledged&quot;:true,&quot;index&quot;:&quot;openrxv-items-final-backup&quot;}%
$ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><ul>
{&#34;acknowledged&#34;:true,&#34;shards_acknowledged&#34;:true,&#34;index&#34;:&#34;openrxv-items-final-backup&#34;}%
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-final/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><ul>
<li>Then start a harvesting on AReS Explorer</li>
<li>Help Enrico get some 2020 statistics for the Roots, Tubers and Bananas (RTB) community on CGSpace
<ul>
@ -181,8 +181,8 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Conte
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.issn -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">253</span>
</code></pre></div><ul>
<li>For now I only fixed obvious errors like &ldquo;1234-5678.&rdquo; and &ldquo;e-ISSN: 1234-5678&rdquo; etc, but there are still lots of invalid ones which need more manual work:
<ul>
<li>Too few characters</li>
@ -196,19 +196,19 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Conte
<ul>
<li>The AReS Explorer harvesting from yesterday finished, and the results look OK, but actually the Elasticsearch indexes are messed up again:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool
{
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {}
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {}
},
&quot;openrxv-items-temp&quot;: {
&quot;aliases&quot;: {
&quot;openrxv-items&quot;: {}
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
},
...
}
</code></pre><ul>
</code></pre></div><ul>
<li><code>openrxv-items</code> should be an alias of <code>openrxv-items-final</code>, not <code>openrxv-temp</code>&hellip; I will have to fix that manually</li>
<li>Enrico asked for more information on the RTB stats I gave him yesterday
<ul>
@ -218,16 +218,16 @@ $ curl -X PUT &quot;localhost:9200/openrxv-items-final/_settings&quot; -H 'Conte
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
$ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.csv | \
sed '1d' | \
csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, &quot;&quot;)||COALESCE(c, &quot;&quot;)||COALESCE(d, &quot;&quot;) AS issued FROM stdin' | \
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
$ csvcut -c <span style="color:#e6db74">&#39;id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]&#39;</span> /tmp/rtb.csv | <span style="color:#ae81ff">\
</span><span style="color:#ae81ff"></span> sed &#39;1d&#39; | \
csvsql --no-header --no-inference --query &#39;SELECT a AS id,COALESCE(b, &#34;&#34;)||COALESCE(c, &#34;&#34;)||COALESCE(d, &#34;&#34;) AS issued FROM stdin&#39; | \
csvgrep -c issued -m 2020 | \
csvcut -c id | \
sed '1d' | \
sed &#39;1d&#39; | \
sort | \
uniq
</code></pre><ul>
</code></pre></div><ul>
<li>So I remember in the future, this basically does the following:
<ul>
<li>Use csvcut to extract the id and all date issued columns from the CSV</li>
@ -257,17 +257,17 @@ $ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.
</code></pre></div><ul>
<li>Then I submitted the file three times (changing the page parameter):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp &gt; /tmp/page1.json
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp &gt; /tmp/page1.json
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp &gt; /tmp/page2.json
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp &gt; /tmp/page3.json
</code></pre><ul>
</code></pre></div><ul>
<li>Then I extracted the views and downloads in the most ridiculous way:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep views /tmp/page*.json | grep -o -E <span style="color:#e6db74">&#39;[0-9]+$&#39;</span> | sed <span style="color:#e6db74">&#39;s/,//&#39;</span> | xargs | sed -e <span style="color:#e6db74">&#39;s/ /+/g&#39;</span> | bc
30364
$ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
$ grep downloads /tmp/page*.json | grep -o -E <span style="color:#e6db74">&#39;[0-9]+,&#39;</span> | sed <span style="color:#e6db74">&#39;s/,//&#39;</span> | xargs | sed -e <span style="color:#e6db74">&#39;s/ /+/g&#39;</span> | bc
9100
</code></pre><ul>
</code></pre></div><ul>
<li>For curiousity I did the same exercise for items issued in 2019 and got the following:
<ul>
<li>Views: 30721</li>
@ -290,17 +290,17 @@ $ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs |
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
12413
</code></pre><ul>
</code></pre></div><ul>
<li>The system journal shows thousands of these messages in the system journal, this is the first one:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">Apr 06 07:52:13 linode18 tomcat7[556]: Apr 06, 2021 7:52:13 AM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Apr 06 07:52:13 linode18 tomcat7[556]: Apr 06, 2021 7:52:13 AM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
</code></pre></div><ul>
<li>Around that time in the dspace log I see nothing unusual, but maybe these?</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">2021-04-06 07:52:29,409 INFO com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-04-06 07:52:29,409 INFO com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics
</code></pre></div><ul>
<li>(BTW what is the deal with the &ldquo;200/127&rdquo;? I should send a comment to Atmire)
<ul>
<li>I file a ticket with Atmire: <a href="https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets">https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets</a></li>
@ -308,17 +308,17 @@ $ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs |
</li>
<li>I restarted the PostgreSQL and Tomcat services and now I see less connections, but still WAY high:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
3640
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
2968
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
13
</code></pre><ul>
</code></pre></div><ul>
<li>After ten minutes or so it went back down&hellip;</li>
<li>And now it&rsquo;s back up in the thousands&hellip; I am seeing a lot of stuff in dspace log like this:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">2021-04-06 11:59:34,364 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-04-06 11:59:34,364 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717952
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717953
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717954
@ -339,7 +339,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717969
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717970
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717971
</code></pre><ul>
</code></pre></div><ul>
<li>I sent some notes and a log to Atmire on our existing issue about the database stuff
<ul>
<li>Also I asked them about the possibility of doing a formal review of Hibernate</li>
@ -354,17 +354,17 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
<li>I had a meeting with Peter and Abenet about CGSpace TODOs</li>
<li>CGSpace went down again and the PostgreSQL locks are through the roof:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
12154
</code></pre><ul>
</code></pre></div><ul>
<li>I don&rsquo;t see any activity on REST API, but in the last four hours there have been 3,500 DSpace sessions:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># grep -a -E '2021-04-06 (13|14|15|16|17):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># grep -a -E <span style="color:#e6db74">&#39;2021-04-06 (13|14|15|16|17):&#39;</span> /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -o -E <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}&#39;</span> | sort | uniq | wc -l
3547
</code></pre><ul>
</code></pre></div><ul>
<li>I looked at the same time of day for the past few weeks and it seems to be a normal number of sessions:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do grep -a -E &quot;2021-0(3|4)-[0-9]{2} (13|14|15|16|17):&quot; &quot;$file&quot; | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l; done
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># <span style="color:#66d9ef">for</span> file in /home/cgspace.cgiar.org/log/dspace.log.2021-0<span style="color:#f92672">{</span>3,4<span style="color:#f92672">}</span>-*; <span style="color:#66d9ef">do</span> grep -a -E <span style="color:#e6db74">&#34;2021-0(3|4)-[0-9]{2} (13|14|15|16|17):&#34;</span> <span style="color:#e6db74">&#34;</span>$file<span style="color:#e6db74">&#34;</span> | grep -o -E <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}&#39;</span> | sort | uniq | wc -l; <span style="color:#66d9ef">done</span>
...
3572
4085
@ -387,10 +387,10 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
599
4463
3547
</code></pre><ul>
</code></pre></div><ul>
<li>What about total number of sessions per day?</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do echo &quot;$file:&quot;; grep -a -o -E 'session_id=[A-Z0-9]{32}' &quot;$file&quot; | sort | uniq | wc -l; done
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># <span style="color:#66d9ef">for</span> file in /home/cgspace.cgiar.org/log/dspace.log.2021-0<span style="color:#f92672">{</span>3,4<span style="color:#f92672">}</span>-*; <span style="color:#66d9ef">do</span> echo <span style="color:#e6db74">&#34;</span>$file<span style="color:#e6db74">:&#34;</span>; grep -a -o -E <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}&#39;</span> <span style="color:#e6db74">&#34;</span>$file<span style="color:#e6db74">&#34;</span> | sort | uniq | wc -l; <span style="color:#66d9ef">done</span>
...
/home/cgspace.cgiar.org/log/dspace.log.2021-03-28:
11784
@ -412,7 +412,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
16756
/home/cgspace.cgiar.org/log/dspace.log.2021-04-06:
12343
</code></pre><ul>
</code></pre></div><ul>
<li>So it&rsquo;s not the number of sessions&hellip; it&rsquo;s something with the workload&hellip;</li>
<li>I had to step away for an hour or so and when I came back the site was still down and there were still 12,000 locks
<ul>
@ -421,13 +421,13 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</li>
<li>The locks in PostgreSQL shot up again&hellip;</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
3447
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
3527
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
4582
</code></pre><ul>
</code></pre></div><ul>
<li>I don&rsquo;t know what the hell is going on, but the PostgreSQL connections and locks are way higher than ever before:</li>
</ul>
<p><img src="/cgspace-notes/2021/04/postgres_connections_cgspace-week.png" alt="PostgreSQL connections week">
@ -440,9 +440,9 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
<ul>
<li>While looking at the nginx logs I see that MEL is trying to log into CGSpace&rsquo;s REST API and delete items:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">34.209.213.122 - - [06/Apr/2021:03:50:46 +0200] &quot;POST /rest/login HTTP/1.1&quot; 401 727 &quot;-&quot; &quot;MEL&quot;
34.209.213.122 - - [06/Apr/2021:03:50:48 +0200] &quot;DELETE /rest/items/95f52bf1-f082-4e10-ad57-268a76ca18ec/metadata HTTP/1.1&quot; 401 704 &quot;-&quot; &quot;-&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">34.209.213.122 - - [06/Apr/2021:03:50:46 +0200] &#34;POST /rest/login HTTP/1.1&#34; 401 727 &#34;-&#34; &#34;MEL&#34;
34.209.213.122 - - [06/Apr/2021:03:50:48 +0200] &#34;DELETE /rest/items/95f52bf1-f082-4e10-ad57-268a76ca18ec/metadata HTTP/1.1&#34; 401 704 &#34;-&#34; &#34;-&#34;
</code></pre></div><ul>
<li>I see a few of these per day going back several months
<ul>
<li>I sent a message to Salem and Enrico to ask if they know</li>
@ -450,13 +450,13 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</li>
<li>Also annoying, I see tons of what look like penetration testing requests from Qualys:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">2021-04-04 06:35:17,889 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user &quot;'&gt;&lt;qss a=X158062356Y1_2Z&gt;
2021-04-04 06:35:17,889 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user=&quot;'&gt;&lt;qss a=X158062356Y1_2Z&gt;
2021-04-04 06:35:17,890 INFO org.dspace.app.xmlui.utils.AuthenticationUtil @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:email=&quot;'&gt;&lt;qss a=X158062356Y1_2Z&gt;, realm=null, result=2
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-04-04 06:35:17,889 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user &#34;&#39;&gt;&lt;qss a=X158062356Y1_2Z&gt;
2021-04-04 06:35:17,889 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user=&#34;&#39;&gt;&lt;qss a=X158062356Y1_2Z&gt;
2021-04-04 06:35:17,890 INFO org.dspace.app.xmlui.utils.AuthenticationUtil @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:email=&#34;&#39;&gt;&lt;qss a=X158062356Y1_2Z&gt;, realm=null, result=2
2021-04-04 06:35:18,145 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:auth:attempting trivial auth of user=was@qualys.com
2021-04-04 06:35:18,519 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user was@qualys.com
2021-04-04 06:35:18,520 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user=was@qualys.com
</code></pre><ul>
</code></pre></div><ul>
<li>I deleted the ilri/AReS repository on GitHub since we haven&rsquo;t updated it in two years
<ul>
<li>All development is happening in <a href="https://github.com/ilri/openRXV">https://github.com/ilri/openRXV</a> now</li>
@ -464,27 +464,27 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</li>
<li>10PM and the server is down again, with locks through the roof:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
12198
</code></pre><ul>
</code></pre></div><ul>
<li>I see that there are tons of PostgreSQL connections getting abandoned today, compared to very few in the past few weeks:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ journalctl -u tomcat7 --since=today | grep -c 'ConnectionPool abandon'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ journalctl -u tomcat7 --since<span style="color:#f92672">=</span>today | grep -c <span style="color:#e6db74">&#39;ConnectionPool abandon&#39;</span>
1838
$ journalctl -u tomcat7 --since=2021-03-20 --until=2021-04-05 | grep -c 'ConnectionPool abandon'
$ journalctl -u tomcat7 --since<span style="color:#f92672">=</span>2021-03-20 --until<span style="color:#f92672">=</span>2021-04-05 | grep -c <span style="color:#e6db74">&#39;ConnectionPool abandon&#39;</span>
3
</code></pre><ul>
</code></pre></div><ul>
<li>I even restarted the server and connections were low for a few minutes until they shot back up:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
13
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
8651
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
8940
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
10504
</code></pre><ul>
</code></pre></div><ul>
<li>I had to go to bed and I bet it will crash and be down for hours until I wake up&hellip;</li>
<li>What the hell is this user agent?</li>
</ul>
@ -493,9 +493,9 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
<ul>
<li>CGSpace was still down from last night of course, with tons of database locks:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
12168
</code></pre><ul>
</code></pre></div><ul>
<li>I restarted the server again and the locks came back</li>
<li>Atmire responded to the message from yesterday
<ul>
@ -504,8 +504,8 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">2021-04-01 12:45:11,414 WARN org.dspace.workflowbasic.BasicWorkflowServiceImpl @ a.akwarandu@cgiar.org:session_id=2F20F20D4A8C36DB53D42DE45DFA3CCE:notifyGroupofTask:cannot email user group_id=aecf811b-b7e9-4b6f-8776-3d372e6a048b workflow_item_id=33085\colon; Invalid Addresses (com.sun.mail.smtp.SMTPAddressFailedException\colon; 501 5.1.3 Invalid address
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-04-01 12:45:11,414 WARN org.dspace.workflowbasic.BasicWorkflowServiceImpl @ a.akwarandu@cgiar.org:session_id=2F20F20D4A8C36DB53D42DE45DFA3CCE:notifyGroupofTask:cannot email user group_id=aecf811b-b7e9-4b6f-8776-3d372e6a048b workflow_item_id=33085\colon; Invalid Addresses (com.sun.mail.smtp.SMTPAddressFailedException\colon; 501 5.1.3 Invalid address
</code></pre></div><ul>
<li>The issue is not the named user above, but a member of the group&hellip;</li>
<li>And the group does have users with invalid email addresses (probably accounts created automatically after authenticating with LDAP):</li>
</ul>
@ -513,7 +513,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
<ul>
<li>I extracted all the group IDs from recent logs that had users with invalid email addresses:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -a -E 'email user group_id=\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' /home/cgspace.cgiar.org/log/dspace.log.* | grep -o -E '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' | sort | uniq
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -a -E <span style="color:#e6db74">&#39;email user group_id=\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b&#39;</span> /home/cgspace.cgiar.org/log/dspace.log.* | grep -o -E <span style="color:#e6db74">&#39;\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b&#39;</span> | sort | uniq
0a30d6ae-74a6-4eee-a8f5-ee5d15192ee6
1769137c-36d4-42b2-8fec-60585e110db7
203c8614-8a97-4ac8-9686-d9d62cb52acc
@ -557,7 +557,7 @@ ede59734-adac-4c01-8691-b45f19088d37
f88bd6bb-f93f-41cb-872f-ff26f6237068
f985f5fb-be5c-430b-a8f1-cf86ae4fc49a
fe800006-aaec-4f9e-9ab4-f9475b4cbdc3
</code></pre><h2 id="2021-04-08">2021-04-08</h2>
</code></pre></div><h2 id="2021-04-08">2021-04-08</h2>
<ul>
<li>I can&rsquo;t believe it but the server has been down for twelve hours or so
<ul>
@ -565,26 +565,26 @@ fe800006-aaec-4f9e-9ab4-f9475b4cbdc3
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
12070
</code></pre><ul>
</code></pre></div><ul>
<li>I restarted PostgreSQL and Tomcat and the locks go straight back up!</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
13
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
986
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
1194
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
1212
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
1489
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
2124
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
5934
</code></pre><h2 id="2021-04-09">2021-04-09</h2>
</code></pre></div><h2 id="2021-04-09">2021-04-09</h2>
<ul>
<li>Atmire managed to get CGSpace back up by killing all the PostgreSQL connections yesterday
<ul>
@ -608,46 +608,46 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-backup
$ curl -X PUT &quot;localhost:9200/openrxv-items/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
</code></pre><ul>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
</code></pre></div><ul>
<li>Then I updated all Docker containers and rebooted the server (linode20) so that the correct indexes would be created again:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
</code></pre></div><ul>
<li>Then I realized I have to clone the backup index directly to <code>openrxv-items-final</code>, and re-create the <code>openrxv-items</code> alias:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -X PUT &quot;localhost:9200/openrxv-items-backup/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-backup/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-backup/_clone/openrxv-items-final
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
</code></pre><ul>
$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
</code></pre></div><ul>
<li>Now I see both <code>openrxv-items-final</code> and <code>openrxv-items</code> have the current number of items:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 103373,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 103373,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty'
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 103373,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 103373,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>Then I started a fresh harvesting in the AReS Explorer admin dashboard</li>
</ul>
<h2 id="2021-04-12">2021-04-12</h2>
@ -672,24 +672,24 @@ $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&amp;pretty'
<ul>
<li>13,000 requests in the last two months from a user with user agent <code>SomeRandomText</code>, for example:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">84.33.2.97 - - [06/Apr/2021:06:25:13 +0200] &quot;GET /bitstream/handle/10568/77776/CROP%20SCIENCE.jpg.jpg HTTP/1.1&quot; 404 10890 &quot;-&quot; &quot;SomeRandomText&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">84.33.2.97 - - [06/Apr/2021:06:25:13 +0200] &#34;GET /bitstream/handle/10568/77776/CROP%20SCIENCE.jpg.jpg HTTP/1.1&#34; 404 10890 &#34;-&#34; &#34;SomeRandomText&#34;
</code></pre></div><ul>
<li>I purged them:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
Purging 13159 hits from SomeRandomText in statistics
Total number of bot hits purged: 13159
</code></pre><ul>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 13159
</code></pre></div><ul>
<li>I noticed there were 78 items submitted in the hour before CGSpace crashed:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># grep -a -E '2021-04-06 0(6|7):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># grep -a -E <span style="color:#e6db74">&#39;2021-04-06 0(6|7):&#39;</span> /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item
78
</code></pre><ul>
</code></pre></div><ul>
<li>Of those 78, 77 of them were from Udana</li>
<li>Compared to other mornings (0 to 9 AM) this month that seems to be pretty high:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># for num in {01..13}; do grep -a -E &quot;2021-04-$num 0&quot; /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># <span style="color:#66d9ef">for</span> num in <span style="color:#f92672">{</span>01..13<span style="color:#f92672">}</span>; <span style="color:#66d9ef">do</span> grep -a -E <span style="color:#e6db74">&#34;2021-04-</span>$num<span style="color:#e6db74"> 0&#34;</span> /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a
add_item; done
32
0
@ -704,7 +704,7 @@ Total number of bot hits purged: 13159
1
1
2
</code></pre><h2 id="2021-04-15">2021-04-15</h2>
</code></pre></div><h2 id="2021-04-15">2021-04-15</h2>
<ul>
<li>Release v1.4.2 of the DSpace Statistics API on GitHub: <a href="https://github.com/ilri/dspace-statistics-api/releases/tag/v1.4.2">https://github.com/ilri/dspace-statistics-api/releases/tag/v1.4.2</a>
<ul>
@ -723,8 +723,8 @@ Total number of bot hits purged: 13159
</li>
<li>Create a test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p <span style="color:#e6db74">&#39;fuuuuuuuu&#39;</span>
</code></pre></div><ul>
<li>I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
<ul>
<li>According to my notes from <a href="/cgspace-notes/2020-10/">2020-10</a> the account must be in the admin group in order to submit via the REST API</li>
@ -735,12 +735,12 @@ Total number of bot hits purged: 13159
<ul>
<li>Update all containers on AReS (linode20):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
</code></pre></div><ul>
<li>Then run all system updates and reboot the server</li>
<li>I learned a new command for Elasticsearch:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl http://localhost:9200/_cat/indices
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl http://localhost:9200/_cat/indices
yellow open openrxv-values ChyhGwMDQpevJtlNWO1vcw 1 1 1579 0 537.6kb 537.6kb
yellow open openrxv-items-temp PhV5ieuxQsyftByvCxzSIw 1 1 103585 104372 482.7mb 482.7mb
yellow open openrxv-shared J_8cxIz6QL6XTRZct7UBBQ 1 1 127 0 115.7kb 115.7kb
@ -751,46 +751,46 @@ green open .apm-agent-configuration f3RAkSEBRGaxJZs3ePVxsA 1 0 0 0
yellow open openrxv-items-final sgk-s8O-RZKdcLRoWt3G8A 1 1 970 0 2.3mb 2.3mb
green open .kibana_1 HHPN7RD_T7qe0zDj4rauQw 1 0 25 7 36.8kb 36.8kb
yellow open users M0t2LaZhSm2NrF5xb64dnw 1 1 2 0 11.6kb 11.6kb
</code></pre><ul>
</code></pre></div><ul>
<li>Somehow the <code>openrxv-items-final</code> index only has a few items and the majority are in <code>openrxv-items-temp</code>, via the <code>openrxv-items</code> alias (which is in the temp index):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 103585,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 103585,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>I found a cool tool to help with exporting and restoring Elasticsearch indexes:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --type<span style="color:#f92672">=</span>mapping
$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span> --type<span style="color:#f92672">=</span>data
...
Sun, 18 Apr 2021 06:27:07 GMT | Total Writes: 103585
Sun, 18 Apr 2021 06:27:07 GMT | dump complete
</code></pre><ul>
</code></pre></div><ul>
<li>It took only two or three minutes to export everything&hellip;</li>
<li>I did a test to restore the index:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-test --type=mapping
$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-test --limit 1000 --type=data
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ elasticdump --input<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --output<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items-test --type<span style="color:#f92672">=</span>mapping
$ elasticdump --input<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --output<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items-test --limit <span style="color:#ae81ff">1000</span> --type<span style="color:#f92672">=</span>data
</code></pre></div><ul>
<li>So that&rsquo;s pretty cool!</li>
<li>I deleted the <code>openrxv-items-final</code> index and <code>openrxv-items-temp</code> indexes and then restored the mappings to <code>openrxv-items-final</code>, added the <code>openrxv-items</code> alias, and started restoring the data to <code>openrxv-items</code> with elasticdump:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items --limit 1000 --type=data
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
$ elasticdump --input<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --output<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items-final --type<span style="color:#f92672">=</span>mapping
$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
$ elasticdump --input<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --output<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --limit <span style="color:#ae81ff">1000</span> --type<span style="color:#f92672">=</span>data
</code></pre></div><ul>
<li>AReS seems to be working fine аfter that, so I created the <code>openrxv-items-temp</code> index and then started a fresh harvest on AReS Explorer:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-temp&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp&#34;</span>
</code></pre></div><ul>
<li>Run system updates on CGSpace (linode18) and run the latest Ansible infrastructure playbook to update the DSpace Statistics API, PostgreSQL JDBC driver, etc, and then reboot the system</li>
<li>I wasted a bit of time trying to get TSLint and then ESLint running for OpenRXV on GitHub Actions</li>
</ul>
@ -798,35 +798,35 @@ $ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localh
<ul>
<li>The AReS harvesting last night seems to have completed successfully, but the number of results is strange:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp kNUlupUyS_i7vlBGiuVxwg 1 1 103741 105553 483.6mb 483.6mb
yellow open openrxv-items-final HFc3uytTRq2GPpn13vkbmg 1 1 970 0 2.3mb 2.3mb
</code></pre><ul>
</code></pre></div><ul>
<li>The indices endpoint doesn&rsquo;t include the <code>openrxv-items</code> alias, but it is currently in the <code>openrxv-items-temp</code> index so the number of items is the same:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 103741,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 103741,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre><ul>
</code></pre></div><ul>
<li>A user was having problems resetting their password on CGSpace, with some message about SMTP etc
<ul>
<li>I checked and we are indeed locked out of our mailbox:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace test-email
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace test-email
...
Error sending email:
- Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 550 5.2.1 Mailbox cannot be accessed [PR0P264CA0280.FRAP264.PROD.OUTLOOK.COM]
)
</code></pre><ul>
</code></pre></div><ul>
<li>I have to write to ICT&hellip;</li>
<li>I decided to switch back to the G1GC garbage collector on DSpace Test
<ul>
@ -869,46 +869,46 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items --limit 1000 --type=data
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --type<span style="color:#f92672">=</span>mapping
$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span> --type<span style="color:#f92672">=</span>data
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
$ elasticdump --input<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --output<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items-final --type<span style="color:#f92672">=</span>mapping
$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
$ elasticdump --input<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --output<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --limit <span style="color:#ae81ff">1000</span> --type<span style="color:#f92672">=</span>data
</code></pre></div><ul>
<li>Then I started a fresh AReS harvest</li>
</ul>
<h2 id="2021-04-26">2021-04-26</h2>
<ul>
<li>The AReS harvest last night seems to have finished successfully and the number of items looks good:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 0 0 283b 283b
yellow open openrxv-items-final ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0 254mb 254mb
</code></pre><ul>
</code></pre></div><ul>
<li>And the aliases seem correct for once:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool
...
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {
&quot;openrxv-items&quot;: {}
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
},
&quot;openrxv-items-temp&quot;: {
&quot;aliases&quot;: {}
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {}
},
...
</code></pre><ul>
</code></pre></div><ul>
<li>That&rsquo;s 250 new items in the index since the last harvest!</li>
<li>Re-create my local Artifactory container because I&rsquo;m getting errors starting it and it has been a few months since it was updated:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ podman rm artifactory
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ podman rm artifactory
$ podman pull docker.bintray.io/jfrog/artifactory-oss:latest
$ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081-8082:8081-8082 docker.bintray.io/jfrog/artifactory-oss
$ podman create --ulimit nofile<span style="color:#f92672">=</span>32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081-8082:8081-8082 docker.bintray.io/jfrog/artifactory-oss
$ podman start artifactory
</code></pre><ul>
</code></pre></div><ul>
<li>Start testing DSpace 7.0 Beta 5 so I can evaluate if it solves some of the problems we are having on DSpace 6, and if it&rsquo;s missing things like multiple handle resolvers, etc
<ul>
<li>I see it needs Java JDK 11, Tomcat 9, Solr 8, and PostgreSQL 11</li>
@ -925,13 +925,13 @@ $ podman start artifactory
</li>
<li>I tried to delete all the Atmire SQL migrations:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace7b5= &gt; DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace7b5= &gt; DELETE FROM schema_version WHERE description LIKE &#39;%Atmire%&#39; OR description LIKE &#39;%CUA%&#39; OR description LIKE &#39;%cua%&#39;;
</code></pre></div><ul>
<li>But I got an error when running <code>dspace database migrate</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ~/dspace7b5/bin/dspace database migrate
Database URL: jdbc:postgresql://localhost:5432/dspace7b5
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ~/dspace7b5/bin/dspace database migrate
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Database URL: jdbc:postgresql://localhost:5432/dspace7b5
Migrating database to latest version... (Check dspace logs for details)
Migration exception:
java.sql.SQLException: Flyway migration error occurred
@ -949,8 +949,8 @@ Caused by: org.flywaydb.core.api.FlywayException: Validate failed:
Detected applied migration not resolved locally: 5.0.2017.09.25
Detected applied migration not resolved locally: 6.0.2017.01.30
Detected applied migration not resolved locally: 6.0.2017.09.25
at org.flywaydb.core.Flyway.doValidate(Flyway.java:292)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> at org.flywaydb.core.Flyway.doValidate(Flyway.java:292)
at org.flywaydb.core.Flyway.access$100(Flyway.java:73)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:166)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:158)
@ -958,14 +958,14 @@ Detected applied migration not resolved locally: 6.0.2017.09.25
at org.flywaydb.core.Flyway.migrate(Flyway.java:158)
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:729)
... 9 more
</code></pre><ul>
</code></pre></div><ul>
<li>I deleted those migrations:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace7b5= &gt; DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace7b5= &gt; DELETE FROM schema_version WHERE version IN (&#39;5.0.2017.09.25&#39;, &#39;6.0.2017.01.30&#39;, &#39;6.0.2017.09.25&#39;);
</code></pre></div><ul>
<li>Then when I ran the migration again it failed for a new reason, related to the configurable workflow:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">Database URL: jdbc:postgresql://localhost:5432/dspace7b5
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Database URL: jdbc:postgresql://localhost:5432/dspace7b5
Migrating database to latest version... (Check dspace logs for details)
Migration exception:
java.sql.SQLException: Flyway migration error occurred
@ -984,24 +984,24 @@ Migration V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql failed
--------------------------------------------------------------------
SQL State : 42P01
Error Code : 0
Message : ERROR: relation &quot;cwf_pooltask&quot; does not exist
Message : ERROR: relation &#34;cwf_pooltask&#34; does not exist
Position: 8
Location : org/dspace/storage/rdbms/sqlmigration/postgres/V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql (/home/aorth/src/apache-tomcat-9.0.45/file:/home/aorth/dspace7b5/lib/dspace-api-7.0-beta5.jar!/org/dspace/storage/rdbms/sqlmigration/postgres/V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql)
Line : 16
Statement : UPDATE cwf_pooltask SET workflow_id='defaultWorkflow' WHERE workflow_id='default'
Statement : UPDATE cwf_pooltask SET workflow_id=&#39;defaultWorkflow&#39; WHERE workflow_id=&#39;default&#39;
...
</code></pre><ul>
</code></pre></div><ul>
<li>The <a href="https://wiki.lyrasis.org/display/DSDOC7x/Upgrading+DSpace">DSpace 7 upgrade docs</a> say I need to apply these previously optional migrations:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ~/dspace7b5/bin/dspace database migrate ignored
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ~/dspace7b5/bin/dspace database migrate ignored
</code></pre></div><ul>
<li>Now I see all migrations have completed and DSpace actually starts up fine!</li>
<li>I will try to do a full re-index to see how long it takes:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ time ~/dspace7b5/bin/dspace index-discovery -b
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time ~/dspace7b5/bin/dspace index-discovery -b
...
~/dspace7b5/bin/dspace index-discovery -b 25156.71s user 64.22s system 97% cpu 7:11:09.94 total
</code></pre><ul>
</code></pre></div><ul>
<li>Not good, that shit took almost seven hours!</li>
</ul>
<h2 id="2021-04-27">2021-04-27</h2>
@ -1012,9 +1012,9 @@ Statement : UPDATE cwf_pooltask SET workflow_id='defaultWorkflow' WHERE workflo
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvgrep -e 'windows-1252' -c 'Handle.net IDs' -i -m '10568/' ~/Downloads/Altmetric\ -\ Research\ Outputs\ -\ CGSpace\ -\ 2021-04-26.csv | csvcut -c DOI | sed '1d' &gt; /tmp/dois.txt
$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.csv -db dspace63 -u dspace -p 'fuuu' -d
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvgrep -e <span style="color:#e6db74">&#39;windows-1252&#39;</span> -c <span style="color:#e6db74">&#39;Handle.net IDs&#39;</span> -i -m <span style="color:#e6db74">&#39;10568/&#39;</span> ~/Downloads/Altmetric<span style="color:#ae81ff">\ </span>-<span style="color:#ae81ff">\ </span>Research<span style="color:#ae81ff">\ </span>Outputs<span style="color:#ae81ff">\ </span>-<span style="color:#ae81ff">\ </span>CGSpace<span style="color:#ae81ff">\ </span>-<span style="color:#ae81ff">\ </span>2021-04-26.csv | csvcut -c DOI | sed <span style="color:#e6db74">&#39;1d&#39;</span> &gt; /tmp/dois.txt
$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.csv -db dspace63 -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -d
</code></pre></div><ul>
<li>He will Tweet them&hellip;</li>
</ul>
<h2 id="2021-04-28">2021-04-28</h2>

View File

@ -36,7 +36,7 @@ I looked at the top user agents and IPs in the Solr statistics for last month an
I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one&hellip; as that&rsquo;s an actual user&hellip;
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -147,17 +147,17 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] &quot;GET /rest/collections/1179/items?limit=812&amp;expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1&quot; 400 5 &quot;-&quot; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&quot;
193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] &quot;GET /rest/collections/1179/items?limit=812&amp;expand=metadata-21%2B21*01 HTTP/1.1&quot; 200 458201 &quot;-&quot; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&quot;
193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] &quot;GET /rest/collections/1179/items?limit=812&amp;expand=metadata'||lower('')||' HTTP/1.1&quot; 400 5 &quot;-&quot; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&quot;
193.169.254.178 - - [21/Apr/2021:02:02:10 +0200] &quot;GET /rest/collections/1179/items?limit=812&amp;expand=metadata'%2Brtrim('')%2B' HTTP/1.1&quot; 200 458209 &quot;-&quot; &quot;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] &#34;GET /rest/collections/1179/items?limit=812&amp;expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1&#34; 400 5 &#34;-&#34; &#34;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&#34;
193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] &#34;GET /rest/collections/1179/items?limit=812&amp;expand=metadata-21%2B21*01 HTTP/1.1&#34; 200 458201 &#34;-&#34; &#34;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&#34;
193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] &#34;GET /rest/collections/1179/items?limit=812&amp;expand=metadata&#39;||lower(&#39;&#39;)||&#39; HTTP/1.1&#34; 400 5 &#34;-&#34; &#34;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&#34;
193.169.254.178 - - [21/Apr/2021:02:02:10 +0200] &#34;GET /rest/collections/1179/items?limit=812&amp;expand=metadata&#39;%2Brtrim(&#39;&#39;)%2B&#39; HTTP/1.1&#34; 200 458209 &#34;-&#34; &#34;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)&#34;
</code></pre></div><ul>
<li>I will report the IP on abuseipdb.com and purge their hits from Solr</li>
<li>The second IP is in Colombia and is making thousands of requests for what looks like some test site:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] &quot;GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0&quot; 200 123613 &quot;http://cassavalighthousetest.org/&quot; &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36&quot;
181.62.166.177 - - [20/Apr/2021:22:55:39 +0200] &quot;GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0&quot; 200 123613 &quot;http://cassavalighthousetest.org/&quot; &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] &#34;GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0&#34; 200 123613 &#34;http://cassavalighthousetest.org/&#34; &#34;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36&#34;
181.62.166.177 - - [20/Apr/2021:22:55:39 +0200] &#34;GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0&#34; 200 123613 &#34;http://cassavalighthousetest.org/&#34; &#34;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36&#34;
</code></pre></div><ul>
<li>But this site does not exist (yet?)
<ul>
<li>I will purge them from Solr</li>
@ -165,46 +165,46 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
</li>
<li>The third IP is in Russia apparently, and the user agent has the <code>pl-PL</code> locale with thousands of requests like this:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] &quot;GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&amp;isAllowed=y HTTP/1.1&quot; 200 918998 &quot;http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf&quot; &quot;Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] &#34;GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&amp;isAllowed=y HTTP/1.1&#34; 200 918998 &#34;http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf&#34; &#34;Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15&#34;
</code></pre></div><ul>
<li>I will purge these all with my <code>check-spider-ip-hits.sh</code> script:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 21648 hits from 193.169.254.178 in statistics
Purging 20323 hits from 181.62.166.177 in statistics
Purging 19376 hits from 45.146.166.180 in statistics
Total number of bot hits purged: 61347
</code></pre><h2 id="2021-05-02">2021-05-02</h2>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 61347
</code></pre></div><h2 id="2021-05-02">2021-05-02</h2>
<ul>
<li>Check the AReS Harvester indexes:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 0 0 283b 283b
yellow open openrxv-items-final ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0 254mb 254mb
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool
...
&quot;openrxv-items-temp&quot;: {
&quot;aliases&quot;: {}
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {}
},
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {
&quot;openrxv-items&quot;: {}
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
},
</code></pre><ul>
</code></pre></div><ul>
<li>I think they look OK (<code>openrxv-items</code> is an alias of <code>openrxv-items-final</code>), but I took a backup just in case:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --type<span style="color:#f92672">=</span>mapping
$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
</code></pre></div><ul>
<li>Then I started an indexing in the AReS Explorer admin dashboard</li>
<li>The indexing finished, but it looks like the aliases are messed up again:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0 2.2mb 2.2mb
</code></pre><h2 id="2021-05-05">2021-05-05</h2>
</code></pre></div><h2 id="2021-05-05">2021-05-05</h2>
<ul>
<li>Peter noticed that we no longer display <code>cg.link.reference</code> on the item view
<ul>
@ -229,9 +229,9 @@ yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ time ~/dspace64/bin/dspace index-discovery -b
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time ~/dspace64/bin/dspace index-discovery -b
~/dspace64/bin/dspace index-discovery -b 4053.24s user 53.17s system 38% cpu 2:58:53.83 total
</code></pre><ul>
</code></pre></div><ul>
<li>Nope! Still slow, and still no mapped item&hellip;
<ul>
<li>I even tried unmapping it from all collections, and adding it to a single new owning collection&hellip;</li>
@ -244,53 +244,53 @@ yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0
</li>
<li>The indexes on AReS Explorer are messed up after last week&rsquo;s harvesting:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0 2.2mb 2.2mb
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool
...
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {}
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {}
},
&quot;openrxv-items-temp&quot;: {
&quot;aliases&quot;: {
&quot;openrxv-items&quot;: {}
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
}
</code></pre><ul>
</code></pre></div><ul>
<li><code>openrxv-items</code> should be an alias of <code>openrxv-items-final</code>&hellip;</li>
<li>I made a backup of the temp index and then started indexing on the AReS Explorer admin dashboard:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-temp-backup
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: false}}'
</code></pre><h2 id="2021-05-10">2021-05-10</h2>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: false}}&#39;</span>
</code></pre></div><h2 id="2021-05-10">2021-05-10</h2>
<ul>
<li>Amazing, the harvesting on AReS finished but it messed up all the indexes and now there are no items in any index!</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp 8thRX0WVRUeAzmd2hkG6TA 1 1 0 0 283b 283b
yellow open openrxv-items-temp-backup _0tyvctBTg2pjOlcoVP1LA 1 1 104165 20134 305.5mb 305.5mb
yellow open openrxv-items-final BtvV9kwVQ3yBYCZvJS1QyQ 1 1 0 0 283b 283b
</code></pre><ul>
</code></pre></div><ul>
<li>I fixed the indexes manually by re-creating them and cloning from the backup:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -X PUT &quot;localhost:9200/openrxv-items-temp-backup/_settings&quot; -H 'Content-Type: application/json' -d'{&quot;settings&quot;: {&quot;index.blocks.write&quot;: true}}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
$ curl -X PUT <span style="color:#e6db74">&#34;localhost:9200/openrxv-items-temp-backup/_settings&#34;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;settings&#34;: {&#34;index.blocks.write&#34;: true}}&#39;</span>
$ curl -s -X POST http://localhost:9200/openrxv-items-temp-backup/_clone/openrxv-items-final
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp-backup'
</code></pre><ul>
$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp-backup&#39;</span>
</code></pre></div><ul>
<li>Also I ran all updated on the server and updated all Docker images, then rebooted the server (linode20):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
</code></pre></div><ul>
<li>I backed up the AReS Elasticsearch data using elasticdump, then started a new harvest:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --type<span style="color:#f92672">=</span>mapping
$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
</code></pre></div><ul>
<li>Discuss CGSpace statistics with the CIP team
<ul>
<li>They were wondering why their numbers for 2020 were so low</li>
@ -329,10 +329,10 @@ $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/o
</li>
<li>I checked the CLARISA list against ROR&rsquo;s April, 2020 release (&ldquo;Version 9&rdquo;, on figshare, though it is version 8 in the dump):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
$ csvgrep -c matched -m <span style="color:#e6db74">&#39;true&#39;</span> /tmp/clarisa-ror-matches.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | wc -l
1770
</code></pre><ul>
</code></pre></div><ul>
<li>With 1770 out of 6230 matched, that&rsquo;s 28.5%&hellip;</li>
<li>I sent an email to Hector Tobon to point out the issues in CLARISA again and ask him to chat</li>
<li>Meeting with GARDIAN developers about CG Core and how GARDIAN works</li>
@ -341,11 +341,11 @@ $ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
<ul>
<li>Fix a few thousand IWMI URLs that are using HTTP instead of HTTPS on CGSpace:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://www.iwmi.cgiar.org','https://www.iwmi.cgiar.org', 'g') WHERE text_value LIKE 'http://www.iwmi.cgiar.org%' AND metadata_field_id=219;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;http://www.iwmi.cgiar.org&#39;,&#39;https://www.iwmi.cgiar.org&#39;, &#39;g&#39;) WHERE text_value LIKE &#39;http://www.iwmi.cgiar.org%&#39; AND metadata_field_id=219;
UPDATE 1132
localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://publications.iwmi.org','https://publications.iwmi.org', 'g') WHERE text_value LIKE 'http://publications.iwmi.org%' AND metadata_field_id=219;
localhost/dspace63= &gt; UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, &#39;http://publications.iwmi.org&#39;,&#39;https://publications.iwmi.org&#39;, &#39;g&#39;) WHERE text_value LIKE &#39;http://publications.iwmi.org%&#39; AND metadata_field_id=219;
UPDATE 1803
</code></pre><ul>
</code></pre></div><ul>
<li>In the case of the latter, the HTTP links don&rsquo;t even work! The web server returns HTTP 404 unless the request is HTTPS</li>
<li>IWMI also says that their subjects are a subset of AGROVOC so they no longer want to use <code>cg.subject.iwmi</code> for their subjects
<ul>
@ -367,67 +367,67 @@ UPDATE 1803
<ul>
<li>I have to fix the Elasticsearch indexes on AReS after last week&rsquo;s harvesting because, as always, the <code>openrxv-items</code> index should be an alias of <code>openrxv-items-final</code> instead of <code>openrxv-items-temp</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {}
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {}
},
&quot;openrxv-items-temp&quot;: {
&quot;aliases&quot;: {
&quot;openrxv-items&quot;: {}
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
},
...
</code></pre><ul>
</code></pre></div><ul>
<li>I took a backup of the <code>openrxv-items</code> index with elasticdump so I can re-create them manually before starting a new harvest tomorrow:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
</code></pre><h2 id="2021-05-16">2021-05-16</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --type<span style="color:#f92672">=</span>mapping
$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
</code></pre></div><h2 id="2021-05-16">2021-05-16</h2>
<ul>
<li>I deleted and re-created the Elasticsearch indexes on AReS:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XPUT 'http://localhost:9200/openrxv-items-final'
$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
$ curl -XDELETE <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -XPUT <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-final&#39;</span>
$ curl -XPUT <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp&#39;</span>
$ curl -s -X POST <span style="color:#e6db74">&#39;http://localhost:9200/_aliases&#39;</span> -H <span style="color:#e6db74">&#39;Content-Type: application/json&#39;</span> -d<span style="color:#e6db74">&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;</span>
</code></pre></div><ul>
<li>Then I re-imported the backup that I created with elasticdump yesterday:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ elasticdump --input<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --output<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items-final --type<span style="color:#f92672">=</span>mapping
$ elasticdump --input<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --output<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items-final --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
</code></pre></div><ul>
<li>Then I started a new harvest on AReS</li>
</ul>
<h2 id="2021-05-17">2021-05-17</h2>
<ul>
<li>The AReS harvest finished and the Elasticsearch indexes seem OK so I shouldn&rsquo;t have to fix them next time&hellip;</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 0 0 283b 283b
yellow open openrxv-items-final TrJ1Ict3QZ-vFkj-4VcAzw 1 1 104317 0 259.4mb 259.4mb
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
&quot;openrxv-items-temp&quot;: {
&quot;aliases&quot;: {}
$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/_alias/&#39;</span> | python -m json.tool
&#34;openrxv-items-temp&#34;: {
&#34;aliases&#34;: {}
},
&quot;openrxv-items-final&quot;: {
&quot;aliases&quot;: {
&quot;openrxv-items&quot;: {}
&#34;openrxv-items-final&#34;: {
&#34;aliases&#34;: {
&#34;openrxv-items&#34;: {}
}
},
...
</code></pre><ul>
</code></pre></div><ul>
<li>Abenet said she and some others can&rsquo;t log into CGSpace
<ul>
<li>I tried to check the CGSpace LDAP account and it does seem to be not working:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;cgspace-ldap@cgiarad.org&quot; -W &quot;(sAMAccountName=aorth)&quot;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b <span style="color:#e6db74">&#34;dc=cgiarad,dc=org&#34;</span> -D <span style="color:#e6db74">&#34;cgspace-ldap@cgiarad.org&#34;</span> -W <span style="color:#e6db74">&#34;(sAMAccountName=aorth)&#34;</span>
Enter LDAP Password:
ldap_bind: Invalid credentials (49)
additional info: 80090308: LdapErr: DSID-0C090453, comment: AcceptSecurityContext error, data 532, v3839
</code></pre><ul>
</code></pre></div><ul>
<li>I sent a message to Biruk so he can check the LDAP account</li>
<li>IWMI confirmed that they do indeed want to move all their subjects to AGROVOC, so I made the changes in the XMLUI and config (<a href="https://github.com/ilri/DSpace/pull/467">#467</a>)
<ul>
@ -446,14 +446,14 @@ ldap_bind: Invalid credentials (49)
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ xmllint --xpath '//value-pairs[@value-pairs-name=&quot;ccafsprojectpii&quot;]/pair/stored-value/node()' dspace/config/input-forms.xml
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ xmllint --xpath <span style="color:#e6db74">&#39;//value-pairs[@value-pairs-name=&#34;ccafsprojectpii&#34;]/pair/stored-value/node()&#39;</span> dspace/config/input-forms.xml
</code></pre></div><ul>
<li>I formatted the input file with tidy, especially because one of the new project tags has an ampersand character&hellip; grrr:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w 0 dspace/config/input-forms.xml
line 3658 column 26 - Warning: unescaped &amp; or unknown entity &quot;&amp;WA_EU-IFAD&quot;
line 3659 column 23 - Warning: unescaped &amp; or unknown entity &quot;&amp;WA_EU-IFAD&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w <span style="color:#ae81ff">0</span> dspace/config/input-forms.xml
line 3658 column 26 - Warning: unescaped &amp; or unknown entity &#34;&amp;WA_EU-IFAD&#34;
line 3659 column 23 - Warning: unescaped &amp; or unknown entity &#34;&amp;WA_EU-IFAD&#34;
</code></pre></div><ul>
<li>After testing whether this escaped value worked during submission, I created and merged a pull request to <code>6_x-prod</code> (<a href="https://github.com/ilri/DSpace/pull/468">#468</a>)</li>
</ul>
<h2 id="2021-05-18">2021-05-18</h2>
@ -461,34 +461,34 @@ line 3659 column 23 - Warning: unescaped &amp; or unknown entity &quot;&amp;WA_E
<li>Paola from the Alliance emailed me some new ORCID identifiers to add to CGSpace</li>
<li>I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using <code>resolve-orcids.py</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2021-05-18-combined.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort | uniq &gt; /tmp/2021-05-18-combined.txt
$ ./ilri/resolve-orcids.py -i /tmp/2021-05-18-combined.txt -o /tmp/2021-05-18-combined-names.txt
</code></pre><ul>
</code></pre></div><ul>
<li>I sorted the names and added the XML formatting in vim, then ran it through tidy:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-identifier.xml
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ tidy -xml -utf8 -m -iq -w <span style="color:#ae81ff">0</span> dspace/config/controlled-vocabularies/cg-creator-identifier.xml
</code></pre></div><ul>
<li>Tag fifty-five items from the Alliance&rsquo;s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-05-18-add-orcids.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-05-18-add-orcids.csv
dc.contributor.author,cg.creator.identifier
&quot;Urioste Daza, Sergio&quot;,Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
&quot;Urioste, Sergio&quot;,Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
&quot;Villegas, Daniel&quot;,Daniel M. Villegas: 0000-0001-6801-3332
&quot;Villegas, Daniel M.&quot;,Daniel M. Villegas: 0000-0001-6801-3332
&quot;Giles, James&quot;,James Giles: 0000-0003-1899-9206
&quot;Simbare, Alice&quot;,Alice Simbare: 0000-0003-2389-0969
&quot;Simbare, Alice&quot;,Alice Simbare: 0000-0003-2389-0969
&quot;Simbare, A.&quot;,Alice Simbare: 0000-0003-2389-0969
&quot;Dita Rodriguez, Miguel&quot;,Miguel Angel Dita Rodriguez: 0000-0002-0496-4267
&quot;Templer, Noel&quot;,Noel Templer: 0000-0002-3201-9043
&quot;Jalonen, R.&quot;,Riina Jalonen: 0000-0003-1669-9138
&quot;Jalonen, Riina&quot;,Riina Jalonen: 0000-0003-1669-9138
&quot;Izquierdo, Paulo&quot;,Paulo Izquierdo: 0000-0002-2153-0655
&quot;Reyes, Byron&quot;,Byron Reyes: 0000-0003-2672-9636
&quot;Reyes, Byron A.&quot;,Byron Reyes: 0000-0003-2672-9636
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
</code></pre><ul>
&#34;Urioste Daza, Sergio&#34;,Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
&#34;Urioste, Sergio&#34;,Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
&#34;Villegas, Daniel&#34;,Daniel M. Villegas: 0000-0001-6801-3332
&#34;Villegas, Daniel M.&#34;,Daniel M. Villegas: 0000-0001-6801-3332
&#34;Giles, James&#34;,James Giles: 0000-0003-1899-9206
&#34;Simbare, Alice&#34;,Alice Simbare: 0000-0003-2389-0969
&#34;Simbare, Alice&#34;,Alice Simbare: 0000-0003-2389-0969
&#34;Simbare, A.&#34;,Alice Simbare: 0000-0003-2389-0969
&#34;Dita Rodriguez, Miguel&#34;,Miguel Angel Dita Rodriguez: 0000-0002-0496-4267
&#34;Templer, Noel&#34;,Noel Templer: 0000-0002-3201-9043
&#34;Jalonen, R.&#34;,Riina Jalonen: 0000-0003-1669-9138
&#34;Jalonen, Riina&#34;,Riina Jalonen: 0000-0003-1669-9138
&#34;Izquierdo, Paulo&#34;,Paulo Izquierdo: 0000-0002-2153-0655
&#34;Reyes, Byron&#34;,Byron Reyes: 0000-0003-2672-9636
&#34;Reyes, Byron A.&#34;,Byron Reyes: 0000-0003-2672-9636
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -d
</code></pre></div><ul>
<li>I deployed the latest <code>6_x-prod</code> branch on CGSpace, ran all system updates, and rebooted the server
<ul>
<li>This included the IWMI changes, so I also migrated the <code>cg.subject.iwmi</code> metadata to <code>dcterms.subject</code> and deleted the subject term</li>
@ -504,9 +504,9 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspa
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ &#39;[[:upper:]]&#39;;
UPDATE 47405
</code></pre><ul>
</code></pre></div><ul>
<li>That&rsquo;s interesting because we lowercased them all a few months ago, so these must all be new&hellip; wow
<ul>
<li>We have 405,000 total AGROVOC terms, with 20,600 of them being unique</li>
@ -518,12 +518,12 @@ UPDATE 47405
<ul>
<li>Export the top 5,000 AGROVOC terms to validate them:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
COPY 5000
$ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d &gt; /tmp/2021-05-20-agrovoc.txt
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-05-20-agrovoc.csv| sed 1d &gt; /tmp/2021-05-20-agrovoc.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv
$ csvgrep -c &quot;number of matches&quot; -r '^0$' /tmp/2021-05-20-agrovoc-results.csv &gt; /tmp/2021-05-20-agrovoc-rejected.csv
</code></pre><ul>
$ csvgrep -c <span style="color:#e6db74">&#34;number of matches&#34;</span> -r <span style="color:#e6db74">&#39;^0$&#39;</span> /tmp/2021-05-20-agrovoc-results.csv &gt; /tmp/2021-05-20-agrovoc-rejected.csv
</code></pre></div><ul>
<li>Meeting with Medha and Pythagoras about the FAIR Workflow tool
<ul>
<li>Discussed the need for such a tool, other tools being developed, etc</li>
@ -545,54 +545,54 @@ $ csvgrep -c &quot;number of matches&quot; -r '^0$' /tmp/2021-05-20-agrovoc-resu
<ul>
<li>Add ORCID identifiers for missing ILRI authors and tag 550 others based on a few authors I noticed that were missing them:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-05-24-add-orcids.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-05-24-add-orcids.csv
dc.contributor.author,cg.creator.identifier
&quot;Patel, Ekta&quot;,&quot;Ekta Patel: 0000-0001-9400-6988&quot;
&quot;Dessie, Tadelle&quot;,&quot;Tadelle Dessie: 0000-0002-1630-0417&quot;
&quot;Tadelle, D.&quot;,&quot;Tadelle Dessie: 0000-0002-1630-0417&quot;
&quot;Dione, Michel M.&quot;,&quot;Michel Dione: 0000-0001-7812-5776&quot;
&quot;Kiara, Henry K.&quot;,&quot;Henry Kiara: 0000-0001-9578-1636&quot;
&quot;Naessens, Jan&quot;,&quot;Jan Naessens: 0000-0002-7075-9915&quot;
&quot;Steinaa, Lucilla&quot;,&quot;Lucilla Steinaa: 0000-0003-3691-3971&quot;
&quot;Wieland, Barbara&quot;,&quot;Barbara Wieland: 0000-0003-4020-9186&quot;
&quot;Grace, Delia&quot;,&quot;Delia Grace: 0000-0002-0195-9489&quot;
&quot;Rao, Idupulapati M.&quot;,&quot;Idupulapati M. Rao: 0000-0002-8381-9358&quot;
&quot;Cardoso Arango, Juan Andrés&quot;,&quot;Juan Andrés Cardoso Arango: 0000-0002-0252-4655&quot;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p 'fuuu'
</code></pre><ul>
&#34;Patel, Ekta&#34;,&#34;Ekta Patel: 0000-0001-9400-6988&#34;
&#34;Dessie, Tadelle&#34;,&#34;Tadelle Dessie: 0000-0002-1630-0417&#34;
&#34;Tadelle, D.&#34;,&#34;Tadelle Dessie: 0000-0002-1630-0417&#34;
&#34;Dione, Michel M.&#34;,&#34;Michel Dione: 0000-0001-7812-5776&#34;
&#34;Kiara, Henry K.&#34;,&#34;Henry Kiara: 0000-0001-9578-1636&#34;
&#34;Naessens, Jan&#34;,&#34;Jan Naessens: 0000-0002-7075-9915&#34;
&#34;Steinaa, Lucilla&#34;,&#34;Lucilla Steinaa: 0000-0003-3691-3971&#34;
&#34;Wieland, Barbara&#34;,&#34;Barbara Wieland: 0000-0003-4020-9186&#34;
&#34;Grace, Delia&#34;,&#34;Delia Grace: 0000-0002-0195-9489&#34;
&#34;Rao, Idupulapati M.&#34;,&#34;Idupulapati M. Rao: 0000-0002-8381-9358&#34;
&#34;Cardoso Arango, Juan Andrés&#34;,&#34;Juan Andrés Cardoso Arango: 0000-0002-0252-4655&#34;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</code></pre></div><ul>
<li>A few days ago I took a backup of the Elasticsearch indexes on AReS using elasticdump:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_data.json --type<span style="color:#f92672">=</span>data --limit<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>
$ elasticdump --input<span style="color:#f92672">=</span>http://localhost:9200/openrxv-items --output<span style="color:#f92672">=</span>/home/aorth/openrxv-items_mapping.json --type<span style="color:#f92672">=</span>mapping
</code></pre></div><ul>
<li>The indexes look OK so I started a harvesting on AReS</li>
</ul>
<h2 id="2021-05-25">2021-05-25</h2>
<ul>
<li>The AReS harvest got messed up somehow, as I see the number of items in the indexes are the same as before the harvesting:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb
yellow open openrxv-items-final soEzAnp3TDClIGZbmVyEIw 1 1 953 0 2.3mb 2.3mb
</code></pre><ul>
</code></pre></div><ul>
<li>Update all docker images on the AReS server (linode20):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose -f docker/docker-compose.yml down
$ docker-compose -f docker/docker-compose.yml build
</code></pre><ul>
</code></pre></div><ul>
<li>Then run all system updates on the server and reboot it</li>
<li>Oh crap, I deleted everything on AReS and restored the backup and the total items are now 104317&hellip; so it was actually correct before!</li>
<li>For reference, this is how I re-created everything:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">curl -XDELETE 'http://localhost:9200/openrxv-items-final'
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
curl -XPUT 'http://localhost:9200/openrxv-items-final'
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">curl -XDELETE &#39;http://localhost:9200/openrxv-items-final&#39;
curl -XDELETE &#39;http://localhost:9200/openrxv-items-temp&#39;
curl -XPUT &#39;http://localhost:9200/openrxv-items-final&#39;
curl -XPUT &#39;http://localhost:9200/openrxv-items-temp&#39;
curl -s -X POST &#39;http://localhost:9200/_aliases&#39; -H &#39;Content-Type: application/json&#39; -d&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
</code></pre><ul>
</code></pre></div><ul>
<li>I will just start a new harvest&hellip; sigh</li>
</ul>
<h2 id="2021-05-26">2021-05-26</h2>
@ -638,18 +638,18 @@ May 26, 02:57 UTC
</code></pre><ul>
<li>And indeed the email seems to be broken:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace test-email
About to send test email:
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace test-email
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>About to send test email:
- To: fuuuuuu
- Subject: DSpace test email
- Server: smtp.office365.com
Error sending email:
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Error sending email:
- Error: javax.mail.SendFailedException: Send failure (javax.mail.MessagingException: Could not convert socket to TLS (javax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)))
Please see the DSpace documentation for assistance.
</code></pre><ul>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Please see the DSpace documentation for assistance.
</code></pre></div><ul>
<li>I saw a recent thread on the dspace-tech mailing list about this that makes me wonder if Microsoft changed something on Office 365
<ul>
<li>I added <code>mail.smtp.ssl.protocols=TLSv1.2</code> to the <code>mail.extraproperties</code> in dspace.cfg and the test email sent successfully</li>

View File

@ -36,7 +36,7 @@ I simply started it and AReS was running again:
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -132,8 +132,8 @@ I simply started it and AReS was running again:
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml start angular_nginx
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker-compose -f docker/docker-compose.yml start angular_nginx
</code></pre></div><ul>
<li>Margarita from CCAFS emailed me to say that workflow alerts haven&rsquo;t been working lately
<ul>
<li>I guess this is related to the SMTP issues last week</li>
@ -162,14 +162,14 @@ I simply started it and AReS was running again:
<ul>
<li>The Elasticsearch indexes are messed up so I dumped and re-created them correctly:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">curl -XDELETE 'http://localhost:9200/openrxv-items-final'
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
curl -XPUT 'http://localhost:9200/openrxv-items-final'
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{&quot;actions&quot; : [{&quot;add&quot; : { &quot;index&quot; : &quot;openrxv-items-final&quot;, &quot;alias&quot; : &quot;openrxv-items&quot;}}]}'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">curl -XDELETE &#39;http://localhost:9200/openrxv-items-final&#39;
curl -XDELETE &#39;http://localhost:9200/openrxv-items-temp&#39;
curl -XPUT &#39;http://localhost:9200/openrxv-items-final&#39;
curl -XPUT &#39;http://localhost:9200/openrxv-items-temp&#39;
curl -s -X POST &#39;http://localhost:9200/_aliases&#39; -H &#39;Content-Type: application/json&#39; -d&#39;{&#34;actions&#34; : [{&#34;add&#34; : { &#34;index&#34; : &#34;openrxv-items-final&#34;, &#34;alias&#34; : &#34;openrxv-items&#34;}}]}&#39;
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
</code></pre><ul>
</code></pre></div><ul>
<li>Then I started a harvesting on AReS</li>
</ul>
<h2 id="2021-06-07">2021-06-07</h2>
@ -208,8 +208,8 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
</code></pre></div><ul>
<li>The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it&rsquo;s much faster
<ul>
<li>I harvested 90,000+ items from DSpace Test in ~3 hours</li>
@ -231,23 +231,23 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">&#39;{print $2}&#39;</span> | wc -l
90459
$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">&#39;{print $2}&#39;</span> | sort | uniq | wc -l
90380
$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data.json | awk -F: <span style="color:#e6db74">&#39;{print $2}&#39;</span> | sort | uniq -c | sort -h
...
2 &quot;10568/99409&quot;
2 &quot;10568/99410&quot;
2 &quot;10568/99411&quot;
2 &quot;10568/99516&quot;
3 &quot;10568/102093&quot;
3 &quot;10568/103524&quot;
3 &quot;10568/106664&quot;
3 &quot;10568/106940&quot;
3 &quot;10568/107195&quot;
3 &quot;10568/96546&quot;
</code></pre><h2 id="2021-06-20">2021-06-20</h2>
2 &#34;10568/99409&#34;
2 &#34;10568/99410&#34;
2 &#34;10568/99411&#34;
2 &#34;10568/99516&#34;
3 &#34;10568/102093&#34;
3 &#34;10568/103524&#34;
3 &#34;10568/106664&#34;
3 &#34;10568/106940&#34;
3 &#34;10568/107195&#34;
3 &#34;10568/96546&#34;
</code></pre></div><h2 id="2021-06-20">2021-06-20</h2>
<ul>
<li>Udana asked me to update their IWMI subjects from <code>farmer managed irrigation systems</code> to <code>farmer-led irrigation</code>
<ul>
@ -255,12 +255,12 @@ $ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-it
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
</code></pre></div><ul>
<li>Then I used <code>csvcut</code> to extract just the columns I needed and do the replacement into a new CSV:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' &gt; /tmp/2021-06-20-IWMI-new-subjects.csv
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,dcterms.subject[],dcterms.subject[en_US]&#39;</span> /tmp/2021-06-20-IWMI.csv | sed <span style="color:#e6db74">&#39;s/farmer managed irrigation systems/farmer-led irrigation/&#39;</span> &gt; /tmp/2021-06-20-IWMI-new-subjects.csv
</code></pre></div><ul>
<li>Then I uploaded the resulting CSV to CGSpace, updating 161 items</li>
<li>Start a harvest on AReS</li>
<li>I found <a href="https://jira.lyrasis.org/browse/DS-1977">a bug</a> and <a href="https://github.com/DSpace/DSpace/pull/2584">a patch</a> for the private items showing up in the DSpace sitemap bug
@ -278,19 +278,19 @@ $ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-it
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:alnum:]]+&quot;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;CGSpace&#34;&#39;</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:alnum:]]+&#34;&#39;</span> | wc -l
90937
$ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:alnum:]]+&quot;' | sort -u | wc -l
$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;CGSpace&#34;&#39;</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:alnum:]]+&#34;&#39;</span> | sort -u | wc -l
85709
</code></pre><ul>
</code></pre></div><ul>
<li>So those could be duplicates from the way we harvest pages, but they could also be from mappings&hellip;
<ul>
<li>Manually inspecting the duplicates where handles appear more than once:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:alnum:]]+&quot;' | sort | uniq -c | sort -h
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;CGSpace&#34;&#39;</span> openrxv-items_data.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:alnum:]]+&#34;&#39;</span> | sort | uniq -c | sort -h
</code></pre></div><ul>
<li>Unfortunately I found no pattern:
<ul>
<li>Some appear twice in the Elasticsearch index, but appear in only one collection</li>
@ -312,23 +312,23 @@ $ grep -E '&quot;repo&quot;:&quot;CGSpace&quot;' openrxv-items_data.json | grep
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq length
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq length
5
$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq '.[].handle'
&quot;10673/4&quot;
&quot;10673/3&quot;
&quot;10673/6&quot;
&quot;10673/5&quot;
&quot;10673/7&quot;
# log into DSpace Demo XMLUI as admin and make one item private (for example 10673/6)
$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq length
$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq <span style="color:#e6db74">&#39;.[].handle&#39;</span>
&#34;10673/4&#34;
&#34;10673/3&#34;
&#34;10673/6&#34;
&#34;10673/5&#34;
&#34;10673/7&#34;
# log into DSpace Demo XMLUI as admin and make one item private <span style="color:#f92672">(</span><span style="color:#66d9ef">for</span> example 10673/6<span style="color:#f92672">)</span>
$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq length
4
$ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&quot; | jq '.[].handle'
&quot;10673/4&quot;
&quot;10673/3&quot;
&quot;10673/5&quot;
&quot;10673/7&quot;
</code></pre><ul>
$ curl -s -H <span style="color:#e6db74">&#34;Accept: application/json&#34;</span> <span style="color:#e6db74">&#34;https://demo.dspace.org/rest/items?offset=0&amp;limit=5&#34;</span> | jq <span style="color:#e6db74">&#39;.[].handle&#39;</span>
&#34;10673/4&#34;
&#34;10673/3&#34;
&#34;10673/5&#34;
&#34;10673/7&#34;
</code></pre></div><ul>
<li>I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira</li>
<li>Last week I noticed that the Gender Platform website is using &ldquo;cgspace.cgiar.org&rdquo; links for CGSpace, instead of handles
<ul>
@ -355,11 +355,11 @@ $ curl -s -H &quot;Accept: application/json&quot; &quot;https://demo.dspace.org/
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data-local-ds-4065.json | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data-local-ds-4065.json | wc -l
90327
$ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-items_data-local-ds-4065.json | sort -u | wc -l
$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;[[:digit:]]+/[[:digit:]]+&#34;&#39;</span> openrxv-items_data-local-ds-4065.json | sort -u | wc -l
90317
</code></pre><h2 id="2021-06-22">2021-06-22</h2>
</code></pre></div><h2 id="2021-06-22">2021-06-22</h2>
<ul>
<li>Make a <a href="https://github.com/atmire/COUNTER-Robots/pull/43">pull request</a> to the COUNTER-Robots project to add two new user agents: crusty and newspaper
<ul>
@ -368,13 +368,13 @@ $ grep -oE '&quot;handle&quot;:&quot;[[:digit:]]+/[[:digit:]]+&quot;' openrxv-it
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 1339 hits from RI\/1\.0 in statistics
Purging 447 hits from crusty in statistics
Purging 3736 hits from newspaper in statistics
Total number of bot hits purged: 5522
</code></pre><ul>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 5522
</code></pre></div><ul>
<li>Surprised to see RI/1.0 in there because it&rsquo;s been in the override file for a while</li>
<li>Looking at the 2021 statistics in Solr I see a few more suspicious user agents:
<ul>
@ -397,11 +397,11 @@ Total number of bot hits purged: 5522
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># journalctl --since<span style="color:#f92672">=</span>today -u tomcat7 | grep -c <span style="color:#e6db74">&#39;Connection has been abandoned&#39;</span>
978
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
10100
</code></pre><ul>
</code></pre></div><ul>
<li>I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now</li>
<li>In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)</li>
<li>I also applied the following patches from the 6.4 milestone to our <code>6_x-prod</code> branch:
@ -412,17 +412,17 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</li>
<li>After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39;</span> | wc -l
63
</code></pre><ul>
</code></pre></div><ul>
<li>Looking in the DSpace log, the first &ldquo;pool empty&rdquo; message I saw this morning was at 4AM:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
</code></pre></div><ul>
<li>Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
</code></pre></div><ul>
<li>We can purge them, as this is not user traffic: <a href="https://about.flipboard.com/browserproxy/">https://about.flipboard.com/browserproxy/</a>
<ul>
<li>I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots</li>
@ -448,17 +448,17 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspace-openrxv-items-temp-backup.json | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\.)+/[[:digit:]]+&#34;&#39;</span> cgspace-openrxv-items-temp-backup.json | wc -l
104797
$ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
$ grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\.)+/[[:digit:]]+&#34;&#39;</span> cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
99186
</code></pre><ul>
</code></pre></div><ul>
<li>This number is probably unique for that particular harvest, but I don&rsquo;t think it represents the true number of items&hellip;</li>
<li>The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;DSpace Test&quot;' 2021-06-23-openrxv-items-final-local.json | grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' | sort | uniq | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -E <span style="color:#e6db74">&#39;&#34;repo&#34;:&#34;DSpace Test&#34;&#39;</span> 2021-06-23-openrxv-items-final-local.json | grep -oE <span style="color:#e6db74">&#39;&#34;handle&#34;:&#34;([[:digit:]]|\.)+/[[:digit:]]+&#34;&#39;</span> | sort | uniq | wc -l
90990
</code></pre><ul>
</code></pre></div><ul>
<li>So the harvest on the live site is missing items, then why didn&rsquo;t the add missing items plugin find them?!
<ul>
<li>I notice that we are missing the <code>type</code> in the metadata structure config for each repository on the production site, and we are using <code>type</code> for item type in the actual schema&hellip; so maybe there is a conflict there</li>
@ -469,8 +469,8 @@ $ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspa
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] &quot;GET /sitemap HTTP/1.1&quot; 503 190 &quot;-&quot; &quot;OpenRXV harvesting bot; https://github.com/ilri/OpenRXV&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] &#34;GET /sitemap HTTP/1.1&#34; 503 190 &#34;-&#34; &#34;OpenRXV harvesting bot; https://github.com/ilri/OpenRXV&#34;
</code></pre></div><ul>
<li>I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins&hellip; now it&rsquo;s checking 180,000+ handles to see if they are collections or items&hellip;
<ul>
<li>I see it fetched the sitemap three times, we need to make sure it&rsquo;s only doing it once for each repository</li>
@ -478,9 +478,9 @@ $ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspa
</li>
<li>According to the api logs we will be adding 5,697 items:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker logs api 2&gt;/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker logs api 2&gt;/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
5697
</code></pre><ul>
</code></pre></div><ul>
<li>Spent a few hours with Moayad troubleshooting and improving OpenRXV
<ul>
<li>We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs</li>
@ -496,35 +496,35 @@ $ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspa
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ redis-cli
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ redis-cli
127.0.0.1:6379&gt; SCAN 0 COUNT 5
1) &quot;49152&quot;
2) 1) &quot;bull:plugins:476595&quot;
2) &quot;bull:plugins:367382&quot;
3) &quot;bull:plugins:369228&quot;
4) &quot;bull:plugins:438986&quot;
5) &quot;bull:plugins:366215&quot;
</code></pre><ul>
1) &#34;49152&#34;
2) 1) &#34;bull:plugins:476595&#34;
2) &#34;bull:plugins:367382&#34;
3) &#34;bull:plugins:369228&#34;
4) &#34;bull:plugins:438986&#34;
5) &#34;bull:plugins:366215&#34;
</code></pre></div><ul>
<li>We can apparently get the names of the jobs in each hash using <code>hget</code>:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">127.0.0.1:6379&gt; TYPE bull:plugins:401827
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">127.0.0.1:6379&gt; TYPE bull:plugins:401827
hash
127.0.0.1:6379&gt; HGET bull:plugins:401827 name
&quot;dspace_add_missing_items&quot;
</code></pre><ul>
&#34;dspace_add_missing_items&#34;
</code></pre></div><ul>
<li>I whipped up a one liner to get the keys for all plugin jobs, convert to redis <code>HGET</code> commands to extract the value of the name field, and then sort them by their counts:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ redis-cli KEYS &quot;bull:plugins:*&quot; \
| sed -e 's/^bull/HGET bull/' -e 's/\([[:digit:]]\)$/\1 name/' \
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ redis-cli KEYS <span style="color:#e6db74">&#34;bull:plugins:*&#34;</span> <span style="color:#ae81ff">\
</span><span style="color:#ae81ff"></span> | sed -e &#39;s/^bull/HGET bull/&#39; -e &#39;s/\([[:digit:]]\)$/\1 name/&#39; \
| ncat -w 3 localhost 6379 \
| grep -v -E '^\$' | sort | uniq -c | sort -h
| grep -v -E &#39;^\$&#39; | sort | uniq -c | sort -h
3 dspace_health_check
4 -ERR wrong number of arguments for 'hget' command
4 -ERR wrong number of arguments for &#39;hget&#39; command
12 mel_downloads_and_views
129 dspace_altmetrics
932 dspace_downloads_and_views
186428 dspace_add_missing_items
</code></pre><ul>
</code></pre></div><ul>
<li>Note that this uses <code>ncat</code> to send commands directly to redis all at once instead of one at a time (<code>netcat</code> didn&rsquo;t work here, as it doesn&rsquo;t know when our input is finished and never quits)
<ul>
<li>I thought of using <code>redis-cli --pipe</code> but then you have to construct the commands in the redis protocol format with the number of args and length of each command</li>
@ -544,7 +544,7 @@ hash
<ul>
<li>Looking at the DSpace log I see there was definitely a higher number of sessions that day, perhaps twice the normal:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ for file in dspace.log.2021-06-[12]*; do echo &quot;$file&quot;; grep -oE 'session_id=[A-Z0-9]{32}' &quot;$file&quot; | sort | uniq | wc -l; done
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ <span style="color:#66d9ef">for</span> file in dspace.log.2021-06-<span style="color:#f92672">[</span>12<span style="color:#f92672">]</span>*; <span style="color:#66d9ef">do</span> echo <span style="color:#e6db74">&#34;</span>$file<span style="color:#e6db74">&#34;</span>; grep -oE <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}&#39;</span> <span style="color:#e6db74">&#34;</span>$file<span style="color:#e6db74">&#34;</span> | sort | uniq | wc -l; <span style="color:#66d9ef">done</span>
dspace.log.2021-06-10
19072
dspace.log.2021-06-11
@ -581,12 +581,12 @@ dspace.log.2021-06-26
16163
dspace.log.2021-06-27
5886
</code></pre><ul>
</code></pre></div><ul>
<li>I see 15,000 unique IPs in the XMLUI logs alone on that day:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep '23/Jun/2021' | awk '{print $1}' | sort | uniq | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep <span style="color:#e6db74">&#39;23/Jun/2021&#39;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq | wc -l
15835
</code></pre><ul>
</code></pre></div><ul>
<li>Annoyingly I found 37,000 more hits from Bing using <code>dns:*msnbot* AND dns:*.msn.com.</code> as a Solr filter
<ul>
<li>WTF, they are using a normal user agent: <code>Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko</code></li>
@ -628,8 +628,8 @@ dspace.log.2021-06-27
</li>
<li>The DSpace log shows:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
</code></pre></div><ul>
<li>The first one of these I see is from last night at 2021-06-29 at 10:47 PM</li>
<li>I restarted Tomcat 7 and CGSpace came back up&hellip;</li>
<li>I didn&rsquo;t see that Atmire had responded last week (on 2021-06-23) about the issues we had
@ -641,14 +641,14 @@ dspace.log.2021-06-27
</li>
<li>Export a list of all CGSpace&rsquo;s AGROVOC keywords with counts for Enrico and Elizabeth Arnaud to discuss with AGROVOC:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value AS &quot;dcterms.subject&quot;, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY &quot;dcterms.subject&quot; ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value AS &#34;dcterms.subject&#34;, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY &#34;dcterms.subject&#34; ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
COPY 20780
</code></pre><ul>
</code></pre></div><ul>
<li>Actually Enrico wanted NON AGROVOC, so I extracted all the center and CRP subjects (ignoring system office and themes):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
COPY 1710
</code></pre><ul>
</code></pre></div><ul>
<li>Fix an issue in the Ansible infrastructure playbooks for the DSpace role
<ul>
<li>It was causing the template module to fail when setting up the npm environment</li>
@ -657,13 +657,13 @@ COPY 1710
</li>
<li>I saw a strange message in the Tomcat 7 journal on DSpace Test (linode26):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
</code></pre></div><ul>
<li>What&rsquo;s even crazier is that it is twice that on CGSpace (linode18)!</li>
<li>Apparently OpenJDK defaults to using <code>/dev/random</code> (see <code>/etc/java-8-openjdk/security/java.security</code>):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">securerandom.source=file:/dev/urandom
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">securerandom.source=file:/dev/urandom
</code></pre></div><ul>
<li><code>/dev/random</code> blocks and can take a long time to get entropy, and urandom on modern Linux is a cryptographically secure pseudorandom number generator
<ul>
<li>Now Tomcat starts much faster and no warning is printed so I&rsquo;m going to add this to our Ansible infrastructure playbooks</li>

View File

@ -30,7 +30,7 @@ Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVO
localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -120,17 +120,17 @@ COPY 20994
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
</code></pre><h2 id="2021-07-04">2021-07-04</h2>
</code></pre></div><h2 id="2021-07-04">2021-07-04</h2>
<ul>
<li>Update all Docker containers on the AReS server (linode20) and rebuild OpenRXV:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cd OpenRXV
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cd OpenRXV
$ docker-compose -f docker/docker-compose.yml down
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose -f docker/docker-compose.yml build
</code></pre><ul>
</code></pre></div><ul>
<li>Then run all system updates and reboot the server</li>
<li>After the server came back up I cloned the <code>openrxv-items-final</code> index to <code>openrxv-items-temp</code> and started the plugins
<ul>
@ -172,7 +172,7 @@ $ docker-compose -f docker/docker-compose.yml build
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
Purging 95 hits from Drupal in statistics
Purging 38 hits from DTS Agent in statistics
Purging 601 hits from Microsoft Office Existence Discovery in statistics
@ -183,16 +183,16 @@ Purging 144 hits from FlipboardProxy in statistics
Purging 37 hits from LinkWalker in statistics
Purging 1 hits from [Ll]ink.?[Cc]heck.? in statistics
Purging 427 hits from WordPress in statistics
Total number of bot hits purged: 15030
</code></pre><ul>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 15030
</code></pre></div><ul>
<li>Meet with the CGIARAGROVOC task group to discuss how we want to do the workflow for submitting new terms to AGROVOC</li>
<li>I extracted another list of all subjects to check against AGROVOC:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">\COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-07-06-all-subjects.csv | sed 1d &gt; /tmp/2021-07-06-all-subjects.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">\COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-07-06-all-subjects.csv | sed 1d &gt; /tmp/2021-07-06-all-subjects.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-06-agrovoc-results-all-subjects.csv -d
</code></pre><ul>
</code></pre></div><ul>
<li>Test <a href="https://github.com/DSpace/DSpace/pull/3162">Hrafn Malmquist&rsquo;s proposed DBCP2 changes</a> for DSpace 6.4 (DS-4574)
<ul>
<li>His changes reminded me that we can perhaps switch back to using this pooling instead of Tomcat 7&rsquo;s JDBC pooling via JNDI</li>
@ -205,7 +205,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># for num in {10..26}; do echo &quot;2021-06-$num&quot;; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep &quot;$num/Jun/2021&quot; | awk '{print $1}' | sort | uniq | wc -l; done
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># <span style="color:#66d9ef">for</span> num in <span style="color:#f92672">{</span>10..26<span style="color:#f92672">}</span>; <span style="color:#66d9ef">do</span> echo <span style="color:#e6db74">&#34;2021-06-</span>$num<span style="color:#e6db74">&#34;</span>; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep <span style="color:#e6db74">&#34;</span>$num<span style="color:#e6db74">/Jun/2021&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq | wc -l; <span style="color:#66d9ef">done</span>
2021-06-10
10693
2021-06-11
@ -240,10 +240,10 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
9439
2021-06-26
7930
</code></pre><ul>
</code></pre></div><ul>
<li>Similarly, the number of connections to the REST API was around the average for the recent weeks before:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># for num in {10..26}; do echo &quot;2021-06-$num&quot;; zcat /var/log/nginx/rest.*.gz | grep &quot;$num/Jun/2021&quot; | awk '{print $1}' | sort | uniq | wc -l; done
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># <span style="color:#66d9ef">for</span> num in <span style="color:#f92672">{</span>10..26<span style="color:#f92672">}</span>; <span style="color:#66d9ef">do</span> echo <span style="color:#e6db74">&#34;2021-06-</span>$num<span style="color:#e6db74">&#34;</span>; zcat /var/log/nginx/rest.*.gz | grep <span style="color:#e6db74">&#34;</span>$num<span style="color:#e6db74">/Jun/2021&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq | wc -l; <span style="color:#66d9ef">done</span>
2021-06-10
1183
2021-06-11
@ -278,11 +278,11 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
969
2021-06-26
904
</code></pre><ul>
</code></pre></div><ul>
<li>According to goaccess, the traffic spike started at 2AM (remember that the first &ldquo;Pool empty&rdquo; error in dspace.log was at 4:01AM):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># zcat /var/log/nginx/access.log.1[45].gz /var/log/nginx/library-access.log.1[45].gz | grep -E '23/Jun/2021' | goaccess --log-format=COMBINED -
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat /var/log/nginx/access.log.1<span style="color:#f92672">[</span>45<span style="color:#f92672">]</span>.gz /var/log/nginx/library-access.log.1<span style="color:#f92672">[</span>45<span style="color:#f92672">]</span>.gz | grep -E <span style="color:#e6db74">&#39;23/Jun/2021&#39;</span> | goaccess --log-format<span style="color:#f92672">=</span>COMBINED -
</code></pre></div><ul>
<li>Moayad sent a fix for the add missing items plugins issue (<a href="https://github.com/ilri/OpenRXV/pull/107">#107</a>)
<ul>
<li>It works MUCH faster because it correctly identifies the missing handles in each repository</li>
@ -311,19 +311,19 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | wc -l
2302
postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | wc -l
2564
postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | wc -l
2530
</code></pre><ul>
</code></pre></div><ul>
<li>The locks are held by XMLUI, not REST API or OAI:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | grep -o -E &#39;(dspaceWeb|dspaceApi)&#39; | sort | uniq -c | sort -n
57 dspaceApi
2671 dspaceWeb
</code></pre><ul>
</code></pre></div><ul>
<li>I ran all updates on the server (linode18) and restarted it, then DSpace came back up</li>
<li>I sent a message to Atmire, as I never heard from them last week when we blocked access to the REST API for two days for them to investigate the server issues</li>
<li>Clone the <code>openrxv-items-temp</code> index on AReS and re-run all the plugins, but most of the &ldquo;dspace_add_missing_items&rdquo; tasks failed so I will just run a full re-harvest</li>
@ -338,7 +338,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq -c | sort -n
32 91.243.191.124
33 91.243.191.129
33 91.243.191.200
@ -362,7 +362,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
45 91.243.191.151
46 91.243.191.103
56 91.243.191.172
</code></pre><ul>
</code></pre></div><ul>
<li>I found a few people complaining about these Russian attacks too:
<ul>
<li><a href="https://community.cloudflare.com/t/russian-ddos-completley-unmitigated-by-cloudflare/284578">https://community.cloudflare.com/t/russian-ddos-completley-unmitigated-by-cloudflare/284578</a></li>
@ -392,13 +392,13 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./asn -n 45.80.217.235
╭──────────────────────────────╮
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./asn -n 45.80.217.235
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>╭──────────────────────────────╮
│ ASN lookup for 45.80.217.235 │
╰──────────────────────────────╯
45.80.217.235 ┌PTR -
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 45.80.217.235 ┌PTR -
├ASN 46844 (ST-BGP, US)
├ORG Sharktech
├NET 45.80.217.0/24 (TrafficTransitSolutionNet)
@ -407,7 +407,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
├TYP Proxy host Hosting/DC
├GEO Los Angeles, California (US)
└REP ✓ NONE
</code></pre><ul>
</code></pre></div><ul>
<li>Slowly slowly I manually built up a list of the IPs, ISP names, and network blocks, for example:</li>
</ul>
<pre tabindex="0"><code class="language-csv" data-lang="csv">IP, Organization, Website, Network
@ -496,17 +496,17 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; /var/log/nginx/access.log | awk '{print $1}' | sort | uniq &gt; /tmp/ips-sorted.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># grep -v -E <span style="color:#e6db74">&#34;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&#34;</span> /var/log/nginx/access.log | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/ips-sorted.txt
# wc -l /tmp/ips-sorted.txt
10776 /tmp/ips-sorted.txt
</code></pre><ul>
</code></pre></div><ul>
<li>Then resolve them all:</li>
</ul>
<pre tabindex="0"><code class="language-console:" data-lang="console:">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips-sorted.txt -o /tmp/out.csv
</code></pre><ul>
<li>Then get the top 10 organizations and top ten ASNs:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 2 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#ae81ff">2</span> /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n <span style="color:#ae81ff">10</span>
213 AMAZON-AES
218 ASN-QUADRANET-GLOBAL
246 Silverstar Invest Limited
@ -517,7 +517,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
814 UGB Hosting OU
1010 ST-BGP
1757 Global Layer B.V.
$ csvcut -c 3 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
$ csvcut -c <span style="color:#ae81ff">3</span> /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n <span style="color:#ae81ff">10</span>
213 14618
218 8100
246 35624
@ -528,10 +528,10 @@ $ csvcut -c 3 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
814 206485
1010 46844
1757 49453
</code></pre><ul>
</code></pre></div><ul>
<li>I will download blocklists for all these except Ethiopian Telecom, Quadranet, and Amazon, though I&rsquo;m concerned about Global Layer because it&rsquo;s a huge ASN that seems to have legit hosts too&hellip;?</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ wget https://asn.ipinfo.app/api/text/nginx/AS49453
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ wget https://asn.ipinfo.app/api/text/nginx/AS49453
$ wget https://asn.ipinfo.app/api/text/nginx/AS46844
$ wget https://asn.ipinfo.app/api/text/nginx/AS206485
$ wget https://asn.ipinfo.app/api/text/nginx/AS62282
@ -540,12 +540,12 @@ $ wget https://asn.ipinfo.app/api/text/nginx/AS35624
$ cat AS* | sort | uniq &gt; /tmp/abusive-networks.txt
$ wc -l /tmp/abusive-networks.txt
2276 /tmp/abusive-networks.txt
</code></pre><ul>
</code></pre></div><ul>
<li>Combining with my existing rules and filtering uniques:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l
2298
</code></pre><ul>
</code></pre></div><ul>
<li><a href="https://scamalytics.com/ip/isp/2021-06">According to Scamalytics all these are high risk ISPs</a> (as recently as 2021-06) so I will just keep blocking them</li>
<li>I deployed the block list on CGSpace (linode18) and the load is down to 1.0 but I see there are still some DDoS IPs getting through&hellip; sigh</li>
<li>The next thing I need to do is purge all the IPs from Solr using grepcidr&hellip;</li>
@ -558,12 +558,12 @@ $ wc -l /tmp/abusive-networks.txt
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ sudo zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 | grep -E &quot; (200|499) &quot; | awk '{print $1}' | sort | uniq &gt; /tmp/all-ips.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ sudo zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 | grep -E <span style="color:#e6db74">&#34; (200|499) &#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/all-ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips.txt -o /tmp/all-ips-out.csv
$ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/all-ips-to-block.txt
$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(206485|35624|36352|46844|49453|62282)$&#39;</span> /tmp/all-ips-out.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/all-ips-to-block.txt
$ wc -l /tmp/all-ips-to-block.txt
5095 /tmp/all-ips-to-block.txt
</code></pre><ul>
</code></pre></div><ul>
<li>Then I added them to the normal ipset we are already using with firewalld
<ul>
<li>I will check again in a few hours and ban more</li>
@ -571,10 +571,10 @@ $ wc -l /tmp/all-ips-to-block.txt
</li>
<li>I decided to extract the networks from the GeoIP database with <code>resolve-addresses-geoip2.py</code> so I can block them more efficiently than using the 5,000 IPs in an ipset:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c network | sed 1d | sort | uniq &gt; /tmp/all-networks-to-block.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(206485|35624|36352|46844|49453|62282)$&#39;</span> /tmp/all-ips-out.csv | csvcut -c network | sed 1d | sort | uniq &gt; /tmp/all-networks-to-block.txt
$ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq | wc -l
2354
</code></pre><ul>
</code></pre></div><ul>
<li>Combined with the previous networks this brings about 200 more for a total of 2,354 networks
<ul>
<li>I think I need to re-work the ipset stuff in my common Ansible role so that I can add such abusive networks as an iptables ipset / nftables set, and have a cron job to update them daily (from <a href="https://www.spamhaus.org/drop/">Spamhaus&rsquo;s DROP and EDROP lists</a>, for example)</li>
@ -582,25 +582,25 @@ $ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq
</li>
<li>Then I got a list of all the 5,095 IPs from above and used <code>check-spider-ip-hits.sh</code> to purge them from Solr:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ilri/check-spider-ip-hits.sh -f /tmp/all-ips-to-block.txt -p
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ilri/check-spider-ip-hits.sh -f /tmp/all-ips-to-block.txt -p
...
Total number of bot hits purged: 197116
</code></pre><ul>
</code></pre></div><ul>
<li>I started a harvest on AReS and it finished in a few hours now that the load on CGSpace is back to a normal level</li>
</ul>
<h2 id="2021-07-20">2021-07-20</h2>
<ul>
<li>Looking again at the IPs making connections to CGSpace over the last few days from these seven ASNs, it&rsquo;s much higher than I noticed yesterday:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624)$&#39;</span> /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
5643
</code></pre><ul>
</code></pre></div><ul>
<li>I purged 27,000 more hits from the Solr stats using this new list of IPs with my <code>check-spider-ip-hits.sh</code> script</li>
<li>Surprise surprise, I checked the nginx logs from 2021-06-23 when we last had issues with thousands of XMLUI sessions and PostgreSQL connections and I see IPs from the same ASNs!</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ sudo zcat --force /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E &quot; (200|499) &quot; | grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; | awk '{print $1}' | sort | uniq &gt; /tmp/all-ips-june-23.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ sudo zcat --force /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E <span style="color:#e6db74">&#34; (200|499) &#34;</span> | grep -v -E <span style="color:#e6db74">&#34;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/all-ips-june-23.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips-june-23.txt -o /tmp/out.csv
$ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
$ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n <span style="color:#ae81ff">15</span>
265 GOOGLE,15169
277 Silverstar Invest Limited,35624
280 FACEBOOK,32934
@ -616,17 +616,17 @@ $ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
874 Ethiopian Telecommunication Corporation,24757
912 UGB Hosting OU,206485
1607 Global Layer B.V.,49453
</code></pre><ul>
</code></pre></div><ul>
<li>Again it was over 5,000 IPs:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624)$&#39;</span> /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
5228
</code></pre><ul>
</code></pre></div><ul>
<li>Interestingly, it seems these are five thousand <em>different</em> IP addresses than the attack from last weekend, as there are over 10,000 unique ones if I combine them!</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat /tmp/ips-june23.txt /tmp/ips-jul16.txt | sort | uniq | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat /tmp/ips-june23.txt /tmp/ips-jul16.txt | sort | uniq | wc -l
10458
</code></pre><ul>
</code></pre></div><ul>
<li>I purged all the (26,000) hits from these new IP addresses from Solr as well</li>
<li>Looking back at my notes for the 2019-05 attack I see that I had already identified most of these network providers (!)&hellip;
<ul>
@ -636,30 +636,30 @@ $ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
</li>
<li>Adding QuadraNet brings the total networks seen during these two attacks to 262, and the number of unique IPs to 10900:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E &quot; (200|499) &quot; | grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; | awk '{print $1}' | sort | uniq &gt; /tmp/ddos-ips.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E <span style="color:#e6db74">&#34; (200|499) &#34;</span> | grep -v -E <span style="color:#e6db74">&#34;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/ddos-ips.txt
# wc -l /tmp/ddos-ips.txt
54002 /tmp/ddos-ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ddos-ips.txt -o /tmp/ddos-ips.csv
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/ddos-ips.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/ddos-ips-to-purge.txt
$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624|8100)$&#39;</span> /tmp/ddos-ips.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/ddos-ips-to-purge.txt
$ wc -l /tmp/ddos-ips-to-purge.txt
10900 /tmp/ddos-ips-to-purge.txt
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/ddos-ips.csv | csvcut -c network | sed 1d | sort | uniq &gt; /tmp/ddos-networks-to-block.txt
$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624|8100)$&#39;</span> /tmp/ddos-ips.csv | csvcut -c network | sed 1d | sort | uniq &gt; /tmp/ddos-networks-to-block.txt
$ wc -l /tmp/ddos-networks-to-block.txt
262 /tmp/ddos-networks-to-block.txt
</code></pre><ul>
</code></pre></div><ul>
<li>The new total number of networks to block, including the network prefixes for these ASNs downloaded from asn.ipinfo.app, is 4,007:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ wget https://asn.ipinfo.app/api/text/nginx/AS49453 \
https://asn.ipinfo.app/api/text/nginx/AS46844 \
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ wget https://asn.ipinfo.app/api/text/nginx/AS49453 <span style="color:#ae81ff">\
</span><span style="color:#ae81ff"></span>https://asn.ipinfo.app/api/text/nginx/AS46844 \
https://asn.ipinfo.app/api/text/nginx/AS206485 \
https://asn.ipinfo.app/api/text/nginx/AS62282 \
https://asn.ipinfo.app/api/text/nginx/AS36352 \
https://asn.ipinfo.app/api/text/nginx/AS35913 \
https://asn.ipinfo.app/api/text/nginx/AS35624 \
https://asn.ipinfo.app/api/text/nginx/AS8100
$ cat AS* /tmp/ddos-networks-to-block.txt | sed -e '/^$/d' -e '/^#/d' -e '/^{/d' -e 's/deny //' -e 's/;//' | sort | uniq | wc -l
$ cat AS* /tmp/ddos-networks-to-block.txt | sed -e <span style="color:#e6db74">&#39;/^$/d&#39;</span> -e <span style="color:#e6db74">&#39;/^#/d&#39;</span> -e <span style="color:#e6db74">&#39;/^{/d&#39;</span> -e <span style="color:#e6db74">&#39;s/deny //&#39;</span> -e <span style="color:#e6db74">&#39;s/;//&#39;</span> | sort | uniq | wc -l
4007
</code></pre><ul>
</code></pre></div><ul>
<li>I re-applied these networks to nginx on CGSpace (linode18) and DSpace Test (linode26), and purged 14,000 more Solr statistics hits from these IPs</li>
</ul>
<h2 id="2021-07-22">2021-07-22</h2>

View File

@ -32,7 +32,7 @@ Update Docker images on AReS server (linode20) and reboot the server:
I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -122,37 +122,37 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre></div><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
<ul>
<li>First running all existing updates, taking some backups, checking for broken packages, and then rebooting:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># apt update &amp;&amp; apt dist-upgrade
# apt autoremove &amp;&amp; apt autoclean
# check for any packages with residual configs we can purge
# dpkg -l | grep -E '^rc' | awk '{print $2}'
# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># apt update <span style="color:#f92672">&amp;&amp;</span> apt dist-upgrade
# apt autoremove <span style="color:#f92672">&amp;&amp;</span> apt autoclean
# check <span style="color:#66d9ef">for</span> any packages with residual configs we can purge
# dpkg -l | grep -E <span style="color:#e6db74">&#39;^rc&#39;</span> | awk <span style="color:#e6db74">&#39;{print $2}&#39;</span>
# dpkg -l | grep -E <span style="color:#e6db74">&#39;^rc&#39;</span> | awk <span style="color:#e6db74">&#39;{print $2}&#39;</span> | xargs dpkg -P
# dpkg -C
# dpkg -l &gt; 2021-08-01-linode20-dpkg.txt
# tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc
# reboot
# sed -i 's/bionic/focal/' /etc/apt/sources.list.d/*.list
# do-release-upgrade
</code></pre><ul>
# sed -i <span style="color:#e6db74">&#39;s/bionic/focal/&#39;</span> /etc/apt/sources.list.d/*.list
# <span style="color:#66d9ef">do</span>-release-upgrade
</code></pre></div><ul>
<li>&hellip; but of course it hit <a href="https://bugs.launchpad.net/ubuntu/+source/libxcrypt/+bug/1903838">the libxcrypt bug</a></li>
<li>I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># apt install -f
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># apt install -f
# apt dist-upgrade
# reboot
</code></pre><ul>
</code></pre></div><ul>
<li>After rebooting I purged all packages with residual configs and cleaned up again:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
# apt autoremove &amp;&amp; apt autoclean
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># dpkg -l | grep -E <span style="color:#e6db74">&#39;^rc&#39;</span> | awk <span style="color:#e6db74">&#39;{print $2}&#39;</span> | xargs dpkg -P
# apt autoremove <span style="color:#f92672">&amp;&amp;</span> apt autoclean
</code></pre></div><ul>
<li>Then I cleared my local Ansible fact cache and re-ran the <a href="https://github.com/ilri/rmg-ansible-public">infrastructure playbooks</a></li>
<li>Open <a href="https://github.com/ilri/OpenRXV/issues/111">an issue for the value mappings global replacement bug in OpenRXV</a></li>
<li>Advise Peter and Abenet on expected CGSpace budget for 2022</li>
@ -190,21 +190,21 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E &quot; (200|499) &quot; | grep -v -E &quot;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&quot; | awk '{print $1}' | sort | uniq &gt; /tmp/2021-08-05-all-ips.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E <span style="color:#e6db74">&#34; (200|499) &#34;</span> | grep -v -E <span style="color:#e6db74">&#34;(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)&#34;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/2021-08-05-all-ips.txt
# wc -l /tmp/2021-08-05-all-ips.txt
43428 /tmp/2021-08-05-all-ips.txt
</code></pre><ul>
</code></pre></div><ul>
<li>Already I can see that the total is much less than during the attack on one weekend last month (over 50,000!)
<ul>
<li>Indeed, now I see that there are no IPs from those networks coming in now:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/2021-08-05-all-ips-to-purge.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624|8100)$&#39;</span> /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq &gt; /tmp/2021-08-05-all-ips-to-purge.csv
$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
0 /tmp/2021-08-05-all-ips-to-purge.csv
</code></pre><h2 id="2021-08-08">2021-08-08</h2>
</code></pre></div><h2 id="2021-08-08">2021-08-08</h2>
<ul>
<li>Advise IWMI colleagues on best practices for thumbnails</li>
<li>Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest
@ -220,8 +220,8 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
</code></pre></div><ul>
<li>That IP is on Amazon, and from looking at the DSpace logs I don&rsquo;t see them logging in at all, only scraping&hellip; so I will purge hits from that IP</li>
<li>I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
<ul>
@ -232,14 +232,14 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
<li>3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart</li>
<li>61.143.40.50 is in China and uses this hilarious user agent:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}&#34;
</code></pre></div><ul>
<li>47.252.80.214 is owned by Alibaba in the US and has the same user agent</li>
<li>159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours</li>
<li>95.87.154.12 seems to be a new bot with the following user agent:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
</code></pre></div><ul>
<li>They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
<ul>
<li>I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots</li>
@ -247,14 +247,14 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
</li>
<li>I see a new bot using this user agent:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">nettle (+https://www.nettle.sk)
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">nettle (+https://www.nettle.sk)
</code></pre></div><ul>
<li>129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.</li>
<li>217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day</li>
<li>103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human</li>
<li>There are probably more but that&rsquo;s most of them over 1,000 hits last month, so I will purge them:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 10796 hits from 35.174.144.154 in statistics
Purging 9993 hits from 93.158.90.30 in statistics
Purging 6092 hits from 130.255.162.173 in statistics
@ -267,17 +267,17 @@ Purging 2786 hits from 47.252.80.214 in statistics
Purging 1485 hits from 129.0.211.251 in statistics
Purging 8952 hits from 217.182.21.193 in statistics
Purging 3446 hits from 103.135.104.139 in statistics
Total number of bot hits purged: 90485
</code></pre><ul>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 90485
</code></pre></div><ul>
<li>Then I purged a few thousand more by user agent:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
Found 2707 hits from MaCoCu in statistics
Found 1785 hits from nettle in statistics
Total number of hits from bots: 4492
</code></pre><ul>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 4492
</code></pre></div><ul>
<li>I found some CGSpace metadata in the wrong fields
<ul>
<li>Seven metadata in dc.subject (57) should be in dcterms.subject (187)</li>
@ -289,8 +289,8 @@ Total number of hits from bots: 4492
</li>
<li>I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv &gt; /tmp/2021-08-08-issn-isbn.csv
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]&#39;</span> /tmp/2021-08-08-cgspace.csv &gt; /tmp/2021-08-08-issn-isbn.csv
</code></pre></div><ul>
<li>Then in OpenRefine I merged all null, blank, and en fields into the <code>en_US</code> one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
<ul>
<li>In total it was a few thousand metadata entries or so so I had to split the CSV with <code>xsv split</code> in order to process it</li>
@ -303,20 +303,20 @@ Total number of hits from bots: 4492
<ul>
<li>Extract all unique ISSNs to look up on Sherpa Romeo and Crossref</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq &gt; /tmp/2021-08-09-issns.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;cg.issn[en_US]&#39;</span> ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c <span style="color:#ae81ff">1</span> -r <span style="color:#e6db74">&#39;^[0-9]{4}&#39;</span> | sed 1d | sort | uniq &gt; /tmp/2021-08-09-issns.txt
$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
</code></pre><ul>
</code></pre></div><ul>
<li>Then I updated the CSV headers for each and joined the CSVs on the issn column:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ sed -i <span style="color:#e6db74">&#39;1s/journal title/sherpa romeo journal title/&#39;</span> /tmp/2021-08-09-journals-sherpa-romeo.csv
$ sed -i <span style="color:#e6db74">&#39;1s/journal title/crossref journal title/&#39;</span> /tmp/2021-08-09-journals-crossref.csv
$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv &gt; /tmp/2021-08-09-journals-all.csv
</code></pre><ul>
</code></pre></div><ul>
<li>In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,&quot;same&quot;,&quot;different&quot;)
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">if(cells[&#39;sherpa romeo journal title&#39;].value == cells[&#39;crossref journal title&#39;].value,&#34;same&#34;,&#34;different&#34;)
</code></pre></div><ul>
<li>Then I exported the list of journals that differ and sent it to Peter for comments and corrections
<ul>
<li>I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it</li>
@ -332,15 +332,15 @@ $ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-jour
</li>
<li>I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s <span style="color:#ae81ff">600</span> -o <span style="color:#e6db74">&#39;%s-vips.jpg[Q=85,optimize_coding,strip]&#39;</span>
39004:0.08
$ /usr/bin/time -f %M:%e gm convert IPCC.pdf\[0\] -quality 85 -thumbnail x600 -flatten IPCC-gm.jpg
$ /usr/bin/time -f %M:%e gm convert IPCC.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -quality <span style="color:#ae81ff">85</span> -thumbnail x600 -flatten IPCC-gm.jpg
40932:0.53
$ /usr/bin/time -f %M:%e convert IPCC.pdf\[0\] -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
$ /usr/bin/time -f %M:%e convert IPCC.pdf<span style="color:#ae81ff">\[</span>0<span style="color:#ae81ff">\]</span> -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
41724:0.59
$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality 85 -thumbnail 600x600 IPCC-im.jpg
$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality <span style="color:#ae81ff">85</span> -thumbnail 600x600 IPCC-im.jpg
24736:0.04
</code></pre><ul>
</code></pre></div><ul>
<li>The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail)
<ul>
<li>libvips does use less time and memory&hellip; I should do more tests!</li>
@ -359,17 +359,17 @@ $ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d &gt; /tmp/journals1.txt
$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d &gt; /tmp/journals2.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d &gt; /tmp/journals1.txt
$ csvcut -c <span style="color:#e6db74">&#39;sherpa romeo journal title&#39;</span> ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d &gt; /tmp/journals2.txt
$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
1911
</code></pre><ul>
</code></pre></div><ul>
<li>Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine</li>
<li>I exported a list of all the journal titles we have in the <code>cg.journal</code> field:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT(text_value) AS &quot;cg.journal&quot; FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT(text_value) AS &#34;cg.journal&#34; FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
COPY 3245
</code></pre><ul>
</code></pre></div><ul>
<li>I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don&rsquo;t match, so I&rsquo;d have to go check many of them manually before selecting a match or fixing them&hellip;
<ul>
<li>I think it&rsquo;s better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way</li>
@ -421,10 +421,10 @@ COPY 3245
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
$ dspace community-filiator --set --parent=10568/114644 --child=10568/35730
$ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/72600
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/35730
$ dspace community-filiator --set --parent<span style="color:#f92672">=</span>10568/114644 --child<span style="color:#f92672">=</span>10568/76451
</code></pre></div><ul>
<li>I made a minor fix to OpenRXV to prefix all image names with <code>docker.io</code> so it works with less changes on podman
<ul>
<li>Docker assumes the <code>docker.io</code> registry by default, but we should be explicit</li>
@ -446,40 +446,40 @@ $ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
</li>
<li>Lower case all AGROVOC metadata, as I had noticed a few in sentence case:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ &#39;[[:upper:]]&#39;;
UPDATE 484
</code></pre><ul>
</code></pre></div><ul>
<li>Also update some DOIs using the <code>dx.doi.org</code> format, just to keep things uniform:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;https://dx.doi.org&#39;, &#39;https://doi.org&#39;) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE &#39;https://dx.doi.org%&#39;;
UPDATE 469
</code></pre><ul>
</code></pre></div><ul>
<li>Then start a full Discovery re-indexing to update the Feed the Future community item counts that have been stuck at 0 since we moved the three projects to be a subcommunity a few days ago:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 322m16.917s
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ time chrt -b <span style="color:#ae81ff">0</span> ionice -c2 -n7 nice -n19 dspace index-discovery -b
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>real 322m16.917s
user 226m43.121s
sys 3m17.469s
</code></pre><ul>
</code></pre></div><ul>
<li>I learned how to use the OpenRXV API, which is just a thin wrapper around Elasticsearch:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
-H 'Content-Type: application/json' \
-d '{
&quot;size&quot;: 10,
&quot;query&quot;: {
&quot;bool&quot;: {
&quot;filter&quot;: {
&quot;term&quot;: {
&quot;repo.keyword&quot;: &quot;CGSpace&quot;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -X POST <span style="color:#e6db74">&#39;https://cgspace.cgiar.org/explorer/api/search?scroll=1d&#39;</span> <span style="color:#ae81ff">\
</span><span style="color:#ae81ff"></span> -H &#39;Content-Type: application/json&#39; \
-d &#39;{
&#34;size&#34;: 10,
&#34;query&#34;: {
&#34;bool&#34;: {
&#34;filter&#34;: {
&#34;term&#34;: {
&#34;repo.keyword&#34;: &#34;CGSpace&#34;
}
}
}
}
}'
$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ=='
</code></pre><ul>
}&#39;
$ curl -X POST <span style="color:#e6db74">&#39;https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ==&#39;</span>
</code></pre></div><ul>
<li>This uses the Elasticsearch scroll ID to page through results
<ul>
<li>The second query doesn&rsquo;t need the request body because it is saved for 1 day as part of the first request</li>
@ -525,46 +525,46 @@ $ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq &gt; /tmp/2021-08-25-combined-orcids.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE <span style="color:#e6db74">&#39;[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}&#39;</span> | sort | uniq &gt; /tmp/2021-08-25-combined-orcids.txt
$ wc -l /tmp/2021-08-25-combined-orcids.txt
1331
</code></pre><ul>
</code></pre></div><ul>
<li>After I combined them and removed duplicates, I resolved all the names using my <code>resolve-orcids.py</code> script:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
</code></pre></div><ul>
<li>Tag existing items from the Alliance&rsquo;s new authors with ORCID iDs using <code>add-orcid-identifiers-csv.py</code> (181 new metadata fields added):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-08-25-add-orcids.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-08-25-add-orcids.csv
dc.contributor.author,cg.creator.identifier
&quot;Chege, Christine G. Kiria&quot;,&quot;Christine G.Kiria Chege: 0000-0001-8360-0279&quot;
&quot;Chege, Christine Kiria&quot;,&quot;Christine G.Kiria Chege: 0000-0001-8360-0279&quot;
&quot;Kiria, C.&quot;,&quot;Christine G.Kiria Chege: 0000-0001-8360-0279&quot;
&quot;Kinyua, Ivy&quot;,&quot;Ivy Kinyua :0000-0002-1978-8833&quot;
&quot;Rahn, E.&quot;,&quot;Eric Rahn: 0000-0001-6280-7430&quot;
&quot;Rahn, Eric&quot;,&quot;Eric Rahn: 0000-0001-6280-7430&quot;
&quot;Jager M.&quot;,&quot;Matthias Jager: 0000-0003-1059-3949&quot;
&quot;Jager, M.&quot;,&quot;Matthias Jager: 0000-0003-1059-3949&quot;
&quot;Jager, Matthias&quot;,&quot;Matthias Jager: 0000-0003-1059-3949&quot;
&quot;Waswa, Boaz&quot;,&quot;Boaz Waswa: 0000-0002-0066-0215&quot;
&quot;Waswa, Boaz S.&quot;,&quot;Boaz Waswa: 0000-0002-0066-0215&quot;
&quot;Rivera, Tatiana&quot;,&quot;Tatiana Rivera: 0000-0003-4876-5873&quot;
&quot;Andrade, Robert&quot;,&quot;Robert Andrade: 0000-0002-5764-3854&quot;
&quot;Ceccarelli, Viviana&quot;,&quot;Viviana Ceccarelli: 0000-0003-2160-9483&quot;
&quot;Ceccarellia, Viviana&quot;,&quot;Viviana Ceccarelli: 0000-0003-2160-9483&quot;
&quot;Nyawira, Sylvia&quot;,&quot;Sylvia Sarah Nyawira: 0000-0003-4913-1389&quot;
&quot;Nyawira, Sylvia S.&quot;,&quot;Sylvia Sarah Nyawira: 0000-0003-4913-1389&quot;
&quot;Nyawira, Sylvia Sarah&quot;,&quot;Sylvia Sarah Nyawira: 0000-0003-4913-1389&quot;
&quot;Groot, J.C.&quot;,&quot;Groot, J.C.J.: 0000-0001-6516-5170&quot;
&quot;Groot, J.C.J.&quot;,&quot;Groot, J.C.J.: 0000-0001-6516-5170&quot;
&quot;Groot, Jeroen C.J.&quot;,&quot;Groot, J.C.J.: 0000-0001-6516-5170&quot;
&quot;Groot, Jeroen CJ&quot;,&quot;Groot, J.C.J.: 0000-0001-6516-5170&quot;
&quot;Abera, W.&quot;,&quot;Wuletawu Abera: 0000-0002-3657-5223&quot;
&quot;Abera, Wuletawu&quot;,&quot;Wuletawu Abera: 0000-0002-3657-5223&quot;
&quot;Kanyenga Lubobo, Antoine&quot;,&quot;Antoine Lubobo Kanyenga: 0000-0003-0806-9304&quot;
&quot;Lubobo Antoine, Kanyenga&quot;,&quot;Antoine Lubobo Kanyenga: 0000-0003-0806-9304&quot;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p 'fuuu'
</code></pre><h2 id="2021-08-29">2021-08-29</h2>
&#34;Chege, Christine G. Kiria&#34;,&#34;Christine G.Kiria Chege: 0000-0001-8360-0279&#34;
&#34;Chege, Christine Kiria&#34;,&#34;Christine G.Kiria Chege: 0000-0001-8360-0279&#34;
&#34;Kiria, C.&#34;,&#34;Christine G.Kiria Chege: 0000-0001-8360-0279&#34;
&#34;Kinyua, Ivy&#34;,&#34;Ivy Kinyua :0000-0002-1978-8833&#34;
&#34;Rahn, E.&#34;,&#34;Eric Rahn: 0000-0001-6280-7430&#34;
&#34;Rahn, Eric&#34;,&#34;Eric Rahn: 0000-0001-6280-7430&#34;
&#34;Jager M.&#34;,&#34;Matthias Jager: 0000-0003-1059-3949&#34;
&#34;Jager, M.&#34;,&#34;Matthias Jager: 0000-0003-1059-3949&#34;
&#34;Jager, Matthias&#34;,&#34;Matthias Jager: 0000-0003-1059-3949&#34;
&#34;Waswa, Boaz&#34;,&#34;Boaz Waswa: 0000-0002-0066-0215&#34;
&#34;Waswa, Boaz S.&#34;,&#34;Boaz Waswa: 0000-0002-0066-0215&#34;
&#34;Rivera, Tatiana&#34;,&#34;Tatiana Rivera: 0000-0003-4876-5873&#34;
&#34;Andrade, Robert&#34;,&#34;Robert Andrade: 0000-0002-5764-3854&#34;
&#34;Ceccarelli, Viviana&#34;,&#34;Viviana Ceccarelli: 0000-0003-2160-9483&#34;
&#34;Ceccarellia, Viviana&#34;,&#34;Viviana Ceccarelli: 0000-0003-2160-9483&#34;
&#34;Nyawira, Sylvia&#34;,&#34;Sylvia Sarah Nyawira: 0000-0003-4913-1389&#34;
&#34;Nyawira, Sylvia S.&#34;,&#34;Sylvia Sarah Nyawira: 0000-0003-4913-1389&#34;
&#34;Nyawira, Sylvia Sarah&#34;,&#34;Sylvia Sarah Nyawira: 0000-0003-4913-1389&#34;
&#34;Groot, J.C.&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
&#34;Groot, J.C.J.&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
&#34;Groot, Jeroen C.J.&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
&#34;Groot, Jeroen CJ&#34;,&#34;Groot, J.C.J.: 0000-0001-6516-5170&#34;
&#34;Abera, W.&#34;,&#34;Wuletawu Abera: 0000-0002-3657-5223&#34;
&#34;Abera, Wuletawu&#34;,&#34;Wuletawu Abera: 0000-0002-3657-5223&#34;
&#34;Kanyenga Lubobo, Antoine&#34;,&#34;Antoine Lubobo Kanyenga: 0000-0003-0806-9304&#34;
&#34;Lubobo Antoine, Kanyenga&#34;,&#34;Antoine Lubobo Kanyenga: 0000-0003-0806-9304&#34;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</code></pre></div><h2 id="2021-08-29">2021-08-29</h2>
<ul>
<li>Run a full harvest on AReS</li>
<li>Also do more work the past few days on OpenRXV

View File

@ -48,7 +48,7 @@ The syntax Moayad showed me last month doesn&rsquo;t seem to honor the search qu
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -154,9 +154,9 @@ The syntax Moayad showed me last month doesn&rsquo;t seem to honor the search qu
<ul>
<li>Update Docker images on AReS server (linode20) and rebuild OpenRXV:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
</code></pre><ul>
</code></pre></div><ul>
<li>Then run system updates and reboot the server
<ul>
<li>After the system came back up I started a fresh re-harvesting</li>
@ -201,8 +201,8 @@ $ docker-compose build
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ vipsthumbnail ARRTB2020ST.pdf -s x600 -o '%s.jpg[Q=85,optimize_coding,strip]'
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ vipsthumbnail ARRTB2020ST.pdf -s x600 -o <span style="color:#e6db74">&#39;%s.jpg[Q=85,optimize_coding,strip]&#39;</span>
</code></pre></div><ul>
<li>Looking at the PDF&rsquo;s metadata I see:
<ul>
<li>Producer: iLovePDF</li>
@ -236,11 +236,11 @@ $ docker-compose build
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cat 2021-09-15-add-orcids.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cat 2021-09-15-add-orcids.csv
dc.contributor.author,cg.creator.identifier
&quot;Kotchofa, Pacem&quot;,&quot;Pacem Kotchofa: 0000-0002-1640-8807&quot;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u dspace -p 'fuuuu'
</code></pre><ul>
&#34;Kotchofa, Pacem&#34;,&#34;Pacem Kotchofa: 0000-0002-1640-8807&#34;
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuuu&#39;</span>
</code></pre></div><ul>
<li>Meeting with Leroy Mwanzia and some other Alliance people about depositing to CGSpace via API
<ul>
<li>I gave them some technical information about the CGSpace API and links to the controlled vocabularies and metadata registries we are using</li>
@ -273,24 +273,24 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_stat_activity&#39;</span> | wc -l
63
</code></pre><ul>
</code></pre></div><ul>
<li>Load on the server is under 1.0, and there are only about 1,000 XMLUI sessions, which seems to be normal for this time of day according to Munin</li>
<li>But the DSpace log file shows tons of database issues:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -c &quot;Timeout waiting for idle object&quot; dspace.log.2021-09-17
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">&#34;Timeout waiting for idle object&#34;</span> dspace.log.2021-09-17
14779
</code></pre><ul>
</code></pre></div><ul>
<li>The earliest one I see is around midnight (now is 2PM):</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">2021-09-17 00:01:49,572 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: null
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">2021-09-17 00:01:49,572 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: null
2021-09-17 00:01:49,572 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
</code></pre><ul>
</code></pre></div><ul>
<li>But I was definitely logged into the site this morning so there were no issues then&hellip;</li>
<li>It seems that a few errors are normal, but there&rsquo;s obviously something wrong today:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -c &quot;Timeout waiting for idle object&quot; dspace.log.2021-09-*
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">&#34;Timeout waiting for idle object&#34;</span> dspace.log.2021-09-*
dspace.log.2021-09-01:116
dspace.log.2021-09-02:163
dspace.log.2021-09-03:77
@ -308,7 +308,7 @@ dspace.log.2021-09-14:102
dspace.log.2021-09-15:542
dspace.log.2021-09-16:368
dspace.log.2021-09-17:15235
</code></pre><ul>
</code></pre></div><ul>
<li>I restarted the server and DSpace came up fine&hellip; so it must have been some kind of fluke</li>
<li>Continue working on cleaning up and annotating the metadata registry on CGSpace
<ul>
@ -338,9 +338,9 @@ dspace.log.2021-09-17:15235
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
</code></pre><h2 id="2021-09-20">2021-09-20</h2>
</code></pre></div><h2 id="2021-09-20">2021-09-20</h2>
<ul>
<li>I synchronized the production CGSpace PostreSQL, Solr, and Assetstore data with DSpace Test</li>
<li>Over the weekend a few users reported that they could not log into CGSpace
@ -349,10 +349,10 @@ $ docker-compose build
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;cgspace-ldap-account@cgiarad.org&quot; -W &quot;(sAMAccountName=someaccountnametocheck)&quot;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b <span style="color:#e6db74">&#34;dc=cgiarad,dc=org&#34;</span> -D <span style="color:#e6db74">&#34;cgspace-ldap-account@cgiarad.org&#34;</span> -W <span style="color:#e6db74">&#34;(sAMAccountName=someaccountnametocheck)&#34;</span>
Enter LDAP Password:
ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
</code></pre><ul>
ldap_sasl_bind(SIMPLE): Can&#39;t contact LDAP server (-1)
</code></pre></div><ul>
<li>I sent a message to CGNET to ask about the server settings and see if our IP is still whitelisted
<ul>
<li>It turns out that CGNET created a new Active Directory server (AZCGNEROOT3.cgiarad.org) and decomissioned the old one last week</li>
@ -361,8 +361,8 @@ ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
</li>
<li>Create another test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p <span style="color:#e6db74">&#39;fuuuuuuuu&#39;</span>
</code></pre></div><ul>
<li>I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
<ul>
<li>According to my notes from <a href="/cgspace-notes/2020-10/">2020-10</a> the account must be in the admin group in order to submit via the REST API</li>
@ -371,13 +371,13 @@ ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
<li>Run <code>dspace cleanup -v</code> process on CGSpace to clean up old bitstreams</li>
<li>Export lists of authors, donors, and affiliations for Peter Ballantyne to clean up:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-authors.csv WITH CSV HEADER;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;dc.contributor.author&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-authors.csv WITH CSV HEADER;
COPY 80901
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.donor&quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-donors.csv WITH CSV HEADER;
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.donor&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-donors.csv WITH CSV HEADER;
COPY 1274
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-affiliations.csv WITH CSV HEADER;
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-affiliations.csv WITH CSV HEADER;
COPY 8091
</code></pre><h2 id="2021-09-23">2021-09-23</h2>
</code></pre></div><h2 id="2021-09-23">2021-09-23</h2>
<ul>
<li>Peter sent me back the corrections for the affiliations
<ul>
@ -386,24 +386,24 @@ COPY 8091
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csv-metadata-quality -i ~/Downloads/2021-09-20-affiliations.csv -o /tmp/affiliations.csv -x cg.contributor.affiliation
$ csvgrep -c 'correct' -m 'DELETE' /tmp/affiliations.csv &gt; /tmp/affiliations-delete.csv
$ csvgrep -c 'correct' -r '^.+$' /tmp/affiliations.csv | csvgrep -i -c 'correct' -m 'DELETE' &gt; /tmp/affiliations-fix.csv
$ ./ilri/fix-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
$ ./ilri/delete-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csv-metadata-quality -i ~/Downloads/2021-09-20-affiliations.csv -o /tmp/affiliations.csv -x cg.contributor.affiliation
$ csvgrep -c <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#e6db74">&#39;DELETE&#39;</span> /tmp/affiliations.csv &gt; /tmp/affiliations-delete.csv
$ csvgrep -c <span style="color:#e6db74">&#39;correct&#39;</span> -r <span style="color:#e6db74">&#39;^.+$&#39;</span> /tmp/affiliations.csv | csvgrep -i -c <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#e6db74">&#39;DELETE&#39;</span> &gt; /tmp/affiliations-fix.csv
$ ./ilri/fix-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.contributor.affiliation -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">211</span>
$ ./ilri/delete-metadata-values.py -i /tmp/affiliations-fix.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.contributor.affiliation -m <span style="color:#ae81ff">211</span>
</code></pre></div><ul>
<li>Then I updated the controlled vocabulary for affiliations by exporting the top 1,000 used terms:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2021-09-23-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-09-23-affiliations.csv | sed 1d &gt; /tmp/affiliations.txt
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2021-09-23-affiliations.csv WITH CSV HEADER;
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-09-23-affiliations.csv | sed 1d &gt; /tmp/affiliations.txt
</code></pre></div><ul>
<li>Peter also sent me 310 corrections and 234 deletions for donors so I applied those and updated the controlled vocabularies too</li>
<li>Move some One CGIAR-related collections around the CGSpace hierarchy for Peter Ballantyne</li>
<li>Mohammed Salem asked me for an ID to UUID mapping for CGSpace collections, so I generated one similar to the ID one I sent him in 2020-11:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT collection_id,uuid FROM collection WHERE collection_id IS NOT NULL) TO /tmp/2021-09-23-collection-id2uuid.csv WITH CSV HEADER;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT collection_id,uuid FROM collection WHERE collection_id IS NOT NULL) TO /tmp/2021-09-23-collection-id2uuid.csv WITH CSV HEADER;
COPY 1139
</code></pre><h2 id="2021-09-24">2021-09-24</h2>
</code></pre></div><h2 id="2021-09-24">2021-09-24</h2>
<ul>
<li>Peter and Abenet agreed that we should consider converting more of our UPPER CASE metadata values to Title Case
<ul>
@ -435,33 +435,33 @@ COPY 1139
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=231;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=231;
UPDATE 2903
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &quot;cg.coverage.subregion&quot; FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 231) to /tmp/2021-09-24-subregions.txt;
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.coverage.subregion&#34; FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 231) to /tmp/2021-09-24-subregions.txt;
COPY 1200
</code></pre><ul>
</code></pre></div><ul>
<li>Then I process the list for matches with my <code>subdivision-lookup.py</code> script, and extract only the values that matched:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/subdivision-lookup.py -i /tmp/2021-09-24-subregions.txt -o /tmp/subregions.csv
$ csvgrep -c matched -m 'true' /tmp/subregions.csv | csvcut -c 1 | sed 1d &gt; /tmp/subregions-matched.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/subdivision-lookup.py -i /tmp/2021-09-24-subregions.txt -o /tmp/subregions.csv
$ csvgrep -c matched -m <span style="color:#e6db74">&#39;true&#39;</span> /tmp/subregions.csv | csvcut -c <span style="color:#ae81ff">1</span> | sed 1d &gt; /tmp/subregions-matched.txt
$ wc -l /tmp/subregions-matched.txt
81 /tmp/subregions-matched.txt
</code></pre><ul>
</code></pre></div><ul>
<li>Then I updated the controlled vocabulary in the submission forms</li>
<li>I did the same for <code>dcterms.audience</code>, taking special care to a few all-caps values:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value != 'NGOS' AND text_value != 'CGIAR';
localhost/dspace63= &gt; UPDATE metadatavalue SET text_value='NGOs' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = 'NGOS';
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value != &#39;NGOS&#39; AND text_value != &#39;CGIAR&#39;;
localhost/dspace63= &gt; UPDATE metadatavalue SET text_value=&#39;NGOs&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value = &#39;NGOS&#39;;
</code></pre></div><ul>
<li>Update submission form comment for DOIs because it was still recommending people use the &ldquo;dx.doi.org&rdquo; format even though I batch updated all DOIs to the &ldquo;doi.org&rdquo; format a few times in the last year
<ul>
<li>Then I updated all existing metadata to the new format again:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, &#39;https://dx.doi.org&#39;, &#39;https://doi.org&#39;) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE &#39;https://dx.doi.org%&#39;;
UPDATE 49
</code></pre><h2 id="2021-09-26">2021-09-26</h2>
</code></pre></div><h2 id="2021-09-26">2021-09-26</h2>
<ul>
<li>Mohammed Salem told me last week that MELSpace and WorldFish have been upgraded to DSpace 6 so I updated the repository setup in AReS to use the UUID field instead of IDs
<ul>
@ -489,26 +489,26 @@ UPDATE 49
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,collection,dc.title[en_US]' ~/Downloads/10568-106990.csv &gt; /tmp/2021-09-28-alliance-reports.csv
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,collection,dc.title[en_US]&#39;</span> ~/Downloads/10568-106990.csv &gt; /tmp/2021-09-28-alliance-reports.csv
</code></pre></div><ul>
<li>She sent it back fairly quickly with a new column marked &ldquo;Move&rdquo; so I extracted those items that matched and set them to the new owning collection:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvgrep -c Move -m 'Yes' ~/Downloads/2021_28_09_alliance_reports_csv.csv | csvcut -c 1,2 | sed 's_10568/106990_10568/111506_' &gt; /tmp/alliance-move.csv
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvgrep -c Move -m <span style="color:#e6db74">&#39;Yes&#39;</span> ~/Downloads/2021_28_09_alliance_reports_csv.csv | csvcut -c 1,2 | sed <span style="color:#e6db74">&#39;s_10568/106990_10568/111506_&#39;</span> &gt; /tmp/alliance-move.csv
</code></pre></div><ul>
<li>Maria from the Alliance emailed us to say that approving submissions was slow on CGSpace
<ul>
<li>I looked at the PostgreSQL activity and it seems low:</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_stat_activity&#39; | wc -l
59
</code></pre><ul>
</code></pre></div><ul>
<li>Locks look high though:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | sort | uniq -c | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c &#39;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;&#39; | sort | uniq -c | wc -l
1154
</code></pre><ul>
</code></pre></div><ul>
<li>Indeed it seems something started causing locks to increase yesterday:</li>
</ul>
<p><img src="/cgspace-notes/2021/09/postgres_locks_ALL-week.png" alt="PostgreSQL locks week"></p>
@ -520,9 +520,9 @@ UPDATE 49
<li>The number of DSpace sessions is normal, hovering around 1,000&hellip;</li>
<li>Looking closer at the PostgreSQL activity log, I see the locks are all held by the <code>dspaceCli</code> user&hellip; which seem weird:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c &quot;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceCli';&quot; | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">postgres@linode18:~$ psql -c &#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name=&#39;dspaceCli&#39;;&#34; | wc -l
1096
</code></pre><ul>
</code></pre></div><ul>
<li>Now I&rsquo;m wondering why there are no connections from <code>dspaceApi</code> or <code>dspaceWeb</code>. Could it be that our Tomcat JDBC pooling via JNDI isn&rsquo;t working?
<ul>
<li>I see the same thing on DSpace Test hmmmm</li>
@ -536,14 +536,14 @@ UPDATE 49
<ul>
<li>Export a list of ILRI subjects from CGSpace to validate against AGROVOC for Peter and Abenet:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 203) to /tmp/2021-09-29-ilri-subject.txt;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 203) to /tmp/2021-09-29-ilri-subject.txt;
COPY 149
</code></pre><ul>
</code></pre></div><ul>
<li>Then validate and format the matches:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/agrovoc-lookup.py -i /tmp/2021-09-29-ilri-subject.txt -o /tmp/2021-09-29-ilri-subjects.csv -d
$ csvcut -c subject,'match type' /tmp/2021-09-29-ilri-subjects.csv | sed -e 's/match type/matched/' -e 's/\(alt\|pref\)Label/yes/' &gt; /tmp/2021-09-29-ilri-subjects2.csv
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/agrovoc-lookup.py -i /tmp/2021-09-29-ilri-subject.txt -o /tmp/2021-09-29-ilri-subjects.csv -d
$ csvcut -c subject,<span style="color:#e6db74">&#39;match type&#39;</span> /tmp/2021-09-29-ilri-subjects.csv | sed -e <span style="color:#e6db74">&#39;s/match type/matched/&#39;</span> -e <span style="color:#e6db74">&#39;s/\(alt\|pref\)Label/yes/&#39;</span> &gt; /tmp/2021-09-29-ilri-subjects2.csv
</code></pre></div><ul>
<li>I talked to Salem about depositing from MEL to CGSpace
<ul>
<li>He mentioned that the one issue is that when you deposit to a workflow you don&rsquo;t get a Handle or any kind of identifier back!</li>

View File

@ -11,7 +11,7 @@
Export all affiliations on CGSpace and run them against the latest RoR data dump:
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
@ -35,7 +35,7 @@ So we have 1879/7100 (26.46%) matching already
Export all affiliations on CGSpace and run them against the latest RoR data dump:
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
@ -46,7 +46,7 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
So we have 1879/7100 (26.46%) matching already
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -136,15 +136,15 @@ So we have 1879/7100 (26.46%) matching already
<ul>
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
</code></pre><ul>
</code></pre></div><ul>
<li>So we have 1879/7100 (26.46%) matching already</li>
</ul>
<h2 id="2021-10-03">2021-10-03</h2>
@ -185,19 +185,19 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/*.log* | grep 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' | awk '{print $1}' | sort | uniq &gt; /tmp/mozilla-4.0-ips.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zcat --force /var/log/nginx/*.log* | grep <span style="color:#e6db74">&#39;Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)&#39;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort | uniq &gt; /tmp/mozilla-4.0-ips.txt
# wc -l /tmp/mozilla-4.0-ips.txt
543 /tmp/mozilla-4.0-ips.txt
</code></pre><ul>
</code></pre></div><ul>
<li>Then I resolved the IPs and extracted the ones belonging to Amazon:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k &quot;$ABUSEIPDB_API_KEY&quot; -o /tmp/mozilla-4.0-ips.csv
$ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/resolve-addresses-geoip2.py -i /tmp/mozilla-4.0-ips.txt -k <span style="color:#e6db74">&#34;</span>$ABUSEIPDB_API_KEY<span style="color:#e6db74">&#34;</span> -o /tmp/mozilla-4.0-ips.csv
$ csvgrep -c asn -m <span style="color:#ae81ff">14618</span> /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee /tmp/amazon-ips.txt | wc -l
</code></pre></div><ul>
<li>I am thinking I will purge them all, as I have several indicators that they are bots: mysterious user agent, IP owned by Amazon</li>
<li>Even more interesting, these requests are weighted VERY heavily on the CGIAR System community:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"> 1592 GET /handle/10947/2526
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"> 1592 GET /handle/10947/2526
1592 GET /handle/10947/2527
1592 GET /handle/10947/34
1593 GET /handle/10947/6
@ -215,7 +215,7 @@ $ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee
1600 GET /handle/10947/4467
1607 GET /handle/10568/103816
290382 GET /handle/10568/83389
</code></pre><ul>
</code></pre></div><ul>
<li>Before I purge all those I will ask someone Samuel Stacey from the System Office to hopefully get an insight&hellip;</li>
<li>Meeting with Michael Victor, Peter, Jane, and Abenet about the future of repositories in the One CGIAR</li>
<li>Meeting with Michelle from Altmetric about their new CSV upload system
@ -231,10 +231,10 @@ $ csvgrep -c asn -m 14618 /tmp/mozilla-4.0-ips.csv | csvcut -c ip | sed 1d | tee
</code></pre><ul>
<li>Extract the AGROVOC subjects from IWMI&rsquo;s 292 publications to validate them against AGROVOC:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'dcterms.subject[en_US]' ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e 's/||/\n/g' -e 's/&quot;//g' | sort -u &gt; /tmp/agrovoc.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;dcterms.subject[en_US]&#39;</span> ~/Downloads/2021-10-03-non-IWMI-publications.csv | sed -e 1d -e <span style="color:#e6db74">&#39;s/||/\n/g&#39;</span> -e <span style="color:#e6db74">&#39;s/&#34;//g&#39;</span> | sort -u &gt; /tmp/agrovoc.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/agrovoc-sorted.txt -o /tmp/agrovoc-matches.csv
$ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 &gt; /tmp/invalid-agrovoc.csv
</code></pre><h2 id="2021-10-05">2021-10-05</h2>
$ csvgrep -c <span style="color:#e6db74">&#39;number of matches&#39;</span> -m <span style="color:#e6db74">&#39;0&#39;</span> /tmp/agrovoc-matches.csv | csvcut -c <span style="color:#ae81ff">1</span> &gt; /tmp/invalid-agrovoc.csv
</code></pre></div><h2 id="2021-10-05">2021-10-05</h2>
<ul>
<li>Sam put me in touch with Dodi from the System Office web team and he confirmed that the Amazon requests are not theirs
<ul>
@ -243,11 +243,11 @@ $ csvgrep -c 'number of matches' -m '0' /tmp/agrovoc-matches.csv | csvcut -c 1 &
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
...
Total number of bot hits purged: 465119
</code></pre><h2 id="2021-10-06">2021-10-06</h2>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 465119
</code></pre></div><h2 id="2021-10-06">2021-10-06</h2>
<ul>
<li>Thinking about how we could check for duplicates before importing
<ul>
@ -255,14 +255,14 @@ Total number of bot hits purged: 465119
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; CREATE EXTENSION pg_trgm;
localhost/dspace63= &gt; SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') &gt; 0.5;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; CREATE EXTENSION pg_trgm;
localhost/dspace63= &gt; SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,&#39;Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines&#39;) &gt; 0.5;
metadata_value_id │ text_value │ dspace_object_id
───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
(2 rows)
</code></pre><ul>
</code></pre></div><ul>
<li>I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)</li>
<li>I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items
<ul>
@ -291,10 +291,10 @@ localhost/dspace63= &gt; SELECT metadata_value_id, text_value, dspace_object_id
</li>
<li>Then I ran this new version of csv-metadata-quality on an export of IWMI&rsquo;s community, minus some fields I don&rsquo;t want to check:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv &gt; /tmp/iwmi-to-check.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -C <span style="color:#e6db74">&#39;dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection&#39;</span> ~/Downloads/iwmi.csv &gt; /tmp/iwmi-to-check.csv
$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
$ xsv split -s 2000 /tmp /tmp/iwmi.csv
</code></pre><ul>
$ xsv split -s <span style="color:#ae81ff">2000</span> /tmp /tmp/iwmi.csv
</code></pre></div><ul>
<li>I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs&hellip;
<ul>
<li>I cut a subset of the fields from the main CSV and tried again, but DSpace said &ldquo;no changes detected&rdquo;</li>
@ -319,18 +319,18 @@ Try doing it in two imports. In first import, remove all authors. In second impo
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' ~/Downloads/iwmi.csv &gt; /tmp/iwmi-duplicate-metadata.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.iwmilibrary[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],cg.river.basin[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]&#39;</span> ~/Downloads/iwmi.csv &gt; /tmp/iwmi-duplicate-metadata.csv
# Copy and blank columns in OpenRefine
$ csv-metadata-quality -i ~/Downloads/2021-10-07-IWMI-duplicate-metadata-csv.csv -o /tmp/iwmi-duplicates-cleaned.csv | tee /tmp/out.log
$ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
</code></pre><ul>
$ xsv split -s <span style="color:#ae81ff">2000</span> /tmp /tmp/iwmi-duplicates-cleaned.csv
</code></pre></div><ul>
<li>It takes a few hours per 2,000 items because DSpace processes them so slowly&hellip; sigh&hellip;</li>
</ul>
<h2 id="2021-10-08">2021-10-08</h2>
<ul>
<li>I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2603711
@ -342,31 +342,31 @@ $ xsv split -s 2000 /tmp /tmp/iwmi-duplicates-cleaned.csv
| 0
(7 rows)
cgspace=# BEGIN;
cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
cgspace=# UPDATE metadatavalue SET text_lang=&#39;en_US&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN (&#39;en_Fu&#39;, &#39;en&#39;, &#39;&#39;);
UPDATE 129673
cgspace=# COMMIT;
</code></pre><ul>
</code></pre></div><ul>
<li>So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -c 'Removing duplicate value' /tmp/out.log
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">&#39;Removing duplicate value&#39;</span> /tmp/out.log
391
</code></pre><ul>
</code></pre></div><ul>
<li>I tried to export ILRI&rsquo;s community, but ran into the export bug (DS-4211)
<ul>
<li>After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed <span style="color:#e6db74">&#39;1d&#39;</span> | wc -l
32070
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed <span style="color:#e6db74">&#39;1d&#39;</span> | wc -l
19315
</code></pre><ul>
</code></pre></div><ul>
<li>It seems there are only about 200 duplicate values in this subset of fields in ILRI&rsquo;s community:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -c 'Removing duplicate value' /tmp/out.log
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">&#39;Removing duplicate value&#39;</span> /tmp/out.log
220
</code></pre><ul>
</code></pre></div><ul>
<li>I found a cool way to select only the items with corrections
<ul>
<li>First, extract a handful of fields from the CSV with csvcut</li>
@ -376,11 +376,11 @@ $ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]' /tmp/ilri.csv | csvsort | uniq &gt; /tmp/ilri-deduplicated-items.csv
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,cg.contributor.affiliation[en_US],cg.coverage.country[en_US],cg.coverage.iso3166-alpha2[en_US],cg.coverage.subregion[en_US],cg.identifier.doi[en_US],cg.identifier.url[en_US],cg.isijournal[en_US],cg.issn[en_US],dc.contributor.author[en_US],dcterms.subject[en_US]&#39;</span> /tmp/ilri.csv | csvsort | uniq &gt; /tmp/ilri-deduplicated-items.csv
$ csv-metadata-quality -i /tmp/ilri-deduplicated-items.csv -o /tmp/ilri-deduplicated-items-cleaned.csv | tee /tmp/out.log
$ sed -i -e '1s/en_US/en_Fu/g' /tmp/ilri-deduplicated-items-cleaned.csv
$ sed -i -e <span style="color:#e6db74">&#39;1s/en_US/en_Fu/g&#39;</span> /tmp/ilri-deduplicated-items-cleaned.csv
$ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cleaned.csv &gt; /tmp/ilri-deduplicated-items-cleaned-joined.csv
</code></pre><ul>
</code></pre></div><ul>
<li>Then I imported the file into OpenRefine and used a custom text facet with a GREL like this to identify the rows with changes:</li>
</ul>
<pre tabindex="0"><code>if(cells['dcterms.subject[en_US]'].value == cells['dcterms.subject[en_Fu]'].value,&quot;same&quot;,&quot;different&quot;)
@ -392,9 +392,9 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
</li>
<li>I did the same for CIAT but there were over 7,000 duplicate metadata values! Hard to believe:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -c 'Removing duplicate value' /tmp/out.log
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c <span style="color:#e6db74">&#39;Removing duplicate value&#39;</span> /tmp/out.log
7720
</code></pre><ul>
</code></pre></div><ul>
<li>I applied these to the CIAT community, so in total that&rsquo;s over 8,000 duplicate metadata values removed in a handful of fields&hellip;</li>
</ul>
<h2 id="2021-10-09">2021-10-09</h2>
@ -402,14 +402,14 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
<li>I did similar metadata cleanups for CCAFS and IITA too, but there were only a few hundred duplicates there</li>
<li>Also of note, there are some other fixes too, for example in IITA&rsquo;s community:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep -c -E '(Fixing|Removing) (duplicate|excessive|invalid)' /tmp/out.log
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep -c -E <span style="color:#e6db74">&#39;(Fixing|Removing) (duplicate|excessive|invalid)&#39;</span> /tmp/out.log
249
</code></pre><ul>
</code></pre></div><ul>
<li>I ran a full Discovery re-indexing on CGSpace</li>
<li>Then I exported all of CGSpace and extracted the ISSNs and ISBNs:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ csvcut -c 'id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]' /tmp/cgspace.csv &gt; /tmp/cgspace-issn-isbn.csv
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c <span style="color:#e6db74">&#39;id,cg.issn[en_US],dc.identifier.issn[en_US],cg.isbn[en_US],dc.identifier.isbn[en_US]&#39;</span> /tmp/cgspace.csv &gt; /tmp/cgspace-issn-isbn.csv
</code></pre></div><ul>
<li>I did cleanups on about seventy items with invalid and mixed ISSNs/ISBNs</li>
</ul>
<h2 id="2021-10-10">2021-10-10</h2>
@ -417,42 +417,42 @@ $ csvjoin -c id /tmp/ilri-deduplicated-items.csv /tmp/ilri-deduplicated-items-cl
<li>Start testing DSpace 7.1-SNAPSHOT to see if it has the duplicate item bug on <code>metadata-export</code> (DS-4211)</li>
<li>First create a new PostgreSQL 13 container:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ podman run --name dspacedb13 -v dspacedb13_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5433:5432 -d postgres:13-alpine
$ createuser -h localhost -p 5433 -U postgres --pwprompt dspacetest
$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7
$ psql -h localhost -p 5433 -U postgres dspace7 -c 'CREATE EXTENSION pgcrypto;'
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ podman run --name dspacedb13 -v dspacedb13_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD<span style="color:#f92672">=</span>postgres -p 5433:5432 -d postgres:13-alpine
$ createuser -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres --pwprompt dspacetest
$ createdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspace7
$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">&#39;CREATE EXTENSION pgcrypto;&#39;</span>
</code></pre></div><ul>
<li>Then edit setting in <code>dspace/config/local.cfg</code> and build the backend server with Java 11:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ mvn package
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ mvn package
$ cd dspace/target/dspace-installer
$ ant fresh_install
# fix database not being fully ready, causing Tomcat to fail to start the server application
$ ~/dspace7/bin/dspace database migrate
</code></pre><ul>
</code></pre></div><ul>
<li>Copy Solr configs and start Solr:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ cp -Rv ~/dspace7/solr/* ~/src/solr-8.8.2/server/solr/configsets
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ cp -Rv ~/dspace7/solr/* ~/src/solr-8.8.2/server/solr/configsets
$ ~/src/solr-8.8.2/bin/solr start
</code></pre><ul>
</code></pre></div><ul>
<li>Start my local Tomcat 9 instance:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ systemctl --user start tomcat9@dspace7
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ systemctl --user start tomcat9@dspace7
</code></pre></div><ul>
<li>This works, so now I will drop the default database and import a dump from CGSpace</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ systemctl --user stop tomcat9@dspace7
$ dropdb -h localhost -p 5433 -U postgres dspace7
$ createdb -h localhost -p 5433 -U postgres -O dspacetest --encoding=UNICODE dspace7
$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -p 5433 -U postgres -d dspace7 -O --role=dspacetest -h localhost dspace-2021-10-09.backup
$ psql -h localhost -p 5433 -U postgres -c 'alter user dspacetest nosuperuser;'
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ systemctl --user stop tomcat9@dspace7
$ dropdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7
$ createdb -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -O dspacetest --encoding<span style="color:#f92672">=</span>UNICODE dspace7
$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -c <span style="color:#e6db74">&#39;alter user dspacetest superuser;&#39;</span>
$ pg_restore -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -d dspace7 -O --role<span style="color:#f92672">=</span>dspacetest -h localhost dspace-2021-10-09.backup
$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres -c <span style="color:#e6db74">&#39;alter user dspacetest nosuperuser;&#39;</span>
</code></pre></div><ul>
<li>Delete Atmire migrations and some others that were &ldquo;unresolved&rdquo;:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -h localhost -p 5433 -U postgres dspace7 -c &quot;DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';&quot;
$ psql -h localhost -p 5433 -U postgres dspace7 -c &quot;DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">&#34;DELETE FROM schema_version WHERE description LIKE &#39;%Atmire%&#39; OR description LIKE &#39;%CUA%&#39; OR description LIKE &#39;%cua%&#39;;&#34;</span>
$ psql -h localhost -p <span style="color:#ae81ff">5433</span> -U postgres dspace7 -c <span style="color:#e6db74">&#34;DELETE FROM schema_version WHERE version IN (&#39;5.0.2017.09.25&#39;, &#39;6.0.2017.01.30&#39;, &#39;6.0.2017.09.25&#39;);&#34;</span>
</code></pre></div><ul>
<li>Now DSpace 7 starts with my CGSpace data&hellip; nice
<ul>
<li>The Discovery indexing still takes seven hours&hellip; fuck</li>
@ -469,11 +469,11 @@ $ psql -h localhost -p 5433 -U postgres dspace7 -c &quot;DELETE FROM schema_vers
<ul>
<li>Start a full Discovery reindex on my local DSpace 6.3 instance:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e chrt -b 0 ~/dspace63/bin/dspace index-discovery -b
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ /usr/bin/time -f %M:%e chrt -b <span style="color:#ae81ff">0</span> ~/dspace63/bin/dspace index-discovery -b
Loading @mire database changes for module MQM
Changes have been processed
836140:6543.6
</code></pre><ul>
</code></pre></div><ul>
<li>So that&rsquo;s 1.8 hours versus 7 on DSpace 7, with the same database!</li>
<li>Several users wrote to me that CGSpace was slow recently
<ul>
@ -481,13 +481,13 @@ Changes have been processed
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ psql -c <span style="color:#e6db74">&#39;SELECT * FROM pg_stat_activity&#39;</span> | wc -l
53
$ psql -c &quot;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&quot; | wc -l
$ psql -c <span style="color:#e6db74">&#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid&#34;</span> | wc -l
1697
$ psql -c &quot;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceWeb'&quot; | wc -l
$ psql -c <span style="color:#e6db74">&#34;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name=&#39;dspaceWeb&#39;&#34;</span> | wc -l
1681
</code></pre><ul>
</code></pre></div><ul>
<li>Looking at Munin, I see there are indeed a higher number of locks starting on the morning of 2021-10-07:</li>
</ul>
<p><img src="/cgspace-notes/2021/10/postgres_locks_ALL-week.png" alt="PostgreSQL locks week"></p>
@ -516,71 +516,71 @@ $ psql -c &quot;SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.p
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace= &gt; SET pg_trgm.similarity_threshold = 0.5;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace= &gt; SET pg_trgm.similarity_threshold = 0.5;
</code></pre></div><ul>
<li>Next I experimented with using GIN or GiST indexes on <code>metadatavalue</code>, but they were slower than the existing DSpace indexes
<ul>
<li>I tested a few variations of the query I had been using and found it&rsquo;s <em>much</em> faster if I use the similarity operator and keep the condition that object IDs are in the item table&hellip;</li>
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % &#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;;
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
Time: 739.948 ms
</code></pre><ul>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 739.948 ms
</code></pre></div><ul>
<li>Now this script runs in four minutes (versus twenty-four!) and it still finds the same seven duplicates! Amazing!</li>
<li>I still don&rsquo;t understand the differences in the query plan well enough, but I see it is using the DSpace default indexes and the results are accurate</li>
<li>So to summarize, the best to the worst query, all returning the same result:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace= &gt; SET pg_trgm.similarity_threshold = 0.6;
localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace= &gt; SET pg_trgm.similarity_threshold = 0.6;
localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % &#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;;
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
Time: 683.165 ms
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 683.165 ms
Time: 635.364 ms
Time: 674.666 ms
localhost/dspace= &gt; DISCARD ALL;
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= &gt; DISCARD ALL;
localhost/dspace= &gt; SET pg_trgm.similarity_threshold = 0.6;
localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % &#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;;
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
Time: 1584.765 ms (00:01.585)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 1584.765 ms (00:01.585)
Time: 1665.594 ms (00:01.666)
Time: 1623.726 ms (00:01.624)
localhost/dspace= &gt; DISCARD ALL;
localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') &gt; 0.6;
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= &gt; DISCARD ALL;
localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,&#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;) &gt; 0.6;
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
Time: 4028.939 ms (00:04.029)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 4028.939 ms (00:04.029)
Time: 4022.239 ms (00:04.022)
Time: 4061.820 ms (00:04.062)
localhost/dspace= &gt; DISCARD ALL;
localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') &gt; 0.6;
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>localhost/dspace= &gt; DISCARD ALL;
localhost/dspace= &gt; SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,&#39;Traditional knowledge affects soil management ability of smallholder farmers in marginal areas&#39;) &gt; 0.6;
text_value │ dspace_object_id
────────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
Time: 4358.713 ms (00:04.359)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Time: 4358.713 ms (00:04.359)
Time: 4301.248 ms (00:04.301)
Time: 4417.909 ms (00:04.418)
</code></pre><h2 id="2021-10-13">2021-10-13</h2>
</code></pre></div><h2 id="2021-10-13">2021-10-13</h2>
<ul>
<li>I looked into the <a href="https://github.com/DSpace/DSpace/issues/7946">REST API issue where fields without qualifiers throw an HTTP 500</a>
<ul>
@ -640,11 +640,11 @@ Time: 4417.909 ms (00:04.418)
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT3.CGIARAD.ORG:636/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;booo&quot; -W &quot;(sAMAccountName=fuuu)&quot;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ldapsearch -x -H ldaps://AZCGNEROOT3.CGIARAD.ORG:636/ -b <span style="color:#e6db74">&#34;dc=cgiarad,dc=org&#34;</span> -D <span style="color:#e6db74">&#34;booo&#34;</span> -W <span style="color:#e6db74">&#34;(sAMAccountName=fuuu)&#34;</span>
Enter LDAP Password:
ldap_bind: Invalid credentials (49)
additional info: 80090308: LdapErr: DSID-0C090447, comment: AcceptSecurityContext error, data 52e, v3839
</code></pre><ul>
</code></pre></div><ul>
<li>I sent a message to ILRI ICT to ask them to check the account
<ul>
<li>They reset the password so I ran all system updates and rebooted the server since users weren&rsquo;t able to log in anyways</li>
@ -664,17 +664,17 @@ ldap_bind: Invalid credentials (49)
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ http 'localhost:8081/solr/statistics/select?q=time%3A2021-04*&amp;fl=ip&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=ip&amp;facet.limit=200000&amp;facet.mincount=1' &gt; /tmp/2021-04-ips.json
# Ghetto way to extract the IPs using jq, but I can't figure out how only print them and not the facet counts, so I just use sed
$ jq '.facet_counts.facet_fields.ip[]' /tmp/2021-04-ips.json | grep -E '^&quot;' | sed -e 's/&quot;//g' &gt; /tmp/ips.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ http <span style="color:#e6db74">&#39;localhost:8081/solr/statistics/select?q=time%3A2021-04*&amp;fl=ip&amp;wt=json&amp;indent=true&amp;facet=true&amp;facet.field=ip&amp;facet.limit=200000&amp;facet.mincount=1&#39;</span> &gt; /tmp/2021-04-ips.json
# Ghetto way to extract the IPs using jq, but I can<span style="color:#960050;background-color:#1e0010">&#39;</span>t figure out how only print them and not the facet counts, so I just use sed
$ jq <span style="color:#e6db74">&#39;.facet_counts.facet_fields.ip[]&#39;</span> /tmp/2021-04-ips.json | grep -E <span style="color:#e6db74">&#39;^&#34;&#39;</span> | sed -e <span style="color:#e6db74">&#39;s/&#34;//g&#39;</span> &gt; /tmp/ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-04-ips.csv
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-04-ips.csv | csvcut -c network | sed 1d | sort -u &gt; /tmp/networks-to-block.txt
$ csvgrep -c asn -r <span style="color:#e6db74">&#39;^(49453|46844|206485|62282|36352|35913|35624|8100)$&#39;</span> /tmp/2021-04-ips.csv | csvcut -c network | sed 1d | sort -u &gt; /tmp/networks-to-block.txt
$ wc -l /tmp/networks-to-block.txt
125 /tmp/networks-to-block.txt
$ grepcidr -f /tmp/networks-to-block.txt /tmp/ips.txt &gt; /tmp/ips-to-purge.txt
$ wc -l /tmp/ips-to-purge.txt
202
</code></pre><ul>
</code></pre></div><ul>
<li>Attempting to purge those only shows about 3,500 hits, but I will do it anyways
<ul>
<li>Adding 64.39.108.48 from Qualys I get a total of 22631 hits purged</li>
@ -715,9 +715,9 @@ $ wc -l /tmp/ips-to-purge.txt
</code></pre><ul>
<li>Even more annoying, they are not re-using their session ID:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ grep 93.158.91.62 log/dspace.log.2021-10-29 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ grep 93.158.91.62 log/dspace.log.2021-10-29 | grep -oE <span style="color:#e6db74">&#39;session_id=[A-Z0-9]{32}:ip_addr=&#39;</span> | sort | uniq | wc -l
4888
</code></pre><ul>
</code></pre></div><ul>
<li>This IP has made 36,000 requests to CGSpace&hellip;</li>
<li>The IP is owned by <a href="internetvikings.com">Internet Vikings</a> in Sweden</li>
<li>I purged their statistics and set up a temporary HTTP 403 telling them to use a real user agent</li>
@ -733,13 +733,13 @@ $ wc -l /tmp/ips-to-purge.txt
</code></pre><ul>
<li>I reported them to AbuseIPDB.com and purged their hits:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ip.txt -p
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ip.txt -p
Purging 6364 hits from 45.9.20.71 in statistics
Purging 8039 hits from 45.146.166.157 in statistics
Purging 3383 hits from 45.155.204.82 in statistics
Total number of bot hits purged: 17786
</code></pre><h2 id="2021-10-31">2021-10-31</h2>
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 17786
</code></pre></div><h2 id="2021-10-31">2021-10-31</h2>
<ul>
<li>Update Docker containers for AReS on linode20 and run a fresh harvest</li>
<li>Found some strange IP (94.71.3.44) making 51,000 requests today with the user agent &ldquo;Microsoft Internet Explorer&rdquo;
@ -757,13 +757,13 @@ Total number of bot hits purged: 17786
<li>That&rsquo;s from ASN 12552 (IPO-EU, SE), which is operated by Internet Vikings, though AbuseIPDB.com says it&rsquo;s <a href="availo.se">Availo Networks AB</a></li>
<li>There&rsquo;s another IP (3.225.28.105) that made a few thousand requests to the REST API from Amazon, though it&rsquo;s using a normal user agent</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># zgrep 3.225.28.105 /var/log/nginx/rest.log.* | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zgrep 3.225.28.105 /var/log/nginx/rest.log.* | wc -l
3991
~# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | grep -oE 'GET /rest/(collections|handle|items)' | sort | uniq -c
~# zgrep 3.225.28.105 /var/log/nginx/rest.log.* | grep -oE &#39;GET /rest/(collections|handle|items)&#39; | sort | uniq -c
3154 GET /rest/collections
427 GET /rest/handle
410 GET /rest/items
</code></pre><ul>
</code></pre></div><ul>
<li>It requested the <a href="https://cgspace.cgiar.org/handle/10568/75560">CIAT Story Maps</a> collection over 3,000 times last month&hellip;
<ul>
<li>I will purge those hits</li>

View File

@ -18,7 +18,7 @@ $ zstd statistics-2019.json
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-11/" />
<meta property="article:published_time" content="2021-11-02T22:27:07+02:00" />
<meta property="article:modified_time" content="2021-11-03T15:56:15+02:00" />
<meta property="article:modified_time" content="2021-11-07T11:26:32+02:00" />
@ -32,7 +32,7 @@ First I exported all the 2019 stats from CGSpace:
$ ./run.sh -s http://localhost:8081/solr/statistics -f &#39;time:2019-*&#39; -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -42,9 +42,9 @@ $ zstd statistics-2019.json
"@type": "BlogPosting",
"headline": "November, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-11/",
"wordCount": "468",
"wordCount": "722",
"datePublished": "2021-11-02T22:27:07+02:00",
"dateModified": "2021-11-03T15:56:15+02:00",
"dateModified": "2021-11-07T11:26:32+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -123,16 +123,16 @@ $ zstd statistics-2019.json
<li>I experimented with manually sharding the Solr statistics on DSpace Test</li>
<li>First I exported all the 2019 stats from CGSpace:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
</code></pre><ul>
</code></pre></div><ul>
<li>Then on DSpace Test I created a <code>statistics-2019</code> core with the same instance dir as the main <code>statistics</code> core (as <a href="https://wiki.lyrasis.org/display/DSDOC6x/Testing+Solr+Shards">illustrated in the DSpace docs</a>)</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ mkdir -p /home/dspacetest.cgiar.org/solr/statistics-2019/data
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ mkdir -p /home/dspacetest.cgiar.org/solr/statistics-2019/data
# create core in Solr admin
$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;time:2019-*&lt;/query&gt;&lt;/delete&gt;&quot;
$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;time:2019-*&lt;/query&gt;&lt;/delete&gt;&#34;</span>
$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics-2019.json -k uid
</code></pre><ul>
</code></pre></div><ul>
<li>The key thing above is that you create the core in the Solr admin UI, but the data directory must already exist so you have to do that first in the file system</li>
<li>I restarted the server after the import was done to see if the cores would come back up OK
<ul>
@ -165,13 +165,13 @@ $ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics
</ul>
</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">91.213.50.11 - - [03/Nov/2021:06:47:20 +0100] &quot;HEAD /bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf?sequence=1%60%20WHERE%206158%3D6158%20AND%204894%3D4741--%20kIlq&amp;isAllowed=y HTTP/1.1&quot; 200 0 &quot;https://cgspace.cgiar.org:443/bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf&quot; &quot;Mozilla/5.0 (X11; U; Linux i686; en-CA; rv:1.8.0.10) Gecko/20070223 Fedora/1.5.0.10-1.fc5 Firefox/1.5.0.10&quot;
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">91.213.50.11 - - [03/Nov/2021:06:47:20 +0100] &#34;HEAD /bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf?sequence=1%60%20WHERE%206158%3D6158%20AND%204894%3D4741--%20kIlq&amp;isAllowed=y HTTP/1.1&#34; 200 0 &#34;https://cgspace.cgiar.org:443/bitstream/handle/10568/106239/U19ArtSimonikovaChromosomeInthomNodev.pdf&#34; &#34;Mozilla/5.0 (X11; U; Linux i686; en-CA; rv:1.8.0.10) Gecko/20070223 Fedora/1.5.0.10-1.fc5 Firefox/1.5.0.10&#34;
</code></pre></div><ul>
<li>Another is in China, and they grabbed 1,200 PDFs from the REST API in under an hour:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># zgrep 222.129.53.160 /var/log/nginx/rest.log.2.gz | wc -l
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># zgrep 222.129.53.160 /var/log/nginx/rest.log.2.gz | wc -l
1178
</code></pre><ul>
</code></pre></div><ul>
<li>I will continue to split the Solr statistics back into year-shards on DSpace Test (linode26)
<ul>
<li>Today I did all 2018 stats&hellip;</li>
@ -183,11 +183,56 @@ $ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a import -o statistics
<ul>
<li>Update all Docker containers on AReS and rebuild OpenRXV:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
</code></pre><ul>
</code></pre></div><ul>
<li>Then restart the server and start a fresh harvest</li>
<li>Continue splitting the Solr statistics into yearly shards on DSpace Test (doing 2017 today)</li>
<li>Continue splitting the Solr statistics into yearly shards on DSpace Test (doing 2017, 2016, 2015, and 2014 today)</li>
<li>Several users wrote to me last week to say that workflow emails haven&rsquo;t been working since 2021-10-21 or so
<ul>
<li>I did a test on CGSpace and it&rsquo;s indeed broken:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ dspace test-email
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>About to send test email:
- To: fuuuu
- Subject: DSpace test email
- Server: smtp.office365.com
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Error sending email:
- Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 535 5.7.139 Authentication unsuccessful, the user credentials were incorrect. [AM5PR0701CA0005.eurprd07.prod.outlook.com]
)
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Please see the DSpace documentation for assistance.
</code></pre></div><ul>
<li>I sent a message to ILRI ICT to ask them to check the account/password</li>
<li>I want to do one last test of the Elasticsearch updates on OpenRXV so I got a snapshot of the latest Elasticsearch volume used on the production AReS instance:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># tar czf openrxv_esData_7.tar.xz /var/lib/docker/volumes/openrxv_esData_7
</code></pre></div><ul>
<li>Then on my local server:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ mv ~/.local/share/containers/storage/volumes/openrxv_esData_7/ ~/.local/share/containers/storage/volumes/openrxv_esData_7.2021-11-07.bak
$ tar xf /tmp/openrxv_esData_7.tar.xz -C ~/.local/share/containers/storage/volumes --strip-components<span style="color:#f92672">=</span><span style="color:#ae81ff">4</span>
$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type f -exec chmod <span style="color:#ae81ff">660</span> <span style="color:#f92672">{}</span> <span style="color:#ae81ff">\;</span>
$ find ~/.local/share/containers/storage/volumes/openrxv_esData_7 -type d -exec chmod <span style="color:#ae81ff">770</span> <span style="color:#f92672">{}</span> <span style="color:#ae81ff">\;</span>
# copy backend/data to /tmp <span style="color:#66d9ef">for</span> the repository setup/layout
$ rsync -av --partial --progress --delete provisioning@ares:/tmp/data/ backend/data
</code></pre></div><ul>
<li>This seems to work: all items, stats, and repository setup/layout are OK</li>
<li>I merged my <a href="https://github.com/ilri/OpenRXV/pull/126">Elasticsearch pull request</a> from last month into OpenRXV</li>
</ul>
<h2 id="2021-11-08">2021-11-08</h2>
<ul>
<li>File <a href="https://github.com/DSpace/dspace-angular/issues/1391">an issue for the Angular flash of unstyled content</a> on DSpace 7</li>
<li>Help Udana from IWMI with a question about CGSpace statistics
<ul>
<li>He found conflicting numbers when using the community and collection modes in Content and Usage Analysis</li>
<li>I sent him more numbers directly from the DSpace Statistics API</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -17,7 +17,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="404 Page not found"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Categories"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -95,9 +95,9 @@
<li>I experimented with manually sharding the Solr statistics on DSpace Test</li>
<li>First I exported all the 2019 stats from CGSpace:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
</code></pre>
</code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-11/'>Read more →</a>
</article>
@ -119,15 +119,15 @@ $ zstd statistics-2019.json
<ul>
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
</code></pre><ul>
</code></pre></div><ul>
<li>So we have 1879/7100 (26.46%) matching already</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2021-10/'>Read more →</a>
@ -184,8 +184,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre></div><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2021-08/'>Read more →</a>
@ -209,9 +209,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
</code></pre>
</code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-07/'>Read more →</a>
</article>

View File

@ -18,9 +18,9 @@
&lt;li&gt;I experimented with manually sharding the Solr statistics on DSpace Test&lt;/li&gt;
&lt;li&gt;First I exported all the 2019 stats from CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ ./run.sh -s http://localhost:8081/solr/statistics -f &#39;time:2019-*&#39; -a export -o statistics-2019.json -k uid
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ ./run.sh -s http://localhost:8081/solr/statistics -f &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;time:2019-*&amp;#39;&lt;/span&gt; -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
&lt;/code&gt;&lt;/pre&gt;</description>
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -33,15 +33,15 @@ $ zstd statistics-2019.json
&lt;ul&gt;
&lt;li&gt;Export all affiliations on CGSpace and run them against the latest RoR data dump:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT text_value as &amp;quot;cg.contributor.affiliation&amp;quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d &amp;gt; /tmp/2021-10-01-affiliations.txt
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT text_value as &amp;#34;cg.contributor.affiliation&amp;#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c &lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt; /tmp/2021-10-01-affiliations.csv | sed 1d &amp;gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;So we have 1879/7100 (26.46%) matching already&lt;/li&gt;
&lt;/ul&gt;</description>
</item>
@ -80,8 +80,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
&lt;ul&gt;
&lt;li&gt;Update Docker images on AReS server (linode20) and reboot the server:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &#39;s/ \+/:/g&#39; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;s/ \+/:/g&amp;#39;&lt;/span&gt; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;I decided to upgrade linode20 from Ubuntu 18.04 to 20.04&lt;/li&gt;
&lt;/ul&gt;</description>
</item>
@ -96,9 +96,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
&lt;ul&gt;
&lt;li&gt;Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
&lt;/code&gt;&lt;/pre&gt;</description>
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -203,17 +203,17 @@ COPY 20994
&lt;li&gt;I had a call with CodeObia to discuss the work on OpenRXV&lt;/li&gt;
&lt;li&gt;Check the results of the AReS harvesting from last night:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&#39;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&amp;#39;&lt;/span&gt;
{
&amp;quot;count&amp;quot; : 100875,
&amp;quot;_shards&amp;quot; : {
&amp;quot;total&amp;quot; : 1,
&amp;quot;successful&amp;quot; : 1,
&amp;quot;skipped&amp;quot; : 0,
&amp;quot;failed&amp;quot; : 0
&amp;#34;count&amp;#34; : 100875,
&amp;#34;_shards&amp;#34; : {
&amp;#34;total&amp;#34; : 1,
&amp;#34;successful&amp;#34; : 1,
&amp;#34;skipped&amp;#34; : 0,
&amp;#34;failed&amp;#34; : 0
}
}
&lt;/code&gt;&lt;/pre&gt;</description>
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -101,17 +101,17 @@
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 100875,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100875,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre>
</code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-02/'>Read more →</a>
</article>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGIAR Library Migration"/>
<meta name="twitter:description" content="Notes on the migration of the CGIAR Library to CGSpace"/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace CG Core v2 Migration"/>
<meta name="twitter:description" content="Possible changes to CGSpace metadata fields to align more with DC, QDC, and DCTERMS as well as CG Core v2."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -18,7 +18,7 @@
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace DSpace 6 Upgrade"/>
<meta name="twitter:description" content="Documenting the DSpace 6 upgrade."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -129,20 +129,20 @@
</ul>
<h3 id="re-import-oai-with-clean-index">Re-import OAI with clean index</h3>
<p>After the upgrade is complete, re-index all items into OAI with a clean index:</p>
<pre tabindex="0"><code class="language-console" data-lang="console">$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx2048m&quot;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Dfile.encoding=UTF-8 -Xmx2048m&#34;</span>
$ dspace oai -c import
</code></pre><p>The process ran out of memory several times so I had to keep trying again with more JVM heap memory.</p>
</code></pre></div><p>The process ran out of memory several times so I had to keep trying again with more JVM heap memory.</p>
<h3 id="processing-solr-statistics-with-solr-upgrade-statistics-6x">Processing Solr Statistics With solr-upgrade-statistics-6x</h3>
<p>After the main upgrade process was finished and DSpace was running I started processing the Solr statistics with <code>solr-upgrade-statistics-6x</code> to migrate all IDs to UUIDs.</p>
<h2 id="statistics">statistics</h2>
<p>First process the current year&rsquo;s statistics core:</p>
<pre tabindex="0"><code class="language-console" data-lang="console">$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;-Dfile.encoding=UTF-8 -Xmx2048m&#39;</span>
$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics
...
=================================================================
*** Statistics Records with Legacy Id ***
3,817,407 Bistream View
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 3,817,407 Bistream View
1,693,443 Item View
105,974 Collection View
62,383 Community View
@ -152,22 +152,22 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
--------------------------------------
6,475,268 TOTAL
=================================================================
</code></pre><p>After several rounds of processing it finished. Here are some statistics about unmigrated documents:</p>
</code></pre></div><p>After several rounds of processing it finished. Here are some statistics about unmigrated documents:</p>
<ul>
<li>227,000: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>471,000: <code>id:/.+-unmigrated/</code></li>
<li>698,000: <code>*:* NOT id:/.{36}/</code></li>
<li>Majority are <code>type: 5</code> (aka SITE, according to <code>Constants.java</code>) so we can purge them:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2019">statistics-2019</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2019">statistics-2019</h2>
<p>Processing the statistics-2019 core:</p>
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics
...
=================================================================
*** Statistics Records with Legacy Id ***
5,569,344 Bistream View
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 5,569,344 Bistream View
2,179,105 Item View
117,194 Community View
104,091 Collection View
@ -177,22 +177,22 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
--------------------------------------
10,794,839 TOTAL
=================================================================
</code></pre><p>After several rounds of processing it finished. Here are some statistics about unmigrated documents:</p>
</code></pre></div><p>After several rounds of processing it finished. Here are some statistics about unmigrated documents:</p>
<ul>
<li>2,690,309: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>1,494,587: <code>id:/.+-unmigrated/</code></li>
<li>4,184,896: <code>*:* NOT id:/.{36}/</code></li>
<li>4,172,929 are <code>type: 5</code> (aka SITE) so we can purge them:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2019/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2018">statistics-2018</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2019/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2018">statistics-2018</h2>
<p>Processing the statistics-2018 core:</p>
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2018
...
=================================================================
*** Statistics Records with Legacy Id ***
3,561,532 Bistream View
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 3,561,532 Bistream View
1,129,326 Item View
97,401 Community View
63,508 Collection View
@ -202,25 +202,25 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
--------------------------------------
5,561,166 TOTAL
=================================================================
</code></pre><p>After some time I got an error about Java heap space so I increased the JVM memory and restarted processing:</p>
<pre tabindex="0"><code class="language-console" data-lang="console">$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
</code></pre><p>Eventually the processing finished. Here are some statistics about unmigrated documents:</p>
</code></pre></div><p>After some time I got an error about Java heap space so I increased the JVM memory and restarted processing:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ export JAVA_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;-Dfile.encoding=UTF-8 -Xmx4096m&#39;</span>
$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2018
</code></pre></div><p>Eventually the processing finished. Here are some statistics about unmigrated documents:</p>
<ul>
<li>365,473: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>546,955: <code>id:/.+-unmigrated/</code></li>
<li>923,158: <code>*:* NOT id:/.{36}/</code></li>
<li>823,293: are <code>type: 5</code> so we can purge them:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2017">statistics-2017</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2018/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2017">statistics-2017</h2>
<p>Processing the statistics-2017 core:</p>
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2017
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2017
...
=================================================================
*** Statistics Records with Legacy Id ***
2,529,208 Bistream View
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 2,529,208 Bistream View
1,618,717 Item View
144,945 Community View
74,249 Collection View
@ -230,22 +230,22 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
--------------------------------------
5,813,639 TOTAL
=================================================================
</code></pre><p>Eventually the processing finished. Here are some statistics about unmigrated documents:</p>
</code></pre></div><p>Eventually the processing finished. Here are some statistics about unmigrated documents:</p>
<ul>
<li>808,309: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>893,868: <code>id:/.+-unmigrated/</code></li>
<li>1,702,177: <code>*:* NOT id:/.{36}/</code></li>
<li>1,660,524 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2017/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2016">statistics-2016</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2017/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2016">statistics-2016</h2>
<p>Processing the statistics-2016 core:</p>
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2016
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2016
...
=================================================================
*** Statistics Records with Legacy Id ***
1,765,924 Bistream View
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 1,765,924 Bistream View
1,151,575 Item View
187,110 Community View
51,204 Collection View
@ -255,21 +255,21 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
--------------------------------------
4,190,098 TOTAL
=================================================================
</code></pre><ul>
</code></pre></div><ul>
<li>849,408: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>627,747: <code>id:/.+-unmigrated/</code></li>
<li>1,477,155: <code>*:* NOT id:/.{36}/</code></li>
<li>1,469,706 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2016/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2015">statistics-2015</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2016/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2015">statistics-2015</h2>
<p>Processing the statistics-2015 core:</p>
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2015
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2015
...
=================================================================
*** Statistics Records with Legacy Id ***
990,916 Bistream View
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 990,916 Bistream View
506,070 Item View
116,153 Community View
33,282 Collection View
@ -279,22 +279,22 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
--------------------------------------
1,730,378 TOTAL
=================================================================
</code></pre><p>Summary of stats after processing:</p>
</code></pre></div><p>Summary of stats after processing:</p>
<ul>
<li>195,293: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>67,146: <code>id:/.+-unmigrated/</code></li>
<li>262,439: <code>*:* NOT id:/.{36}/</code></li>
<li>247,400 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2015/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2014">statistics-2014</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2015/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2014">statistics-2014</h2>
<p>Processing the statistics-2014 core:</p>
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2014
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2014
...
=================================================================
*** Statistics Records with Legacy Id ***
2,381,603 Item View
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 2,381,603 Item View
1,323,357 Bistream View
501,545 Community View
247,805 Collection View
@ -305,22 +305,22 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
--------------------------------------
4,465,716 TOTAL
=================================================================
</code></pre><p>Summary of unmigrated documents after processing:</p>
</code></pre></div><p>Summary of unmigrated documents after processing:</p>
<ul>
<li>182,131: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>39,947: <code>id:/.+-unmigrated/</code></li>
<li>222,078: <code>*:* NOT id:/.{36}/</code></li>
<li>188,791 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2014/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2013">statistics-2013</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2014/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2013">statistics-2013</h2>
<p>Processing the statistics-2013 core:</p>
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2013
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2013
...
=================================================================
*** Statistics Records with Legacy Id ***
2,352,124 Item View
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 2,352,124 Item View
1,117,676 Bistream View
575,711 Community View
171,639 Collection View
@ -331,81 +331,81 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
--------------------------------------
4,218,862 TOTAL
=================================================================
</code></pre><p>Summary of unmigrated docs after processing:</p>
</code></pre></div><p>Summary of unmigrated docs after processing:</p>
<ul>
<li>2,548 : <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>29,772: <code>id:/.+-unmigrated/</code></li>
<li>32,320: <code>*:* NOT id:/.{36}/</code></li>
<li>15,691 are <code>type: 5</code> (SITE) so we can purge them:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2013/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2012">statistics-2012</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2013/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2012">statistics-2012</h2>
<p>Processing the statistics-2012 core:</p>
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2012
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2012
...
=================================================================
*** Statistics Records with Legacy Id ***
2,229,332 Item View
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 2,229,332 Item View
913,577 Bistream View
215,577 Collection View
104,734 Community View
--------------------------------------
3,463,220 TOTAL
=================================================================
</code></pre><p>Summary of unmigrated docs after processing:</p>
</code></pre></div><p>Summary of unmigrated docs after processing:</p>
<ul>
<li>0: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>33,161: <code>id:/.+-unmigrated/</code></li>
<li>33,161: <code>*:* NOT id:/.{36}/</code></li>
<li>33,161 are <code>type: 3</code> (COLLECTION), which is different than I&rsquo;ve seen previously&hellip; but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2012/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2011">statistics-2011</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2012/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2011">statistics-2011</h2>
<p>Processing the statistics-2011 core:</p>
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2011
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2011
...
=================================================================
*** Statistics Records with Legacy Id ***
904,896 Item View
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 904,896 Item View
385,789 Bistream View
154,356 Collection View
62,978 Community View
--------------------------------------
1,508,019 TOTAL
=================================================================
</code></pre><p>Summary of unmigrated docs after processing:</p>
</code></pre></div><p>Summary of unmigrated docs after processing:</p>
<ul>
<li>0: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>17,551: <code>id:/.+-unmigrated/</code></li>
<li>17,551: <code>*:* NOT id:/.{36}/</code></li>
<li>12,116 are <code>type: 3</code> (COLLECTION), which is different than I&rsquo;ve seen previously&hellip; but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2011/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h2 id="statistics-2010">statistics-2010</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2011/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h2 id="statistics-2010">statistics-2010</h2>
<p>Processing the statistics-2010 core:</p>
<pre tabindex="0"><code class="language-console" data-lang="console">$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2010
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ chrt -b <span style="color:#ae81ff">0</span> dspace solr-upgrade-statistics-6x -n <span style="color:#ae81ff">2500000</span> -i statistics-2010
...
=================================================================
*** Statistics Records with Legacy Id ***
26,067 Item View
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span> 26,067 Item View
15,615 Bistream View
4,116 Collection View
1,094 Community View
--------------------------------------
46,892 TOTAL
=================================================================
</code></pre><p>Summary of unmigrated docs after processing:</p>
</code></pre></div><p>Summary of unmigrated docs after processing:</p>
<ul>
<li>0: <code>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</code></li>
<li>1,012: <code>id:/.+-unmigrated/</code></li>
<li>1,012: <code>*:* NOT id:/.{36}/</code></li>
<li>654 are <code>type: 3</code> (COLLECTION), which is different than I&rsquo;ve seen previously&hellip; but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary &quot;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&quot;
</code></pre><h3 id="processing-solr-statistics-with-atomicstatisticsupdatecli">Processing Solr statistics with AtomicStatisticsUpdateCLI</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#34;http://localhost:8081/solr/statistics-2010/update?softCommit=true&#34;</span> -H <span style="color:#e6db74">&#34;Content-Type: text/xml&#34;</span> --data-binary <span style="color:#e6db74">&#34;&lt;delete&gt;&lt;query&gt;*:* NOT id:/.{36}/&lt;/query&gt;&lt;/delete&gt;&#34;</span>
</code></pre></div><h3 id="processing-solr-statistics-with-atomicstatisticsupdatecli">Processing Solr statistics with AtomicStatisticsUpdateCLI</h3>
<p>On 2020-11-18 I finished processing the Solr statistics with solr-upgrade-statistics-6x and I started processing them with AtomicStatisticsUpdateCLI.</p>
<h2 id="statistics-1">statistics</h2>
<p>First the current year&rsquo;s statistics core, in 12-hour batches:</p>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -110,9 +110,9 @@
<li>I experimented with manually sharding the Solr statistics on DSpace Test</li>
<li>First I exported all the 2019 stats from CGSpace:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
</code></pre>
</code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-11/'>Read more →</a>
</article>
@ -134,15 +134,15 @@ $ zstd statistics-2019.json
<ul>
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
</code></pre><ul>
</code></pre></div><ul>
<li>So we have 1879/7100 (26.46%) matching already</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2021-10/'>Read more →</a>
@ -199,8 +199,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre></div><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2021-08/'>Read more →</a>
@ -224,9 +224,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
</code></pre>
</code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-07/'>Read more →</a>
</article>

View File

@ -18,9 +18,9 @@
&lt;li&gt;I experimented with manually sharding the Solr statistics on DSpace Test&lt;/li&gt;
&lt;li&gt;First I exported all the 2019 stats from CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ ./run.sh -s http://localhost:8081/solr/statistics -f &#39;time:2019-*&#39; -a export -o statistics-2019.json -k uid
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ ./run.sh -s http://localhost:8081/solr/statistics -f &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;time:2019-*&amp;#39;&lt;/span&gt; -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
&lt;/code&gt;&lt;/pre&gt;</description>
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -33,15 +33,15 @@ $ zstd statistics-2019.json
&lt;ul&gt;
&lt;li&gt;Export all affiliations on CGSpace and run them against the latest RoR data dump:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT text_value as &amp;quot;cg.contributor.affiliation&amp;quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d &amp;gt; /tmp/2021-10-01-affiliations.txt
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT text_value as &amp;#34;cg.contributor.affiliation&amp;#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c &lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt; /tmp/2021-10-01-affiliations.csv | sed 1d &amp;gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;So we have 1879/7100 (26.46%) matching already&lt;/li&gt;
&lt;/ul&gt;</description>
</item>
@ -80,8 +80,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
&lt;ul&gt;
&lt;li&gt;Update Docker images on AReS server (linode20) and reboot the server:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &#39;s/ \+/:/g&#39; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;s/ \+/:/g&amp;#39;&lt;/span&gt; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;I decided to upgrade linode20 from Ubuntu 18.04 to 20.04&lt;/li&gt;
&lt;/ul&gt;</description>
</item>
@ -96,9 +96,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
&lt;ul&gt;
&lt;li&gt;Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
&lt;/code&gt;&lt;/pre&gt;</description>
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -203,17 +203,17 @@ COPY 20994
&lt;li&gt;I had a call with CodeObia to discuss the work on OpenRXV&lt;/li&gt;
&lt;li&gt;Check the results of the AReS harvesting from last night:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&#39;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&amp;#39;&lt;/span&gt;
{
&amp;quot;count&amp;quot; : 100875,
&amp;quot;_shards&amp;quot; : {
&amp;quot;total&amp;quot; : 1,
&amp;quot;successful&amp;quot; : 1,
&amp;quot;skipped&amp;quot; : 0,
&amp;quot;failed&amp;quot; : 0
&amp;#34;count&amp;#34; : 100875,
&amp;#34;_shards&amp;#34; : {
&amp;#34;total&amp;#34; : 1,
&amp;#34;successful&amp;#34; : 1,
&amp;#34;skipped&amp;#34; : 0,
&amp;#34;failed&amp;#34; : 0
}
}
&lt;/code&gt;&lt;/pre&gt;</description>
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -116,17 +116,17 @@
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 100875,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100875,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre>
</code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-02/'>Read more →</a>
</article>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -110,9 +110,9 @@
<li>I experimented with manually sharding the Solr statistics on DSpace Test</li>
<li>First I exported all the 2019 stats from CGSpace:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
</code></pre>
</code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-11/'>Read more →</a>
</article>
@ -134,15 +134,15 @@ $ zstd statistics-2019.json
<ul>
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
</code></pre><ul>
</code></pre></div><ul>
<li>So we have 1879/7100 (26.46%) matching already</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2021-10/'>Read more →</a>
@ -199,8 +199,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre></div><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2021-08/'>Read more →</a>
@ -224,9 +224,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
</code></pre>
</code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-07/'>Read more →</a>
</article>

View File

@ -18,9 +18,9 @@
&lt;li&gt;I experimented with manually sharding the Solr statistics on DSpace Test&lt;/li&gt;
&lt;li&gt;First I exported all the 2019 stats from CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ ./run.sh -s http://localhost:8081/solr/statistics -f &#39;time:2019-*&#39; -a export -o statistics-2019.json -k uid
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ ./run.sh -s http://localhost:8081/solr/statistics -f &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;time:2019-*&amp;#39;&lt;/span&gt; -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
&lt;/code&gt;&lt;/pre&gt;</description>
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -33,15 +33,15 @@ $ zstd statistics-2019.json
&lt;ul&gt;
&lt;li&gt;Export all affiliations on CGSpace and run them against the latest RoR data dump:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT text_value as &amp;quot;cg.contributor.affiliation&amp;quot;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d &amp;gt; /tmp/2021-10-01-affiliations.txt
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT text_value as &amp;#34;cg.contributor.affiliation&amp;#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c &lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt; /tmp/2021-10-01-affiliations.csv | sed 1d &amp;gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;So we have 1879/7100 (26.46%) matching already&lt;/li&gt;
&lt;/ul&gt;</description>
</item>
@ -80,8 +80,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
&lt;ul&gt;
&lt;li&gt;Update Docker images on AReS server (linode20) and reboot the server:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &#39;s/ \+/:/g&#39; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;s/ \+/:/g&amp;#39;&lt;/span&gt; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;I decided to upgrade linode20 from Ubuntu 18.04 to 20.04&lt;/li&gt;
&lt;/ul&gt;</description>
</item>
@ -96,9 +96,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
&lt;ul&gt;
&lt;li&gt;Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
&lt;/code&gt;&lt;/pre&gt;</description>
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -203,17 +203,17 @@ COPY 20994
&lt;li&gt;I had a call with CodeObia to discuss the work on OpenRXV&lt;/li&gt;
&lt;li&gt;Check the results of the AReS harvesting from last night:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&#39;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&amp;#39;&lt;/span&gt;
{
&amp;quot;count&amp;quot; : 100875,
&amp;quot;_shards&amp;quot; : {
&amp;quot;total&amp;quot; : 1,
&amp;quot;successful&amp;quot; : 1,
&amp;quot;skipped&amp;quot; : 0,
&amp;quot;failed&amp;quot; : 0
&amp;#34;count&amp;#34; : 100875,
&amp;#34;_shards&amp;#34; : {
&amp;#34;total&amp;#34; : 1,
&amp;#34;successful&amp;#34; : 1,
&amp;#34;skipped&amp;#34; : 0,
&amp;#34;failed&amp;#34; : 0
}
}
&lt;/code&gt;&lt;/pre&gt;</description>
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />
@ -116,17 +116,17 @@
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&quot;count&quot; : 100875,
&quot;_shards&quot; : {
&quot;total&quot; : 1,
&quot;successful&quot; : 1,
&quot;skipped&quot; : 0,
&quot;failed&quot; : 0
&#34;count&#34; : 100875,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre>
</code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-02/'>Read more →</a>
</article>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-03T15:56:15+02:00" />
<meta property="og:updated_time" content="2021-11-07T11:26:32+02:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.88.1" />
<meta name="generator" content="Hugo 0.89.2" />

Some files were not shown because too many files have changed in this diff Show More