diff --git a/content/posts/2021-09.md b/content/posts/2021-09.md index c7b9b0671..8c7ab65d8 100644 --- a/content/posts/2021-09.md +++ b/content/posts/2021-09.md @@ -29,4 +29,46 @@ $ docker-compose build - Then run system updates and reboot the server - After the system came back up I started a fresh re-harvesting +## 2021-09-07 + +- Checking last month's Solr statistics to see if there are any new bots that I need to purge and add to the list + - 78.203.225.68 made 50,000 requests on one day in August, and it is using this user agent: `Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36` + - It's a fixed line ISP in Montpellier according to AbuseIPDB.com, and has not been flagged as abusive, so it must be some CGIAR SMO person doing some web application harvesting from the browser + - 130.255.162.154 is in Sweden and made 46,000 requests in August and it is using this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0` + - 35.174.144.154 is on Amazon and made 28,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36` + - 192.121.135.6 is in Sweden and made 9,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0` + - 185.38.40.66 is in Germany and made 6,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0 BoldBrains SC/1.10.2.4` + - 3.225.28.105 is in Amazon and made 3,000 requests with this user agent: `Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36` + - I also noticed that we still have tons (25,000) of requests by MSNbot using this normal-looking user agent: `Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko` + - I can identify them by their reverse DNS: msnbot-40-77-167-105.search.msn.com. + - I had already purged a bunch of these by their IPs in 2021-06, so it looks like I have to do that again + - While looking at the MSN requests I noticed tons of requests from another strange host using reverse IP DNS: malta2095.startdedicated.com., astra5139.startdedicated.com., and many others + - They must be related, because I see them all using the exact same user agent: `Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko` + - So this startdedicated.com DNS is some Bing bot also... +- I extracted all the IPs and purged them using my `check-spider-ip-hits.sh` script + - In total I purged 225,000 hits... + +## 2021-09-12 + +- Start a harvest on AReS + +## 2021-09-13 + +- Mishell Portilla asked me about thumbnails on CGSpace being small + - For example, [10568/114576](https://cgspace.cgiar.org/handle/10568/114576) has a lot of white space on the left side + - I created a new thumbnail with vipsthumbnail: + +```console +$ vipsthumbnail ARRTB2020ST.pdf -s x600 -o '%s.jpg[Q=85,optimize_coding,strip]' +``` + +- Looking at the PDF's metadata I see: + - Producer: iLovePDF + - Creator: Adobe InDesign 15.0 (Windows) + - Format: PDF-1.7 +- Eventually I should do more tests on this and perhaps file a bug with DSpace... +- Some Alliance people contacted me about getting access to the CGSpace API to deposit with their TIP tool + - I told them I can give them access to DSpace Test and that we should have a meeting soon + - We need to figure out what controlled vocabularies they should use + diff --git a/docs/2015-11/index.html b/docs/2015-11/index.html index 3cb6f6312..c39a250c6 100644 --- a/docs/2015-11/index.html +++ b/docs/2015-11/index.html @@ -34,7 +34,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 "/> - + @@ -126,7 +126,7 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac
  • Looks like DSpace exhausted its PostgreSQL connection pool
  • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
  • -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     78
     
    -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     96
     
    • For some reason the number of idle connections is very high since we upgraded to DSpace 5
    • @@ -147,7 +147,7 @@ $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspac
    • Troubleshoot the DSpace 5 OAI breakage caused by nginx routing config
    • The OAI application requests stylesheets and javascript files with the path /oai/static/css, which gets matched here:
    -
    # static assets we can load from the file system directly with nginx
    +
    # static assets we can load from the file system directly with nginx
     location ~ /(themes|static|aspects/ReportingSuite) {
         try_files $uri @tomcat;
     ...
    @@ -158,21 +158,21 @@ location ~ /(themes|static|aspects/ReportingSuite) {
     
  • We simply need to add include extra-security.conf; to the above location block (but research and test first)
  • We should add WOFF assets to the list of things to set expires for:
  • -
    location ~* \.(?:ico|css|js|gif|jpe?g|png|woff)$ {
    +
    location ~* \.(?:ico|css|js|gif|jpe?g|png|woff)$ {
     
    • We should also add aspects/Statistics to the location block for static assets (minus static from above):
    -
    location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) {
    +
    location ~ /(themes|aspects/ReportingSuite|aspects/Statistics) {
     
    • Need to check /about on CGSpace, as it’s blank on my local test server and we might need to add something there
    • CGSpace has been up and down all day due to PostgreSQL idle connections (current DSpace pool is 90):
    -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     93
     
    • I looked closer at the idle connections and saw that many have been idle for hours (current time on server is 2015-11-25T20:20:42+0000):
    -
    $ psql -c 'SELECT * from pg_stat_activity;' | less -S
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | less -S
     datid | datname  |  pid  | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |
     -------+----------+-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+---
     20951 | cgspace  | 10966 |    18205 | cgspace  |                  | 127.0.0.1   |                 |       37731 | 2015-11-25 13:13:02.837624+00 |                               | 20
    @@ -191,13 +191,13 @@ datid | datname  |  pid  | usesysid | usename  | application_name | client_addr
     
  • CCAFS colleagues mentioned that the REST API is very slow, 24 seconds for one item
  • Not as bad for me, but still unsustainable if you have to get many:
  • -
    $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
    +
    $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     8.415
     
    • Monitoring e-mailed in the evening to say CGSpace was down
    • Idle connections in PostgreSQL again:
    -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
     66
     
    • At the time, the current DSpace pool size was 50…
    • @@ -208,14 +208,14 @@ datid | datname | pid | usesysid | usename | application_name | client_addr
    • Still more alerts that CGSpace has been up and down all day
    • Current database settings for DSpace:
    -
    db.maxconnections = 30
    +
    db.maxconnections = 30
     db.maxwait = 5000
     db.maxidle = 8
     db.statementpool = true
     
    • And idle connections:
    -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
     49
     
    • Perhaps I need to start drastically increasing the connection limits—like to 300—to see if DSpace’s thirst can ever be quenched
    • diff --git a/docs/2015-12/index.html b/docs/2015-12/index.html index 5831fb5b9..587298078 100644 --- a/docs/2015-12/index.html +++ b/docs/2015-12/index.html @@ -36,7 +36,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz "/> - + @@ -126,7 +126,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
      • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:
      -
      # cd /home/dspacetest.cgiar.org/log
      +
      # cd /home/dspacetest.cgiar.org/log
       # ls -lh dspace.log.2015-11-18*
       -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
       -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
      @@ -137,20 +137,20 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
       
    • CGSpace went down again (due to PostgreSQL idle connections of course)
    • Current database settings for DSpace are db.maxconnections = 30 and db.maxidle = 8, yet idle connections are exceeding this:
    -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
     39
     
    • I restarted PostgreSQL and Tomcat and it’s back
    • On a related note of why CGSpace is so slow, I decided to finally try the pgtune script to tune the postgres settings:
    -
    # apt-get install pgtune
    +
    # apt-get install pgtune
     # pgtune -i /etc/postgresql/9.3/main/postgresql.conf -o postgresql.conf-pgtune
     # mv /etc/postgresql/9.3/main/postgresql.conf /etc/postgresql/9.3/main/postgresql.conf.orig 
     # mv postgresql.conf-pgtune /etc/postgresql/9.3/main/postgresql.conf
     
    • It introduced the following new settings:
    -
    default_statistics_target = 50
    +
    default_statistics_target = 50
     maintenance_work_mem = 480MB
     constraint_exclusion = on
     checkpoint_completion_target = 0.9
    @@ -164,7 +164,7 @@ max_connections = 80
     
  • Now I need to go read PostgreSQL docs about these options, and watch memory settings in munin etc
  • For what it’s worth, now the REST API should be faster (because of these PostgreSQL tweaks):
  • -
    $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
    +
    $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     1.474
     $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     2.141
    @@ -189,7 +189,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
     
  • CGSpace very slow, and monitoring emailing me to say its down, even though I can load the page (very slowly)
  • Idle postgres connections look like this (with no change in DSpace db settings lately):
  • -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
     29
     
    • I restarted Tomcat and postgres…
    • @@ -197,7 +197,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
    • We weren’t out of heap yet, but it’s probably fair enough that the DSpace 5 upgrade (and new Atmire modules) requires more memory so it’s ok
    • A possible side effect is that I see that the REST API is twice as fast for the request above now:
    -
    $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
    +
    $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     1.368
     $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     0.968
    @@ -214,7 +214,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
     
  • CGSpace has been up and down all day and REST API is completely unresponsive
  • PostgreSQL idle connections are currently:
  • -
    postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
    +
    postgres@linode01:~$ psql -c 'SELECT * from pg_stat_activity;' | grep cgspace | grep -c idle
     28
     
    • I have reverted all the pgtune tweaks from the other day, as they didn’t fix the stability issues, so I’d rather not have them introducing more variables into the equation
    • @@ -229,7 +229,7 @@ $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle
    • Atmire sent some fixes to DSpace’s REST API code that was leaving contexts open (causing the slow performance and database issues)
    • After deploying the fix to CGSpace the REST API is consistently faster:
    -
    $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
    +
    $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     0.675
     $ curl -o /dev/null -s -w %{time_total}\\n https://cgspace.cgiar.org/rest/handle/10568/32802?expand=all
     0.599
    diff --git a/docs/2016-01/index.html b/docs/2016-01/index.html
    index 3b3c1261b..79b5b7694 100644
    --- a/docs/2016-01/index.html
    +++ b/docs/2016-01/index.html
    @@ -28,7 +28,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_
     I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated.
     Update GitHub wiki for documentation of maintenance tasks.
     "/>
    -
    +
     
     
         
    diff --git a/docs/2016-02/index.html b/docs/2016-02/index.html
    index 6d6755b5a..0f22cef25 100644
    --- a/docs/2016-02/index.html
    +++ b/docs/2016-02/index.html
    @@ -38,7 +38,7 @@ I noticed we have a very interesting list of countries on CGSpace:
     Not only are there 49,000 countries, we have some blanks (25)…
     Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”
     "/>
    -
    +
     
     
         
    @@ -140,20 +140,20 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r
     
  • Found a way to get items with null/empty metadata values from SQL
  • First, find the metadata_field_id for the field you want from the metadatafieldregistry table:
  • -
    dspacetest=# select * from metadatafieldregistry;
    +
    dspacetest=# select * from metadatafieldregistry;
     
    • In this case our country field is 78
    • Now find all resources with type 2 (item) that have null/empty values for that field:
    -
    dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
    +
    dspacetest=# select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=78 and (text_value='' OR text_value IS NULL);
     
    • Then you can find the handle that owns it from its resource_id:
    -
    dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
    +
    dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
     
    • It’s 25 items so editing in the web UI is annoying, let’s try SQL!
    -
    dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
    +
    dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
     DELETE 25
     
    • After that perhaps a regular dspace index-discovery (no -b) should suffice…
    • @@ -171,7 +171,7 @@ DELETE 25
    • I need to start running DSpace in Mac OS X instead of a Linux VM
    • Install PostgreSQL from homebrew, then configure and import CGSpace database dump:
    -
    $ postgres -D /opt/brew/var/postgres
    +
    $ postgres -D /opt/brew/var/postgres
     $ createuser --superuser postgres
     $ createuser --pwprompt dspacetest
     $ createdb -O dspacetest --encoding=UNICODE dspacetest
    @@ -187,7 +187,7 @@ $ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sq
     
    • After building and running a fresh_install I symlinked the webapps into Tomcat’s webapps folder:
    -
    $ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
    +
    $ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
     $ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT
     $ ln -sfv ~/dspace/webapps/rest /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/rest
     $ ln -sfv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/jspui
    @@ -198,11 +198,11 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
     
  • Add CATALINA_OPTS in /opt/brew/Cellar/tomcat/8.0.30/libexec/bin/setenv.sh, as this script is sourced by the catalina startup script
  • For example:
  • -
    CATALINA_OPTS="-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8"
    +
    CATALINA_OPTS="-Djava.awt.headless=true -Xms2048m -Xmx2048m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8"
     
    • After verifying that the site is working, start a full index:
    -
    $ ~/dspace/bin/dspace index-discovery -b
    +
    $ ~/dspace/bin/dspace index-discovery -b
     

    2016-02-08

    • Finish cleaning up and importing ~400 DAGRIS items into CGSpace
    • @@ -216,7 +216,7 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
    • Help Sisay with OpenRefine
    • Enable HTTPS on DSpace Test using Let’s Encrypt:
    -
    $ cd ~/src/git
    +
    $ cd ~/src/git
     $ git clone https://github.com/letsencrypt/letsencrypt
     $ cd letsencrypt
     $ sudo service nginx stop
    @@ -231,15 +231,15 @@ $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-becom
     
  • Getting more and more hangs on DSpace Test, seemingly random but also during CSV import
  • Logs don’t always show anything right when it fails, but eventually one of these appears:
  • -
    org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
    +
    org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
     
    • or
    -
    Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
    +
    Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
     
    • Right now DSpace Test’s Tomcat heap is set to 1536m and we have quite a bit of free RAM:
    -
    # free -m
    +
    # free -m
                  total       used       free     shared    buffers     cached
     Mem:          3950       3902         48          9         37       1311
     -/+ buffers/cache:       2552       1397
    @@ -253,11 +253,11 @@ Swap:          255         57        198
     
  • There are 1200 records that have PDFs, and will need to be imported into CGSpace
  • I created a filename column based on the dc.identifier.url column using the following transform:
  • -
    value.split('/')[-1]
    +
    value.split('/')[-1]
     
    • Then I wrote a tool called generate-thumbnails.py to download the PDFs and generate thumbnails for them, for example:
    -
    $ ./generate-thumbnails.py ciat-reports.csv
    +
    $ ./generate-thumbnails.py ciat-reports.csv
     Processing 64661.pdf
     > Downloading 64661.pdf
     > Creating thumbnail for 64661.pdf
    @@ -278,13 +278,13 @@ Processing 64195.pdf
     
  • Looking at CIAT’s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I’m not sure if we can use those
  • 265 items have dirty, URL-encoded filenames:
  • -
    $ ls | grep -c -E "%"
    +
    $ ls | grep -c -E "%"
     265
     
    • I suggest that we import ~850 or so of the clean ones first, then do the rest after I can find a clean/reliable way to decode the filenames
    • This python2 snippet seems to work in the CLI, but not so well in OpenRefine:
    -
    $ python -c "import urllib, sys; print urllib.unquote(sys.argv[1])" CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
    +
    $ python -c "import urllib, sys; print urllib.unquote(sys.argv[1])" CIAT_COLOMBIA_000169_T%C3%A9cnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
     CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_yuca.pdf
     
    • Merge pull requests for submission form theming (#178) and missing center subjects in XMLUI item views (#176)
    • @@ -294,7 +294,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
      • Turns out OpenRefine has an unescape function!
      -
      value.unescape("url")
      +
      value.unescape("url")
       
      • This turns the URLs into human-readable versions that we can use as proper filenames
      • Run web server and system updates on DSpace Test and reboot
      • @@ -316,7 +316,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
      • Turns out the “bug” in SAFBuilder isn’t a bug, it’s a feature that allows you to encode extra information like the destintion bundle in the filename
      • Also, it seems DSpace’s SAF import tool doesn’t like importing filenames that have accents in them:
      -
      java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
      +
      java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
       
      • Need to rename files to have no accents or umlauts, etc…
      • Useful custom text facet for URLs ending with “.pdf”: value.endsWith(".pdf")
      • @@ -325,12 +325,12 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
        • To change Spanish accents to ASCII in OpenRefine:
        -
        value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
        +
        value.replace('ó','o').replace('í','i').replace('á','a').replace('é','e').replace('ñ','n')
         
        • But actually, the accents might not be an issue, as I can successfully import files containing Spanish accents on my Mac
        • On closer inspection, I can import files with the following names on Linux (DSpace Test):
        -
        Bitstream: tést.pdf
        +
        Bitstream: tést.pdf
         Bitstream: tést señora.pdf
         Bitstream: tést señora alimentación.pdf
         
          @@ -353,7 +353,7 @@ Bitstream: tést señora alimentación.pdf
        • Looking at the filenames for the CIAT Reports, some have some really ugly characters, like: ' or , or = or [ or ] or ( or ) or _.pdf or ._ etc
        • It’s tricky to parse those things in some programming languages so I’d rather just get rid of the weird stuff now in OpenRefine:
        -
        value.replace("'",'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
        +
        value.replace("'",'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
         
        • Finally import the 1127 CIAT items into CGSpace: https://cgspace.cgiar.org/handle/10568/35710
        • Re-deploy CGSpace with the Google Scholar fix, but I’m waiting on the Atmire fixes for now, as the branch history is ugly
        • diff --git a/docs/2016-03/index.html b/docs/2016-03/index.html index ed99e8b29..f1dfaf0e2 100644 --- a/docs/2016-03/index.html +++ b/docs/2016-03/index.html @@ -28,7 +28,7 @@ Looking at issues with author authorities on CGSpace For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server "/> - + @@ -128,7 +128,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
        • I identified one commit that causes the issue and let them know
        • Restart DSpace Test, as it seems to have crashed after Sisay tried to import some CSV or zip or something:
        -
        Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
        +
        Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
         

        2016-03-08

        • Add a few new filters to Atmire’s Listings and Reports module (#180)
        • @@ -175,7 +175,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
        • Help Sisay with some PostgreSQL queries to clean up the incorrect dc.contributor.corporateauthor field
        • I noticed that we have some weird values in dc.language:
        -
        # select * from metadatavalue where metadata_field_id=37;
        +
        # select * from metadatavalue where metadata_field_id=37;
          metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
         -------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
                    1942571 |       35342 |                37 | hi         |           |     1 |           |         -1 |                2
        @@ -215,7 +215,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
         
        • Command used:
        -
        $ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg
        +
        $ gm convert -trim -quality 82 -thumbnail x300 -flatten Descriptor\ for\ Butia_EN-2015_2021.pdf\[0\] cover.jpg
         
        • Also, it looks like adding -sharpen 0x1.0 really improves the quality of the image for only a few KB
        @@ -261,7 +261,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja -
        Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
        +
        Can't find method org.dspace.app.xmlui.aspect.administrative.FlowGroupUtils.processSaveGroup(org.dspace.core.Context,number,string,[Ljava.lang.String;,[Ljava.lang.String;,org.apache.cocoon.environment.wrapper.RequestWrapper). (resource://aspects/Administrative/administrative.js#967)
         
        • I can reproduce the same error on DSpace Test and on my Mac
        • Looks to be an issue with the Atmire modules, I’ve submitted a ticket to their tracker.
        • diff --git a/docs/2016-04/index.html b/docs/2016-04/index.html index e8ba1e05f..31d488aa9 100644 --- a/docs/2016-04/index.html +++ b/docs/2016-04/index.html @@ -32,7 +32,7 @@ After running DSpace for over five years I’ve never needed to look in any This will save us a few gigs of backup space we’re paying for on S3 Also, I noticed the checker log has some errors we should pay attention to: "/> - + @@ -126,7 +126,7 @@ Also, I noticed the checker log has some errors we should pay attention to:
        • This will save us a few gigs of backup space we’re paying for on S3
        • Also, I noticed the checker log has some errors we should pay attention to:
        -
        Run start time: 03/06/2016 04:00:22
        +
        Run start time: 03/06/2016 04:00:22
         Error retrieving bitstream ID 71274 from asset store.
         java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290601546459645925328536011917633626 (Too many open files)
                 at java.io.FileInputStream.open(Native Method)
        @@ -158,7 +158,7 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
         
        • Reduce Amazon S3 storage used for logs from 46 GB to 6GB by deleting a bunch of logs we don’t need!
        -
        # s3cmd ls s3://cgspace.cgiar.org/log/ > /tmp/s3-logs.txt
        +
        # s3cmd ls s3://cgspace.cgiar.org/log/ > /tmp/s3-logs.txt
         # grep checker.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
         # grep cocoon.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
         # grep handle-plugin.log /tmp/s3-logs.txt | awk '{print $4}' | xargs s3cmd del
        @@ -171,7 +171,7 @@ java.io.FileNotFoundException: /home/cgspace.cgiar.org/assetstore/64/29/06/64290
         
        • A better way to move metadata on this scale is via SQL, for example dc.type.output → dc.type (their IDs in the metadatafieldregistry are 66 and 109, respectively):
        -
        dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66;
        +
        dspacetest=# update metadatavalue set metadata_field_id=109 where metadata_field_id=66;
         UPDATE 40852
         
        -
        $ ./migrate-fields.sh
        +
        $ ./migrate-fields.sh
         UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
         UPDATE 40883
         UPDATE metadatavalue SET metadata_field_id=202 WHERE metadata_field_id=72
        @@ -199,7 +199,7 @@ UPDATE 51258
         
      • Looking at the DOI issue reported by Leroy from CIAT a few weeks ago
      • It seems the dx.doi.org URLs are much more proper in our repository!
      -
      dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
      +
      dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and text_value like 'http://dx.doi.org%';
        count
       -------
         5638
      @@ -221,7 +221,7 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and t
       
      • Looking at quality of WLE data (cg.subject.iwmi) in SQL:
      -
      dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc;
      +
      dspacetest=# select text_value, count(*) from metadatavalue where metadata_field_id=217 group by text_value order by count(*) desc;
       
      • Listings and Reports is still not returning reliable data for dc.type
      • I think we need to ask Atmire, as their documentation isn’t too clear on the format of the filter configs
      • @@ -231,11 +231,11 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=74 and t
      • I decided to keep the set of subjects that had FMD and RANGELANDS added, as it appears to have been requested to have been added, and might be the newer list
      • I found 226 blank metadatavalues:
      -
      dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
      +
      dspacetest# select * from metadatavalue where resource_type_id=2 and text_value='';
       
      • I think we should delete them and do a full re-index:
      -
      dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
      +
      dspacetest=# delete from metadatavalue where resource_type_id=2 and text_value='';
       DELETE 226
       
      • I deleted them on CGSpace but I’ll wait to do the re-index as we’re going to be doing one in a few days for the metadata changes anyways
      • @@ -281,7 +281,7 @@ DELETE 226
      • Test metadata migration on local instance again:
      -
      $ ./migrate-fields.sh
      +
      $ ./migrate-fields.sh
       UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
       UPDATE 40885
       UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
      @@ -298,7 +298,7 @@ $ JAVA_OPTS="-Xms512m -Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dsp
       
      • CGSpace was down but I’m not sure why, this was in catalina.out:
      -
      Apr 18, 2016 7:32:26 PM com.sun.jersey.spi.container.ContainerResponse logException
      +
      Apr 18, 2016 7:32:26 PM com.sun.jersey.spi.container.ContainerResponse logException
       SEVERE: Mapped exception to response: 500 (Internal Server Error)
       javax.ws.rs.WebApplicationException
               at org.dspace.rest.Resource.processFinally(Resource.java:163)
      @@ -328,7 +328,7 @@ javax.ws.rs.WebApplicationException
       
      • Get handles for items that are using a given metadata field, ie dc.Species.animal (105):
      -
      # select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=105);
      +
      # select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select resource_id from metadatavalue where resource_type_id=2 and metadata_field_id=105);
          handle
       -------------
        10568/10298
      @@ -338,26 +338,26 @@ javax.ws.rs.WebApplicationException
       
      • Delete metadata values for dc.GRP and dc.icsubject.icrafsubject:
      -
      # delete from metadatavalue where resource_type_id=2 and metadata_field_id=96;
      +
      # delete from metadatavalue where resource_type_id=2 and metadata_field_id=96;
       # delete from metadatavalue where resource_type_id=2 and metadata_field_id=83;
       
      • They are old ICRAF fields and we haven’t used them since 2011 or so
      • Also delete them from the metadata registry
      • CGSpace went down again, dspace.log had this:
      -
      2016-04-19 15:02:17,025 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
      +
      2016-04-19 15:02:17,025 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
       org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
       
      • I restarted Tomcat and PostgreSQL and now it’s back up
      • I bet this is the same crash as yesterday, but I only saw the errors in catalina.out
      • Looks to be related to this, from dspace.log:
      -
      2016-04-19 15:16:34,670 ERROR org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
      +
      2016-04-19 15:16:34,670 ERROR org.dspace.rest.Resource @ Something get wrong. Aborting context in finally statement.
       
      • We have 18,000 of these errors right now…
      • Delete a few more old metadata values: dc.Species.animal, dc.type.journal, and dc.publicationcategory:
      -
      # delete from metadatavalue where resource_type_id=2 and metadata_field_id=105;
      +
      # delete from metadatavalue where resource_type_id=2 and metadata_field_id=105;
       # delete from metadatavalue where resource_type_id=2 and metadata_field_id=85;
       # delete from metadatavalue where resource_type_id=2 and metadata_field_id=95;
       
        @@ -369,7 +369,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
      • Migrate fields and re-deploy CGSpace with the new subject and type fields, run all system updates, and reboot the server
      • Field migration went well:
      -
      $ ./migrate-fields.sh
      +
      $ ./migrate-fields.sh
       UPDATE metadatavalue SET metadata_field_id=109 WHERE metadata_field_id=66
       UPDATE 40909
       UPDATE metadatavalue SET metadata_field_id=203 WHERE metadata_field_id=76
      @@ -387,7 +387,7 @@ UPDATE 46075
       
    • Basically, this gives us the ability to use the latest upstream stable 9.3.x release (currently 9.3.12)
    • Looking into the REST API errors again, it looks like these started appearing a few days ago in the tens of thousands:
    -
    $ grep -c "Aborting context in finally statement" dspace.log.2016-04-20
    +
    $ grep -c "Aborting context in finally statement" dspace.log.2016-04-20
     21252
     
    • I found a recent discussion on the DSpace mailing list and I’ve asked for advice there
    • @@ -423,7 +423,7 @@ UPDATE 46075
    • Looks like the last one was “down” from about four hours ago
    • I think there must be something with this REST stuff:
    -
    # grep -c "Aborting context in finally statement" dspace.log.2016-04-*
    +
    # grep -c "Aborting context in finally statement" dspace.log.2016-04-*
     dspace.log.2016-04-01:0
     dspace.log.2016-04-02:0
     dspace.log.2016-04-03:0
    @@ -468,7 +468,7 @@ dspace.log.2016-04-27:7271
     
    • Logs for today and yesterday have zero references to this REST error, so I’m going to open back up the REST API but log all requests
    -
    location /rest {
    +
    location /rest {
     	access_log /var/log/nginx/rest.log;
     	proxy_pass http://127.0.0.1:8443;
     }
    diff --git a/docs/2016-05/index.html b/docs/2016-05/index.html
    index 12b0bfeee..071bef4f0 100644
    --- a/docs/2016-05/index.html
    +++ b/docs/2016-05/index.html
    @@ -34,7 +34,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
     # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
     "/>
    -
    +
     
     
         
    @@ -126,13 +126,13 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
     
  • I have blocked access to the API now
  • There are 3,000 IPs accessing the REST API in a 24-hour period!
  • -
    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
    +
    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
     
    • The two most often requesters are in Ethiopia and Colombia: 213.55.99.121 and 181.118.144.29
    • 100% of the requests coming from Ethiopia are like this and result in an HTTP 500:
    -
    GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1
    +
    GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1
     
    • For now I’ll block just the Ethiopian IP
    • The owner of that application has said that the NaN (not a number) is an error in his code and he’ll fix it
    • @@ -152,7 +152,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
    • I will re-generate the Discovery indexes after re-deploying
    • Testing renew-letsencrypt.sh script for nginx
    -
    #!/usr/bin/env bash
    +
    #!/usr/bin/env bash
     
     readonly SERVICE_BIN=/usr/sbin/service
     readonly LETSENCRYPT_BIN=/opt/letsencrypt/letsencrypt-auto
    @@ -214,7 +214,7 @@ fi
     

    After completing the rebase I tried to build with the module versions Atmire had indicated as being 5.5 ready but I got this error:

    -
    [ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -> [Help 1]
    +
    [ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -> [Help 1]
     
    • I’ve sent them a question about it
    • A user mentioned having problems with uploading a 33 MB PDF
    • @@ -240,7 +240,7 @@ fi
    • Found ~200 messed up CIAT values in dc.publisher:
    -
    # select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to "%  %";
    +
    # select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to "%  %";
     

    2016-05-13

    • More theorizing about CGcore
    • @@ -259,7 +259,7 @@ fi
    • They have thumbnails on Flickr and elsewhere
    • In OpenRefine I created a new filename column based on the thumbnail column with the following GREL:
    -
    if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
    +
    if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
     
    • Because ~400 records had the same filename on Flickr (hqdefault.jpg) but different UUIDs in the URL
    • So for the hqdefault.jpg ones I just take the UUID (-2) and use it as the filename
    • @@ -269,7 +269,7 @@ fi
      • More quality control on filename field of CCAFS records to make processing in shell and SAFBuilder more reliable:
      -
      value.replace('_','').replace('-','')
      +
      value.replace('_','').replace('-','')
       
      • We need to hold off on moving dc.Species to cg.species because it is only used for plants, and might be better to move it to something like cg.species.plant
      • And dc.identifier.fund is MOSTLY used for CPWF project identifier but has some other sponsorship things @@ -281,17 +281,17 @@ fi
    -
    # select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
    +
    # select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
     

    2016-05-20

    • More work on CCAFS Video and Images records
    • For SAFBuilder we need to modify filename column to have the thumbnail bundle:
    -
    value + "__bundle:THUMBNAIL"
    +
    value + "__bundle:THUMBNAIL"
     
    • Also, I fixed some weird characters using OpenRefine’s transform with the following GREL:
    -
    value.replace(/\u0081/,'')
    +
    value.replace(/\u0081/,'')
     
    • Write shell script to resize thumbnails with height larger than 400: https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256
    • Upload 707 CCAFS records to DSpace Test
    • @@ -309,12 +309,12 @@ fi
      • Export CCAFS video and image records from DSpace Test using the migrate option (-m):
      -
      $ mkdir ~/ccafs-images
      +
      $ mkdir ~/ccafs-images
       $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~/ccafs-images -n 0 -m
       
      • And then import to CGSpace:
      -
      $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &> /tmp/ccafs-images-may30.log
      +
      $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &> /tmp/ccafs-images-may30.log
       
      • But now we have double authors for “CGIAR Research Program on Climate Change, Agriculture and Food Security” in the authority
      • I’m trying to do a Discovery index before messing with the authority index
      • @@ -322,19 +322,19 @@ $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~
      • Run system updates on DSpace Test, re-deploy code, and reboot the server
      • Clean up and import ~200 CTA records to CGSpace via CSV like:
      -
      $ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
      +
      $ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
       $ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &> ~/CTA-May30/CTA-42229.log
       
      • Discovery indexing took a few hours for some reason, and after that I started the index-authority script
      -
      $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace index-authority
      +
      $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace index-authority
       

      2016-05-31

      • The index-authority script ran over night and was finished in the morning
      • Hopefully this was because we haven’t been running it regularly and it will speed up next time
      • I am running it again with a timer to see:
      -
      $ time /home/cgspace.cgiar.org/bin/dspace index-authority
      +
      $ time /home/cgspace.cgiar.org/bin/dspace index-authority
       Retrieving all data
       Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
       Cleaning the old index
      diff --git a/docs/2016-06/index.html b/docs/2016-06/index.html
      index 4fee2b001..c965e7389 100644
      --- a/docs/2016-06/index.html
      +++ b/docs/2016-06/index.html
      @@ -34,7 +34,7 @@ This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRec
       You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
       Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship
       "/>
      -
      +
       
       
           
      @@ -129,7 +129,7 @@ Working on second phase of metadata migration, looks like this will work for mov
       
    • You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets
    • Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship
    -
    dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
    +
    dspacetest=# update metadatavalue set metadata_field_id=130 where metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
     UPDATE 497
     dspacetest=# update metadatavalue set metadata_field_id=29 where metadata_field_id=75;
     UPDATE 14
    @@ -141,7 +141,7 @@ UPDATE 14
     
  • Testing the configuration and theme changes for the upcoming metadata migration and I found some issues with cg.coverage.admin-unit
  • Seems that the Browse configuration in dspace.cfg can’t handle the ‘-’ in the field name:
  • -
    webui.browse.index.12 = subregion:metadata:cg.coverage.admin-unit:text
    +
    webui.browse.index.12 = subregion:metadata:cg.coverage.admin-unit:text
     
    • But actually, I think since DSpace 4 or 5 (we are 5.1) the Browse indexes come from Discovery (defined in discovery.xml) so this is really just a parsing error
    • I’ve sent a message to the DSpace mailing list to ask about the Browse index definition
    • @@ -154,13 +154,13 @@ UPDATE 14
    • Investigating the CCAFS authority issue, I exported the metadata for the Videos collection
    • The top two authors are:
    -
    CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::500
    +
    CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::500
     CGIAR Research Program on Climate Change, Agriculture and Food Security::acd00765-02f1-4b5b-92fa-bfa3877229ce::600
     
    • So the only difference is the “confidence”
    • Ok, well THAT is interesting:
    -
    dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %';
    +
    dspacetest=# select text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like '%Orth, %';
      text_value |              authority               | confidence
     ------------+--------------------------------------+------------
      Orth, A.   | ab606e3a-2b04-4c7d-9423-14beccf54257 |         -1
    @@ -180,7 +180,7 @@ CGIAR Research Program on Climate Change, Agriculture and Food Security::acd0076
     
    • And now an actually relevent example:
    -
    dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500;
    +
    dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security' and confidence = 500;
      count
     -------
        707
    @@ -194,14 +194,14 @@ dspacetest=# select count(*) from metadatavalue where metadata_field_id=3 and te
     
    • Trying something experimental:
    -
    dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
    +
    dspacetest=# update metadatavalue set confidence=500 where metadata_field_id=3 and text_value like 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
     UPDATE 960
     
    • And then re-indexing authority and Discovery…?
    • After Discovery reindex the CCAFS authors are all together in the Authors sidebar facet
    • The docs for the ORCiD and Authority stuff for DSpace 5 mention changing the browse indexes to use the Authority as well:
    -
    webui.browse.index.2 = author:metadataAuthority:dc.contributor.author:authority
    +
    webui.browse.index.2 = author:metadataAuthority:dc.contributor.author:authority
     
    • That would only be for the “Browse by” function… so we’ll have to see what effect that has later
    @@ -215,7 +215,7 @@ UPDATE 960
    • Figured out how to export a list of the unique values from a metadata field ordered by count:
    -
    dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=29 group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
    +
    dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=29 group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
     
    -
    $ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
    +
    $ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
     
    • Write to Atmire about the use of atmire.orcid.id to see if we can change it
    • Seems to be a virtual field that is queried from the authority cache… hmm
    • @@ -263,7 +263,7 @@ UPDATE 960
    • It looks like the values are documented in Choices.java
    • Experiment with setting all 960 CCAFS author values to be 500:
    -
    dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
    +
    dspacetest=# SELECT authority, confidence FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
     
     dspacetest=# UPDATE metadatavalue set confidence = 500 where resource_type_id=2 AND metadata_field_id=3 AND text_value = 'CGIAR Research Program on Climate Change, Agriculture and Food Security';
     UPDATE 960
    @@ -320,7 +320,7 @@ UPDATE 960
     
    • CGSpace’s HTTPS certificate expired last night and I didn’t notice, had to renew:
    -
    # /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook "/usr/bin/service nginx stop" --post-hook "/usr/bin/service nginx start"
    +
    # /opt/letsencrypt/letsencrypt-auto renew --standalone --pre-hook "/usr/bin/service nginx stop" --post-hook "/usr/bin/service nginx start"
     
    • I really need to fix that cron job…
    @@ -328,7 +328,7 @@ UPDATE 960
    • Run the replacements/deletes for dc.description.sponsorship (investors) on CGSpace:
    -
    $ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace
    +
    $ ./fix-metadata-values.py -i investors-not-blank-not-delete-85.csv -f dc.description.sponsorship -t 'correct investor' -m 29 -d cgspace -p 'fuuu' -u cgspace
     $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.sponsorship -m 29 -d cgspace -p 'fuuu' -u cgspace
     
    • The scripts for this are here: @@ -346,7 +346,7 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
    • There are still ~97 fields that weren’t indicated to do anything
    • After the above deletions and replacements I regenerated a CSV and sent it to Peter et al to have a look
    -
    dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=126 group by text_value order by count desc) to /tmp/contributors-june28.csv with csv;
    +
    dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=126 group by text_value order by count desc) to /tmp/contributors-june28.csv with csv;
     
    • Re-evaluate dc.contributor.corporate and it seems we will move it to dc.contributor.author as this is more in line with how editors are actually using it
    @@ -354,7 +354,7 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
    • Test run of migrate-fields.sh with the following re-mappings:
    -
    72  55  #dc.source
    +
    72  55  #dc.source
     86  230 #cg.contributor.crp
     91  211 #cg.contributor.affiliation
     94  212 #cg.species
    @@ -367,7 +367,7 @@ $ ./delete-metadata-values.py -i investors-delete-82.csv -f dc.description.spons
     
    • Run all cleanups and deletions of dc.contributor.corporate on CGSpace:
    -
    $ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i Corporate-Authors-Fix-121.csv -f dc.contributor.corporate -t 'Correct style' -m 126 -d cgspace -u cgspace -p 'fuuu'
     $ ./fix-metadata-values.py -i Corporate-Authors-Fix-PB.csv -f dc.contributor.corporate -t 'should be' -m 126 -d cgspace -u cgspace -p 'fuuu'
     $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-Delete-13.csv -m 126 -u cgspace -d cgspace -p 'fuuu'
     
      @@ -383,11 +383,11 @@ $ ./delete-metadata-values.py -f dc.contributor.corporate -i Corporate-Authors-D
      • Wow, there are 95 authors in the database who have ‘,’ at the end of their name:
      -
      # select text_value from  metadatavalue where metadata_field_id=3 and text_value like '%,';
      +
      # select text_value from  metadatavalue where metadata_field_id=3 and text_value like '%,';
       
      • We need to use something like this to fix them, need to write a proper regex later:
      -
      # update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,';
      +
      # update metadatavalue set text_value = regexp_replace(text_value, '(Poole, J),', '\1') where metadata_field_id=3 and text_value = 'Poole, J,';
       
      diff --git a/docs/2016-07/index.html b/docs/2016-07/index.html index a4c816296..d37202bde 100644 --- a/docs/2016-07/index.html +++ b/docs/2016-07/index.html @@ -44,7 +44,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and In this case the select query was showing 95 results before the update "/> - + @@ -135,7 +135,7 @@ In this case the select query was showing 95 results before the update
    • Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
    • I think this query should find and replace all authors that have “,” at the end of their names:
    -
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    +
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
      text_value
    @@ -158,7 +158,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
     
  • We really only need statistics and authority but meh
  • Fix metadata for species on DSpace Test:
  • -
    $ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 94 -d dspacetest -u dspacetest -p 'fuuu'
     
    • Will run later on CGSpace
    • A user is still having problems with Sherpa/Romeo causing crashes during the submission process when the journal is “ungraded”
    • @@ -169,7 +169,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
      • Delete 23 blank metadata values from CGSpace:
      -
      cgspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
      +
      cgspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
       DELETE 23
       
      • Complete phase three of metadata migration, for the following fields: @@ -188,7 +188,7 @@ DELETE 23
      • Also, run fixes and deletes for species and author affiliations (over 1000 corrections!)
      -
      $ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu'
      +
      $ ./fix-metadata-values.py -i Species-Peter-Fix.csv -f dc.Species -t CORRECT -m 212 -d dspace -u dspace -p 'fuuu'
       $ ./fix-metadata-values.py -i Affiliations-Fix-1045-Peter-Abenet.csv -f dc.contributor.affiliation -t Correct -m 211 -d dspace -u dspace -p 'fuuu'
       $ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Delete-Peter-Abenet.csv -m 211 -u dspace -d dspace -p 'fuuu'
       
        @@ -198,7 +198,7 @@ $ ./delete-metadata-values.py -f dc.contributor.affiliation -i Affiliations-Dele
        • Doing some author cleanups from Peter and Abenet:
        -
        $ ./fix-metadata-values.py -i /tmp/Authors-Fix-205-UTF8.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
        +
        $ ./fix-metadata-values.py -i /tmp/Authors-Fix-205-UTF8.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
         $ ./delete-metadata-values.py -f dc.contributor.author -i /tmp/Authors-Delete-UTF8.csv -m 3 -u dspacetest -d dspacetest -p fuuu
         

        2016-07-13

          @@ -215,20 +215,20 @@ $ ./delete-metadata-values.py -f dc.contributor.author -i /tmp/Authors-Delete-UT
        • Add species and breed to the XMLUI item display
        • CGSpace crashed late at night and the DSpace logs were showing:
        -
        2016-07-18 20:26:30,941 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
        +
        2016-07-18 20:26:30,941 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
         org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
         ...
         
        • I suspect it’s someone hitting REST too much:
        -
        # awk '{print $1}' /var/log/nginx/rest.log  | sort -n | uniq -c | sort -h | tail -n 3
        +
        # awk '{print $1}' /var/log/nginx/rest.log  | sort -n | uniq -c | sort -h | tail -n 3
             710 66.249.78.38
            1781 181.118.144.29
           24904 70.32.99.142
         
        • I just blocked access to /rest for that last IP for now:
        -
             # log rest requests
        +
             # log rest requests
              location /rest {
                  access_log /var/log/nginx/rest.log;
                  proxy_pass http://127.0.0.1:8443;
        @@ -248,23 +248,23 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
         
      • We might need to use index.authority.ignore-prefered=true to tell the Discovery index to prefer the variation that exists in the metadatavalue rather than what it finds in the authority cache.
      • Trying these on DSpace Test after a discussion by Daniel Scharon on the dspace-tech mailing list:
      -
      index.authority.ignore-prefered.dc.contributor.author=true
      +
      index.authority.ignore-prefered.dc.contributor.author=true
       index.authority.ignore-variants.dc.contributor.author=false
       
      • After reindexing I don’t see any change in Discovery’s display of authors, and still have entries like:
      -
      Grace, D. (464)
      +
      Grace, D. (464)
       Grace, D. (62)
       
      • I asked for clarification of the following options on the DSpace mailing list:
      -
      index.authority.ignore
      +
      index.authority.ignore
       index.authority.ignore-prefered
       index.authority.ignore-variants
       
      • In the mean time, I will try these on DSpace Test (plus a reindex):
      -
      index.authority.ignore=true
      +
      index.authority.ignore=true
       index.authority.ignore-prefered=true
       index.authority.ignore-variants=true
       
        @@ -272,7 +272,7 @@ index.authority.ignore-variants=true
      • It was misconfigured and disabled, but already working for some reason sigh
      • … no luck. Trying with just:
      -
      index.authority.ignore=true
      +
      index.authority.ignore=true
       
      • After re-indexing and clearing the XMLUI cache nothing has changed
      @@ -280,7 +280,7 @@ index.authority.ignore-variants=true
      • Trying a few more settings (plus reindex) for Discovery on DSpace Test:
      -
      index.authority.ignore-prefered.dc.contributor.author=true
      +
      index.authority.ignore-prefered.dc.contributor.author=true
       index.authority.ignore-variants=true
       
      • Run all OS updates and reboot DSpace Test server
      • @@ -291,7 +291,7 @@ index.authority.ignore-variants=true
        • The DSpace source code mentions the configuration key discovery.index.authority.ignore-prefered.* (with prefix of discovery, despite the docs saying otherwise), so I’m trying the following on DSpace Test:
        -
        discovery.index.authority.ignore-prefered.dc.contributor.author=true
        +
        discovery.index.authority.ignore-prefered.dc.contributor.author=true
         discovery.index.authority.ignore-variants=true
         
        • Still no change!
        • diff --git a/docs/2016-08/index.html b/docs/2016-08/index.html index 210ef6052..7f223540f 100644 --- a/docs/2016-08/index.html +++ b/docs/2016-08/index.html @@ -42,7 +42,7 @@ $ git checkout -b 55new 5_x-prod $ git reset --hard ilri/5_x-prod $ git rebase -i dspace-5.5 "/> - + @@ -137,7 +137,7 @@ $ git rebase -i dspace-5.5
        • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
        • Start working on DSpace 5.1 → 5.5 port:
        -
        $ git checkout -b 55new 5_x-prod
        +
        $ git checkout -b 55new 5_x-prod
         $ git reset --hard ilri/5_x-prod
         $ git rebase -i dspace-5.5
         
          @@ -166,7 +166,7 @@ $ git rebase -i dspace-5.5
        • Fix item display incorrectly displaying Species when Breeds were present (#260)
        • Experiment with fixing more authors, like Delia Grace:
        -
        dspacetest=# update metadatavalue set authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where metadata_field_id=3 and text_value='Grace, D.';
        +
        dspacetest=# update metadatavalue set authority='0b4fcbc1-d930-4319-9b4d-ea1553cca70b', confidence=600 where metadata_field_id=3 and text_value='Grace, D.';
         

        2016-08-06

        • Finally figured out how to remove “View/Open” and “Bitstreams” from the item view
        • @@ -184,7 +184,7 @@ $ git rebase -i dspace-5.5
        • Install latest Oracle Java 8 JDK
        • Create setenv.sh in Tomcat 8 libexec/bin directory:
        -
        CATALINA_OPTS="-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8"
        +
        CATALINA_OPTS="-Djava.awt.headless=true -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -Dfile.encoding=UTF-8"
         CATALINA_OPTS="$CATALINA_OPTS -Djava.library.path=/opt/brew/Cellar/tomcat-native/1.2.8/lib"
         
         JRE_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
        @@ -192,7 +192,7 @@ JRE_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
         
      • Edit Tomcat 8 server.xml to add regular HTTP listener for solr
      • Symlink webapps:
      -
      $ rm -rf /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT
      +
      $ rm -rf /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT
       $ ln -sv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/ROOT
       $ ln -sv ~/dspace/webapps/oai /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/oai
       $ ln -sv ~/dspace/webapps/jspui /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/jspui
      @@ -246,7 +246,7 @@ $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/sol
       
    • Fix “CONGO,DR” country name in input-forms.xml (#264)
    • Also need to fix existing records using the incorrect form in the database:
    -
    dspace=# update metadatavalue set text_value='CONGO, DR' where resource_type_id=2 and metadata_field_id=228 and text_value='CONGO,DR';
    +
    dspace=# update metadatavalue set text_value='CONGO, DR' where resource_type_id=2 and metadata_field_id=228 and text_value='CONGO,DR';
     
    • I asked a question on the DSpace mailing list about updating “preferred” forms of author names from ORCID
    @@ -262,7 +262,7 @@ $ ln -sv ~/dspace/webapps/solr /opt/brew/Cellar/tomcat/8.5.4/libexec/webapps/sol
    • Database migrations are fine on DSpace 5.1:
    -
    $ ~/dspace/bin/dspace database info
    +
    $ ~/dspace/bin/dspace database info
     
     Database URL: jdbc:postgresql://localhost:5432/dspacetest
     Database Schema: public
    @@ -300,12 +300,12 @@ Database Driver: PostgreSQL Native Driver version PostgreSQL 9.1 JDBC4 (build 90
     
  • Talk to Atmire about the DSpace 5.5 issue, and it seems to be caused by a bug in FlywayDB
  • They said I should delete the Atmire migrations
  • -
    dspacetest=# delete from schema_version where description =  'Atmire CUA 4 migration' and version='5.1.2015.12.03.2';
    +
    dspacetest=# delete from schema_version where description =  'Atmire CUA 4 migration' and version='5.1.2015.12.03.2';
     dspacetest=# delete from schema_version where description =  'Atmire MQM migration' and version='5.1.2015.12.03.3';
     
    • After that DSpace starts up by XMLUI now has unrelated issues that I need to solve!
    -
    org.apache.avalon.framework.configuration.ConfigurationException: Type 'ThemeResourceReader' does not exist for 'map:read' at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
    +
    org.apache.avalon.framework.configuration.ConfigurationException: Type 'ThemeResourceReader' does not exist for 'map:read' at jndi:/localhost/themes/0_CGIAR/sitemap.xmap:136:77
     context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
     
    • Looks like we’re missing some stuff in the XMLUI module’s sitemap.xmap, as well as in each of our XMLUI themes
    • @@ -324,18 +324,18 @@ context:/jndi:/localhost/themes/0_CGIAR/sitemap.xmap - 136:77
    • Clean up and import 48 CCAFS records into DSpace Test
    • SQL to get all journal titles from dc.source (55), since it’s apparently used for internal DSpace filename shit, but we moved all our journal titles there a few months ago:
    -
    dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ '.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*';
    +
    dspacetest=# select distinct text_value from metadatavalue where metadata_field_id=55 and text_value !~ '.*(\.pdf|\.png|\.PDF|\.Pdf|\.JPEG|\.jpg|\.JPG|\.jpeg|\.xls|\.rtf|\.docx?|\.potx|\.dotx|\.eqa|\.tiff|\.mp4|\.mp3|\.gif|\.zip|\.txt|\.pptx|\.indd|\.PNG|\.bmp|\.exe|org\.dspace\.app\.mediafilter).*';
     

    2016-08-25

    • Atmire suggested adding a missing bean to dspace/config/spring/api/atmire-cua.xml but it doesn’t help:
    -
    ...
    +
    ...
     Error creating bean with name 'MetadataStorageInfoService'
     ...
     
    • Atmire sent an updated version of dspace/config/spring/api/atmire-cua.xml and now XMLUI starts but gives a null pointer exception:
    -
    Java stacktrace: java.lang.NullPointerException
    +
    Java stacktrace: java.lang.NullPointerException
         at org.dspace.app.xmlui.aspect.statistics.Navigation.addOptions(Navigation.java:129)
         at org.dspace.app.xmlui.wing.AbstractWingTransformer.startElement(AbstractWingTransformer.java:228)
         at sun.reflect.GeneratedMethodAccessor126.invoke(Unknown Source)
    @@ -350,7 +350,7 @@ Error creating bean with name 'MetadataStorageInfoService'
     
    • Import the 47 CCAFS records to CGSpace, creating the SimpleArchiveFormat bundles and importing like:
    -
    $ ./safbuilder.sh -c /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/3546.csv
    +
    $ ./safbuilder.sh -c /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/3546.csv
     $ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/3546 -s /tmp/Thumbnails\ to\ Upload\ to\ CGSpace/SimpleArchiveFormat -m 3546.map
     
    • Finally got DSpace 5.5 working with the Atmire modules after a few rounds of back and forth with Atmire devs
    • @@ -360,7 +360,7 @@ $ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/b
    • CGSpace had issues tonight, not entirely crashing, but becoming unresponsive
    • The dspace log had this:
    -
    2016-08-26 20:48:05,040 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -                                                               org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
    +
    2016-08-26 20:48:05,040 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -                                                               org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
     
    • Related to /rest no doubt
    diff --git a/docs/2016-09/index.html b/docs/2016-09/index.html index 063c1bf41..6259657bc 100644 --- a/docs/2016-09/index.html +++ b/docs/2016-09/index.html @@ -34,7 +34,7 @@ It looks like we might be able to use OUs now, instead of DCs: $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)" "/> - + @@ -127,11 +127,11 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=or
  • We had been using DC=ILRI to determine whether a user was ILRI or not
  • It looks like we might be able to use OUs now, instead of DCs:
  • -
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    +
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
     
    • User who has been migrated to the root vs user still in the hierarchical structure:
    -
    distinguishedName: CN=Last\, First (ILRI),OU=ILRI Kenya Employees,OU=ILRI Kenya,OU=ILRIHUB,DC=CGIARAD,DC=ORG
    +
    distinguishedName: CN=Last\, First (ILRI),OU=ILRI Kenya Employees,OU=ILRI Kenya,OU=ILRIHUB,DC=CGIARAD,DC=ORG
     distinguishedName: CN=Last\, First (ILRI),OU=ILRI Ethiopia Employees,OU=ILRI Ethiopia,DC=ILRI,DC=CGIARAD,DC=ORG
     
    • Changing the DSpace LDAP config to use OU=ILRIHUB seems to work:
    • @@ -140,7 +140,7 @@ distinguishedName: CN=Last\, First (ILRI),OU=ILRI Ethiopia Employees,OU=ILRI Eth
      • Notes for local PostgreSQL database recreation from production snapshot:
      -
      $ dropdb dspacetest
      +
      $ dropdb dspacetest
       $ createdb -O dspacetest --encoding=UNICODE dspacetest
       $ psql dspacetest -c 'alter user dspacetest createuser;'
       $ pg_restore -O -U dspacetest -d dspacetest ~/Downloads/cgspace_2016-09-01.backup
      @@ -150,7 +150,7 @@ $ vacuumdb dspacetest
       
      • Some names that I thought I fixed in July seem not to be:
      -
      dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
      +
      dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
             text_value       |              authority               | confidence
       -----------------------+--------------------------------------+------------
        Poole, Elizabeth Jane | b6efa27f-8829-4b92-80fe-bc63e03e3ccb |        600
      @@ -163,12 +163,12 @@ $ vacuumdb dspacetest
       
      • At least a few of these actually have the correct ORCID, but I will unify the authority to be c3a22456-8d6a-41f9-bba0-de51ef564d45
      -
      dspacetest=# update metadatavalue set authority='c3a22456-8d6a-41f9-bba0-de51ef564d45', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
      +
      dspacetest=# update metadatavalue set authority='c3a22456-8d6a-41f9-bba0-de51ef564d45', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Poole, %';
       UPDATE 69
       
      • And for Peter Ballantyne:
      -
      dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
      +
      dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
           text_value     |              authority               | confidence
       -------------------+--------------------------------------+------------
        Ballantyne, Peter | 2dcbcc7b-47b0-4fd7-bef9-39d554494081 |        600
      @@ -180,12 +180,12 @@ UPDATE 69
       
      • Again, a few have the correct ORCID, but there should only be one authority…
      -
      dspacetest=# update metadatavalue set authority='4f04ca06-9a76-4206-bd9c-917ca75d278e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
      +
      dspacetest=# update metadatavalue set authority='4f04ca06-9a76-4206-bd9c-917ca75d278e', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Ballantyne, %';
       UPDATE 58
       
      • And for me:
      -
      dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, A%';
      +
      dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Orth, A%';
        text_value |              authority               | confidence
       ------------+--------------------------------------+------------
        Orth, Alan | 4884def0-4d7e-4256-9dd4-018cd60a5871 |        600
      @@ -197,7 +197,7 @@ UPDATE 11
       
      • And for CCAFS author Bruce Campbell that I had discussed with CCAFS earlier this week:
      -
      dspacetest=# update metadatavalue set authority='0e414b4c-4671-4a23-b570-6077aca647d8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
      +
      dspacetest=# update metadatavalue set authority='0e414b4c-4671-4a23-b570-6077aca647d8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
       UPDATE 166
       dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Campbell, B%';
              text_value       |              authority               | confidence
      @@ -215,7 +215,7 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu
       
      • After one week of logging TLS connections on CGSpace:
      -
      # zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
      +
      # zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
       217
       # zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
       1164376
      @@ -226,7 +226,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
       
    • So this represents 0.02% of 1.16M connections over a one-week period
    • Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:
    -
    value + "__description:" + cells["dc.type"].value
    +
    value + "__description:" + cells["dc.type"].value
     
    • This gives you, for example: Mainstreaming gender in agricultural R&D.pdf__description:Brief
    @@ -251,7 +251,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
  • If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8
  • We should definitely clean filenames so they don’t use characters that are tricky to process in CSV and shell scripts, like: ,, ', and "
  • -
    value.replace("'","").replace(",","").replace('"','')
    +
    value.replace("'","").replace(",","").replace('"','')
     
    • I need to write a Python script to match that for renaming files in the file system
    • When importing SAF bundles it seems you can specify the target collection on the command line using -c 10568/4003 or in the collections file inside each item in the bundle
    • @@ -263,7 +263,7 @@ TLSv1/EDH-RSA-DES-CBC3-SHA
    • Import CIAT Gender Network records to CGSpace, first creating the SAF bundles as my user, then importing as the tomcat7 user, and deleting the bundle, for each collection’s items:
    -
    $ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv
    +
    $ ./safbuilder.sh -c /home/aorth/ciat-gender-2016-09-06/66601.csv
     $ JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" /home/cgspace.cgiar.org/bin/dspace import -a -e aorth@mjanja.ch -c 10568/66601 -s /home/aorth/ciat-gender-2016-09-06/SimpleArchiveFormat -m 66601.map
     $ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
     

    2016-09-07

    @@ -274,7 +274,7 @@ $ rm -rf ~/ciat-gender-2016-09-06/SimpleArchiveFormat/
  • See: https://www.postgresql.org/docs/9.3/static/routine-vacuuming.html
  • CGSpace went down and the error seems to be the same as always (lately):
  • -
    2016-09-07 11:39:23,162 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
    +
    2016-09-07 11:39:23,162 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
     org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
     ...
     
      @@ -284,7 +284,7 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
      • CGSpace crashed twice today, errors from catalina.out:
      -
      org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
      +
      org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
               at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)
       
      • I enabled logging of requests to /rest again
      • @@ -293,29 +293,29 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
        • CGSpace crashed again, errors from catalina.out:
        -
        org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
        +
        org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
                 at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)
         
        • I restarted Tomcat and it was ok again
        • CGSpace crashed a few hours later, errors from catalina.out:
        -
        Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
        +
        Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
                 at java.lang.StringCoding.decode(StringCoding.java:215)
         
        • We haven’t seen that in quite a while…
        • Indeed, in a month of logs it only occurs 15 times:
        -
        # grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
        +
        # grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
         15
         
        • I also see a bunch of errors from dspace.log:
        -
        2016-09-14 12:23:07,981 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
        +
        2016-09-14 12:23:07,981 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
         org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
         
        • Looking at REST requests, it seems there is one IP hitting us nonstop:
        -
        # awk '{print $1}' /var/log/nginx/rest.log  | sort -n | uniq -c | sort -h | tail -n 3
        +
        # awk '{print $1}' /var/log/nginx/rest.log  | sort -n | uniq -c | sort -h | tail -n 3
             820 50.87.54.15
           12872 70.32.99.142
           25744 70.32.83.92
        @@ -328,19 +328,19 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
         
      • I think the stability issues are definitely from REST
      • Crashed AGAIN, errors from dspace.log:
      -
      2016-09-14 14:31:43,069 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
      +
      2016-09-14 14:31:43,069 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
       org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
       
      • And more heap space errors:
      -
      # grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
      +
      # grep -rsI "OutOfMemoryError" /var/log/tomcat7/catalina.* | wc -l
       19
       
      • There are no more rest requests since the last crash, so maybe there are other things causing this.
      • Hmm, I noticed a shitload of IPs from 180.76.0.0/16 are connecting to both CGSpace and DSpace Test (58 unique IPs concurrently!)
      • They seem to be coming from Baidu, and so far during today alone account for 1/6 of every connection:
      -
      # grep -c ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
      +
      # grep -c ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
       29084
       # grep -c ip_addr=180.76.15 /home/cgspace.cgiar.org/log/dspace.log.2016-09-14
       5192
      @@ -349,16 +349,16 @@ org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error
       
    • From the activity control panel I can see 58 unique IPs hitting the site concurrently, which has GOT to hurt our stability
    • A list of all 2000 unique IPs from CGSpace logs today:
    -
    # grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: '{print $5}' | sort -n | uniq -c | sort -h | tail -n 100
    +
    # grep ip_addr= /home/cgspace.cgiar.org/log/dspace.log.2016-09-11 | awk -F: '{print $5}' | sort -n | uniq -c | sort -h | tail -n 100
     
    • Looking at the top 20 IPs or so, most are Yahoo, MSN, Google, Baidu, TurnitIn (iParadigm), etc… do we have any real users?
    • Generate a list of all author affiliations for Peter Ballantyne to go through, make corrections, and create a lookup list from:
    -
    dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
    +
    dspacetest=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
     
    • Looking into the Catalina logs again around the time of the first crash, I see:
    -
    Wed Sep 14 09:47:27 UTC 2016 | Query:id: 78581 AND type:2
    +
    Wed Sep 14 09:47:27 UTC 2016 | Query:id: 78581 AND type:2
     Wed Sep 14 09:47:28 UTC 2016 | Updating : 6/6 docs.
     Commit
     Commit done
    @@ -368,7 +368,7 @@ Exception in thread "http-bio-127.0.0.1-8081-exec-193" java.lang.OutOf
     
  • And after that I see a bunch of “pool error Timeout waiting for idle object”
  • Later, near the time of the next crash I see:
  • -
    dn:CN=Haman\, Magdalena  (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
    +
    dn:CN=Haman\, Magdalena  (CIAT-CCAFS),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
     Wed Sep 14 11:29:55 UTC 2016 | Query:id: 79078 AND type:2
     Wed Sep 14 11:30:20 UTC 2016 | Updating : 6/6 docs.
     Commit
    @@ -389,7 +389,7 @@ java.util.Map does not have a no-arg default constructor.
     
    • Then 20 minutes later another outOfMemoryError:
    -
    Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
    +
    Exception in thread "http-bio-127.0.0.1-8081-exec-25" java.lang.OutOfMemoryError: Java heap space
             at java.lang.StringCoding.decode(StringCoding.java:215)
     
    • Perhaps these particular issues are memory issues, the munin graphs definitely show some weird purging/allocating behavior starting this week
    • @@ -402,7 +402,7 @@ java.util.Map does not have a no-arg default constructor.
    • Oh great, the configuration on the actual server is different than in configuration management!
    • Seems we added a bunch of settings to the /etc/default/tomcat7 in December, 2015 and never updated our ansible repository:
    -
    JAVA_OPTS="-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts"
    +
    JAVA_OPTS="-Djava.awt.headless=true -Xms3584m -Xmx3584m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -XX:-UseGCOverheadLimit -XX:MaxGCPauseMillis=250 -XX:GCTimeRatio=9 -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts"
     
    • So I’m going to bump the heap +512m and remove all the other experimental shit (and update ansible!)
    • Increased JVM heap to 4096m on CGSpace (linode01)
    • @@ -416,7 +416,7 @@ java.util.Map does not have a no-arg default constructor.
      • CGSpace crashed again, and there are TONS of heap space errors but the datestamps aren’t on those lines so I’m not sure if they were yesterday:
      -
      dn:CN=Orentlicher\, Natalie (CIAT),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
      +
      dn:CN=Orentlicher\, Natalie (CIAT),OU=Standard,OU=Users,OU=HQ,OU=CIATHUB,dc=cgiarad,dc=org
       Thu Sep 15 18:45:25 UTC 2016 | Query:id: 55785 AND type:2
       Thu Sep 15 18:45:26 UTC 2016 | Updating : 100/218 docs.
       Thu Sep 15 18:45:26 UTC 2016 | Updating : 200/218 docs.
      @@ -443,7 +443,7 @@ Exception in thread "Thread-54216" org.apache.solr.client.solrj.impl.H
       
    • I bumped the heap space from 4096m to 5120m to see if this is really about heap speace or not.
    • Looking into some of these errors that I’ve seen this week but haven’t noticed before:
    -
    # zcat -f -- /var/log/tomcat7/catalina.* | grep -c 'Failed to generate the schema for the JAX-B elements'
    +
    # zcat -f -- /var/log/tomcat7/catalina.* | grep -c 'Failed to generate the schema for the JAX-B elements'
     113
     
    • I’ve sent a message to Atmire about the Solr error to see if it’s related to their batch update module
    • @@ -452,7 +452,7 @@ Exception in thread "Thread-54216" org.apache.solr.client.solrj.impl.H
      • Work on cleanups for author affiliations after Peter sent me his list of corrections/deletions:
      -
      $ ./fix-metadata-values.py -i affiliations_pb-322-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p fuuu
      +
      $ ./fix-metadata-values.py -i affiliations_pb-322-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p fuuu
       $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2-deletions.csv -m 211 -u dspace -d dspace -p fuuu
       
      • After that we need to take the top ~300 and make a controlled vocabulary for it
      • @@ -474,7 +474,7 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2
      • Turns out the Solr search logic switched from OR to AND in DSpace 6.0 and the change is easy to backport: https://jira.duraspace.org/browse/DS-2809
      • We just need to set this in dspace/solr/search/conf/schema.xml:
      -
      <solrQueryParser defaultOperator="AND"/>
      +
      <solrQueryParser defaultOperator="AND"/>
       
      • It actually works really well, and search results return much less hits now (before, after):
      @@ -483,7 +483,7 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2
      • Found a way to improve the configuration of Atmire’s Content and Usage Analysis (CUA) module for date fields
      -
      -content.analysis.dataset.option.8=metadata:dateAccessioned:discovery
      +
      -content.analysis.dataset.option.8=metadata:dateAccessioned:discovery
       +content.analysis.dataset.option.8=metadata:dc.date.accessioned:date(month)
       
      • This allows the module to treat the field as a date rather than a text string, so we can interrogate it more intelligently
      • @@ -492,7 +492,7 @@ $ ./delete-metadata-values.py -f cg.contributor.affiliation -i affiliations_pb-2
      • 45 minutes of downtime!
      • Start processing the fixes to dc.description.sponsorship from Peter Ballantyne:
      -
      $ ./fix-metadata-values.py -i sponsors-fix-23.csv -f dc.description.sponsorship -t correct -m 29 -d dspace -u dspace -p fuuu
      +
      $ ./fix-metadata-values.py -i sponsors-fix-23.csv -f dc.description.sponsorship -t correct -m 29 -d dspace -u dspace -p fuuu
       $ ./delete-metadata-values.py -i sponsors-delete-8.csv -f dc.description.sponsorship -m 29 -d dspace -u dspace -p fuuu
       
      • I need to run these and the others from a few days ago on CGSpace the next time we run updates
      • @@ -511,14 +511,14 @@ $ ./delete-metadata-values.py -i sponsors-delete-8.csv -f dc.description.sponsor
      • Not sure if it’s something like we already have too many filters there (30), or the filter name is reserved, etc…
      • Generate a list of ILRI subjects for Peter and Abenet to look through/fix:
      -
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=203 group by text_value order by count desc) to /tmp/ilrisubjects.csv with csv;
      +
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=203 group by text_value order by count desc) to /tmp/ilrisubjects.csv with csv;
       
      • Regenerate Discovery indexes a few times after playing with discovery.xml index definitions (syntax, parameters, etc).
      • Merge changes to boolean logic in Solr search (#274)
      • Run all sponsorship and affiliation fixes on CGSpace, deploy latest 5_x-prod branch, and re-index Discovery on CGSpace
      • Tested OCSP stapling on DSpace Test’s nginx and it works:
      -
      $ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
      +
      $ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
       ...
       OCSP response:
       ======================================
      @@ -533,12 +533,12 @@ OCSP Response Data:
       
    • Discuss fixing some ORCIDs for CCAFS author Sonja Vermeulen with Magdalena Haman
    • This author has a few variations:
    -
    dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeu
    +
    dspacetest=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeu
     len, S%';
     
    • And it looks like fe4b719f-6cc4-4d65-8504-7a83130b9f83 is the authority with the correct ORCID linked
    -
    dspacetest=# update metadatavalue set authority='fe4b719f-6cc4-4d65-8504-7a83130b9f83w', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
    +
    dspacetest=# update metadatavalue set authority='fe4b719f-6cc4-4d65-8504-7a83130b9f83w', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
     UPDATE 101
     
    • Hmm, now her name is missing from the authors facet and only shows the authority ID
    • @@ -547,7 +547,7 @@ UPDATE 101
    • On a clean snapshot of the database I see the correct authority should be f01f7b7b-be3f-4df7-a61d-b73c067de88d, not fe4b719f-6cc4-4d65-8504-7a83130b9f83
    • Updating her authorities again and reindexing:
    -
    dspacetest=# update metadatavalue set authority='f01f7b7b-be3f-4df7-a61d-b73c067de88d', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
    +
    dspacetest=# update metadatavalue set authority='f01f7b7b-be3f-4df7-a61d-b73c067de88d', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
     UPDATE 101
     
    • Use GitHub icon from Font Awesome instead of a PNG to save one extra network request
    • @@ -564,14 +564,14 @@ UPDATE 101
    • Minor fix to a string in Atmire’s CUA module (#280)
    • This seems to be what I’ll need to do for Sonja Vermeulen (but with 2b4166b7-6e4d-4f66-9d8b-ddfbec9a6ae0 instead on the live site):
    -
    dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
    +
    dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen, S%';
     dspacetest=# update metadatavalue set authority='09e4da69-33a3-45ca-b110-7d3f82d2d6d2', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Vermeulen SJ%';
     
    • And then update Discovery and Authority indexes
    • Minor fix for “Subject” string in Discovery search and Atmire modules (#281)
    • Start testing batch fixes for ILRI subject from Peter:
    -
    $ ./fix-metadata-values.py -i ilrisubjects-fix-32.csv -f cg.subject.ilri -t correct -m 203 -d dspace -u dspace -p fuuuu
    +
    $ ./fix-metadata-values.py -i ilrisubjects-fix-32.csv -f cg.subject.ilri -t correct -m 203 -d dspace -u dspace -p fuuuu
     $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -m 203 -d dspace -u dspace -p fuuu
     

    2016-09-29

      @@ -580,7 +580,7 @@ $ ./delete-metadata-values.py -i ilrisubjects-delete-13.csv -f cg.subject.ilri -
    • DSpace Test (linode02) became unresponsive for some reason, I had to hard reboot it from the Linode console
    • People on DSpace mailing list gave me a query to get authors from certain collections:
    -
    dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/5472', '10568/5473')));
    +
    dspacetest=# select distinct text_value from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/5472', '10568/5473')));
     

    2016-09-30

    • Deny access to REST API’s find-by-metadata-field endpoint to protect against an upstream security issue (DS-3250)
    • diff --git a/docs/2016-10/index.html b/docs/2016-10/index.html index 1eacd77a6..2e9b40284 100644 --- a/docs/2016-10/index.html +++ b/docs/2016-10/index.html @@ -42,7 +42,7 @@ I exported a random item’s metadata as CSV, deleted all columns except id 0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X "/> - + @@ -139,7 +139,7 @@ I exported a random item’s metadata as CSV, deleted all columns except id
    • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
    -
    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    +
    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
     
    • Hmm, with the dc.contributor.author column removed, DSpace doesn’t detect any changes
    • With a blank dc.contributor.author column, DSpace wants to remove all non-ORCID authors and add the new ORCID authors
    • @@ -161,14 +161,14 @@ I exported a random item’s metadata as CSV, deleted all columns except id
    • That left us with 3,180 valid corrections and 3 deletions:
    -
    $ ./fix-metadata-values.py -i authors-fix-3180.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
    +
    $ ./fix-metadata-values.py -i authors-fix-3180.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
     $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -m 3 -d dspacetest -u dspacetest -p fuuu
     
    • Remove old about page (#284)
    • CGSpace crashed a few times today
    • Generate list of unique authors in CCAFS collections:
    -
    dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
    +
    dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
     

    2016-10-05

    • Work on more infrastructure cleanups for Ansible DSpace role
    • @@ -190,13 +190,13 @@ $ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -
    • Re-deploy CGSpace with latest changes from late September and early October
    • Run fixes for ILRI subjects and delete blank metadata values:
    -
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
    +
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
     DELETE 11
     
    • Run all system updates and reboot CGSpace
    • Delete ten gigs of old 2015 Tomcat logs that never got rotated (WTF?):
    -
    root@linode01:~# ls -lh /var/log/tomcat7/localhost_access_log.2015* | wc -l
    +
    root@linode01:~# ls -lh /var/log/tomcat7/localhost_access_log.2015* | wc -l
     47
     
    • Delete 2GB cron-filter-media.log file, as it is just a log from a cron job and it doesn’t get rotated like normal log files (almost a year now maybe)
    • @@ -211,7 +211,7 @@ DELETE 11
      • A bit more cleanup on the CCAFS authors, and run the corrections on DSpace Test:
      -
      $ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
      +
      $ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
       
      • One observation is that there are still some old versions of names in the author lookup because authors appear in other communities (as we only corrected authors from CCAFS for this round)
      @@ -219,7 +219,7 @@ DELETE 11
      • Start working on DSpace 5.5 porting work again:
      -
      $ git checkout -b 5_x-55 5_x-prod
      +
      $ git checkout -b 5_x-55 5_x-prod
       $ git rebase -i dspace-5.5
       
      • Have to fix about ten merge conflicts, mostly in the SCSS for the CGIAR theme
      • @@ -248,25 +248,25 @@ $ git rebase -i dspace-5.5
        • Move the LIVES community from the top level to the ILRI projects community
        -
        $ /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=10568/25101
        +
        $ /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=10568/25101
         
        • Start testing some things for DSpace 5.5, like command line metadata import, PDF media filter, and Atmire CUA
        • Start looking at batch fixing of “old” ILRI website links without www or https, for example:
        -
        dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ilri.org%';
        +
        dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ilri.org%';
         
        • Also CCAFS has HTTPS and their links should use it where possible:
        -
        dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ccafs.cgiar.org%';
        +
        dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ccafs.cgiar.org%';
         
        • And this will find community and collection HTML text that is using the old style PNG/JPG icons for RSS and email (we should be using Font Awesome icons instead):
        -
        dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like '%Iconrss2.png%';
        +
        dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like '%Iconrss2.png%';
         
        • Turns out there are shit tons of varieties of this, like with http, https, www, separate </img> tags, alignments, etc
        • Had to find all variations and replace them individually:
        -
        dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>','<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>%';
        +
        dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>','<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>%';
         dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/email.jpg"/>%';
         dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/Iconrss2.png"/>%';
         dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/email.jpg"/>%';
        @@ -291,7 +291,7 @@ dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<i
         
        • Run Font Awesome fixes on DSpace Test:
        -
        dspace=# \i /tmp/font-awesome-text-replace.sql
        +
        dspace=# \i /tmp/font-awesome-text-replace.sql
         UPDATE 17
         UPDATE 17
         UPDATE 3
        @@ -321,7 +321,7 @@ UPDATE 0
         
        • Fix some messed up authors on CGSpace:
        -
        dspace=# update metadatavalue set authority='799da1d8-22f3-43f5-8233-3d2ef5ebf8a8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Charleston, B.%';
        +
        dspace=# update metadatavalue set authority='799da1d8-22f3-43f5-8233-3d2ef5ebf8a8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Charleston, B.%';
         UPDATE 10
         dspace=# update metadatavalue set authority='e936f5c5-343d-4c46-aa91-7a1fff6277ed', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Knight-Jones%';
         UPDATE 36
        @@ -332,20 +332,20 @@ UPDATE 36
         
      • Talk to Carlos Quiros about CG Core metadata in CGSpace
      • Get a list of countries from CGSpace so I can do some batch corrections:
      -
      dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=228 group by text_value order by count desc) to /tmp/countries.csv with csv;
      +
      dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=228 group by text_value order by count desc) to /tmp/countries.csv with csv;
       
      • Fix a bunch of countries in Open Refine and run the corrections on CGSpace:
      -
      $ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu
      +
      $ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu
       $ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -m 228 -d dspace -u dspace -p fuuu
       
      • Run a shit ton of author fixes from Peter Ballantyne that we’ve been cleaning up for two months:
      -
      $ ./fix-metadata-values.py -i /tmp/authors-fix-pb2.csv -f dc.contributor.author -t correct -m 3 -u dspace -d dspace -p fuuu
      +
      $ ./fix-metadata-values.py -i /tmp/authors-fix-pb2.csv -f dc.contributor.author -t correct -m 3 -u dspace -d dspace -p fuuu
       
      • Run a few URL corrections for ilri.org and doi.org, etc:
      -
      dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://www.ilri.org','https://www.ilri.org') where resource_type_id=2 and text_value like '%http://www.ilri.org%';
      +
      dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://www.ilri.org','https://www.ilri.org') where resource_type_id=2 and text_value like '%http://www.ilri.org%';
       dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://mahider.ilri.org', 'https://cgspace.cgiar.org') where resource_type_id=2 and text_value like '%http://mahider.%.org%' and metadata_field_id not in (28);
       dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://dx.doi.org%' and metadata_field_id not in (18,26,28,111);
       dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://doi.org%' and metadata_field_id not in (18,26,28,111);
      diff --git a/docs/2016-11/index.html b/docs/2016-11/index.html
      index b94aaa600..7493299a9 100644
      --- a/docs/2016-11/index.html
      +++ b/docs/2016-11/index.html
      @@ -26,7 +26,7 @@ Add dc.type to the output options for Atmire’s Listings and Reports module
       Add dc.type to the output options for Atmire’s Listings and Reports module (#286)
       
       "/>
      -
      +
       
       
           
      @@ -125,7 +125,7 @@ Add dc.type to the output options for Atmire’s Listings and Reports module
       
    • Indexing Discovery on DSpace Test took 332 minutes, which is like five times as long as it usually takes
    • At the end it appeared to finish correctly but there were lots of errors right after it finished:
    -
    2016-11-02 15:09:48,578 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76454 to Index
    +
    2016-11-02 15:09:48,578 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76454 to Index
     2016-11-02 15:09:48,584 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Wrote Community: 10568/3202 to Index
     2016-11-02 15:09:48,589 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76455 to Index
     2016-11-02 15:09:48,590 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Wrote Community: 10568/51693 to Index
    @@ -145,7 +145,7 @@ java.lang.NullPointerException
     
  • DSpace is still up, and a few minutes later I see the default DSpace indexer is still running
  • Sure enough, looking back before the first one finished, I see output from both indexers interleaved in the log:
  • -
    2016-11-02 15:09:28,545 INFO  org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/47242 to Index
    +
    2016-11-02 15:09:28,545 INFO  org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/47242 to Index
     2016-11-02 15:09:28,633 INFO  org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/60785 to Index
     2016-11-02 15:09:28,678 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (55695 of 55722): 43557
     2016-11-02 15:09:28,688 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (55703 of 55722): 34476
    @@ -160,17 +160,17 @@ java.lang.NullPointerException
     
    • Horrible one liner to get Linode ID from certain Ansible host vars:
    -
    $ grep -A 3 contact_info * | grep -E "(Orth|Sisay|Peter|Daniel|Tsega)" | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id
    +
    $ grep -A 3 contact_info * | grep -E "(Orth|Sisay|Peter|Daniel|Tsega)" | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id
     
    • I noticed some weird CRPs in the database, and they don’t show up in Discovery for some reason, perhaps the :
    • I’ll export these and fix them in batch:
    -
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=230 group by text_value order by count desc) to /tmp/crp.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=230 group by text_value order by count desc) to /tmp/crp.csv with csv;
     COPY 22
     
    • Test running the replacements:
    -
    $ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p 'fuuu'
     
    • Add AMR to ILRI subjects and remove one duplicate instance of IITA in author affiliations controlled vocabulary (#288)
    @@ -188,7 +188,7 @@ COPY 22
  • The way to go is probably to have a CSV of author names and authority IDs, then to batch update them in PostgreSQL
  • Dump of the top ~200 authors in CGSpace:
  • -
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=3 group by text_value order by count desc limit 210) to /tmp/210-authors.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=3 group by text_value order by count desc limit 210) to /tmp/210-authors.csv with csv;
     

    2016-11-09

    • CGSpace crashed so I quickly ran system updates, applied one or two of the waiting changes from the 5_x-prod branch, and rebooted the server
    • @@ -200,11 +200,11 @@ COPY 22
    • Helping Megan Zandstra and CIAT with some questions about the REST API
    • Playing with find-by-metadata-field, this works:
    -
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}'
    +
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}'
     
    • But the results are deceiving because metadata fields can have text languages and your query must match exactly!
    -
    dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
    +
    dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
      text_value | text_lang
     ------------+-----------
      SEEDS      |
    @@ -215,7 +215,7 @@ COPY 22
     
  • So basically, the text language here could be null, blank, or en_US
  • To query metadata with these properties, you can do:
  • -
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
    +
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
     55
     $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
     34
    @@ -223,7 +223,7 @@ $ curl -s -H "accept: application/json" -H "Content-Type: applica
     
    • The results (55+34=89) don’t seem to match those from the database:
    -
    dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null;
    +
    dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null;
      count
     -------
         15
    @@ -241,7 +241,7 @@ dspace=# select count(text_value) from metadatavalue where resource_type_id=2 an
     
  • I’ll ask a question on the dspace-tech mailing list
  • And speaking of text_lang, this is interesting:
  • -
    dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
    +
    dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
      text_lang
     -----------
     
    @@ -262,28 +262,28 @@ dspace=# select count(text_value) from metadatavalue where resource_type_id=2 an
     
    • Generate a list of all these so I can maybe fix them in batch:
    -
    dspace=# \copy (select distinct text_lang, count(*) from metadatavalue where resource_type_id=2 group by text_lang order by count desc) to /tmp/text-langs.csv with csv;
    +
    dspace=# \copy (select distinct text_lang, count(*) from metadatavalue where resource_type_id=2 group by text_lang order by count desc) to /tmp/text-langs.csv with csv;
     COPY 14
     
    • Perhaps we need to fix them all in batch, or experiment with fixing only certain metadatavalues:
    -
    dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
    +
    dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
     UPDATE 85
     
    • The fix-metadata.py script I have is meant for specific metadata values, so if I want to update some text_lang values I should just do it directly in the database
    • For example, on a limited set:
    -
    dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value='LIVESTOCK' and text_lang='';
    +
    dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value='LIVESTOCK' and text_lang='';
     UPDATE 420
     
    • And assuming I want to do it for all fields:
    -
    dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang='';
    +
    dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang='';
     UPDATE 183726
     
    • After that restarted Tomcat and PostgreSQL (because I’m superstitious about caches) and now I see the following in REST API query:
    -
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
    +
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
     71
     $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
     0
    @@ -298,7 +298,7 @@ $ curl -s -H "accept: application/json" -H "Content-Type: applica
     
  • So there is apparently this Tomcat native way to limit web crawlers to one session: Crawler Session Manager
  • After adding that to server.xml bots matching the pattern in the configuration will all use ONE session, just like normal users:
  • -
    $ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
    +
    $ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
     HTTP/1.1 200 OK
     Connection: keep-alive
     Content-Encoding: gzip
    @@ -336,7 +336,7 @@ X-Cocoon-Version: 2.2.0
     
    • Seems the default regex doesn’t catch Baidu, though:
    -
    $ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
    +
    $ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
     HTTP/1.1 200 OK
     Connection: keep-alive
     Content-Encoding: gzip
    @@ -364,13 +364,13 @@ X-Cocoon-Version: 2.2.0
     
    • Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:
    -
    <!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
    +
    <!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
     <Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
            crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
     
    • Looking at the bots that were active yesterday it seems the above regex should be sufficient:
    -
    $ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
    +
    $ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
     Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
     Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
     Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
    @@ -379,7 +379,7 @@ Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "
     
    • Neat maven trick to exclude some modules from being built:
    -
    $ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
    +
    $ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
     
    • We absolutely don’t use those modules, so we shouldn’t build them in the first place
    @@ -387,13 +387,13 @@ Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "
    • Generate a list of journal titles for Peter and Abenet to look through so we can make a controlled vocabulary out of them:
    -
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc) to /tmp/journal-titles.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc) to /tmp/journal-titles.csv with csv;
     COPY 2515
     
    • Send a message to users of the CGSpace REST API to notify them of upcoming upgrade so they can test their apps against DSpace Test
    • Test an update old, non-HTTPS links to the CCAFS website in CGSpace metadata:
    -
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
    +
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
     UPDATE 164
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://ccafs.cgiar.org','https://ccafs.cgiar.org') where resource_type_id=2 and text_value like '%http://ccafs.cgiar.org%';
     UPDATE 7
    @@ -404,11 +404,11 @@ UPDATE 7
     
  • I’m debating forcing the re-generation of ALL thumbnails, since some come from DSpace 3 and 4 when the thumbnailing wasn’t as good
  • The results were very good, I think that after we upgrade to 5.5 I will do it, perhaps one community / collection at a time:
  • -
    $ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p "ImageMagick PDF Thumbnail"
    +
    $ [dspace]/bin/dspace filter-media -f -i 10568/67156 -p "ImageMagick PDF Thumbnail"
     
    • In related news, I’m looking at thumbnails of thumbnails (the ones we uploaded manually before, and now DSpace’s media filter has made thumbnails of THEM):
    -
    dspace=# select text_value from metadatavalue where text_value like '%.jpg.jpg';
    +
    dspace=# select text_value from metadatavalue where text_value like '%.jpg.jpg';
     
    • I’m not sure if there’s anything we can do, actually, because we would have to remove those from the thumbnail bundles, and replace them with the regular JPGs from the content bundle, and then remove them from the assetstore…
    @@ -464,7 +464,7 @@ UPDATE 7
  • One user says they are still getting a blank page when he logs in (just CGSpace header, but no community list)
  • Looking at the Catlina logs I see there is some super long-running indexing process going on:
  • -
    INFO: FrameworkServlet 'oai': initialization completed in 2600 ms
    +
    INFO: FrameworkServlet 'oai': initialization completed in 2600 ms
     [>                                                  ] 0% time remaining: Calculating... timestamp: 2016-11-28 03:00:18
     [>                                                  ] 0% time remaining: 11 hour(s) 57 minute(s) 46 seconds. timestamp: 2016-11-28 03:00:19
     [>                                                  ] 0% time remaining: 23 hour(s) 4 minute(s) 28 seconds. timestamp: 2016-11-28 03:00:19
    @@ -477,7 +477,7 @@ UPDATE 7
     
  • Double checking the DSpace 5.x upgrade notes for anything I missed, or troubleshooting tips
  • Running some manual processes just in case:
  • -
    $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/dcterms-types.xml
    +
    $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/dcterms-types.xml
     $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/dublin-core-types.xml
     $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/eperson-types.xml
     $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacetest.cgiar.org/config/registries/workflow-types.xml
    @@ -491,7 +491,7 @@ $ /home/dspacetest.cgiar.org/bin/dspace registry-loader -metadata /home/dspacete
     
  • Sisay tried deleting and re-creating Goshu’s account but he still can’t see any communities on the homepage after he logs in
  • Around the time of his login I see this in the DSpace logs:
  • -
    2016-11-29 07:56:36,350 INFO  org.dspace.authenticate.LDAPAuthentication @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:failed_login:no DN found for user g.cherinet@cgiar.org
    +
    2016-11-29 07:56:36,350 INFO  org.dspace.authenticate.LDAPAuthentication @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:failed_login:no DN found for user g.cherinet@cgiar.org
     2016-11-29 07:56:36,350 INFO  org.dspace.authenticate.PasswordAuthentication @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:authenticate:attempting password auth of user=g.cherinet@cgiar.org
     2016-11-29 07:56:36,352 INFO  org.dspace.app.xmlui.utils.AuthenticationUtil @ g.cherinet@cgiar.org:session_id=F628E13AB4EF2BA949198A99EFD8EBE4:ip_addr=213.55.99.121:failed_login:email=g.cherinet@cgiar.org, realm=null, result=2
     2016-11-29 07:56:36,545 INFO  com.atmire.utils.UpdateSolrStatsMetadata @ Start processing item 10568/50391 id:51744
    @@ -513,7 +513,7 @@ org.dspace.discovery.SearchServiceException: Error executing query
     
    • At about the same time in the solr log I see a super long query:
    -
    2016-11-29 07:56:36,734 INFO  org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&fl=dateIssued.year,handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=dateIssued.year:[*+TO+*]&fq=read:(g0+OR+e574+OR+g0+OR+g3+OR+g9+OR+g10+OR+g14+OR+g16+OR+g18+OR+g20+OR+g23+OR+g24+OR+g2072+OR+g2074+OR+g28+OR+g2076+OR+g29+OR+g2078+OR+g2080+OR+g34+OR+g2082+OR+g2084+OR+g38+OR+g2086+OR+g2088+OR+g2091+OR+g43+OR+g2092+OR+g2093+OR+g2095+OR+g2097+OR+g50+OR+g2099+OR+g51+OR+g2103+OR+g62+OR+g65+OR+g2115+OR+g2117+OR+g2119+OR+g2121+OR+g2123+OR+g2125+OR+g77+OR+g78+OR+g79+OR+g2127+OR+g80+OR+g2129+OR+g2131+OR+g2133+OR+g2134+OR+g2135+OR+g2136+OR+g2137+OR+g2138+OR+g2139+OR+g2140+OR+g2141+OR+g2142+OR+g2148+OR+g2149+OR+g2150+OR+g2151+OR+g2152+OR+g2153+OR+g2154+OR+g2156+OR+g2165+OR+g2167+OR+g2171+OR+g2174+OR+g2175+OR+g129+OR+g2182+OR+g2186+OR+g2189+OR+g153+OR+g158+OR+g166+OR+g167+OR+g168+OR+g169+OR+g2225+OR+g179+OR+g2227+OR+g2229+OR+g183+OR+g2231+OR+g184+OR+g2233+OR+g186+OR+g2235+OR+g2237+OR+g191+OR+g192+OR+g193+OR+g202+OR+g203+OR+g204+OR+g205+OR+g207+OR+g208+OR+g218+OR+g219+OR+g222+OR+g223+OR+g230+OR+g231+OR+g238+OR+g241+OR+g244+OR+g254+OR+g255+OR+g262+OR+g265+OR+g268+OR+g269+OR+g273+OR+g276+OR+g277+OR+g279+OR+g282+OR+g2332+OR+g2335+OR+g2338+OR+g292+OR+g293+OR+g2341+OR+g296+OR+g2344+OR+g297+OR+g2347+OR+g301+OR+g2350+OR+g303+OR+g305+OR+g2356+OR+g310+OR+g311+OR+g2359+OR+g313+OR+g2362+OR+g2365+OR+g2368+OR+g321+OR+g2371+OR+g325+OR+g2374+OR+g328+OR+g2377+OR+g2380+OR+g333+OR+g2383+OR+g2386+OR+g2389+OR+g342+OR+g343+OR+g2392+OR+g345+OR+g2395+OR+g348+OR+g2398+OR+g2401+OR+g2404+OR+g2407+OR+g364+OR+g366+OR+g2425+OR+g2427+OR+g385+OR+g387+OR+g388+OR+g389+OR+g2442+OR+g395+OR+g2443+OR+g2444+OR+g401+OR+g403+OR+g405+OR+g408+OR+g2457+OR+g2458+OR+g411+OR+g2459+OR+g414+OR+g2463+OR+g417+OR+g2465+OR+g2467+OR+g421+OR+g2469+OR+g2471+OR+g424+OR+g2473+OR+g2475+OR+g2476+OR+g429+OR+g433+OR+g2481+OR+g2482+OR+g2483+OR+g443+OR+g444+OR+g445+OR+g446+OR+g448+OR+g453+OR+g455+OR+g456+OR+g457+OR+g458+OR+g459+OR+g461+OR+g462+OR+g463+OR+g464+OR+g465+OR+g467+OR+g468+OR+g469+OR+g474+OR+g476+OR+g477+OR+g480+OR+g483+OR+g484+OR+g493+OR+g496+OR+g497+OR+g498+OR+g500+OR+g502+OR+g504+OR+g505+OR+g2559+OR+g2560+OR+g513+OR+g2561+OR+g515+OR+g516+OR+g518+OR+g519+OR+g2567+OR+g520+OR+g521+OR+g522+OR+g2570+OR+g523+OR+g2571+OR+g524+OR+g525+OR+g2573+OR+g526+OR+g2574+OR+g527+OR+g528+OR+g2576+OR+g529+OR+g531+OR+g2579+OR+g533+OR+g534+OR+g2582+OR+g535+OR+g2584+OR+g538+OR+g2586+OR+g540+OR+g2588+OR+g541+OR+g543+OR+g544+OR+g545+OR+g546+OR+g548+OR+g2596+OR+g549+OR+g551+OR+g555+OR+g556+OR+g558+OR+g561+OR+g569+OR+g570+OR+g571+OR+g2619+OR+g572+OR+g2620+OR+g573+OR+g2621+OR+g2622+OR+g575+OR+g578+OR+g581+OR+g582+OR+g584+OR+g585+OR+g586+OR+g587+OR+g588+OR+g590+OR+g591+OR+g593+OR+g595+OR+g596+OR+g598+OR+g599+OR+g601+OR+g602+OR+g603+OR+g604+OR+g605+OR+g606+OR+g608+OR+g609+OR+g610+OR+g612+OR+g614+OR+g616+OR+g620+OR+g621+OR+g623+OR+g630+OR+g635+OR+g636+OR+g646+OR+g649+OR+g683+OR+g684+OR+g687+OR+g689+OR+g691+OR+g695+OR+g697+OR+g698+OR+g699+OR+g700+OR+g701+OR+g707+OR+g708+OR+g709+OR+g710+OR+g711+OR+g712+OR+g713+OR+g714+OR+g715+OR+g716+OR+g717+OR+g719+OR+g720+OR+g729+OR+g732+OR+g733+OR+g734+OR+g736+OR+g737+OR+g738+OR+g2786+OR+g752+OR+g754+OR+g2804+OR+g757+OR+g2805+OR+g2806+OR+g760+OR+g761+OR+g2810+OR+g2815+OR+g769+OR+g771+OR+g773+OR+g776+OR+g786+OR+g787+OR+g788+OR+g789+OR+g791+OR+g792+OR+g793+OR+g794+OR+g795+OR+g796+OR+g798+OR+g800+OR+g802+OR+g803+OR+g806+OR+g808+OR+g810+OR+g814+OR+g815+OR+g817+OR+g829+OR+g830+OR+g849+OR+g893+OR+g895+OR+g898+OR+g902+OR+g903+OR+g917+OR+g919+OR+g921+OR+g922+OR+g923+OR+g924+OR+g925+OR+g926+OR+g927+OR+g928+OR+g929+OR+g930+OR+g932+OR+g933+OR+g934+OR+g938+OR+g939+OR+g944+OR+g945+OR+g946+OR+g947+OR+g948+OR+g949+OR+g950+OR+g951+OR+g953+OR+g954+OR+g955+OR+g956+OR+g958+OR+g959+OR+g960+OR+g963+OR+g964+OR+g965+OR+g968+OR+g969+OR+g970+OR+g971+OR+g972+OR+g973+OR+g974+OR+g976+OR+g978+OR+g979+OR+g984+OR+g985+OR+g987+OR+g988+OR+g991+OR+g993+OR+g994+OR+g999+OR+g1000+OR+g1003+OR+g1005+OR+g1006+OR+g1007+OR+g1012+OR+g1013+OR+g1015+OR+g1016+OR+g1018+OR+g1023+OR+g1024+OR+g1026+OR+g1028+OR+g1030+OR+g1032+OR+g1033+OR+g1035+OR+g1036+OR+g1038+OR+g1039+OR+g1041+OR+g1042+OR+g1044+OR+g1045+OR+g1047+OR+g1048+OR+g1050+OR+g1051+OR+g1053+OR+g1054+OR+g1056+OR+g1057+OR+g1058+OR+g1059+OR+g1060+OR+g1061+OR+g1062+OR+g1063+OR+g1064+OR+g1065+OR+g1066+OR+g1068+OR+g1071+OR+g1072+OR+g1074+OR+g1075+OR+g1076+OR+g1077+OR+g1078+OR+g1080+OR+g1081+OR+g1082+OR+g1084+OR+g1085+OR+g1087+OR+g1088+OR+g1089+OR+g1090+OR+g1091+OR+g1092+OR+g1093+OR+g1094+OR+g1095+OR+g1096+OR+g1097+OR+g1106+OR+g1108+OR+g1110+OR+g1112+OR+g1114+OR+g1117+OR+g1120+OR+g1121+OR+g1126+OR+g1128+OR+g1129+OR+g1131+OR+g1136+OR+g1138+OR+g1140+OR+g1141+OR+g1143+OR+g1145+OR+g1146+OR+g1148+OR+g1152+OR+g1154+OR+g1156+OR+g1158+OR+g1159+OR+g1160+OR+g1162+OR+g1163+OR+g1165+OR+g1166+OR+g1168+OR+g1170+OR+g1172+OR+g1175+OR+g1177+OR+g1179+OR+g1181+OR+g1185+OR+g1191+OR+g1193+OR+g1197+OR+g1199+OR+g1201+OR+g1203+OR+g1204+OR+g1215+OR+g1217+OR+g1219+OR+g1221+OR+g1224+OR+g1226+OR+g1227+OR+g1228+OR+g1230+OR+g1231+OR+g1232+OR+g1233+OR+g1234+OR+g1235+OR+g1236+OR+g1237+OR+g1238+OR+g1240+OR+g1241+OR+g1242+OR+g1243+OR+g1244+OR+g1246+OR+g1248+OR+g1250+OR+g1252+OR+g1254+OR+g1256+OR+g1257+OR+g1259+OR+g1261+OR+g1263+OR+g1275+OR+g1276+OR+g1277+OR+g1278+OR+g1279+OR+g1282+OR+g1284+OR+g1288+OR+g1290+OR+g1293+OR+g1296+OR+g1297+OR+g1299+OR+g1303+OR+g1304+OR+g1306+OR+g1309+OR+g1310+OR+g1311+OR+g1312+OR+g1313+OR+g1316+OR+g1318+OR+g1320+OR+g1322+OR+g1323+OR+g1324+OR+g1325+OR+g1326+OR+g1329+OR+g1331+OR+g1347+OR+g1348+OR+g1361+OR+g1362+OR+g1363+OR+g1364+OR+g1367+OR+g1368+OR+g1369+OR+g1370+OR+g1371+OR+g1374+OR+g1376+OR+g1377+OR+g1378+OR+g1380+OR+g1381+OR+g1386+OR+g1389+OR+g1391+OR+g1392+OR+g1393+OR+g1395+OR+g1396+OR+g1397+OR+g1400+OR+g1402+OR+g1406+OR+g1408+OR+g1415+OR+g1417+OR+g1433+OR+g1435+OR+g1441+OR+g1442+OR+g1443+OR+g1444+OR+g1446+OR+g1448+OR+g1450+OR+g1451+OR+g1452+OR+g1453+OR+g1454+OR+g1456+OR+g1458+OR+g1460+OR+g1462+OR+g1464+OR+g1466+OR+g1468+OR+g1470+OR+g1471+OR+g1475+OR+g1476+OR+g1477+OR+g1478+OR+g1479+OR+g1481+OR+g1482+OR+g1483+OR+g1484+OR+g1485+OR+g1486+OR+g1487+OR+g1488+OR+g1489+OR+g1490+OR+g1491+OR+g1492+OR+g1493+OR+g1495+OR+g1497+OR+g1499+OR+g1501+OR+g1503+OR+g1504+OR+g1506+OR+g1508+OR+g1511+OR+g1512+OR+g1513+OR+g1516+OR+g1522+OR+g1535+OR+g1536+OR+g1537+OR+g1539+OR+g1540+OR+g1541+OR+g1542+OR+g1547+OR+g1549+OR+g1551+OR+g1553+OR+g1555+OR+g1557+OR+g1559+OR+g1561+OR+g1563+OR+g1565+OR+g1567+OR+g1569+OR+g1571+OR+g1573+OR+g1580+OR+g1583+OR+g1588+OR+g1590+OR+g1592+OR+g1594+OR+g1595+OR+g1596+OR+g1598+OR+g1599+OR+g1600+OR+g1601+OR+g1602+OR+g1604+OR+g1606+OR+g1610+OR+g1611+OR+g1612+OR+g1613+OR+g1616+OR+g1619+OR+g1622+OR+g1624+OR+g1625+OR+g1626+OR+g1628+OR+g1629+OR+g1631+OR+g1632+OR+g1692+OR+g1694+OR+g1695+OR+g1697+OR+g1705+OR+g1706+OR+g1707+OR+g1708+OR+g1711+OR+g1715+OR+g1717+OR+g1719+OR+g1721+OR+g1722+OR+g1723+OR+g1724+OR+g1725+OR+g1726+OR+g1727+OR+g1731+OR+g1732+OR+g1736+OR+g1737+OR+g1738+OR+g1740+OR+g1742+OR+g1743+OR+g1753+OR+g1755+OR+g1758+OR+g1759+OR+g1764+OR+g1766+OR+g1769+OR+g1774+OR+g1782+OR+g1794+OR+g1796+OR+g1797+OR+g1814+OR+g1818+OR+g1826+OR+g1853+OR+g1855+OR+g1857+OR+g1858+OR+g1859+OR+g1860+OR+g1861+OR+g1863+OR+g1864+OR+g1865+OR+g1867+OR+g1869+OR+g1871+OR+g1873+OR+g1875+OR+g1877+OR+g1879+OR+g1881+OR+g1883+OR+g1884+OR+g1885+OR+g1887+OR+g1889+OR+g1891+OR+g1892+OR+g1894+OR+g1896+OR+g1898+OR+g1900+OR+g1902+OR+g1907+OR+g1910+OR+g1915+OR+g1916+OR+g1917+OR+g1918+OR+g1929+OR+g1931+OR+g1932+OR+g1933+OR+g1934+OR+g1936+OR+g1937+OR+g1938+OR+g1939+OR+g1940+OR+g1942+OR+g1944+OR+g1945+OR+g1948+OR+g1950+OR+g1955+OR+g1961+OR+g1962+OR+g1964+OR+g1966+OR+g1968+OR+g1970+OR+g1972+OR+g1974+OR+g1976+OR+g1979+OR+g1982+OR+g1984+OR+g1985+OR+g1986+OR+g1987+OR+g1989+OR+g1991+OR+g1996+OR+g2003+OR+g2007+OR+g2011+OR+g2019+OR+g2020+OR+g2046)&sort=dateIssued.year_sort+desc&rows=1&wt=javabin&version=2} hits=56080 status=0 QTime=3
    +
    2016-11-29 07:56:36,734 INFO  org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&fl=dateIssued.year,handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=dateIssued.year:[*+TO+*]&fq=read:(g0+OR+e574+OR+g0+OR+g3+OR+g9+OR+g10+OR+g14+OR+g16+OR+g18+OR+g20+OR+g23+OR+g24+OR+g2072+OR+g2074+OR+g28+OR+g2076+OR+g29+OR+g2078+OR+g2080+OR+g34+OR+g2082+OR+g2084+OR+g38+OR+g2086+OR+g2088+OR+g2091+OR+g43+OR+g2092+OR+g2093+OR+g2095+OR+g2097+OR+g50+OR+g2099+OR+g51+OR+g2103+OR+g62+OR+g65+OR+g2115+OR+g2117+OR+g2119+OR+g2121+OR+g2123+OR+g2125+OR+g77+OR+g78+OR+g79+OR+g2127+OR+g80+OR+g2129+OR+g2131+OR+g2133+OR+g2134+OR+g2135+OR+g2136+OR+g2137+OR+g2138+OR+g2139+OR+g2140+OR+g2141+OR+g2142+OR+g2148+OR+g2149+OR+g2150+OR+g2151+OR+g2152+OR+g2153+OR+g2154+OR+g2156+OR+g2165+OR+g2167+OR+g2171+OR+g2174+OR+g2175+OR+g129+OR+g2182+OR+g2186+OR+g2189+OR+g153+OR+g158+OR+g166+OR+g167+OR+g168+OR+g169+OR+g2225+OR+g179+OR+g2227+OR+g2229+OR+g183+OR+g2231+OR+g184+OR+g2233+OR+g186+OR+g2235+OR+g2237+OR+g191+OR+g192+OR+g193+OR+g202+OR+g203+OR+g204+OR+g205+OR+g207+OR+g208+OR+g218+OR+g219+OR+g222+OR+g223+OR+g230+OR+g231+OR+g238+OR+g241+OR+g244+OR+g254+OR+g255+OR+g262+OR+g265+OR+g268+OR+g269+OR+g273+OR+g276+OR+g277+OR+g279+OR+g282+OR+g2332+OR+g2335+OR+g2338+OR+g292+OR+g293+OR+g2341+OR+g296+OR+g2344+OR+g297+OR+g2347+OR+g301+OR+g2350+OR+g303+OR+g305+OR+g2356+OR+g310+OR+g311+OR+g2359+OR+g313+OR+g2362+OR+g2365+OR+g2368+OR+g321+OR+g2371+OR+g325+OR+g2374+OR+g328+OR+g2377+OR+g2380+OR+g333+OR+g2383+OR+g2386+OR+g2389+OR+g342+OR+g343+OR+g2392+OR+g345+OR+g2395+OR+g348+OR+g2398+OR+g2401+OR+g2404+OR+g2407+OR+g364+OR+g366+OR+g2425+OR+g2427+OR+g385+OR+g387+OR+g388+OR+g389+OR+g2442+OR+g395+OR+g2443+OR+g2444+OR+g401+OR+g403+OR+g405+OR+g408+OR+g2457+OR+g2458+OR+g411+OR+g2459+OR+g414+OR+g2463+OR+g417+OR+g2465+OR+g2467+OR+g421+OR+g2469+OR+g2471+OR+g424+OR+g2473+OR+g2475+OR+g2476+OR+g429+OR+g433+OR+g2481+OR+g2482+OR+g2483+OR+g443+OR+g444+OR+g445+OR+g446+OR+g448+OR+g453+OR+g455+OR+g456+OR+g457+OR+g458+OR+g459+OR+g461+OR+g462+OR+g463+OR+g464+OR+g465+OR+g467+OR+g468+OR+g469+OR+g474+OR+g476+OR+g477+OR+g480+OR+g483+OR+g484+OR+g493+OR+g496+OR+g497+OR+g498+OR+g500+OR+g502+OR+g504+OR+g505+OR+g2559+OR+g2560+OR+g513+OR+g2561+OR+g515+OR+g516+OR+g518+OR+g519+OR+g2567+OR+g520+OR+g521+OR+g522+OR+g2570+OR+g523+OR+g2571+OR+g524+OR+g525+OR+g2573+OR+g526+OR+g2574+OR+g527+OR+g528+OR+g2576+OR+g529+OR+g531+OR+g2579+OR+g533+OR+g534+OR+g2582+OR+g535+OR+g2584+OR+g538+OR+g2586+OR+g540+OR+g2588+OR+g541+OR+g543+OR+g544+OR+g545+OR+g546+OR+g548+OR+g2596+OR+g549+OR+g551+OR+g555+OR+g556+OR+g558+OR+g561+OR+g569+OR+g570+OR+g571+OR+g2619+OR+g572+OR+g2620+OR+g573+OR+g2621+OR+g2622+OR+g575+OR+g578+OR+g581+OR+g582+OR+g584+OR+g585+OR+g586+OR+g587+OR+g588+OR+g590+OR+g591+OR+g593+OR+g595+OR+g596+OR+g598+OR+g599+OR+g601+OR+g602+OR+g603+OR+g604+OR+g605+OR+g606+OR+g608+OR+g609+OR+g610+OR+g612+OR+g614+OR+g616+OR+g620+OR+g621+OR+g623+OR+g630+OR+g635+OR+g636+OR+g646+OR+g649+OR+g683+OR+g684+OR+g687+OR+g689+OR+g691+OR+g695+OR+g697+OR+g698+OR+g699+OR+g700+OR+g701+OR+g707+OR+g708+OR+g709+OR+g710+OR+g711+OR+g712+OR+g713+OR+g714+OR+g715+OR+g716+OR+g717+OR+g719+OR+g720+OR+g729+OR+g732+OR+g733+OR+g734+OR+g736+OR+g737+OR+g738+OR+g2786+OR+g752+OR+g754+OR+g2804+OR+g757+OR+g2805+OR+g2806+OR+g760+OR+g761+OR+g2810+OR+g2815+OR+g769+OR+g771+OR+g773+OR+g776+OR+g786+OR+g787+OR+g788+OR+g789+OR+g791+OR+g792+OR+g793+OR+g794+OR+g795+OR+g796+OR+g798+OR+g800+OR+g802+OR+g803+OR+g806+OR+g808+OR+g810+OR+g814+OR+g815+OR+g817+OR+g829+OR+g830+OR+g849+OR+g893+OR+g895+OR+g898+OR+g902+OR+g903+OR+g917+OR+g919+OR+g921+OR+g922+OR+g923+OR+g924+OR+g925+OR+g926+OR+g927+OR+g928+OR+g929+OR+g930+OR+g932+OR+g933+OR+g934+OR+g938+OR+g939+OR+g944+OR+g945+OR+g946+OR+g947+OR+g948+OR+g949+OR+g950+OR+g951+OR+g953+OR+g954+OR+g955+OR+g956+OR+g958+OR+g959+OR+g960+OR+g963+OR+g964+OR+g965+OR+g968+OR+g969+OR+g970+OR+g971+OR+g972+OR+g973+OR+g974+OR+g976+OR+g978+OR+g979+OR+g984+OR+g985+OR+g987+OR+g988+OR+g991+OR+g993+OR+g994+OR+g999+OR+g1000+OR+g1003+OR+g1005+OR+g1006+OR+g1007+OR+g1012+OR+g1013+OR+g1015+OR+g1016+OR+g1018+OR+g1023+OR+g1024+OR+g1026+OR+g1028+OR+g1030+OR+g1032+OR+g1033+OR+g1035+OR+g1036+OR+g1038+OR+g1039+OR+g1041+OR+g1042+OR+g1044+OR+g1045+OR+g1047+OR+g1048+OR+g1050+OR+g1051+OR+g1053+OR+g1054+OR+g1056+OR+g1057+OR+g1058+OR+g1059+OR+g1060+OR+g1061+OR+g1062+OR+g1063+OR+g1064+OR+g1065+OR+g1066+OR+g1068+OR+g1071+OR+g1072+OR+g1074+OR+g1075+OR+g1076+OR+g1077+OR+g1078+OR+g1080+OR+g1081+OR+g1082+OR+g1084+OR+g1085+OR+g1087+OR+g1088+OR+g1089+OR+g1090+OR+g1091+OR+g1092+OR+g1093+OR+g1094+OR+g1095+OR+g1096+OR+g1097+OR+g1106+OR+g1108+OR+g1110+OR+g1112+OR+g1114+OR+g1117+OR+g1120+OR+g1121+OR+g1126+OR+g1128+OR+g1129+OR+g1131+OR+g1136+OR+g1138+OR+g1140+OR+g1141+OR+g1143+OR+g1145+OR+g1146+OR+g1148+OR+g1152+OR+g1154+OR+g1156+OR+g1158+OR+g1159+OR+g1160+OR+g1162+OR+g1163+OR+g1165+OR+g1166+OR+g1168+OR+g1170+OR+g1172+OR+g1175+OR+g1177+OR+g1179+OR+g1181+OR+g1185+OR+g1191+OR+g1193+OR+g1197+OR+g1199+OR+g1201+OR+g1203+OR+g1204+OR+g1215+OR+g1217+OR+g1219+OR+g1221+OR+g1224+OR+g1226+OR+g1227+OR+g1228+OR+g1230+OR+g1231+OR+g1232+OR+g1233+OR+g1234+OR+g1235+OR+g1236+OR+g1237+OR+g1238+OR+g1240+OR+g1241+OR+g1242+OR+g1243+OR+g1244+OR+g1246+OR+g1248+OR+g1250+OR+g1252+OR+g1254+OR+g1256+OR+g1257+OR+g1259+OR+g1261+OR+g1263+OR+g1275+OR+g1276+OR+g1277+OR+g1278+OR+g1279+OR+g1282+OR+g1284+OR+g1288+OR+g1290+OR+g1293+OR+g1296+OR+g1297+OR+g1299+OR+g1303+OR+g1304+OR+g1306+OR+g1309+OR+g1310+OR+g1311+OR+g1312+OR+g1313+OR+g1316+OR+g1318+OR+g1320+OR+g1322+OR+g1323+OR+g1324+OR+g1325+OR+g1326+OR+g1329+OR+g1331+OR+g1347+OR+g1348+OR+g1361+OR+g1362+OR+g1363+OR+g1364+OR+g1367+OR+g1368+OR+g1369+OR+g1370+OR+g1371+OR+g1374+OR+g1376+OR+g1377+OR+g1378+OR+g1380+OR+g1381+OR+g1386+OR+g1389+OR+g1391+OR+g1392+OR+g1393+OR+g1395+OR+g1396+OR+g1397+OR+g1400+OR+g1402+OR+g1406+OR+g1408+OR+g1415+OR+g1417+OR+g1433+OR+g1435+OR+g1441+OR+g1442+OR+g1443+OR+g1444+OR+g1446+OR+g1448+OR+g1450+OR+g1451+OR+g1452+OR+g1453+OR+g1454+OR+g1456+OR+g1458+OR+g1460+OR+g1462+OR+g1464+OR+g1466+OR+g1468+OR+g1470+OR+g1471+OR+g1475+OR+g1476+OR+g1477+OR+g1478+OR+g1479+OR+g1481+OR+g1482+OR+g1483+OR+g1484+OR+g1485+OR+g1486+OR+g1487+OR+g1488+OR+g1489+OR+g1490+OR+g1491+OR+g1492+OR+g1493+OR+g1495+OR+g1497+OR+g1499+OR+g1501+OR+g1503+OR+g1504+OR+g1506+OR+g1508+OR+g1511+OR+g1512+OR+g1513+OR+g1516+OR+g1522+OR+g1535+OR+g1536+OR+g1537+OR+g1539+OR+g1540+OR+g1541+OR+g1542+OR+g1547+OR+g1549+OR+g1551+OR+g1553+OR+g1555+OR+g1557+OR+g1559+OR+g1561+OR+g1563+OR+g1565+OR+g1567+OR+g1569+OR+g1571+OR+g1573+OR+g1580+OR+g1583+OR+g1588+OR+g1590+OR+g1592+OR+g1594+OR+g1595+OR+g1596+OR+g1598+OR+g1599+OR+g1600+OR+g1601+OR+g1602+OR+g1604+OR+g1606+OR+g1610+OR+g1611+OR+g1612+OR+g1613+OR+g1616+OR+g1619+OR+g1622+OR+g1624+OR+g1625+OR+g1626+OR+g1628+OR+g1629+OR+g1631+OR+g1632+OR+g1692+OR+g1694+OR+g1695+OR+g1697+OR+g1705+OR+g1706+OR+g1707+OR+g1708+OR+g1711+OR+g1715+OR+g1717+OR+g1719+OR+g1721+OR+g1722+OR+g1723+OR+g1724+OR+g1725+OR+g1726+OR+g1727+OR+g1731+OR+g1732+OR+g1736+OR+g1737+OR+g1738+OR+g1740+OR+g1742+OR+g1743+OR+g1753+OR+g1755+OR+g1758+OR+g1759+OR+g1764+OR+g1766+OR+g1769+OR+g1774+OR+g1782+OR+g1794+OR+g1796+OR+g1797+OR+g1814+OR+g1818+OR+g1826+OR+g1853+OR+g1855+OR+g1857+OR+g1858+OR+g1859+OR+g1860+OR+g1861+OR+g1863+OR+g1864+OR+g1865+OR+g1867+OR+g1869+OR+g1871+OR+g1873+OR+g1875+OR+g1877+OR+g1879+OR+g1881+OR+g1883+OR+g1884+OR+g1885+OR+g1887+OR+g1889+OR+g1891+OR+g1892+OR+g1894+OR+g1896+OR+g1898+OR+g1900+OR+g1902+OR+g1907+OR+g1910+OR+g1915+OR+g1916+OR+g1917+OR+g1918+OR+g1929+OR+g1931+OR+g1932+OR+g1933+OR+g1934+OR+g1936+OR+g1937+OR+g1938+OR+g1939+OR+g1940+OR+g1942+OR+g1944+OR+g1945+OR+g1948+OR+g1950+OR+g1955+OR+g1961+OR+g1962+OR+g1964+OR+g1966+OR+g1968+OR+g1970+OR+g1972+OR+g1974+OR+g1976+OR+g1979+OR+g1982+OR+g1984+OR+g1985+OR+g1986+OR+g1987+OR+g1989+OR+g1991+OR+g1996+OR+g2003+OR+g2007+OR+g2011+OR+g2019+OR+g2020+OR+g2046)&sort=dateIssued.year_sort+desc&rows=1&wt=javabin&version=2} hits=56080 status=0 QTime=3
     
    • Which, according to some old threads on DSpace Tech, means that the user has a lot of permissions (from groups or on the individual eperson) which increases the Solr query size / query URL
    • It might be fixed by increasing the Tomcat maxHttpHeaderSize, which is 8192 (or 8KB) by default
    • diff --git a/docs/2016-12/index.html b/docs/2016-12/index.html index 59cad3bde..7f7894753 100644 --- a/docs/2016-12/index.html +++ b/docs/2016-12/index.html @@ -46,7 +46,7 @@ I see thousands of them in the logs for the last few months, so it’s not r I’ve raised a ticket with Atmire to ask Another worrying error from dspace.log is: "/> - + @@ -137,7 +137,7 @@ Another worrying error from dspace.log is:
    • CGSpace was down for five hours in the morning while I was sleeping
    • While looking in the logs for errors, I see tons of warnings about Atmire MQM:
    -
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    +
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
    @@ -147,7 +147,7 @@ Another worrying error from dspace.log is:
     
  • I’ve raised a ticket with Atmire to ask
  • Another worrying error from dspace.log is:
  • -
    org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
    +
    org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
             at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:972)
             at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
             at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:882)
    @@ -236,13 +236,13 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
     
    • The first error I see in dspace.log this morning is:
    -
    2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;"b0b541c1-ec15-48bf-9209-6dbe8e338cdc"
    +
    2016-12-02 03:00:46,656 ERROR org.dspace.authority.AuthorityValueFinder @ anonymous::Error while retrieving AuthorityValue from solr:query\colon; id\colon;"b0b541c1-ec15-48bf-9209-6dbe8e338cdc"
     org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8081/solr/authority
     
    • Looking through DSpace’s solr log I see that about 20 seconds before this, there were a few 30+ KiB solr queries
    • The last logs here right before Solr became unresponsive (and right after I restarted it five hours later) were:
    -
    2016-12-02 03:00:42,606 INFO  org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&fq=-isInternal:true&fq=-(author_mtdt:"CGIAR\+Institutional\+Learning\+and\+Change\+Initiative"++AND+subject_mtdt:"PARTNERSHIPS"+AND+subject_mtdt:"RESEARCH"+AND+subject_mtdt:"AGRICULTURE"+AND+subject_mtdt:"DEVELOPMENT"++AND+iso_mtdt:"en"+)&rows=0&wt=javabin&version=2} hits=0 status=0 QTime=19
    +
    2016-12-02 03:00:42,606 INFO  org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/select params={q=containerItem:72828+AND+type:0&shards=localhost:8081/solr/statistics-2010,localhost:8081/solr/statistics&fq=-isInternal:true&fq=-(author_mtdt:"CGIAR\+Institutional\+Learning\+and\+Change\+Initiative"++AND+subject_mtdt:"PARTNERSHIPS"+AND+subject_mtdt:"RESEARCH"+AND+subject_mtdt:"AGRICULTURE"+AND+subject_mtdt:"DEVELOPMENT"++AND+iso_mtdt:"en"+)&rows=0&wt=javabin&version=2} hits=0 status=0 QTime=19
     2016-12-02 08:28:23,908 INFO  org.apache.solr.servlet.SolrDispatchFilter @ SolrDispatchFilter.init()
     
    • DSpace’s own Solr logs don’t give IP addresses, so I will have to enable Nginx’s logging of /solr so I can see where this request came from
    • @@ -255,7 +255,7 @@ org.apache.solr.client.solrj.SolrServerException: Server refused connection at:
    • I got a weird report from the CGSpace checksum checker this morning
    • It says 732 bitstreams have potential issues, for example:
    -
    ------------------------------------------------ 
    +
    ------------------------------------------------ 
     Bitstream Id = 6
     Process Start Date = Dec 4, 2016
     Process End Date = Dec 4, 2016
    @@ -278,7 +278,7 @@ Result = The bitstream could not be found
     
  • For what it’s worth, there is no item on DSpace Test or S3 backups with that checksum either…
  • In other news, I’m looking at JVM settings from the Solr 4.10.2 release, from bin/solr.in.sh:
  • -
    # These GC settings have shown to work well for a number of common Solr workloads
    +
    # These GC settings have shown to work well for a number of common Solr workloads
     GC_TUNE="-XX:-UseSuperWord \
     -XX:NewRatio=3 \
     -XX:SurvivorRatio=4 \
    @@ -311,7 +311,7 @@ GC_TUNE="-XX:-UseSuperWord \
     
  • Atmire responded about the MQM warnings in the DSpace logs
  • Apparently we need to change the batch edit consumers in dspace/config/dspace.cfg:
  • -
    event.consumer.batchedit.filters = Community|Collection+Create
    +
    event.consumer.batchedit.filters = Community|Collection+Create
     
    • I haven’t tested it yet, but I created a pull request: #289
    @@ -319,7 +319,7 @@ GC_TUNE="-XX:-UseSuperWord \
    • Some author authority corrections and name standardizations for Peter:
    -
    dspace=# update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
    +
    dspace=# update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
     UPDATE 11
     dspace=# update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
     UPDATE 36
    @@ -343,7 +343,7 @@ UPDATE 561
     
  • The docs say a good starting point for a dedicated server is 25% of the system RAM, and our server isn’t dedicated (also runs Solr, which can benefit from OS cache) so let’s try 1024MB
  • In other news, the authority reindexing keeps crashing (I was manually running it after the author updates above):
  • -
    $ time JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace index-authority
    +
    $ time JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace index-authority
     Retrieving all data
     Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
     Exception: null
    @@ -376,7 +376,7 @@ sys     0m22.647s
     
  • For example, do a Solr query for “first_name:Grace” and look at the results
  • Querying that ID shows the fields that need to be changed:
  • -
    {
    +
    {
       "responseHeader": {
         "status": 0,
         "QTime": 1,
    @@ -409,7 +409,7 @@ sys     0m22.647s
     
  • I think I can just update the value, first_name, and last_name fields…
  • The update syntax should be something like this, but I’m getting errors from Solr:
  • -
    $ curl 'localhost:8081/solr/authority/update?commit=true&wt=json&indent=true' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'
    +
    $ curl 'localhost:8081/solr/authority/update?commit=true&wt=json&indent=true' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'
     {
       "responseHeader":{
         "status":400,
    @@ -421,13 +421,13 @@ sys     0m22.647s
     
  • When I try using the XML format I get an error that the updateLog needs to be configured for that core
  • Maybe I can just remove the authority UUID from the records, run the indexing again so it creates a new one for each name variant, then match them correctly?
  • -
    dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
    +
    dspace=# update metadatavalue set authority=null, confidence=-1 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
     UPDATE 561
     
    • Then I’ll reindex discovery and authority and see how the authority Solr core looks
    • After this, now there are authorities for some of the “Grace, D.” and “Grace, Delia” text_values in the database (the first version is actually the same authority that already exists in the core, so it was just added back to some text_values, but the second one is new):
    -
    $ curl 'localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&wt=json&indent=true'
    +
    $ curl 'localhost:8081/solr/authority/select?q=id%3A18ea1525-2513-430a-8817-a834cd733fbc&wt=json&indent=true'
     {
       "responseHeader":{
         "status":0,
    @@ -453,7 +453,7 @@ UPDATE 561
     
  • In this case it seems that since there were also two different IDs in the original database, I just picked the wrong one!
  • Better to use:
  • -
    dspace#= update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
    +
    dspace#= update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
     
    • This proves that unifying author name varieties in authorities is easy, but fixing the name in the authority is tricky!
    • Perhaps another way is to just add our own UUID to the authority field for the text_value we like, then re-index authority so they get synced from PostgreSQL to Solr, then set the other text_values to use that authority ID
    • @@ -461,7 +461,7 @@ UPDATE 561
    • Deploy “take task” hack/fix on CGSpace (#290)
    • I ran the following author corrections and then reindexed discovery:
    -
    update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
    +
    update metadatavalue set authority='b041f2f4-19e7-4113-b774-0439baabd197', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Mora Benard%';
     update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Hoek, R%';
     update metadatavalue set text_value = 'Hoek, Rein van der', authority='4d6cbce2-6fd5-4b43-9363-58d18e7952c9', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like '%an der Hoek%' and text_value !~ '^.*W\.?$';
     update metadatavalue set authority='18349f29-61b1-44d7-ac60-89e55546e812', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne, P%';
    @@ -471,7 +471,7 @@ update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-417
     
    • Something weird happened and Peter Thorne’s names all ended up as “Thorne”, I guess because the original authority had that as its name value:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Thorne%';
         text_value    |              authority               | confidence
     ------------------+--------------------------------------+------------
      Thorne, P.J.     | 18349f29-61b1-44d7-ac60-89e55546e812 |        600
    @@ -484,12 +484,12 @@ update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-417
     
    • I generated a new UUID using uuidgen | tr [A-Z] [a-z] and set it along with correct name variation for all records:
    -
    dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812';
    +
    dspace=# update metadatavalue set authority='b2f7603d-2fb5-4018-923a-c4ec8d85b3bb', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812';
     UPDATE 43
     
    • Apparently we also need to normalize Phil Thornton’s names to Thornton, Philip K.:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
          text_value      |              authority               | confidence
     ---------------------+--------------------------------------+------------
      Thornton, P         | 0d8369bb-57f7-4b2f-92aa-af820b183aca |        600
    @@ -506,7 +506,7 @@ UPDATE 43
     
    • Seems his original authorities are using an incorrect version of the name so I need to generate another UUID and tie it to the correct name, then reindex:
    -
    dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
    +
    dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
     UPDATE 362
     
    • It seems that, when you are messing with authority and author text values in the database, it is better to run authority reindex first (postgres→solr authority core) and then Discovery reindex (postgres→solr Discovery core)
    • @@ -520,7 +520,7 @@ UPDATE 362
    • Set PostgreSQL’s shared_buffers on CGSpace to 10% of system RAM (1200MB)
    • Run the following author corrections on CGSpace:
    -
    dspace=# update metadatavalue set authority='34df639a-42d8-4867-a3f2-1892075fcb3f', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812' or authority='021cd183-946b-42bb-964e-522ebff02993';
    +
    dspace=# update metadatavalue set authority='34df639a-42d8-4867-a3f2-1892075fcb3f', text_value='Thorne, P.J.' where resource_type_id=2 and metadata_field_id=3 and authority='18349f29-61b1-44d7-ac60-89e55546e812' or authority='021cd183-946b-42bb-964e-522ebff02993';
     dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab764', text_value='Thornton, Philip K.', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^Thornton[,\.]? P.*';
     
    • The authority IDs were different now than when I was looking a few days ago so I had to adjust them here
    • @@ -534,7 +534,7 @@ dspace=# update metadatavalue set authority='2df8136e-d8f4-4142-b58c-562337cab76
      • Looking at CIAT records from last week again, they have a lot of double authors like:
      -
      International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::600
      +
      International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::600
       International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::500
       International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024eb2::0
       
        @@ -542,7 +542,7 @@ International Center for Tropical Agriculture::3026b1de-9302-4f3e-85ab-ef48da024
      • Removing the duplicates in OpenRefine and uploading a CSV to DSpace says “no changes detected”
      • Seems like the only way to sortof clean these up would be to start in SQL:
      -
      dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Center for Tropical Agriculture';
      +
      dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Center for Tropical Agriculture';
                         text_value                   |              authority               | confidence
       -----------------------------------------------+--------------------------------------+------------
        International Center for Tropical Agriculture | cc726b78-a2f4-4ee9-af98-855c2ea31c36 |         -1
      @@ -577,14 +577,14 @@ UPDATE 35
       
    • So basically, new cron jobs for logs should look something like this:
    • Find any file named *.log* that isn’t dspace.log*, isn’t already zipped, and is older than one day, and zip it:
    -
    # find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex ".*\.log.*" ! -iregex ".*dspace\.log.*" ! -iregex ".*\.(gz|lrz|lzo|xz)" ! -newermt "Yesterday" -exec schedtool -B -e ionice -c2 -n7 xz {} \;
    +
    # find /home/dspacetest.cgiar.org/log -regextype posix-extended -iregex ".*\.log.*" ! -iregex ".*dspace\.log.*" ! -iregex ".*\.(gz|lrz|lzo|xz)" ! -newermt "Yesterday" -exec schedtool -B -e ionice -c2 -n7 xz {} \;
     
    • Since there is xzgrep and xzless we can actually just zip them after one day, why not?!
    • We can keep the zipped ones for two weeks just in case we need to look for errors, etc, and delete them after that
    • I use schedtool -B and ionice -c2 -n7 to set the CPU scheduling to SCHED_BATCH and the IO to best effort which should, in theory, impact important system processes like Tomcat and PostgreSQL less
    • When the tasks are running you can see that the policies do apply:
    -
    $ schedtool $(ps aux | grep "xz /home" | grep -v grep | awk '{print $2}') && ionice -p $(ps aux | grep "xz /home" | grep -v grep | awk '{print $2}')
    +
    $ schedtool $(ps aux | grep "xz /home" | grep -v grep | awk '{print $2}') && ionice -p $(ps aux | grep "xz /home" | grep -v grep | awk '{print $2}')
     PID 17049: PRIO   0, POLICY B: SCHED_BATCH   , NICE   0, AFFINITY 0xf
     best-effort: prio 7
     
      @@ -594,7 +594,7 @@ best-effort: prio 7
    • Some users pointed out issues with the “most popular” stats on a community or collection
    • This error appears in the logs when you try to view them:
    -
    2016-12-13 21:17:37,486 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
    +
    2016-12-13 21:17:37,486 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
     org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceObjectDatasetGenerator.toDatasetQuery(Lorg/dspace/core/Context;)Lcom/atmire/statistics/content/DatasetQuery;
     	at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:972)
     	at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:852)
    @@ -679,7 +679,7 @@ Caused by: java.lang.NoSuchMethodError: com.atmire.statistics.generator.DSpaceOb
     
  • None of our users in African institutes will have IPv6, but some Europeans might, so I need to check if any submissions have been added since then
  • Update some names and authorities in the database:
  • -
    dspace=# update metadatavalue set authority='5ff35043-942e-4d0a-b377-4daed6e3c1a3', confidence=600, text_value='Duncan, Alan' where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*Duncan,? A.*';
    +
    dspace=# update metadatavalue set authority='5ff35043-942e-4d0a-b377-4daed6e3c1a3', confidence=600, text_value='Duncan, Alan' where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*Duncan,? A.*';
     UPDATE 204
     dspace=# update metadatavalue set authority='46804b53-ea30-4a85-9ccf-b79a35816fa9', confidence=600, text_value='Mekonnen, Kindu' where resource_type_id=2 and metadata_field_id=3 and text_value like '%Mekonnen, K%';
     UPDATE 89
    @@ -692,7 +692,7 @@ UPDATE 140
     
  • Enable OCSP stapling for hosts >= Ubuntu 16.04 in our Ansible playbooks (#76)
  • Working for DSpace Test on the second response:
  • -
    $ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
    +
    $ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
     ...
     OCSP response: no response sent
     $ openssl s_client -connect dspacetest.cgiar.org:443 -servername dspacetest.cgiar.org -tls1_2 -tlsextdebug -status
    @@ -704,12 +704,12 @@ OCSP Response Data:
     
  • Migrate CGSpace to new server, roughly following these steps:
  • On old server:
  • -
    # service tomcat7 stop
    +
    # service tomcat7 stop
     # /home/backup/scripts/postgres_backup.sh
     
    • On new server:
    -
    # systemctl stop tomcat7
    +
    # systemctl stop tomcat7
     # rsync -4 -av --delete 178.79.187.182:/home/cgspace.cgiar.org/assetstore/ /home/cgspace.cgiar.org/assetstore/
     # rsync -4 -av --delete 178.79.187.182:/home/backup/ /home/backup/
     # rsync -4 -av --delete 178.79.187.182:/home/cgspace.cgiar.org/solr/ /home/cgspace.cgiar.org/solr
    @@ -750,7 +750,7 @@ $ exit
     
  • Abenet wanted a CSV of the IITA community, but the web export doesn’t include the dc.date.accessioned field
  • I had to export it from the command line using the -a flag:
  • -
    $ [dspace]/bin/dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
    +
    $ [dspace]/bin/dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
     

    2016-12-28

    • We’ve been getting two alerts per day about CPU usage on the new server from Linode
    • diff --git a/docs/2017-01/index.html b/docs/2017-01/index.html index 42f4f8421..d14827e10 100644 --- a/docs/2017-01/index.html +++ b/docs/2017-01/index.html @@ -28,7 +28,7 @@ I checked to see if the Solr sharding task that is supposed to run on January 1s I tested on DSpace Test as well and it doesn’t work there either I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years "/> - + @@ -124,7 +124,7 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua
      • I tried to shard my local dev instance and it fails the same way:
      -
      $ JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace stats-util -s
      +
      $ JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace stats-util -s
       Moving: 9318 into core statistics-2016
       Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
       org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
      @@ -171,7 +171,7 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
       
      • And the DSpace log shows:
      -
      2017-01-04 22:39:05,412 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
      +
      2017-01-04 22:39:05,412 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
       2017-01-04 22:39:05,412 INFO  org.dspace.statistics.SolrLogger @ Moving: 9318 records into core statistics-2016
       2017-01-04 22:39:07,310 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (java.net.SocketException) caught when processing request to {}->http://localhost:8081: Broken pipe (Write failed)
       2017-01-04 22:39:07,310 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}->http://localhost:8081
      @@ -179,7 +179,7 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
       
    • Despite failing instantly, a statistics-2016 directory was created, but it only has a data dir (no conf)
    • The Tomcat access logs show more:
    -
    127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
    +
    127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
     127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=*%3A*&rows=0&facet=true&facet.range=time&facet.range.start=NOW%2FYEAR-17YEARS&facet.range.end=NOW%2FYEAR%2B0YEARS&facet.range.gap=%2B1YEAR&facet.mincount=1&wt=javabin&version=2 HTTP/1.1" 200 423
     127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=STATUS&core=statistics-2016&indexInfo=true&wt=javabin&version=2 HTTP/1.1" 200 77
     127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=CREATE&name=statistics-2016&instanceDir=statistics&dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&wt=javabin&version=2 HTTP/1.1" 200 63
    @@ -208,11 +208,11 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
     
  • I found an old post on the mailing list discussing a similar issue, and listing some SQL commands that might help
  • For example, this shows 186 mappings for the item, the first three of which are real:
  • -
    dspace=#  select * from collection2item where item_id = '80596';
    +
    dspace=#  select * from collection2item where item_id = '80596';
     
    • Then I deleted the others:
    -
    dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
    +
    dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
     
    • And in the item view it now shows the correct mappings
    • I will have to ask the DSpace people if this is a valid approach
    • @@ -223,24 +223,24 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
    • Maria found another item with duplicate mappings: https://cgspace.cgiar.org/handle/10568/78658
    • Error in fix-metadata-values.py when it tries to print the value for Entwicklung & Ländlicher Raum:
    -
    Traceback (most recent call last):
    +
    Traceback (most recent call last):
       File "./fix-metadata-values.py", line 80, in <module>
         print("Fixing {} occurences of: {}".format(records_to_fix, record[0]))
     UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
     
    • Seems we need to encode as UTF-8 before printing to screen, ie:
    -
    print("Fixing {} occurences of: {}".format(records_to_fix, record[0].encode('utf-8')))
    +
    print("Fixing {} occurences of: {}".format(records_to_fix, record[0].encode('utf-8')))
     
    • See: http://stackoverflow.com/a/36427358/487333
    • I’m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database… I’ve never had this issue before
    • Now back to cleaning up some journal titles so we can make the controlled vocabulary:
    -
    $ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
     
    • Now get the top 500 journal titles:
    -
    dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
    +
    dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
     
    • The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November
    • I will have to go through these and fix some more before making the controlled vocabulary
    • @@ -254,7 +254,7 @@ UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15:
      • Fix the two items Maria found with duplicate mappings with this script:
      -
      /* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
      +
      /* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
       delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
       /* 1 incorrect mapping: https://cgspace.cgiar.org/handle/10568/78658 */
       delete from collection2item where id = '91082';
      @@ -266,20 +266,20 @@ delete from collection2item where id = '91082';
       
    • And the file names don’t really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore
    • Seems like the only ones I should replace are the ' apostrophe characters, as %27:
    -
    value.replace("'",'%27')
    +
    value.replace("'",'%27')
     
    • Add the item’s Type to the filename column as a hint to SAF Builder so it can set a more useful description field:
    -
    value + "__description:" + cells["dc.type"].value
    +
    value + "__description:" + cells["dc.type"].value
     
    • Test importing of the new CIAT records (actually there are 232, not 234):
    -
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
    +
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
     
    • Many of the PDFs are 20, 30, 40, 50+ MB, which makes a total of 4GB
    • These are scanned from paper and likely have no compression, so we should try to test if these compression techniques help without comprimising the quality too much:
    -
    $ convert -compress Zip -density 150x150 input.pdf output.pdf
    +
    $ convert -compress Zip -density 150x150 input.pdf output.pdf
     $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
     
    • Somewhere on the Internet suggested using a DPI of 144
    • @@ -289,7 +289,7 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
    • In testing a random sample of CIAT’s PDFs for compressability, it looks like all of these methods generally increase the file size so we will just import them as they are
    • Import 232 CIAT records into CGSpace:
    -
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
    +
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
     

    2017-01-22

    • Looking at some records that Sisay is having problems importing into DSpace Test (seems to be because of copious whitespace return characters from Excel’s CSV exporter)
    • @@ -300,22 +300,22 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
    • I merged Atmire’s pull request into the development branch so they can deploy it on DSpace Test
    • Move some old ILRI Program communities to a new subcommunity for former programs (10568/79164):
    -
    $ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child="$community" && /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child="$community"; done
    +
    $ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child="$community" && /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child="$community"; done
     
    -
    10568/42161 10568/171 10568/79341
    +
    10568/42161 10568/171 10568/79341
     10568/41914 10568/171 10568/79340
     

    2017-01-24

    • Run all updates on DSpace Test and reboot the server
    • Run fixes for Journal titles on CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
    +
    $ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
     
    • Create a new list of the top 500 journal titles from the database:
    -
    dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
    +
    dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by count desc limit 500) to /tmp/journal-titles.csv with csv;
     
    • Then sort them in OpenRefine and create a controlled vocabulary by manually adding the XML markup, pull request (#298)
    • This would be the last issue remaining to close the meta issue about switching to controlled vocabularies (#69)
    • diff --git a/docs/2017-02/index.html b/docs/2017-02/index.html index 9722a3f8c..01f9c63fc 100644 --- a/docs/2017-02/index.html +++ b/docs/2017-02/index.html @@ -50,7 +50,7 @@ DELETE 1 Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301) Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name "/> - + @@ -140,7 +140,7 @@ Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name
      • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
      -
      dspace=# select * from collection2item where item_id = '80278';
      +
      dspace=# select * from collection2item where item_id = '80278';
         id   | collection_id | item_id
       -------+---------------+---------
        92551 |           313 |   80278
      @@ -166,7 +166,7 @@ DELETE 1
       
    • The climate risk management one doesn’t exist, so I will have to ask Magdalena if they want me to add it to the input forms
    • Start testing some nearly 500 author corrections that CCAFS sent me:
    -
    $ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
    +
    $ ./fix-metadata-values.py -i /tmp/CCAFS-Authors-Feb-7.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
     

    2017-02-09

    • More work on CCAFS Phase II stuff
    • @@ -175,7 +175,7 @@ DELETE 1
    • It’s not a very good way to manage the registry, though, as removing one there doesn’t cause it to be removed from the registry, and we always restore from database backups so there would never be a scenario when we needed these to be created
    • Testing some corrections on CCAFS Phase II flagships (cg.subject.ccafs):
    -
    $ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
    +
    $ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
     

    2017-02-10

    • CCAFS said they want to wait on the flagship updates (cg.subject.ccafs) on CGSpace, perhaps for a month or so
    • @@ -215,46 +215,46 @@ DELETE 1
    • Fix issue with duplicate declaration of in atmire-dspace-xmlui pom.xml (causing non-fatal warnings during the maven build)
    • Experiment with making DSpace generate HTTPS handle links, first a change in dspace.cfg or the site’s properties file:
    -
    handle.canonical.prefix = https://hdl.handle.net/
    +
    handle.canonical.prefix = https://hdl.handle.net/
     
    • And then a SQL command to update existing records:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'uri');
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'uri');
     UPDATE 58193
     
    • Seems to work fine!
    • I noticed a few items that have incorrect DOI links (dc.identifier.doi), and after looking in the database I see there are over 100 that are missing the scheme or are just plain wrong:
    -
    dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value not like 'http%://%';
    +
    dspace=# select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value not like 'http%://%';
     
    • This will replace any that begin with 10. and change them to https://dx.doi.org/10.:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^10\..+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like '10.%';
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^10\..+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like '10.%';
     
    • This will get any that begin with doi:10. and change them to https://dx.doi.org/10.x:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^doi:(10\..+$)', 'https://dx.doi.org/\1') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'doi:10%';
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^doi:(10\..+$)', 'https://dx.doi.org/\1') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'doi:10%';
     
    • Fix DOIs like dx.doi.org/10. to be https://dx.doi.org/10.:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org/%';
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org/%';
     
    • Fix DOIs like http//:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^http//(dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http//%';
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '^http//(dx.doi.org/.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http//%';
     
    • Fix DOIs like dx.doi.org./:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org\./.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org./%'
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, '(^dx.doi.org\./.+$)', 'https://dx.doi.org/\1') where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'dx.doi.org./%'
     
     
    • Delete some invalid DOIs:
    -
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value in ('DOI','CPWF Mekong','Bulawayo, Zimbabwe','bb');
    +
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value in ('DOI','CPWF Mekong','Bulawayo, Zimbabwe','bb');
     
    • Fix some other random outliers:
    -
    dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1016/j.aquaculture.2015.09.003' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003';
    +
    dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.1016/j.aquaculture.2015.09.003' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:/dx.doi.org/10.1016/j.aquaculture.2015.09.003';
     dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.5337/2016.200' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'doi: https://dx.doi.org/10.5337/2016.200';
     dspace=# update metadatavalue set text_value = 'https://dx.doi.org/doi:10.1371/journal.pone.0062898' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'Http://dx.doi.org/doi:10.1371/journal.pone.0062898';
     dspace=# update metadatavalue set text_value = 'https://dx.doi.10.1016/j.cosust.2013.11.012' where metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value = 'http:dx.doi.10.1016/j.cosust.2013.11.012';
    @@ -263,13 +263,13 @@ dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agro
     
    • And do another round of http:// → https:// cleanups:
    -
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http://dx.doi.org%';
    +
    dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'identifier' and qualifier = 'doi') and text_value like 'http://dx.doi.org%';
     
    • Run all DOI corrections on CGSpace
    • Something to think about here is to write a Curation Task in Java to do these sanity checks / corrections every night
    • Then we could add a cron job for them and run them from the command line like:
    -
    [dspace]/bin/dspace curate -t noop -i 10568/79891
    +
    [dspace]/bin/dspace curate -t noop -i 10568/79891
     

    2017-02-20

    • Run all system updates on DSpace Test and reboot the server
    • @@ -280,7 +280,7 @@ dspace=# update metadatavalue set text_value = 'https://dx.doi.org/10.15446/agro
    • Testing the fix-metadata-values.py script on macOS and it seems like we don’t need to use .encode('utf-8') anymore when printing strings to the screen
    • It seems this might have only been a temporary problem, as both Python 3.5.2 and 3.6.0 are able to print the problematic string “Entwicklung & Ländlicher Raum” without the encode() call, but print it as a bytes when it is used:
    -
    $ python
    +
    $ python
     Python 3.6.0 (default, Dec 25 2016, 17:30:53)
     >>> print('Entwicklung & Ländlicher Raum')
     Entwicklung & Ländlicher Raum
    @@ -294,7 +294,7 @@ b'Entwicklung & L\xc3\xa4ndlicher Raum'
     
  • Testing regenerating PDF thumbnails, like I started in 2016-11
  • It seems there is a bug in filter-media that causes it to process formats that aren’t part of its configuration:
  • -
    $ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p "ImageMagick PDF Thumbnail"
    +
    $ [dspace]/bin/dspace filter-media -f -i 10568/16856 -p "ImageMagick PDF Thumbnail"
     File: earlywinproposal_esa_postharvest.pdf.jpg
     FILTERED: bitstream 13787 (item: 10568/16881) and created 'earlywinproposal_esa_postharvest.pdf.jpg'
     File: postHarvest.jpg.jpg
    @@ -302,7 +302,7 @@ FILTERED: bitstream 16524 (item: 10568/24655) and created 'postHarvest.jpg.jpg'
     
    • According to dspace.cfg the ImageMagick PDF Thumbnail plugin should only process PDFs:
    -
    filter.org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter.inputFormats = BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000
    +
    filter.org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter.inputFormats = BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000
     filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = Adobe PDF
     
    • I’ve sent a message to the mailing list and might file a Jira issue
    • @@ -317,7 +317,7 @@ filter.org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter.inputFormats = A
      • Find all fields with “http://hdl.handle.net” values (most are in dc.identifier.uri, but some are in other URL-related fields like cg.link.reference, cg.identifier.dataurl, and cg.identifier.url):
      -
      dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like 'http://hdl.handle.net%';
      +
      dspace=# select distinct metadata_field_id from metadatavalue where resource_type_id=2 and text_value like 'http://hdl.handle.net%';
       dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://hdl.handle.net', 'https://hdl.handle.net') where resource_type_id=2 and metadata_field_id IN (25, 113, 179, 219, 220, 223) and text_value like 'http://hdl.handle.net%';
       UPDATE 58633
       
        @@ -328,7 +328,7 @@ UPDATE 58633
        • LDAP users cannot log in today, looks to be an issue with CGIAR’s LDAP server:
        -
        $ openssl s_client -connect svcgroot2.cgiarad.org:3269
        +
        $ openssl s_client -connect svcgroot2.cgiarad.org:3269
         CONNECTED(00000003)
         depth=0 CN = SVCGROOT2.CGIARAD.ORG
         verify error:num=20:unable to get local issuer certificate
        @@ -345,7 +345,7 @@ Certificate chain
         
      • For some reason it is now signed by a private certificate authority
      • This error seems to have started on 2017-02-25:
      -
      $ grep -c "unable to find valid certification path" [dspace]/log/dspace.log.2017-02-*
      +
      $ grep -c "unable to find valid certification path" [dspace]/log/dspace.log.2017-02-*
       [dspace]/log/dspace.log.2017-02-01:0
       [dspace]/log/dspace.log.2017-02-02:0
       [dspace]/log/dspace.log.2017-02-03:0
      @@ -381,7 +381,7 @@ Certificate chain
       
    • The problem likely lies in the logic of ImageMagickThumbnailFilter.java, as ImageMagickPdfThumbnailFilter.java extends it
    • Run CIAT corrections on CGSpace
    -
    dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
    +
    dspace=# update metadatavalue set authority='3026b1de-9302-4f3e-85ab-ef48da024eb2', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value = 'International Center for Tropical Agriculture';
     
    • CGNET has fixed the certificate chain on their LDAP server
    • Redeploy CGSpace and DSpace Test to on latest 5_x-prod branch with fixes for LDAP bind user
    • @@ -393,16 +393,16 @@ Certificate chain
    • Ah, this is probably because some items have the International Center for Tropical Agriculture author twice, which I first noticed in 2016-12 but couldn’t figure out how to fix
    • I think I can do it by first exporting all metadatavalues that have the author International Center for Tropical Agriculture
    -
    dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='International Center for Tropical Agriculture') to /tmp/ciat.csv with csv;
    +
    dspace=# \copy (select resource_id, metadata_value_id from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='International Center for Tropical Agriculture') to /tmp/ciat.csv with csv;
     COPY 1968
     
    • And then use awk to print the duplicate lines to a separate file:
    -
    $ awk -F',' 'seen[$1]++' /tmp/ciat.csv > /tmp/ciat-dupes.csv
    +
    $ awk -F',' 'seen[$1]++' /tmp/ciat.csv > /tmp/ciat-dupes.csv
     
    • From that file I can create a list of 279 deletes and put them in a batch script like:
    -
    delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and metadata_value_id=2742061;
    +
    delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and metadata_value_id=2742061;
     
    diff --git a/docs/2017-03/index.html b/docs/2017-03/index.html index ea4f8c2be..0b50a1de3 100644 --- a/docs/2017-03/index.html +++ b/docs/2017-03/index.html @@ -54,7 +54,7 @@ Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing reg $ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 "/> - + @@ -156,7 +156,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
  • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
  • -
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
    +
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
     
    • This results in discolored thumbnails when compared to the original PDF, for example sRGB and CMYK:
    • @@ -171,7 +171,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
    • I created a patch for DS-3517 and made a pull request against upstream dspace-5_x: https://github.com/DSpace/DSpace/pull/1669
    • Looks like -colorspace sRGB alone isn’t enough, we need to use profiles:
    -
    $ convert alc_contrastes_desafios.pdf\[0\] -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_cmyk.icc -thumbnail 300x300 -flatten -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_rgb.icc alc_contrastes_desafios.pdf.jpg
    +
    $ convert alc_contrastes_desafios.pdf\[0\] -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_cmyk.icc -thumbnail 300x300 -flatten -profile /opt/brew/Cellar/ghostscript/9.20/share/ghostscript/9.20/iccprofiles/default_rgb.icc alc_contrastes_desafios.pdf.jpg
     
    • This reads the input file, applies the CMYK profile, applies the RGB profile, then writes the file
    • Note that you should set the first profile immediately after the input file
    • @@ -180,7 +180,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
    • Somehow we need to detect the color system being used by the input file and handle each case differently (with profiles)
    • This is trivial with identify (even by the Java ImageMagick API):
    -
    $ identify -format '%r\n' alc_contrastes_desafios.pdf\[0\]
    +
    $ identify -format '%r\n' alc_contrastes_desafios.pdf\[0\]
     DirectClass CMYK
     $ identify -format '%r\n' Africa\ group\ of\ negotiators.pdf\[0\]
     DirectClass sRGB Alpha
    @@ -196,7 +196,7 @@ DirectClass sRGB Alpha
     
  • They want something like the items that are returned by the general “LAND” query in the search interface, but we cannot do that
  • We can only return specific results for metadata fields, like:
  • -
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "LAND REFORM", "language": null}' | json_pp
    +
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "LAND REFORM", "language": null}' | json_pp
     
    • But there are hundreds of combinations of fields and values (like dc.subject and all the center subjects), and we can’t use wildcards in REST!
    • Reading about enabling multiple handle prefixes in DSpace
    • @@ -204,7 +204,7 @@ DirectClass sRGB Alpha
    • And a comment from Atmire’s Bram about it on the DSpace wiki: https://wiki.lyrasis.org/display/DSDOC5x/Installing+DSpace?focusedCommentId=78163296#comment-78163296
    • Bram mentions an undocumented configuration option handle.plugin.checknameauthority, but I noticed another one in dspace.cfg:
    -
    # List any additional prefixes that need to be managed by this handle server
    +
    # List any additional prefixes that need to be managed by this handle server
     # (as for examle handle prefix coming from old dspace repository merged in
     # that repository)
     # handle.additional.prefixes = prefix1[, prefix2]
    @@ -212,20 +212,20 @@ DirectClass sRGB Alpha
     
  • Because of this I noticed that our Handle server’s config.dct was potentially misconfigured!
  • We had some default values still present:
  • -
    "300:0.NA/YOUR_NAMING_AUTHORITY"
    +
    "300:0.NA/YOUR_NAMING_AUTHORITY"
     
    • I’ve changed them to the following and restarted the handle server:
    -
    "300:0.NA/10568"
    +
    "300:0.NA/10568"
     
    • In looking at all the configs I just noticed that we are not providing a DOI in the Google-specific metadata crosswalk
    • From dspace/config/crosswalks/google-metadata.properties:
    -
    google.citation_doi = cg.identifier.doi
    +
    google.citation_doi = cg.identifier.doi
     
    • This works, and makes DSpace output the following metadata on the item view page:
    -
    <meta content="https://dx.doi.org/10.1186/s13059-017-1153-y" name="citation_doi">
    +
    <meta content="https://dx.doi.org/10.1186/s13059-017-1153-y" name="citation_doi">
     
    • Submitted and merged pull request for this: https://github.com/ilri/DSpace/pull/305
    • Submit pull request to set the author separator for XMLUI item lists to a semicolon instead of “,": https://github.com/ilri/DSpace/pull/306
    • @@ -260,18 +260,18 @@ DirectClass sRGB Alpha
      • Export list of sponsors so Peter can clean it up:
      -
      dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship') group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
      +
      dspace=# \copy (select text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship') group by text_value order by count desc) to /tmp/sponsorship.csv with csv;
       COPY 285
       

      2017-03-12

      • Test the sponsorship fixes and deletes from Peter:
      -
      $ ./fix-metadata-values.py -i Investors-Fix-51.csv -f dc.description.sponsorship -t Action -m 29 -d dspace -u dspace -p fuuuu
      +
      $ ./fix-metadata-values.py -i Investors-Fix-51.csv -f dc.description.sponsorship -t Action -m 29 -d dspace -u dspace -p fuuuu
       $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.sponsorship -m 29 -d dspace -u dspace -p fuuu
       
      • Generate a new list of unique sponsors so we can update the controlled vocabulary:
      -
      dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship')) to /tmp/sponsorship.csv with csv;
      +
      dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id IN (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'sponsorship')) to /tmp/sponsorship.csv with csv;
       
      • Pull request for controlled vocabulary if Peter approves: https://github.com/ilri/DSpace/pull/308
      • Review Sisay’s roots, tubers, and bananas (RTB) theme, which still needs some fixes to work properly: https://github.com/ilri/DSpace/pull/307
      • @@ -311,12 +311,12 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
        • CCAFS said they are ready for the flagship updates for Phase II to be run (cg.subject.ccafs), so I ran them on CGSpace:
        -
        $ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
        +
        $ ./fix-metadata-values.py -i ccafs-flagships-feb7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
         
        • We’ve been waiting since February to run these
        • Also, I generated a list of all CCAFS flagships because there are a dozen or so more than there should be:
        -
        dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=210 group by text_value order by count desc) to /tmp/ccafs.csv with csv;
        +
        dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=210 group by text_value order by count desc) to /tmp/ccafs.csv with csv;
         
        • I sent a list to CCAFS people so they can tell me if some should be deleted or moved, etc
        • Test, squash, and merge Sisay’s RTB theme into 5_x-prod: https://github.com/ilri/DSpace/pull/316
        • @@ -325,11 +325,11 @@ $ ./delete-metadata-values.py -i Investors-Delete-121.csv -f dc.description.spon
          • Dump a list of fields in the DC and CG schemas to compare with CG Core:
          -
          dspace=# select case when metadata_schema_id=1 then 'dc' else 'cg' end as schema, element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
          +
          dspace=# select case when metadata_schema_id=1 then 'dc' else 'cg' end as schema, element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
           
          • Ooh, a better one!
          -
          dspace=# select coalesce(case when metadata_schema_id=1 then 'dc.' else 'cg.' end) || concat_ws('.', element, qualifier) as field, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
          +
          dspace=# select coalesce(case when metadata_schema_id=1 then 'dc.' else 'cg.' end) || concat_ws('.', element, qualifier) as field, scope_note from metadatafieldregistry where metadata_schema_id in (1, 2);
           

          2017-03-30

          • Adjust the Linode CPU usage alerts for the CGSpace server from 150% to 200%, as generally the nightly Solr indexing causes a usage around 150–190%, so this should make the alerts less regular
          • diff --git a/docs/2017-04/index.html b/docs/2017-04/index.html index a6bfda487..48e4673e9 100644 --- a/docs/2017-04/index.html +++ b/docs/2017-04/index.html @@ -40,7 +40,7 @@ Testing the CMYK patch on a collection with 650 items: $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt "/> - + @@ -136,16 +136,16 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Th
          • Remove redundant/duplicate text in the DSpace submission license
          • Testing the CMYK patch on a collection with 650 items:
          -
          $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
          +
          $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
           

          2017-04-03

          • Continue testing the CMYK patch on more communities:
          -
          $ [dspace]/bin/dspace filter-media -f -i 10568/1 -p "ImageMagick PDF Thumbnail" -v >> /tmp/filter-media-cmyk.txt 2>&1
          +
          $ [dspace]/bin/dspace filter-media -f -i 10568/1 -p "ImageMagick PDF Thumbnail" -v >> /tmp/filter-media-cmyk.txt 2>&1
           
          • So far there are almost 500:
          -
          $ grep -c profile /tmp/filter-media-cmyk.txt
          +
          $ grep -c profile /tmp/filter-media-cmyk.txt
           484
           
          • Looking at the CG Core document again, I’ll send some feedback to Peter and Abenet: @@ -157,39 +157,39 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Th
          • Also, I’m noticing some weird outliers in cg.coverage.region, need to remember to go correct these later:
          -
          dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=227;
          +
          dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=227;
           

          2017-04-04

          • The filter-media script has been running on more large communities and now there are many more CMYK PDFs that have been fixed:
          -
          $ grep -c profile /tmp/filter-media-cmyk.txt
          +
          $ grep -c profile /tmp/filter-media-cmyk.txt
           1584
           
          • Trying to find a way to get the number of items submitted by a certain user in 2016
          • It’s not possible in the DSpace search / module interfaces, but might be able to be derived from dc.description.provenance, as that field contains the name and email of the submitter/approver, ie:
          -
          Submitted by Francesca Giampieri (fgiampieri) on 2016-01-19T13:56:43Z^M
          +
          Submitted by Francesca Giampieri (fgiampieri) on 2016-01-19T13:56:43Z^M
           No. of bitstreams: 1^M
           ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0 (MD5)
           
          • This SQL query returns fields that were submitted or approved by giampieri in 2016 and contain a “checksum” (ie, there was a bitstream in the submission):
          -
          dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
          +
          dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
           
          • Then this one does the same, but for fields that don’t contain checksums (ie, there was no bitstream in the submission):
          -
          dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
          +
          dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
           
          • For some reason there seem to be way too many fields, for example there are 498 + 13 here, which is 511 items for just this one user.
          • It looks like there can be a scenario where the user submitted AND approved it, so some records might be doubled…
          • In that case it might just be better to see how many the user submitted (both with and without bitstreams):
          -
          dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*giampieri.*2016-.*';
          +
          dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*giampieri.*2016-.*';
           

          2017-04-05

          • After doing a few more large communities it seems this is the final count of CMYK PDFs:
          -
          $ grep -c profile /tmp/filter-media-cmyk.txt
          +
          $ grep -c profile /tmp/filter-media-cmyk.txt
           2505
           

          2017-04-06

            @@ -260,7 +260,7 @@ ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0
          • I will have to check the OAI cron scripts on DSpace Test, and then run them on CGSpace
          • Running dspace oai import and dspace oai clean-cache have zero effect, but this seems to rebuild the cache from scratch:
          -
          $ /home/dspacetest.cgiar.org/bin/dspace oai import -c
          +
          $ /home/dspacetest.cgiar.org/bin/dspace oai import -c
           ...
           63900 items imported so far...
           64000 items imported so far...
          @@ -273,7 +273,7 @@ OAI 2.0 manager action ended. It took 829 seconds.
           
        • The import command should theoretically catch situations like this where an item’s metadata was updated, but in this case we changed the metadata schema and it doesn’t seem to catch it (could be a bug!)
        • Attempting a full rebuild of OAI on CGSpace:
        -
        $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
        +
        $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
         $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace oai import -c
         ...
         58700 items imported so far...
        @@ -326,14 +326,14 @@ sys     1m29.310s
         
      • One thing we need to remember if we start using OAI is to enable the autostart of the harvester process (see harvester.autoStart in dspace/config/modules/oai.cfg)
      • Error when running DSpace cleanup task on DSpace Test and CGSpace (on the same item), I need to look this up:
      -
      Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
      +
      Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
         Detail: Key (bitstream_id)=(435) is still referenced from table "bundle".
       

      2017-04-18

      -
      $ git clone https://github.com/ilri/ckm-cgspace-rest-api.git
      +
      $ git clone https://github.com/ilri/ckm-cgspace-rest-api.git
       $ cd ckm-cgspace-rest-api/app
       $ gem install bundler
       $ bundle
      @@ -342,12 +342,12 @@ $ rails -s
       
      • I used Ansible to create a PostgreSQL user that only has SELECT privileges on the tables it needs:
      -
      $ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a 'db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
      +
      $ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a 'db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
       
      -
      $ bundle binstubs puma --path ./sbin
      +
      $ bundle binstubs puma --path ./sbin
       

      2017-04-19

      -
      value.replace(" ||","||").replace("|| ","||").replace(" || ","||")
      +
      value.replace(" ||","||").replace("|| ","||").replace(" || ","||")
       
      • Also, all the filenames have spaces and URL encoded characters in them, so I decoded them from URL encoding:
      -
      unescape(value,"url")
      +
      unescape(value,"url")
       
      • Then create the filename column using the following transform from URL:
      -
      value.split('/')[-1].replace(/#.*$/,"")
      +
      value.split('/')[-1].replace(/#.*$/,"")
       
      • The replace part is because some URLs have an anchor like #page=14 which we obviously don’t want on the filename
      • Also, we need to only use the PDF on the item corresponding with page 1, so we don’t end up with literally hundreds of duplicate PDFs
      • @@ -381,7 +381,7 @@ $ rails -s
      • Looking at the CIAT data again, a bunch of items have metadata values ending in ||, which might cause blank fields to be added at import time
      • Cleaning them up with OpenRefine:
      -
      value.replace(/\|\|$/,"")
      +
      value.replace(/\|\|$/,"")
       
      • Working with the CIAT data in OpenRefine to remove the filename column from all but the first item which requires a particular PDF, as there are many items pointing to the same PDF, which would cause hundreds of duplicates to be added if we included them in the SAF bundle
      • I did some massaging in OpenRefine, flagging duplicates with stars and flags, then filtering and removing the filenames of those items
      • @@ -391,15 +391,15 @@ $ rails -s
      • Also there are loads of whitespace errors in almost every field, so I trimmed leading/trailing whitespace
      • Unbelievable, there are also metadata values like:
      -
      COLLETOTRICHUM LINDEMUTHIANUM||                  FUSARIUM||GERMPLASM
      +
      COLLETOTRICHUM LINDEMUTHIANUM||                  FUSARIUM||GERMPLASM
       
      • Add a description to the file names using:
      -
      value + "__description:" + cells["dc.type"].value
      +
      value + "__description:" + cells["dc.type"].value
       
      • Test import of 933 records:
      -
      $ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
      +
      $ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
       $ wc -l /tmp/ciat
       933 /tmp/ciat
       
        @@ -409,7 +409,7 @@ $ wc -l /tmp/ciat
      • More work on Ansible infrastructure stuff for Tsega’s CKM DSpace REST API
      • I’m going to start re-processing all the PDF thumbnails on CGSpace, one community at a time:
      -
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
      +
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
       $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
       

      2017-04-22

        @@ -417,13 +417,13 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media
      • The solution is to remove the ID (ie set to NULL) from the primary_bitstream_id column in the bundle table
      • After doing that and running the cleanup task again I find more bitstreams that are affected and end up with a long list of IDs that need to be fixed:
      -
      dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1136, 1132, 1220, 1236, 3002, 3255, 5322);
      +
      dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1136, 1132, 1220, 1236, 3002, 3255, 5322);
       

      2017-04-24

      • Two users mentioned some items they recently approved not showing up in the search / XMLUI
      • I looked at the logs from yesterday and it seems the Discovery indexing has been crashing:
      -
      2017-04-24 00:00:15,578 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (55 of 58853): 70590
      +
      2017-04-24 00:00:15,578 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (55 of 58853): 70590
       2017-04-24 00:00:15,586 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (56 of 58853): 74507
       2017-04-24 00:00:15,614 ERROR com.atmire.dspace.discovery.AtmireSolrService @ this IndexWriter is closed
       org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this IndexWriter is closed
      @@ -447,7 +447,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
       
      • Looking at the past few days of logs, it looks like the indexing process started crashing on 2017-04-20:
      -
      # grep -c 'IndexWriter is closed' [dspace]/log/dspace.log.2017-04-*
      +
      # grep -c 'IndexWriter is closed' [dspace]/log/dspace.log.2017-04-*
       [dspace]/log/dspace.log.2017-04-01:0
       [dspace]/log/dspace.log.2017-04-02:0
       [dspace]/log/dspace.log.2017-04-03:0
      @@ -475,12 +475,12 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
       
      • I restarted Tomcat and re-ran the discovery process manually:
      -
      [dspace]/bin/dspace index-discovery
      +
      [dspace]/bin/dspace index-discovery
       
      • Now everything is ok
      • Finally finished manually running the cleanup task over and over and null’ing the conflicting IDs:
      -
      dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
      +
      dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
       
      • Now running the cleanup script on DSpace Test and already seeing 11GB freed from the assetstore—it’s likely we haven’t had a cleanup task complete successfully in years…
      @@ -489,12 +489,12 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
    • Finally finished running the PDF thumbnail re-processing on CGSpace, the final count of CMYK PDFs is about 2751
    • Preparing to run the cleanup task on CGSpace, I want to see how many files are in the assetstore:
    -
    # find [dspace]/assetstore/ -type f | wc -l
    +
    # find [dspace]/assetstore/ -type f | wc -l
     113104
     
    • Troubleshooting the Atmire Solr update process that runs at 3:00 AM every morning, after finishing at 100% it has this error:
    -
    [=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
    +
    [=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
     [=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
     [=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
     [=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:13
    @@ -557,7 +557,7 @@ Caused by: java.lang.ClassNotFoundException: org.dspace.statistics.content.DSpac
     
  • The size of the CGSpace database dump went from 111MB to 96MB, not sure about actual database size though
  • Update RVM’s Ruby from 2.3.0 to 2.4.0 on DSpace Test:
  • -
    $ gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
    +
    $ gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
     $ \curl -sSL https://raw.githubusercontent.com/wayneeseguin/rvm/master/binscripts/rvm-installer | bash -s stable --ruby
     ... reload shell to get new Ruby
     $ gem install sass -v 3.3.14
    diff --git a/docs/2017-05/index.html b/docs/2017-05/index.html
    index c81a295d1..7bf713511 100644
    --- a/docs/2017-05/index.html
    +++ b/docs/2017-05/index.html
    @@ -18,7 +18,7 @@
     
     
     
    -
    +
     
     
         
    @@ -131,7 +131,7 @@
     
  • Discovered that CGSpace has ~700 items that are missing the cg.identifier.status field
  • Need to perhaps try using the “required metadata” curation task to find fields missing these items:
  • -
    $ [dspace]/bin/dspace curate -t requiredmetadata -i 10568/1 -r - > /tmp/curation.out
    +
    $ [dspace]/bin/dspace curate -t requiredmetadata -i 10568/1 -r - > /tmp/curation.out
     
    • It seems the curation task dies when it finds an item which has missing metadata
    @@ -145,7 +145,7 @@
    • Testing one replacement for CCAFS Flagships (cg.subject.ccafs), first changed in the submission forms, and then in the database:
    -
    $ ./fix-metadata-values.py -i ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
    +
    $ ./fix-metadata-values.py -i ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu
     
    • Also, CCAFS wants to re-order their flagships to prioritize the Phase II ones
    • Waiting for feedback from CCAFS, then I can merge #320
    • @@ -159,7 +159,7 @@
    • This leads to tens of thousands of abandoned files in the assetstore, which need to be cleaned up using dspace cleanup -v, or else you’ll run out of disk space
    • In the end I realized it’s better to use submission mode (-s) to ingest the community object as a single AIP without its children, followed by each of the collections:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit"
     $ [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10568/87775 /home/aorth/10947-1/10947-1.zip
     $ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
     $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done
    @@ -184,13 +184,13 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
     
  • The CGIAR Library metadata has some blank metadata values, which leads to ||| in the Discovery facets
  • Clean these up in the database using:
  • -
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
    +
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
     
    • I ended up running into issues during data cleaning and decided to wipe out the entire community and re-sync DSpace Test assetstore and database from CGSpace rather than waiting for the cleanup task to clean up
    • Hours into the re-ingestion I ran into more errors, and had to erase everything and start over again!
    • Now, no matter what I do I keep getting foreign key errors…
    -
    Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey"
    +
    Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey"
       Detail: Key (handle_id)=(80928) already exists.
     
    • I think those errors actually come from me running the update-sequences.sql script while Tomcat/DSpace are running
    • @@ -202,7 +202,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
    • I clarified that the scope of the implementation should be that ORCIDs are stored in the database and exposed via REST / API like other fields
    • Finally finished importing all the CGIAR Library content, final method was:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit"
     $ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2517/10947-2517.zip
     $ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2515/10947-2515.zip
     $ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2516/10947-2516.zip
    @@ -215,7 +215,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
     
  • The -XX:-UseGCOverheadLimit JVM option helps with some issues in large imports
  • After this I ran the update-sequences.sql script (with Tomcat shut down), and cleaned up the 200+ blank metadata records:
  • -
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
    +
    dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
     

    2017-05-13

    • After quite a bit of troubleshooting with importing cleaned up data as CSV, it seems that there are actually NUL characters in the dc.description.abstract field (at least) on the lines where CSV importing was failing
    • @@ -230,7 +230,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
    • Merge changes to CCAFS project identifiers and flagships: #320
    • Run updates for CCAFS flagships on CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p 'fuuu'
     
    • These include:

      @@ -258,19 +258,19 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
      • Looking into the error I get when trying to create a new collection on DSpace Test:
      -
      ERROR: duplicate key value violates unique constraint "handle_pkey" Detail: Key (handle_id)=(84834) already exists.
      +
      ERROR: duplicate key value violates unique constraint "handle_pkey" Detail: Key (handle_id)=(84834) already exists.
       
      • I tried updating the sequences a few times, with Tomcat running and stopped, but it hasn’t helped
      • It appears item with handle_id 84834 is one of the imported CGIAR Library items:
      -
      dspace=# select * from handle where handle_id=84834;
      +
      dspace=# select * from handle where handle_id=84834;
        handle_id |   handle   | resource_type_id | resource_id
       -----------+------------+------------------+-------------
            84834 | 10947/1332 |                2 |       87113
       
      • Looks like the max handle_id is actually much higher:
      -
      dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
      +
      dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
        handle_id |  handle  | resource_type_id | resource_id
       -----------+----------+------------------+-------------
            86873 | 10947/99 |                2 |       89153
      @@ -279,7 +279,7 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
       
    • I’ve posted on the dspace-test mailing list to see if I can just manually set the handle_seq to that value
    • Actually, it seems I can manually set the handle sequence using:
    -
    dspace=# select setval('handle_seq',86873);
    +
    dspace=# select setval('handle_seq',86873);
     
    • After that I can create collections just fine, though I’m not sure if it has other side effects
    @@ -294,11 +294,11 @@ $ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager
  • Do some cleanups of community and collection names in CGIAR System Management Office community on DSpace Test, as well as move some items as Peter requested
  • Peter wanted a list of authors in here, so I generated a list of collections using the “View Source” on each community and this hacky awk:
  • -
    $ grep 10947/ /tmp/collections | grep -v cocoon | awk -F/ '{print $3"/"$4}' | awk -F\" '{print $1}' | vim -
    +
    $ grep 10947/ /tmp/collections | grep -v cocoon | awk -F/ '{print $3"/"$4}' | awk -F\" '{print $1}' | vim -
     
    • Then I joined them together and ran this old SQL query from the dspace-tech mailing list which gives you authors for items in those collections:
    -
    dspace=# select distinct text_value
    +
    dspace=# select distinct text_value
     from metadatavalue
     where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
     AND resource_type_id = 2
    @@ -314,7 +314,7 @@ AND resource_id IN (select item_id from collection2item where collection_id IN (
     
    • To get a CSV (with counts) from that:
    -
    dspace=# \copy (select distinct text_value, count(*)
    +
    dspace=# \copy (select distinct text_value, count(*)
     from metadatavalue
     where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
     AND resource_type_id = 2
    @@ -326,7 +326,7 @@ AND resource_id IN (select item_id from collection2item where collection_id IN (
     
  • For now I’ve suggested that they just change the collection names and that we fix their metadata manually afterwards
  • Also, they have a lot of messed up values in their cg.subject.wle field so I will clean up some of those first:
  • -
    dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id=119) to /tmp/wle.csv with csv;
    +
    dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id=119) to /tmp/wle.csv with csv;
     COPY 111
     
    • Respond to Atmire message about ORCIDs, saying that right now we’d prefer to just have them available via REST API like any other metadata field, and that I’m available for a Skype
    • @@ -343,21 +343,21 @@ COPY 111
    • Communicate with MARLO people about progress on exposing ORCIDs via the REST API, as it is set to be discussed in the June, 2017 DCAT meeting
    • Find all of Amos Omore’s author name variations so I can link them to his authority entry that has an ORCID:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Omore, A%';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Omore, A%';
     
    • Set the authority for all variations to one containing an ORCID:
    -
    dspace=# update metadatavalue set authority='4428ee88-90ef-4107-b837-3c0ec988520b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Omore, A%';
    +
    dspace=# update metadatavalue set authority='4428ee88-90ef-4107-b837-3c0ec988520b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Omore, A%';
     UPDATE 187
     
    • Next I need to do Edgar Twine:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Twine, E%';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Twine, E%';
     
    • But it doesn’t look like any of his existing entries are linked to an authority which has an ORCID, so I edited the metadata via “Edit this Item” and looked up his ORCID and linked it there
    • Now I should be able to set his name variations to the new authority:
    -
    dspace=# update metadatavalue set authority='f70d0a01-d562-45b8-bca3-9cf7f249bc8b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Twine, E%';
    +
    dspace=# update metadatavalue set authority='f70d0a01-d562-45b8-bca3-9cf7f249bc8b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Twine, E%';
     
    • Run the corrections on CGSpace and then update discovery / authority
    • I notice that there are a handful of java.lang.OutOfMemoryError: Java heap space errors in the Catalina logs on CGSpace, I should go look into that…
    • diff --git a/docs/2017-06/index.html b/docs/2017-06/index.html index cf74fa938..ad3b748e3 100644 --- a/docs/2017-06/index.html +++ b/docs/2017-06/index.html @@ -18,7 +18,7 @@ - + @@ -153,7 +153,7 @@
    • 17 of the items have issues with incorrect page number ranges, and upon closer inspection they do not appear in the referenced PDF
    • I’ve flagged them and proceeded without them (752 total) on DSpace Test:
    -
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
    +
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
     
    • I went and did some basic sanity checks on the remaining items in the CIAT Book Chapters and decided they are mostly fine (except one duplicate and the flagged ones), so I imported them to DSpace Test too (162 items)
    • Total items in CIAT Book Chapters is 914, with the others being flagged for some reason, and we should send that back to CIAT
    • @@ -167,7 +167,7 @@
    • Created a new branch with just the relevant changes, so I can send it to them
    • One thing I noticed is that there is a failed database migration related to CUA:
    -
    +----------------+----------------------------+---------------------+---------+
    +
    +----------------+----------------------------+---------------------+---------+
     | Version        | Description                | Installed on        | State   |
     +----------------+----------------------------+---------------------+---------+
     | 1.1            | Initial DSpace 1.1 databas |                     | PreInit |
    @@ -213,7 +213,7 @@
     
     
  • Finally import 914 CIAT Book Chapters to CGSpace in two batches:
  • -
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
    +
    $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log
     $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/35701 --source /home/aorth/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books2.map &> /tmp/ciat-books2.log
     

    2017-06-25

      @@ -221,7 +221,7 @@ $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace impo
    • Pull request with the changes to input-forms.xml: #329
    • As of now it doesn’t look like there are any items using this research theme so we don’t need to do any updates:
    -
    dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like 'Regenerating Degraded Landscapes%';
    +
    dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=237 and text_value like 'Regenerating Degraded Landscapes%';
      text_value
     ------------
     (0 rows)
    @@ -233,7 +233,7 @@ $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace impo
     
    • CGSpace went down briefly, I see lots of these errors in the dspace logs:
    -
    Java stacktrace: java.util.NoSuchElementException: Timeout waiting for idle object
    +
    Java stacktrace: java.util.NoSuchElementException: Timeout waiting for idle object
     
    • After looking at the Tomcat logs, Munin graphs, and PostgreSQL connection stats, it seems there is just a high load
    • Might be a good time to adjust DSpace’s database connection settings, like I first mentioned in April, 2017 after reading the 2017-04 DCAT comments
    • diff --git a/docs/2017-07/index.html b/docs/2017-07/index.html index bad70b205..6258ba4a6 100644 --- a/docs/2017-07/index.html +++ b/docs/2017-07/index.html @@ -36,7 +36,7 @@ Merge changes for WLE Phase II theme rename (#329) Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML: "/> - + @@ -132,7 +132,7 @@ We can use PostgreSQL’s extended output format (-x) plus sed to format the
    • Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace
    • We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML:
    -
    $ psql dspacenew -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=5 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:</dc-type>\n<dc-type>\n<schema>cg</schema>:;s:([^ ]*) +\| (.*):  <\1>\2</\1>:;s:^$:</dc-type>:;1s:</dc-type>\n::'
    +
    $ psql dspacenew -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=5 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:</dc-type>\n<dc-type>\n<schema>cg</schema>:;s:([^ ]*) +\| (.*):  <\1>\2</\1>:;s:^$:</dc-type>:;1s:</dc-type>\n::'
     
    • The sed script is from a post on the PostgreSQL mailing list
    • Abenet says the ILRI board wants to be able to have “lead author” for every item, so I’ve whipped up a WIP test in the 5_x-lead-author branch
    • @@ -151,11 +151,11 @@ We can use PostgreSQL’s extended output format (-x) plus sed to format the
    • Adjust WLE Research Theme to include both Phase I and II on the submission form according to editor feedback (#330)
    • Generate list of fields in the current CGSpace cg scheme so we can record them properly in the metadata registry:
    -
    $ psql dspace -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=2 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:</dc-type>\n<dc-type>\n<schema>cg</schema>:;s:([^ ]*) +\| (.*):  <\1>\2</\1>:;s:^$:</dc-type>:;1s:</dc-type>\n::' > cg-types.xml
    +
    $ psql dspace -x -c 'select element, qualifier, scope_note from metadatafieldregistry where metadata_schema_id=2 order by element, qualifier;' | sed -r 's:^-\[ RECORD (.*) \]-+$:</dc-type>\n<dc-type>\n<schema>cg</schema>:;s:([^ ]*) +\| (.*):  <\1>\2</\1>:;s:^$:</dc-type>:;1s:</dc-type>\n::' > cg-types.xml
     
    • CGSpace was unavailable briefly, and I saw this error in the DSpace log file:
    -
    2017-07-05 13:05:36,452 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
    +
    2017-07-05 13:05:36,452 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
     org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserved for non-replication superuser connections
     
    • Looking at the pg_stat_activity table I saw there were indeed 98 active connections to PostgreSQL, and at this time the limit is 100, so that makes sense
    • @@ -163,7 +163,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
    • Abenet said she was generating a report with Atmire’s CUA module, so it could be due to that?
    • Looking in the logs I see this random error again that I should report to DSpace:
    -
    2017-07-05 13:50:07,196 ERROR org.dspace.statistics.SolrLogger @ COUNTRY ERROR: EU
    +
    2017-07-05 13:50:07,196 ERROR org.dspace.statistics.SolrLogger @ COUNTRY ERROR: EU
     
    • Seems to come from dspace-api/src/main/java/org/dspace/statistics/SolrLogger.java
    @@ -211,7 +211,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
    • Move two top-level communities to be sub-communities of ILRI Projects
    -
    $ for community in 10568/2347 10568/25209; do /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child="$community"; done
    +
    $ for community in 10568/2347 10568/25209; do /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child="$community"; done
     
    • Discuss CGIAR Library data cleanup with Sisay and Abenet
    @@ -241,7 +241,7 @@ org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserve
    • Looks like the final list of metadata corrections for CCAFS project tags will be:
    -
    delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
    +
    delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
     update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
     update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
     delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
    @@ -250,7 +250,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and
     
  • Temporarily increase the nginx upload limit to 200MB for Sisay to upload the CIAT presentations
  • Looking at CGSpace activity page, there are 52 Baidu bots concurrently crawling our website (I copied the activity page to a text file and grep it)!
  • -
    $ grep 180.76. /tmp/status | awk '{print $5}' | sort | uniq | wc -l
    +
    $ grep 180.76. /tmp/status | awk '{print $5}' | sort | uniq | wc -l
     52
     
    • From looking at the dspace.log I see they are all using the same session, which means our Crawler Session Manager Valve is working
    • diff --git a/docs/2017-08/index.html b/docs/2017-08/index.html index 511fb62da..a18bdda26 100644 --- a/docs/2017-08/index.html +++ b/docs/2017-08/index.html @@ -60,7 +60,7 @@ This was due to newline characters in the dc.description.abstract column, which I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet "/> - + @@ -215,7 +215,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
    • I need to get an author list from the database for only the CGIAR Library community to send to Peter
    • It turns out that I had already used this SQL query in May, 2017 to get the authors from CGIAR Library:
    -
    dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9'))) group by text_value order by count desc) to /tmp/cgiar-library-authors.csv with csv;
    +
    dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9'))) group by text_value order by count desc) to /tmp/cgiar-library-authors.csv with csv;
     
    • Meeting with Peter and CGSpace team
        @@ -242,7 +242,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
      • I sent a message to the mailing list about the duplicate content issue with /rest and /bitstream URLs
      • Looking at the logs for the REST API on /rest, it looks like there is someone hammering doing testing or something on it…
      -
      # awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 5
      +
      # awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 5
           140 66.249.66.91
           404 66.249.66.90
          1479 50.116.102.77
      @@ -252,7 +252,7 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
       
    • The top offender is 70.32.83.92 which is actually the same IP as ccafs.cgiar.org, so I will email the Macaroni Bros to see if they can test on DSpace Test instead
    • I’ve enabled logging of /oai requests on nginx as well so we can potentially determine bad actors here (also to see if anyone is actually using OAI!)
    -
        # log oai requests
    +
        # log oai requests
         location /oai {
             access_log /var/log/nginx/oai.log;
             proxy_pass http://tomcat_http;
    @@ -266,11 +266,11 @@ Then I cleaned up the author authorities and HTML characters in OpenRefine and s
     
    • Run author corrections on CGIAR Library community from Peter
    -
    $ ./fix-metadata-values.py -i /tmp/authors-fix-523.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p fuuuu
    +
    $ ./fix-metadata-values.py -i /tmp/authors-fix-523.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p fuuuu
     
    • There were only three deletions so I just did them manually:
    -
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='C';
    +
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='C';
     DELETE 1
     dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='WSSD';
     
      @@ -279,7 +279,7 @@ dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_i
    • In that thread Chris Wilper suggests a new default of 35 max connections for db.maxconnections (from the current default of 30), knowing that each DSpace web application gets to use up to this many on its own
    • It would be good to approximate what the theoretical maximum number of connections on a busy server would be, perhaps by looking to see which apps use SQL:
    -
    $ grep -rsI SQLException dspace-jspui | wc -l          
    +
    $ grep -rsI SQLException dspace-jspui | wc -l          
     473
     $ grep -rsI SQLException dspace-oai | wc -l  
     63
    @@ -320,37 +320,37 @@ $ grep -rsI SQLException dspace-xmlui | wc -l
     
    • I wanted to merge the various field variations like cg.subject.system and cg.subject.system[en_US] in OpenRefine but I realized it would be easier in PostgreSQL:
    -
    dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=254;
    +
    dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=254;
     
    • And actually, we can do it for other generic fields for items in those collections, for example dc.description.abstract:
    -
    dspace=# update metadatavalue set text_lang='en_US' where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'abstract') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')))
    +
    dspace=# update metadatavalue set text_lang='en_US' where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'description' and qualifier = 'abstract') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')))
     
    • And on others like dc.language.iso, dc.relation.ispartofseries, dc.type, dc.title, etc…
    • Also, to move fields from dc.identifier.url to cg.identifier.url[en_US] (because we don’t use the Dublin Core one for some reason):
    -
    dspace=# update metadatavalue set metadata_field_id = 219, text_lang = 'en_US' where resource_type_id = 2 AND metadata_field_id = 237;
    +
    dspace=# update metadatavalue set metadata_field_id = 219, text_lang = 'en_US' where resource_type_id = 2 AND metadata_field_id = 237;
     UPDATE 15
     
    • Set the text_lang of all dc.identifier.uri (Handle) fields to be NULL, just like default DSpace does:
    -
    dspace=# update metadatavalue set text_lang=NULL where resource_type_id = 2 and metadata_field_id = 25 and text_value like 'http://hdl.handle.net/10947/%';
    +
    dspace=# update metadatavalue set text_lang=NULL where resource_type_id = 2 and metadata_field_id = 25 and text_value like 'http://hdl.handle.net/10947/%';
     UPDATE 4248
     
    • Also update the text_lang of dc.contributor.author fields for metadata in these collections:
    -
    dspace=# update metadatavalue set text_lang=NULL where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')));
    +
    dspace=# update metadatavalue set text_lang=NULL where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/93761', '10947/1', '10947/10', '10947/11', '10947/12', '10947/13', '10947/14', '10947/15', '10947/16', '10947/17', '10947/18', '10947/19', '10947/2', '10947/20', '10947/21', '10947/22', '10947/23', '10947/24', '10947/25', '10947/2512', '10947/2515', '10947/2516', '10947/2517', '10947/2518', '10947/2519', '10947/2520', '10947/2521', '10947/2522', '10947/2523', '10947/2524', '10947/2525', '10947/2526', '10947/2527', '10947/2528', '10947/2529', '10947/2530', '10947/2531', '10947/2532', '10947/2533', '10947/2534', '10947/2535', '10947/2536', '10947/2537', '10947/2538', '10947/2539', '10947/2540', '10947/2541', '10947/2589', '10947/26', '10947/2631', '10947/27', '10947/2708', '10947/2776', '10947/2782', '10947/2784', '10947/2786', '10947/2790', '10947/28', '10947/2805', '10947/2836', '10947/2871', '10947/2878', '10947/29', '10947/2900', '10947/2919', '10947/3', '10947/30', '10947/31', '10947/32', '10947/33', '10947/34', '10947/3457', '10947/35', '10947/36', '10947/37', '10947/38', '10947/39', '10947/4', '10947/40', '10947/4052', '10947/4054', '10947/4056', '10947/4068', '10947/41', '10947/42', '10947/43', '10947/4368', '10947/44', '10947/4467', '10947/45', '10947/4508', '10947/4509', '10947/4510', '10947/4573', '10947/46', '10947/4635', '10947/4636', '10947/4637', '10947/4638', '10947/4639', '10947/4651', '10947/4657', '10947/47', '10947/48', '10947/49', '10947/5', '10947/50', '10947/51', '10947/5308', '10947/5322', '10947/5324', '10947/5326', '10947/6', '10947/7', '10947/8', '10947/9')));
     UPDATE 4899
     
    • Wow, I just wrote this baller regex facet to find duplicate authors:
    -
    isNotNull(value.match(/(CGIAR .+?)\|\|\1/))
    +
    isNotNull(value.match(/(CGIAR .+?)\|\|\1/))
     
    • This would be true if the authors were like CGIAR System Management Office||CGIAR System Management Office, which some of the CGIAR Library’s were
    • Unfortunately when you fix these in OpenRefine and then submit the metadata to DSpace it doesn’t detect any changes, so you have to edit them all manually via DSpace’s “Edit Item”
    • Ooh! And an even more interesting regex would match any duplicated author:
    -
    isNotNull(value.match(/(.+?)\|\|\1/))
    +
    isNotNull(value.match(/(.+?)\|\|\1/))
     
    • Which means it can also be used to find items with duplicate dc.subject fields…
    • Finally sent Peter the final dump of the CGIAR System Organization community so he can have a last look at it
    • @@ -365,12 +365,12 @@ UPDATE 4899
    • Uptime Robot said CGSpace went down for 1 minute, not sure why
    • Looking in dspace.log.2017-08-17 I see some weird errors that might be related?
    -
    2017-08-17 07:55:31,396 ERROR net.sf.ehcache.store.DiskStore @ cocoon-ehcacheCache: Could not read disk store element for key PK_G-aspect-cocoon://DRI/12/handle/10568/65885?pipelinehash=823411183535858997_T-Navigation-3368194896954203241. Error was invalid stream header: 00000000
    +
    2017-08-17 07:55:31,396 ERROR net.sf.ehcache.store.DiskStore @ cocoon-ehcacheCache: Could not read disk store element for key PK_G-aspect-cocoon://DRI/12/handle/10568/65885?pipelinehash=823411183535858997_T-Navigation-3368194896954203241. Error was invalid stream header: 00000000
     java.io.StreamCorruptedException: invalid stream header: 00000000
     
    • Weird that these errors seem to have started on August 11th, the same day we had capacity issues with PostgreSQL:
    -
    # grep -c "ERROR net.sf.ehcache.store.DiskStore" dspace.log.2017-08-*
    +
    # grep -c "ERROR net.sf.ehcache.store.DiskStore" dspace.log.2017-08-*
     dspace.log.2017-08-01:0
     dspace.log.2017-08-02:0
     dspace.log.2017-08-03:0
    @@ -412,7 +412,7 @@ dspace.log.2017-08-17:584
     
  • More information about authority framework: https://wiki.lyrasis.org/display/DSPACE/Authority+Control+of+Metadata+Values
  • Wow, I’m playing with the AGROVOC SPARQL endpoint using the sparql-query tool:
  • -
    $ ./sparql-query http://202.45.139.84:10035/catalogs/fao/repositories/agrovoc
    +
    $ ./sparql-query http://202.45.139.84:10035/catalogs/fao/repositories/agrovoc
     sparql$ PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
     SELECT 
         ?label 
    @@ -452,7 +452,7 @@ WHERE {
     
  • Since I cleared the XMLUI cache on 2017-08-17 there haven’t been any more ERROR net.sf.ehcache.store.DiskStore errors
  • Look at the CGIAR Library to see if I can find the items that have been submitted since May:
  • -
    dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z';
    +
    dspace=# select * from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z';
      metadata_value_id | item_id | metadata_field_id |      text_value      | text_lang | place | authority | confidence 
     -------------------+---------+-------------------+----------------------+-----------+-------+-----------+------------
                 123117 |    5872 |                11 | 2017-06-28T13:05:18Z |           |     1 |           |         -1
    @@ -465,7 +465,7 @@ WHERE {
     
  • According to dc.date.accessioned (metadata field id 11) there have only been five items submitted since May
  • These are their handles:
  • -
    dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z');
    +
    dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z');
        handle   
     ------------
      10947/4658
    @@ -490,7 +490,7 @@ WHERE {
     
  • I asked Sisay about this and hinted that he should go back and fix these things, but let’s see what he says
  • Saw CGSpace go down briefly today and noticed SQL connection pool errors in the dspace log file:
  • -
    ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error
    +
    ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error
     org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
     
    • Looking at the logs I see we have been having hundreds or thousands of these errors a few times per week in 2017-07 and almost every day in 2017-08
    • diff --git a/docs/2017-09/index.html b/docs/2017-09/index.html index 6b9d5906e..126f51ce5 100644 --- a/docs/2017-09/index.html +++ b/docs/2017-09/index.html @@ -32,7 +32,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group "/> - + @@ -130,7 +130,7 @@ Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account
      • Delete 58 blank metadata values from the CGSpace database:
      -
      dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
      +
      dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
       DELETE 58
       
      • I also ran it on DSpace Test because we’ll be migrating the CGIAR Library soon and it would be good to catch these before we migrate
      • @@ -145,7 +145,7 @@ DELETE 58
      • There will need to be some metadata updates — though if I recall correctly it is only about seven records — for that as well, I had made some notes about it in 2017-07, but I’ve asked for more clarification from Lili just in case
      • Looking at the DSpace logs to see if we’ve had a change in the “Cannot get a connection” errors since last month when we adjusted the db.maxconnections parameter on CGSpace:
      -
      # grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-09-*
      +
      # grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-09-*
       dspace.log.2017-09-01:0
       dspace.log.2017-09-02:0
       dspace.log.2017-09-03:9
      @@ -174,14 +174,14 @@ dspace.log.2017-09-10:0
       
    • The import process takes the same amount of time with and without the caching
    • Also, I captured TCP packets destined for port 80 and both imports only captured ONE packet (an update check from some component in Java):
    -
    $ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x47455420'
    +
    $ sudo tcpdump -i en0 -w without-cached-xsd.dump dst port 80 and 'tcp[32:4] = 0x47455420'
     
    • Great TCP dump guide here: https://danielmiessler.com/study/tcpdump
    • The last part of that command filters for HTTP GET requests, of which there should have been many to fetch all the XSD files for validation
    • I sent a message to the mailing list to see if anyone knows more about this
    • In looking at the tcpdump results I notice that there is an update check to the ehcache server on every iteration of the ingest loop, for example:
    -
    09:39:36.008956 IP 192.168.8.124.50515 > 157.189.192.67.http: Flags [P.], seq 1736833672:1736834103, ack 147469926, win 4120, options [nop,nop,TS val 1175113331 ecr 550028064], length 431: HTTP: GET /kit/reflector?kitID=ehcache.default&pageID=update.properties&id=2130706433&os-name=Mac+OS+X&jvm-name=Java+HotSpot%28TM%29+64-Bit+Server+VM&jvm-version=1.8.0_144&platform=x86_64&tc-version=UNKNOWN&tc-product=Ehcache+Core+1.7.2&source=Ehcache+Core&uptime-secs=0&patch=UNKNOWN HTTP/1.1
    +
    09:39:36.008956 IP 192.168.8.124.50515 > 157.189.192.67.http: Flags [P.], seq 1736833672:1736834103, ack 147469926, win 4120, options [nop,nop,TS val 1175113331 ecr 550028064], length 431: HTTP: GET /kit/reflector?kitID=ehcache.default&pageID=update.properties&id=2130706433&os-name=Mac+OS+X&jvm-name=Java+HotSpot%28TM%29+64-Bit+Server+VM&jvm-version=1.8.0_144&platform=x86_64&tc-version=UNKNOWN&tc-product=Ehcache+Core+1.7.2&source=Ehcache+Core&uptime-secs=0&patch=UNKNOWN HTTP/1.1
     
    • Turns out this is a known issue and Ehcache has refused to make it opt-in: https://jira.terracotta.org/jira/browse/EHC-461
    • But we can disable it by adding an updateCheck="false" attribute to the main <ehcache > tag in dspace-services/src/main/resources/caching/ehcache-config.xml
    • @@ -204,7 +204,7 @@ dspace.log.2017-09-10:0
    • I wonder what was going on, and looking into the nginx logs I think maybe it’s OAI…
    • Here is yesterday’s top ten IP addresses making requests to /oai:
    -
    # awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
    +
    # awk '{print $1}' /var/log/nginx/oai.log | sort -n | uniq -c | sort -h | tail -n 10
           1 213.136.89.78
           1 66.249.66.90
           1 66.249.66.92
    @@ -217,7 +217,7 @@ dspace.log.2017-09-10:0
     
    • Compared to the previous day’s logs it looks VERY high:
    -
    # awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
    +
    # awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
           1 207.46.13.39
           1 66.249.66.93
           2 66.249.66.91
    @@ -234,7 +234,7 @@ dspace.log.2017-09-10:0
     
     
  • And this user agent has never been seen before today (or at least recently!):
  • -
    # grep -c "API scraper" /var/log/nginx/oai.log
    +
    # grep -c "API scraper" /var/log/nginx/oai.log
     62088
     # zgrep -c "API scraper" /var/log/nginx/oai.log.*.gz
     /var/log/nginx/oai.log.10.gz:0
    @@ -270,19 +270,19 @@ dspace.log.2017-09-10:0
     
  • Some of these heavy users are also using XMLUI, and their user agent isn’t matched by the Tomcat Session Crawler valve, so each request uses a different session
  • Yesterday alone the IP addresses using the API scraper user agent were responsible for 16,000 sessions in XMLUI:
  • -
    # grep -a -E "(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)" /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    # grep -a -E "(54.70.51.7|35.161.215.53|34.211.17.113|54.70.175.86)" /home/cgspace.cgiar.org/log/dspace.log.2017-09-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     15924
     
    • If this continues I will definitely need to figure out who is responsible for this scraper and add their user agent to the session crawler valve regex
    • A search for “API scraper” user agent on Google returns a robots.txt with a comment that this is the Yewno bot: http://www.escholarship.org/robots.txt
    • Also, in looking at the DSpace logs I noticed a warning from OAI that I should look into:
    -
    WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
    +
    WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
     
    • Looking at the spreadsheet with deletions and corrections that CCAFS sent last week
    • It appears they want to delete a lot of metadata, which I’m not sure they realize the implications of:
    -
    dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange') group by text_value;                                                                                                                                                                                                                  
    +
    dspace=# select text_value, count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange') group by text_value;                                                                                                                                                                                                                  
             text_value        | count                              
     --------------------------+-------                             
      FP4_ClimateModels        |     6                              
    @@ -309,14 +309,14 @@ dspace.log.2017-09-10:0
     
  • I sent CCAFS people an email to ask if they really want to remove these 200+ tags
  • She responded yes, so I’ll at least need to do these deletes in PostgreSQL:
  • -
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
    +
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134, 235) and text_value in ('EA_PAR','FP1_CSAEvidence','FP2_CRMWestAfrica','FP3_Gender','FP4_Baseline','FP4_CCPAG','FP4_CCPG','FP4_CIATLAM IMPACT','FP4_ClimateData','FP4_ClimateModels','FP4_GenderPolicy','FP4_GenderToolbox','FP4_Livestock','FP4_PolicyEngagement','FP_GII','SA_Biodiversity','SA_CSV','SA_GHGMeasurement','SEA_mitigationSAMPLES','SEA_UpscalingInnovation','WA_Partnership','WA_SciencePolicyExchange','FP_GII');
     DELETE 207
     
    • When we discussed this in late July there were some other renames they had requested, but I don’t see them in the current spreadsheet so I will have to follow that up
    • I talked to Macaroni Bros and they said to just go ahead with the other corrections as well as their spreadsheet was evolved organically rather than systematically!
    • The final list of corrections and deletes should therefore be:
    -
    delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
    +
    delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-FP4_CRMWestAfrica';
     update metadatavalue set text_value='FP3_VietnamLED' where resource_type_id=2 and metadata_field_id=134 and text_value='FP3_VeitnamLED';
     update metadatavalue set text_value='PII-FP1_PIRCCA' where resource_type_id=2 and metadata_field_id=235 and text_value='PII-SEA_PIRCCA';
     delete from metadatavalue where resource_type_id=2 and metadata_field_id=235 and text_value='PII-WA_IntegratedInterventions';
    @@ -332,7 +332,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
     
  • Testing to see how we end up with all these new authorities after we keep cleaning and merging them in the database
  • Here are all my distinct authority combinations in the database before:
  • -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
      text_value |              authority               | confidence 
     ------------+--------------------------------------+------------
      Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    @@ -347,7 +347,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
     
    • And then after adding a new item and selecting an existing “Orth, Alan” with an ORCID in the author lookup:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
      text_value |              authority               | confidence 
     ------------+--------------------------------------+------------
      Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    @@ -363,7 +363,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
     
    • It created a new authority… let’s try to add another item and select the same existing author and see what happens in the database:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
      text_value |              authority               | confidence 
     ------------+--------------------------------------+------------
      Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    @@ -379,7 +379,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
     
    • No new one… so now let me try to add another item and select the italicized result from the ORCID lookup and see what happens in the database:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';
      text_value |              authority               | confidence 
     ------------+--------------------------------------+------------
      Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    @@ -396,7 +396,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
     
    • Shit, it created another authority! Let’s try it again!
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';                                                                                             
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Orth, %';                                                                                             
      text_value |              authority               | confidence
     ------------+--------------------------------------+------------
      Orth, Alan | 7c2bffb8-58c9-4bc8-b102-ebe8aec200ad |         -1
    @@ -427,7 +427,7 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id in (134
     
    • Apply CCAFS project tag corrections on CGSpace:
    -
    dspace=# \i /tmp/ccafs-projects.sql 
    +
    dspace=# \i /tmp/ccafs-projects.sql 
     DELETE 5
     UPDATE 4
     UPDATE 1
    @@ -439,7 +439,7 @@ DELETE 207
     
  • We still need to do the changes to config.dct and regenerate the sitebndl.zip to send to the Handle.net admins
  • According to this dspace-tech mailing list entry from 2011, we need to add the extra handle prefixes to config.dct like this:
  • -
    "server_admins" = (
    +
    "server_admins" = (
     "300:0.NA/10568"
     "300:0.NA/10947"
     )
    @@ -458,7 +458,7 @@ DELETE 207
     
  • The problem was that we remapped the items to new collections after the initial import, so the items were using the 10947 prefix but the community and collection was using 10568
  • I ended up having to read the AIP Backup and Restore closely a few times and then explicitly preserve handles and ignore parents:
  • -
    $ for item in 10568-93759/ITEM@10947-46*; do ~/dspace/bin/dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/87738 $item; done
    +
    $ for item in 10568-93759/ITEM@10947-46*; do ~/dspace/bin/dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/87738 $item; done
     
    • Also, this was in replace mode (-r) rather than submit mode (-s), because submit mode always generated a new handle even if I told it not to!
    • I decided to start the import process in the evening rather than waiting for the morning, and right as the first community was finished importing I started seeing Timeout waiting for idle object errors
    • @@ -478,7 +478,7 @@ DELETE 207
      • Nightly Solr indexing is working again, and it appears to be pretty quick actually:
      -
      2017-09-19 00:00:14,953 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (0 of 65808): 17607
      +
      2017-09-19 00:00:14,953 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (0 of 65808): 17607
       ...
       2017-09-19 00:04:18,017 INFO  com.atmire.dspace.discovery.AtmireSolrService @ Processing (65807 of 65808): 83753
       
        @@ -494,7 +494,7 @@ DELETE 207
      • Abenet and I noticed that hdl.handle.net is blocked by ETC at ILRI Addis so I asked Biruk Debebe to route it over the satellite
      • Force thumbnail regeneration for the CGIAR System Organization’s Historic Archive community (2000 items):
      -
      $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p "ImageMagick PDF Thumbnail"
      +
      $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -f -i 10947/1 -p "ImageMagick PDF Thumbnail"
       
      • I’m still waiting (over 1 day later) to hear back from the CGIAR System Organization about updating the DNS for library.cgiar.org
      @@ -540,7 +540,7 @@ DELETE 207
    • Turns out he had already mapped some, but requested that I finish the rest
    • With this GREL in OpenRefine I can find items that are mapped, ie they have 10568/3|| or 10568/3$ in their collection field:
    -
    isNotNull(value.match(/.+?10568\/3(\|\|.+|$)/))
    +
    isNotNull(value.match(/.+?10568\/3(\|\|.+|$)/))
     
    • Peter also made a lot of changes to the data in the Archives collections while I was attempting to import the changes, so we were essentially competing for PostgreSQL and Solr connections
    • I ended up having to kill the import and wait until he was done
    • @@ -552,7 +552,7 @@ DELETE 207
    • Communicate (finally) with Tania and Tunji from the CGIAR System Organization office to tell them to request CGNET make the DNS updates for library.cgiar.org
    • Peter wants me to clean up the text values for Delia Grace’s metadata, as the authorities are all messed up again since we cleaned them up in 2016-12:
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';                                  
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';                                  
       text_value  |              authority               | confidence              
     --------------+--------------------------------------+------------             
      Grace, Delia |                                      |        600              
    @@ -563,12 +563,12 @@ DELETE 207
     
  • Strangely, none of her authority entries have ORCIDs anymore…
  • I’ll just fix the text values and forget about it for now:
  • -
    dspace=# update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
    +
    dspace=# update metadatavalue set text_value='Grace, Delia', authority='bfa61d7c-7583-4175-991c-2e7315000f0c', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Grace, D%';
     UPDATE 610
     
    • After this we have to reindex the Discovery and Authority cores (as tomcat7 user):
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
     
     real    83m56.895s
    @@ -603,7 +603,7 @@ sys     0m12.113s
     
  • The index-authority script always seems to fail, I think it’s the same old bug
  • Something interesting for my notes about JNDI database pool—since I couldn’t determine if it was working or not when I tried it locally the other day—is this error message that I just saw in the DSpace logs today:
  • -
    ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspaceLocal
    +
    ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspaceLocal
     ...
     INFO  org.dspace.storage.rdbms.DatabaseManager @ Unable to locate JNDI dataSource: jdbc/dspaceLocal
     INFO  org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Database pool
    @@ -627,13 +627,13 @@ INFO  org.dspace.storage.rdbms.DatabaseManager @ Falling back to creating own Da
     
  • Now the redirects work
  • I quickly registered a Let’s Encrypt certificate for the domain:
  • -
    # systemctl stop nginx
    +
    # systemctl stop nginx
     # /opt/certbot-auto certonly --standalone --email aorth@mjanja.ch -d library.cgiar.org
     # systemctl start nginx
     
    • I modified the nginx configuration of the ansible playbooks to use this new certificate and now the certificate is enabled and OCSP stapling is working:
    -
    $ openssl s_client -connect cgspace.cgiar.org:443 -servername library.cgiar.org  -tls1_2 -tlsextdebug -status
    +
    $ openssl s_client -connect cgspace.cgiar.org:443 -servername library.cgiar.org  -tls1_2 -tlsextdebug -status
     ...
     OCSP Response Data:
     ...
    diff --git a/docs/2017-10/index.html b/docs/2017-10/index.html
    index a0c1af30e..f0b181497 100644
    --- a/docs/2017-10/index.html
    +++ b/docs/2017-10/index.html
    @@ -34,7 +34,7 @@ http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
     There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
     Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
     "/>
    -
    +
     
     
         
    @@ -124,7 +124,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
     
    -
    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    +
    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
     
    • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
    • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
    • @@ -134,13 +134,13 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
    • Peter Ballantyne said he was having problems logging into CGSpace with “both” of his accounts (CGIAR LDAP and personal, apparently)
    • I looked in the logs and saw some LDAP lookup failures due to timeout but also strangely a “no DN found” error:
    -
    2017-10-01 20:24:57,928 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
    +
    2017-10-01 20:24:57,928 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
     2017-10-01 20:22:37,982 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:failed_login:no DN found for user pballantyne
     
    • I thought maybe his account had expired (seeing as it’s was the first of the month) but he says he was finally able to log in today
    • The logs for yesterday show fourteen errors related to LDAP auth failures:
    -
    $ grep -c "ldap_authentication:type=failed_auth" dspace.log.2017-10-01
    +
    $ grep -c "ldap_authentication:type=failed_auth" dspace.log.2017-10-01
     14
     
    • For what it’s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET’s LDAP server
    • @@ -152,7 +152,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
    • Communicate with Sam from the CGIAR System Organization about some broken links coming from their CGIAR Library domain to CGSpace
    • The first is a link to a browse page that should be handled better in nginx:
    -
    http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject
    +
    http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject
     
    • We’ll need to check for browse links and handle them properly, including swapping the subject parameter for systemsubject (which doesn’t exist in Discovery yet, but we’ll need to add it) as we have moved their poorly curated subjects from dc.subject to cg.subject.system
    • The second link was a direct link to a bitstream which has broken due to the sequence being updated, so I told him he should link to the handle of the item instead
    • @@ -165,7 +165,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
    • Twice in the past twenty-four hours Linode has warned that CGSpace’s outbound traffic rate was exceeding the notification threshold
    • I had a look at yesterday’s OAI and REST logs in /var/log/nginx but didn’t see anything unusual:
    -
    # awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
    +
    # awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
         141 157.55.39.240
         145 40.77.167.85
         162 66.249.66.92
    @@ -225,7 +225,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
     
  • Delete Community 10568/102 (ILRI Research and Development Issues)
  • Move five collections to 10568/27629 (ILRI Projects) using move-collections.sh with the following configuration:
  • -
    10568/1637 10568/174 10568/27629
    +
    10568/1637 10568/174 10568/27629
     10568/1642 10568/174 10568/27629
     10568/1614 10568/174 10568/27629
     10568/75561 10568/150 10568/27629
    @@ -270,12 +270,12 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
     
  • Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again
  • Still not sure where the load is coming from right now, but it’s clear why there were so many alerts yesterday on the 25th!
  • -
    # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
    +
    # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
     18022
     
    • Compared to other days there were two or three times the number of requests yesterday!
    -
    # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
    +
    # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
     3141
     # grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-26 | sort -n | uniq | wc -l
     7851
    @@ -302,7 +302,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
     
  • I’m still not sure why this started causing alerts so repeatadely the past week
  • I don’t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:
  • -
    # grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    # grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2049
     
    • So there were 2049 unique sessions during the hour of 2AM
    • @@ -310,7 +310,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
    • I think I’ll need to enable access logging in nginx to figure out what’s going on
    • After enabling logging on requests to XMLUI on / I see some new bot I’ve never seen before:
    -
    137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
    +
    137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
     
    • CORE seems to be some bot that is “Aggregating the world’s open access research papers”
    • The contact address listed in their bot’s user agent is incorrect, correct page is simply: https://core.ac.uk/contact
    • @@ -323,39 +323,39 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
    • Like clock work, Linode alerted about high CPU usage on CGSpace again this morning (this time at 8:13 AM)
    • Uptime Robot noticed that CGSpace went down around 10:15 AM, and I saw that there were 93 PostgreSQL connections:
    -
    dspace=# SELECT * FROM pg_stat_activity;
    +
    dspace=# SELECT * FROM pg_stat_activity;
     ...
     (93 rows)
     
    • Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:
    -
    # grep -c "CORE/0.6" /var/log/nginx/access.log 
    +
    # grep -c "CORE/0.6" /var/log/nginx/access.log 
     26475
     # grep -c "CORE/0.6" /var/log/nginx/access.log.1
     135083
     
    • IP addresses for this bot currently seem to be:
    -
    # grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
    +
    # grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
     137.108.70.6
     137.108.70.7
     
    • I will add their user agent to the Tomcat Session Crawler Valve but it won’t help much because they are only using two sessions:
    -
    # grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
    +
    # grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
     session_id=5771742CABA3D0780860B8DA81E0551B
     session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
     
    • … and most of their requests are for dynamic discover pages:
    -
    # grep -c 137.108.70 /var/log/nginx/access.log
    +
    # grep -c 137.108.70 /var/log/nginx/access.log
     26622
     # grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
     24055
     
    • Just because I’m curious who the top IPs are:
    -
    # awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
    +
    # awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
         496 62.210.247.93
         571 46.4.94.226
         651 40.77.167.39
    @@ -371,7 +371,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
     
  • 190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine
  • Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don’t reuse their session variable, creating thousands of new sessions!
  • -
    # grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    # grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     1419
     # grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2811
    @@ -382,11 +382,11 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
     
  • Ah, wait, it looks like crawlerIps only came in 2017-06, so probably isn’t in Ubuntu 16.04’s 7.0.68 build!
  • That would explain the errors I was getting when trying to set it:
  • -
    WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
    +
    WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
     
    • As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:
    -
    # grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
    +
    # grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
         410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
         574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
        1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
    @@ -399,7 +399,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
     
  • Ask on the dspace-tech mailing list if it’s possible to use an existing item as a template for a new item
  • To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:
  • -
    # grep "CORE/0.6" /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
    +
    # grep "CORE/0.6" /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
      139109 137.108.70.6
      139253 137.108.70.7
     
      @@ -408,7 +408,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
    • I added GoAccess to the list of package to install in the DSpace role of the Ansible infrastructure scripts
    • It makes it very easy to analyze nginx logs from the command line, to see where traffic is coming from:
    -
    # goaccess /var/log/nginx/access.log --log-format=COMBINED
    +
    # goaccess /var/log/nginx/access.log --log-format=COMBINED
     
    • According to Uptime Robot CGSpace went down and up a few times
    • I had a look at goaccess and I saw that CORE was actively indexing
    • @@ -416,7 +416,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
    • I’m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable
    • Actually, come to think of it, they aren’t even obeying robots.txt, because we actually disallow /discover and /search-filter URLs but they are hitting those massively:
    -
    # grep "CORE/0.6" /var/log/nginx/access.log | grep -o -E "GET /(discover|search-filter)" | sort -n | uniq -c | sort -rn 
    +
    # grep "CORE/0.6" /var/log/nginx/access.log | grep -o -E "GET /(discover|search-filter)" | sort -n | uniq -c | sort -rn 
      158058 GET /discover
       14260 GET /search-filter
     
      diff --git a/docs/2017-11/index.html b/docs/2017-11/index.html index ba68dbc51..023733939 100644 --- a/docs/2017-11/index.html +++ b/docs/2017-11/index.html @@ -48,7 +48,7 @@ Generate list of authors on CGSpace for Peter to go through and correct: dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 "/> - + @@ -142,12 +142,12 @@ COPY 54701
      • Today there have been no hits by CORE and no alerts from Linode (coincidence?)
      -
      # grep -c "CORE" /var/log/nginx/access.log
      +
      # grep -c "CORE" /var/log/nginx/access.log
       0
       
      • Generate list of authors on CGSpace for Peter to go through and correct:
      -
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
      +
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
       COPY 54701
       
      • Abenet asked if it would be possible to generate a report of items in Listing and Reports that had “International Fund for Agricultural Development” as the only investor
      • @@ -155,7 +155,7 @@ COPY 54701
      • Work on making the thumbnails in the item view clickable
      • Basically, once you read the METS XML for an item it becomes easy to trace the structure to find the bitstream link
      -
      //mets:fileSec/mets:fileGrp[@USE='CONTENT']/mets:file/mets:FLocat[@LOCTYPE='URL']/@xlink:href
      +
      //mets:fileSec/mets:fileGrp[@USE='CONTENT']/mets:file/mets:FLocat[@LOCTYPE='URL']/@xlink:href
       
      • METS XML is available for all items with this pattern: /metadata/handle/10568/95947/mets.xml
      • I whipped up a quick hack to print a clickable link with this URL on the thumbnail but it needs to check a few corner cases, like when there is a thumbnail but no content bitstream!
      • @@ -177,7 +177,7 @@ COPY 54701
      • It’s the first time in a few days that this has happened
      • I had a look to see what was going on, but it isn’t the CORE bot:
      -
      # awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
      +
      # awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
           306 68.180.229.31
           323 61.148.244.116
           414 66.249.66.91
      @@ -191,7 +191,7 @@ COPY 54701
       
      • 138.201.52.218 is from some Hetzner server, and I see it making 40,000 requests yesterday too, but none before that:
      -
      # zgrep -c 138.201.52.218 /var/log/nginx/access.log*
      +
      # zgrep -c 138.201.52.218 /var/log/nginx/access.log*
       /var/log/nginx/access.log:24403
       /var/log/nginx/access.log.1:45958
       /var/log/nginx/access.log.2.gz:0
      @@ -202,7 +202,7 @@ COPY 54701
       
      • It’s clearly a bot as it’s making tens of thousands of requests, but it’s using a “normal” user agent:
      -
      Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
      +
      Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
       
      • For now I don’t know what this user is!
      @@ -216,7 +216,7 @@ COPY 54701
      • But in the database the authors are correct (none with weird , / characters):
      -
      dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Livestock Research Institute%';
      +
      dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like 'International Livestock Research Institute%';
                        text_value                 |              authority               | confidence 
       --------------------------------------------+--------------------------------------+------------
        International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c |          0
      @@ -240,7 +240,7 @@ COPY 54701
       
    • Tsega had to restart Tomcat 7 to fix it temporarily
    • I will start by looking at bot usage (access.log.1 includes usage until 6AM today):
    -
    # cat /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         619 65.49.68.184
         840 65.49.68.199
         924 66.249.66.91
    @@ -254,7 +254,7 @@ COPY 54701
     
    • 104.196.152.243 seems to be a top scraper for a few weeks now:
    -
    # zgrep -c 104.196.152.243 /var/log/nginx/access.log*
    +
    # zgrep -c 104.196.152.243 /var/log/nginx/access.log*
     /var/log/nginx/access.log:336
     /var/log/nginx/access.log.1:4681
     /var/log/nginx/access.log.2.gz:3531
    @@ -268,7 +268,7 @@ COPY 54701
     
    • This user is responsible for hundreds and sometimes thousands of Tomcat sessions:
    -
    $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     954
     $ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     6199
    @@ -278,7 +278,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
  • The worst thing is that this user never specifies a user agent string so we can’t lump it in with the other bots using the Tomcat Session Crawler Manager Valve
  • They don’t request dynamic URLs like “/discover” but they seem to be fetching handles from XMLUI instead of REST (and some with //handle, note the regex below):
  • -
    # grep -c 104.196.152.243 /var/log/nginx/access.log.1
    +
    # grep -c 104.196.152.243 /var/log/nginx/access.log.1
     4681
     # grep 104.196.152.243 /var/log/nginx/access.log.1 | grep -c -P 'GET //?handle'
     4618
    @@ -286,19 +286,19 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
  • I just realized that ciat.cgiar.org points to 104.196.152.243, so I should contact Leroy from CIAT to see if we can change their scraping behavior
  • The next IP (207.46.13.36) seem to be Microsoft’s bingbot, but all its requests specify the “bingbot” user agent and there are no requests for dynamic URLs that are forbidden, like “/discover”:
  • -
    $ grep -c 207.46.13.36 /var/log/nginx/access.log.1 
    +
    $ grep -c 207.46.13.36 /var/log/nginx/access.log.1 
     2034
     # grep 207.46.13.36 /var/log/nginx/access.log.1 | grep -c "GET /discover"
     0
     
    • The next IP (157.55.39.161) also seems to be bingbot, and none of its requests are for URLs forbidden by robots.txt either:
    -
    # grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c "GET /discover"
    +
    # grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c "GET /discover"
     0
     
    • The next few seem to be bingbot as well, and they declare a proper user agent and do not request dynamic URLs like “/discover”:
    -
    # grep -c -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 
    +
    # grep -c -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 
     5997
     # grep -E '207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c "bingbot"
     5988
    @@ -307,7 +307,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
    • The next few seem to be Googlebot, and they declare a proper user agent and do not request dynamic URLs like “/discover”:
    -
    # grep -c -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 
    +
    # grep -c -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 
     3048
     # grep -E '66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c Google
     3048
    @@ -316,14 +316,14 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
    • The next seems to be Yahoo, which declares a proper user agent and does not request dynamic URLs like “/discover”:
    -
    # grep -c 68.180.229.254 /var/log/nginx/access.log.1 
    +
    # grep -c 68.180.229.254 /var/log/nginx/access.log.1 
     1131
     # grep  68.180.229.254 /var/log/nginx/access.log.1 | grep -c "GET /discover"
     0
     
    • The last of the top ten IPs seems to be some bot with a weird user agent, but they are not behaving too well:
    -
    # grep -c -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 
    +
    # grep -c -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 
     2950
     # grep -E '65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 | grep -c "GET /discover"
     330
    @@ -338,7 +338,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
  • I’ll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs
  • While it’s not in the top ten, Baidu is one bot that seems to not give a fuck:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep -c Baiduspider
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep -c Baiduspider
     8912
     # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "7/Nov/2017" | grep Baiduspider | grep -c -E "GET /(browse|discover|search-filter)"
     2521
    @@ -349,7 +349,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
  • I should look in nginx access.log, rest.log, oai.log, and DSpace’s dspace.log.2017-11-07
  • Here are the top IPs making requests to XMLUI from 2 to 8 AM:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         279 66.249.66.91
         373 65.49.68.199
         446 68.180.229.254
    @@ -364,7 +364,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
  • Of those, most are Google, Bing, Yahoo, etc, except 63.143.42.244 and 63.143.42.242 which are Uptime Robot
  • Here are the top IPs making requests to REST from 2 to 8 AM:
  • -
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
           8 207.241.229.237
          10 66.249.66.90
          16 104.196.152.243
    @@ -377,14 +377,14 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
    • The OAI requests during that same time period are nothing to worry about:
    -
    # cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E '07/Nov/2017:0[2-8]' | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
           1 66.249.66.92
           4 66.249.66.90
           6 68.180.229.254
     
    • The top IPs from dspace.log during the 2–8 AM period:
    -
    $ grep -E '2017-11-07 0[2-8]' dspace.log.2017-11-07 | grep -o -E 'ip_addr=[0-9.]+' | sort -n | uniq -c | sort -h | tail
    +
    $ grep -E '2017-11-07 0[2-8]' dspace.log.2017-11-07 | grep -o -E 'ip_addr=[0-9.]+' | sort -n | uniq -c | sort -h | tail
         143 ip_addr=213.55.99.121
         181 ip_addr=66.249.66.91
         223 ip_addr=157.55.39.161
    @@ -400,7 +400,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
  • The number of requests isn’t even that high to be honest
  • As I was looking at these logs I noticed another heavy user (124.17.34.59) that was not active during this time period, but made many requests today alone:
  • -
    # zgrep -c 124.17.34.59 /var/log/nginx/access.log*
    +
    # zgrep -c 124.17.34.59 /var/log/nginx/access.log*
     /var/log/nginx/access.log:22581
     /var/log/nginx/access.log.1:0
     /var/log/nginx/access.log.2.gz:14
    @@ -414,7 +414,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
    • The whois data shows the IP is from China, but the user agent doesn’t really give any clues:
    -
    # grep 124.17.34.59 /var/log/nginx/access.log | awk -F'" ' '{print $3}' | sort | uniq -c | sort -h
    +
    # grep 124.17.34.59 /var/log/nginx/access.log | awk -F'" ' '{print $3}' | sort | uniq -c | sort -h
         210 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
       22610 "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)"
     
      @@ -424,7 +424,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
    • And as we speak Linode alerted that the outbound traffic rate is very high for the past two hours (about 12–14 hours)
    • At least for now it seems to be that new Chinese IP (124.17.34.59):
    -
    # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         198 207.46.13.103
         203 207.46.13.80
         205 207.46.13.36
    @@ -438,7 +438,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
    • Seems 124.17.34.59 are really downloading all our PDFs, compared to the next top active IPs during this time!
    -
    # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf
    +
    # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf
     5948
     # grep -E "07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 104.196.152.243 | grep -c pdf
     0
    @@ -446,7 +446,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E 'session_id=[A-Z0-9]{3
     
  • About CIAT, I think I need to encourage them to specify a user agent string for their requests, because they are not reuising their Tomcat session and they are creating thousands of sessions per day
  • All CIAT requests vs unique ones:
  • -
    $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | wc -l
    +
    $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | wc -l
     3506
     $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | sort | uniq | wc -l
     3506
    @@ -459,18 +459,18 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
     
    • But they literally just made this request today:
    -
    180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] "GET /discover?filtertype_0=crpsubject&filter_relational_operator_0=equals&filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&filtertype=subject&filter_relational_operator=equals&filter=WATER+RESOURCES HTTP/1.1" 200 82265 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
    +
    180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] "GET /discover?filtertype_0=crpsubject&filter_relational_operator_0=equals&filter_0=WATER%2C+LAND+AND+ECOSYSTEMS&filtertype=subject&filter_relational_operator=equals&filter=WATER+RESOURCES HTTP/1.1" 200 82265 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
     
    • Along with another thousand or so requests to URLs that are forbidden in robots.txt today alone:
    -
    # grep -c Baiduspider /var/log/nginx/access.log
    +
    # grep -c Baiduspider /var/log/nginx/access.log
     3806
     # grep Baiduspider /var/log/nginx/access.log | grep -c -E "GET /(browse|discover|search-filter)"
     1085
     
    • I will think about blocking their IPs but they have 164 of them!
    -
    # grep "Baiduspider/2.0" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq | wc -l
    +
    # grep "Baiduspider/2.0" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq | wc -l
     164
     

    2017-11-08

      @@ -478,12 +478,12 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
    • Linode sent another alert about CPU usage in the morning at 6:12AM
    • Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
     24981
     
    • This is about 20,000 Tomcat sessions:
    -
    $ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59' | sort | uniq | wc -l
    +
    $ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59' | sort | uniq | wc -l
     20733
     
    • I’m getting really sick of this
    • @@ -496,7 +496,7 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
    • Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process
    • Basically, we modify the nginx config to add a mapping with a modified user agent $ua:
    -
    map $remote_addr $ua {
    +
    map $remote_addr $ua {
         # 2017-11-08 Random Chinese host grabbing 20,000 PDFs
         124.17.34.59     'ChineseBot';
         default          $http_user_agent;
    @@ -505,7 +505,7 @@ $ grep -Io -E 'session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-
     
  • If the client’s address matches then the user agent is set, otherwise the default $http_user_agent variable is used
  • Then, in the server’s / block we pass this header to Tomcat:
  • -
    proxy_pass http://tomcat_http;
    +
    proxy_pass http://tomcat_http;
     proxy_set_header User-Agent $ua;
     
    • Note to self: the $ua variable won’t show up in nginx access logs because the default combined log format doesn’t show it, so don’t run around pulling your hair out wondering with the modified user agents aren’t showing in the logs!
    • @@ -516,14 +516,14 @@ proxy_set_header User-Agent $ua;
    • I merged the clickable thumbnails code to 5_x-prod (#347) and will deploy it later along with the new bot mapping stuff (and re-run the Asible nginx and tomcat tags)
    • I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in robots.txt:
    -
    # zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
    +
    # zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
     22229
     # zgrep Googlebot /var/log/nginx/access.log* | grep -c -E "GET /(browse|discover|search-filter)"
     0
     
    • It seems that they rarely even bother checking robots.txt, but Google does multiple times per day!
    -
    # zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
    +
    # zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
     14
     # zgrep Googlebot  /var/log/nginx/access.log* | grep -c robots.txt
     1134
    @@ -538,14 +538,14 @@ proxy_set_header User-Agent $ua;
     
    • Awesome, it seems my bot mapping stuff in nginx actually reduced the number of Tomcat sessions used by the CIAT scraper today, total requests and unique sessions:
    -
    # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Nov/2017' | grep -c 104.196.152.243
    +
    # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Nov/2017' | grep -c 104.196.152.243
     8956
     $ grep 104.196.152.243 dspace.log.2017-11-09 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     223
     
    • Versus the same stats for yesterday and the day before:
    -
    # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '08/Nov/2017' | grep -c 104.196.152.243 
    +
    # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '08/Nov/2017' | grep -c 104.196.152.243 
     10216
     $ grep 104.196.152.243 dspace.log.2017-11-08 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2592
    @@ -569,7 +569,7 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
     
  • Update the Ansible infrastructure templates to be a little more modular and flexible
  • Looking at the top client IPs on CGSpace so far this morning, even though it’s only been eight hours:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "12/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "12/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         243 5.83.120.111
         335 40.77.167.103
         424 66.249.66.91
    @@ -583,12 +583,12 @@ $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E 'session_id=[A-Z0-9]{3
     
    • 5.9.6.51 seems to be a Russian bot:
    -
    # grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1
    +
    # grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1
     5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] "GET /handle/10568/16515/recent-submissions HTTP/1.1" 200 5097 "-" "Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"
     
    • What’s amazing is that it seems to reuse its Java session across all requests:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12
     1558
     $ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     1
    @@ -596,14 +596,14 @@ $ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | s
     
  • Bravo to MegaIndex.ru!
  • The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat’s Crawler Session Manager valve regex should match ‘YandexBot’:
  • -
    # grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
    +
    # grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
     95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] "GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1" 200 972019 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
     $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-12
     991
     
    • Move some items and collections on CGSpace for Peter Ballantyne, running move_collections.sh with the following configuration:
    -
    10947/6    10947/1 10568/83389
    +
    10947/6    10947/1 10568/83389
     10947/34   10947/1 10568/83389
     10947/2512 10947/1 10568/83389
     
      @@ -612,7 +612,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-
    • The solution I came up with uses tricks from both of those
    • I deployed the limit on CGSpace and DSpace Test and it seems to work well:
    -
    $ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
    +
    $ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
     HTTP/1.1 200 OK
     Connection: keep-alive
     Content-Encoding: gzip
    @@ -642,7 +642,7 @@ Server: nginx
     
    • At the end of the day I checked the logs and it really looks like the Baidu rate limiting is working, HTTP 200 vs 503:
    -
    # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "13/Nov/2017" | grep "Baiduspider" | grep -c " 200 "
    +
    # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "13/Nov/2017" | grep "Baiduspider" | grep -c " 200 "
     1132
     # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "13/Nov/2017" | grep "Baiduspider" | grep -c " 503 "
     10105
    @@ -675,7 +675,7 @@ Server: nginx
     
  • Started testing DSpace 6.2 and a few things have changed
  • Now PostgreSQL needs pgcrypto:
  • -
    $ psql dspace6
    +
    $ psql dspace6
     dspace6=# CREATE EXTENSION pgcrypto;
     
    • Also, local settings are no longer in build.properties, they are now in local.cfg
    • @@ -695,7 +695,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
    • After a few minutes the connecitons went down to 44 and CGSpace was kinda back up, it seems like Tsega restarted Tomcat
    • Looking at the REST and XMLUI log files, I don’t see anything too crazy:
    -
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep "17/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep "17/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
          13 66.249.66.223
          14 207.46.13.36
          17 207.46.13.137
    @@ -721,7 +721,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
     
  • I need to look into using JMX to analyze active sessions I think, rather than looking at log files
  • After adding appropriate JMX listener options to Tomcat’s JAVA_OPTS and restarting Tomcat, I can connect remotely using an SSH dynamic port forward (SOCKS) on port 7777 for example, and then start jconsole locally like:
  • -
    $ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=7777 service:jmx:rmi:///jndi/rmi://localhost:9000/jmxrmi -J-DsocksNonProxyHosts=
    +
    $ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=7777 service:jmx:rmi:///jndi/rmi://localhost:9000/jmxrmi -J-DsocksNonProxyHosts=
     
    • Looking at the MBeans you can drill down in Catalina→Manager→webapp→localhost→Attributes and see active sessions, etc
    • I want to enable JMX listener on CGSpace but I need to do some more testing on DSpace Test and see if it causes any performance impact, for example
    • @@ -737,7 +737,7 @@ dspace6=# CREATE EXTENSION pgcrypto;
    • Linode sent an alert that CGSpace was using a lot of CPU around 4–6 AM
    • Looking in the nginx access logs I see the most active XMLUI users between 4 and 6 AM:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "19/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "19/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         111 66.249.66.155
         171 5.9.6.51
         188 54.162.241.40
    @@ -751,12 +751,12 @@ dspace6=# CREATE EXTENSION pgcrypto;
     
    • 66.249.66.153 appears to be Googlebot:
    -
    66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] "GET /handle/10568/2203 HTTP/1.1" 200 6309 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    +
    66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] "GET /handle/10568/2203 HTTP/1.1" 200 6309 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
     
    • We know Googlebot is persistent but behaves well, so I guess it was just a coincidence that it came at a time when we had other traffic and server activity
    • In related news, I see an Atmire update process going for many hours and responsible for hundreds of thousands of log entries (two thirds of all log entries)
    -
    $ wc -l dspace.log.2017-11-19 
    +
    $ wc -l dspace.log.2017-11-19 
     388472 dspace.log.2017-11-19
     $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19 
     267494
    @@ -764,7 +764,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
     
  • WTF is this process doing every day, and for so many hours?
  • In unrelated news, when I was looking at the DSpace logs I saw a bunch of errors like this:
  • -
    2017-11-19 03:00:32,806 INFO  org.apache.pdfbox.pdfparser.PDFParser @ Document is encrypted
    +
    2017-11-19 03:00:32,806 INFO  org.apache.pdfbox.pdfparser.PDFParser @ Document is encrypted
     2017-11-19 03:00:32,807 ERROR org.apache.pdfbox.filter.FlateFilter @ FlateFilter: stop reading corrupt stream due to a DataFormatException
     
    • It’s been a few days since I enabled the G1GC on DSpace Test and the JVM graph definitely changed:
    • @@ -780,13 +780,13 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
      • Magdalena was having problems logging in via LDAP and it seems to be a problem with the CGIAR LDAP server:
      -
      2017-11-21 11:11:09,621 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2FEC0E5286C17B6694567FFD77C3171C:ip_addr=77.241.141.58:ldap_authentication:type=failed_auth javax.naming.CommunicationException\colon; simple bind failed\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is javax.net.ssl.SSLHandshakeException\colon; sun.security.validator.ValidatorException\colon; PKIX path validation failed\colon; java.security.cert.CertPathValidatorException\colon; validity check failed]
      +
      2017-11-21 11:11:09,621 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2FEC0E5286C17B6694567FFD77C3171C:ip_addr=77.241.141.58:ldap_authentication:type=failed_auth javax.naming.CommunicationException\colon; simple bind failed\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is javax.net.ssl.SSLHandshakeException\colon; sun.security.validator.ValidatorException\colon; PKIX path validation failed\colon; java.security.cert.CertPathValidatorException\colon; validity check failed]
       

      2017-11-22

      • Linode sent an alert that the CPU usage on the CGSpace server was very high around 4 to 6 AM
      • The logs don’t show anything particularly abnormal between those hours:
      -
      # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "22/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
      +
      # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "22/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
           136 31.6.77.23
           174 68.180.229.254
           217 66.249.66.91
      @@ -807,7 +807,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
       
    • Linode alerted again that CPU usage was high on CGSpace from 4:13 to 6:13 AM
    • I see a lot of Googlebot (66.249.66.90) in the XMLUI access logs
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "23/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "23/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
          88 66.249.66.91
         140 68.180.229.254
         155 54.196.2.131
    @@ -821,7 +821,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
     
    • … and the usual REST scrapers from CIAT (45.5.184.196) and CCAFS (70.32.83.92):
    -
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "23/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "23/Nov/2017:0[456]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
           5 190.120.6.219
           6 104.198.9.108
          14 104.196.152.243
    @@ -836,7 +836,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
     
  • These IPs crawling the REST API don’t specify user agents and I’d assume they are creating many Tomcat sessions
  • I would catch them in nginx to assign a “bot” user agent to them so that the Tomcat Crawler Session Manager valve could deal with them, but they seem to create any really — at least not in the dspace.log:
  • -
    $ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2
     
    • I’m wondering if REST works differently, or just doesn’t log these sessions?
    • @@ -861,7 +861,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
    • In just a few seconds I already see a dozen requests from Googlebot (of course they get HTTP 301 redirects to cgspace.cgiar.org)
    • I also noticed that CGNET appears to be monitoring the old domain every few minutes:
    -
    192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] "HEAD / HTTP/1.1" 301 0 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
    +
    192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] "HEAD / HTTP/1.1" 301 0 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
     
    • I should probably tell CGIAR people to have CGNET stop that
    @@ -870,7 +870,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
  • Linode alerted that CGSpace server was using too much CPU from 5:18 to 7:18 AM
  • Yet another mystery because the load for all domains looks fine at that time:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "26/Nov/2017:0[567]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "26/Nov/2017:0[567]" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         190 66.249.66.83
         195 104.196.152.243
         220 40.77.167.82
    @@ -887,7 +887,7 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
     
  • About an hour later Uptime Robot said that the server was down
  • Here are all the top XMLUI and REST users from today:
  • -
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "29/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "29/Nov/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         540 66.249.66.83
         659 40.77.167.36
         663 157.55.39.214
    @@ -905,12 +905,12 @@ $ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
     
  • I don’t see much activity in the logs but there are 87 PostgreSQL connections
  • But shit, there were 10,000 unique Tomcat sessions today:
  • -
    $ cat dspace.log.2017-11-29 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ cat dspace.log.2017-11-29 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     10037
     
    • Although maybe that’s not much, as the previous two days had more:
    -
    $ cat dspace.log.2017-11-27 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ cat dspace.log.2017-11-27 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     12377
     $ cat dspace.log.2017-11-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     16984
    diff --git a/docs/2017-12/index.html b/docs/2017-12/index.html
    index ea34be3e0..308448fe5 100644
    --- a/docs/2017-12/index.html
    +++ b/docs/2017-12/index.html
    @@ -30,7 +30,7 @@ The logs say “Timeout waiting for idle object”
     PostgreSQL activity says there are 115 connections currently
     The list of connections to XMLUI and REST API for today:
     "/>
    -
    +
     
     
         
    @@ -123,7 +123,7 @@ The list of connections to XMLUI and REST API for today:
     
  • PostgreSQL activity says there are 115 connections currently
  • The list of connections to XMLUI and REST API for today:
  • -
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         763 2.86.122.76
         907 207.46.13.94
        1018 157.55.39.206
    @@ -137,12 +137,12 @@ The list of connections to XMLUI and REST API for today:
     
    • The number of DSpace sessions isn’t even that high:
    -
    $ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     5815
     
    • Connections in the last two hours:
    -
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017:(09|10)" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail                                                      
    +
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017:(09|10)" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail                                                      
          78 93.160.60.22
         101 40.77.167.122
         113 66.249.66.70
    @@ -157,18 +157,18 @@ The list of connections to XMLUI and REST API for today:
     
  • What the fuck is going on?
  • I’ve never seen this 2.86.122.76 before, it has made quite a few unique Tomcat sessions today:
  • -
    $ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     822
     
    • Appears to be some new bot:
    -
    2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] "GET /handle/10568/78444?show=full HTTP/1.1" 200 29307 "-" "Mozilla/3.0 (compatible; Indy Library)"
    +
    2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] "GET /handle/10568/78444?show=full HTTP/1.1" 200 29307 "-" "Mozilla/3.0 (compatible; Indy Library)"
     
    • I restarted Tomcat and everything came back up
    • I can add Indy Library to the Tomcat crawler session manager valve but it would be nice if I could simply remap the useragent in nginx
    • I will also add ‘Drupal’ to the Tomcat crawler session manager valve because there are Drupals out there harvesting and they should be considered as bots
    -
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | grep Drupal | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log  /var/log/nginx/rest.log.1  /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | grep Drupal | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
           3 54.75.205.145
           6 70.32.83.92
          14 2a01:7e00::f03c:91ff:fe18:7396
    @@ -206,7 +206,7 @@ The list of connections to XMLUI and REST API for today:
     
  • I don’t see any errors in the DSpace logs but I see in nginx’s access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request)
  • Looking at the REST API logs I see some new client IP I haven’t noticed before:
  • -
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "6/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "6/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
          18 95.108.181.88
          19 68.180.229.254
          30 207.46.13.151
    @@ -228,7 +228,7 @@ The list of connections to XMLUI and REST API for today:
     
  • I looked just now and see that there are 121 PostgreSQL connections!
  • The top users right now are:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "7/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "7/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 
         838 40.77.167.11
         939 66.249.66.223
        1149 66.249.66.206
    @@ -243,24 +243,24 @@ The list of connections to XMLUI and REST API for today:
     
  • We’ve never seen 124.17.34.60 yet, but it’s really hammering us!
  • Apparently it is from China, and here is one of its user agents:
  • -
    Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)
    +
    Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)
     
    • It is responsible for 4,500 Tomcat sessions today alone:
    -
    $ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     4574
     
    • I’ve adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it’s the same bot on the same subnet
    • I was running the DSpace cleanup task manually and it hit an error:
    -
    $ /home/cgspace.cgiar.org/bin/dspace cleanup -v
    +
    $ /home/cgspace.cgiar.org/bin/dspace cleanup -v
     ...
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
       Detail: Key (bitstream_id)=(144666) is still referenced from table "bundle".
     
    • The solution is like I discovered in 2017-04, to set the primary_bitstream_id to null:
    -
    dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (144666);
    +
    dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (144666);
     UPDATE 1
     

    2017-12-13

      @@ -294,11 +294,11 @@ UPDATE 1
    • I did a test import of the data locally after building with SAFBuilder but for some reason I had to specify the collection (even though the collections were specified in the collection field)
    -
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &> /tmp/ccafs.log
    +
    $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &> /tmp/ccafs.log
     
    • It’s the same on DSpace Test, I can’t import the SAF bundle without specifying the collection:
    -
    $ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat
    +
    $ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat
     No collections given. Assuming 'collections' file inside item directory
     Adding items from directory: /tmp/ccafs-2016/SimpleArchiveFormat
     Generating mapfile: /tmp/ccafs.map
    @@ -321,14 +321,14 @@ Elapsed time: 2 secs (2559 msecs)
     
    • I even tried to debug it by adding verbose logging to the JAVA_OPTS:
    -
    -Dlog4j.configuration=file:/Users/aorth/dspace/config/log4j-console.properties -Ddspace.log.init.disable=true
    +
    -Dlog4j.configuration=file:/Users/aorth/dspace/config/log4j-console.properties -Ddspace.log.init.disable=true
     
    • … but the error message was the same, just with more INFO noise around it
    • For now I’ll import into a collection in DSpace Test but I’m really not sure what’s up with this!
    • Linode alerted that CGSpace was using high CPU from 4 to 6 PM
    • The logs for today show the CORE bot (137.108.70.7) being active in XMLUI:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         671 66.249.66.70
         885 95.108.181.88
         904 157.55.39.96
    @@ -342,7 +342,7 @@ Elapsed time: 2 secs (2559 msecs)
     
    • And then some CIAT bot (45.5.184.196) is actively hitting API endpoints:
    -
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
          33 68.180.229.254
          48 157.55.39.96
          51 157.55.39.179
    @@ -371,7 +371,7 @@ Elapsed time: 2 secs (2559 msecs)
     
  • Linode alerted this morning that there was high outbound traffic from 6 to 8 AM
  • The XMLUI logs show that the CORE bot from last night (137.108.70.7) is very active still:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         190 207.46.13.146
         191 197.210.168.174
         202 86.101.203.216
    @@ -385,7 +385,7 @@ Elapsed time: 2 secs (2559 msecs)
     
    • On the API side (REST and OAI) there is still the same CIAT bot (45.5.184.196) from last night making quite a number of requests this morning:
    -
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
           7 104.198.9.108
           8 185.29.8.111
           8 40.77.167.176
    @@ -402,7 +402,7 @@ Elapsed time: 2 secs (2559 msecs)
     
  • Linode alerted that CGSpace was using 396.3% CPU from 12 to 2 PM
  • The REST and OAI API logs look pretty much the same as earlier this morning, but there’s a new IP harvesting XMLUI:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail            
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail            
         360 95.108.181.88
         477 66.249.66.90
         526 86.101.203.216
    @@ -416,17 +416,17 @@ Elapsed time: 2 secs (2559 msecs)
     
    • 2.86.72.181 appears to be from Greece, and has the following user agent:
    -
    Mozilla/3.0 (compatible; Indy Library)
    +
    Mozilla/3.0 (compatible; Indy Library)
     
    • Surprisingly it seems they are re-using their Tomcat session for all those 17,000 requests:
    -
    $ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                          
    +
    $ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                          
     1
     
    • I guess there’s nothing I can do to them for now
    • In other news, I am curious how many PostgreSQL connection pool errors we’ve had in the last month:
    -
    $ grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-1* | grep -v :0
    +
    $ grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-1* | grep -v :0
     dspace.log.2017-11-07:15695
     dspace.log.2017-11-08:135
     dspace.log.2017-11-17:1298
    @@ -456,7 +456,7 @@ dspace.log.2017-12-07:2769
     
  • So I restarted Tomcat 7 and restarted the imports
  • I assume the PostgreSQL transactions were fine but I will remove the Discovery index for their community and re-run the light-weight indexing to hopefully re-construct everything:
  • -
    $ dspace index-discovery -r 10568/42211
    +
    $ dspace index-discovery -r 10568/42211
     $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
     
    • The PostgreSQL issues are getting out of control, I need to figure out how to enable connection pools in Tomcat!
    • @@ -476,7 +476,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
    • I re-deployed the 5_x-prod branch on CGSpace, applied all system updates, and restarted the server
    • Looking through the dspace.log I see this error:
    -
    2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
    +
    2017-12-19 08:17:15,740 ERROR org.dspace.statistics.SolrLogger @ Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
     
    • I don’t have time now to look into this but the Solr sharding has long been an issue!
    • Looking into using JDBC / JNDI to provide a database pool to DSpace
    • @@ -484,7 +484,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
    • First, I uncomment db.jndi in dspace/config/dspace.cfg
    • Then I create a global Resource in the main Tomcat server.xml (inside GlobalNamingResources):
    -
    <Resource name="jdbc/dspace" auth="Container" type="javax.sql.DataSource"
    +
    <Resource name="jdbc/dspace" auth="Container" type="javax.sql.DataSource"
     	  driverClassName="org.postgresql.Driver"
     	  url="jdbc:postgresql://localhost:5432/dspace"
     	  username="dspace"
    @@ -500,12 +500,12 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
     
  • Most of the parameters are from comments by Mark Wood about his JNDI setup: https://jira.duraspace.org/browse/DS-3564
  • Then I add a ResourceLink to each web application context:
  • -
    <ResourceLink global="jdbc/dspace" name="jdbc/dspace" type="javax.sql.DataSource"/>
    +
    <ResourceLink global="jdbc/dspace" name="jdbc/dspace" type="javax.sql.DataSource"/>
     
    • I am not sure why several guides show configuration snippets for server.xml and web application contexts that use a Local and Global jdbc…
    • When DSpace can’t find the JNDI context (for whatever reason) you will see this in the dspace logs:
    -
    2017-12-19 13:12:08,796 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
    +
    2017-12-19 13:12:08,796 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
     javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Context. Unable to find [jdbc].
             at org.apache.naming.NamingContext.lookup(NamingContext.java:825)
             at org.apache.naming.NamingContext.lookup(NamingContext.java:173)
    @@ -535,11 +535,11 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
     
    • And indeed the Catalina logs show that it failed to set up the JDBC driver:
    -
    org.apache.tomcat.dbcp.dbcp.SQLNestedException: Cannot load JDBC driver class 'org.postgresql.Driver'
    +
    org.apache.tomcat.dbcp.dbcp.SQLNestedException: Cannot load JDBC driver class 'org.postgresql.Driver'
     
    • There are several copies of the PostgreSQL driver installed by DSpace:
    -
    $ find ~/dspace/ -iname "postgresql*jdbc*.jar"
    +
    $ find ~/dspace/ -iname "postgresql*jdbc*.jar"
     /Users/aorth/dspace/webapps/jspui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
     /Users/aorth/dspace/webapps/oai/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
     /Users/aorth/dspace/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar
    @@ -548,7 +548,7 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
     
    • These apparently come from the main DSpace pom.xml:
    -
    <dependency>
    +
    <dependency>
        <groupId>postgresql</groupId>
        <artifactId>postgresql</artifactId>
        <version>9.1-901-1.jdbc4</version>
    @@ -556,12 +556,12 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
     
    • So WTF? Let’s try copying one to Tomcat’s lib folder and restarting Tomcat:
    -
    $ cp ~/dspace/lib/postgresql-9.1-901-1.jdbc4.jar /usr/local/opt/tomcat@7/libexec/lib
    +
    $ cp ~/dspace/lib/postgresql-9.1-901-1.jdbc4.jar /usr/local/opt/tomcat@7/libexec/lib
     
    • Oh that’s fantastic, now at least Tomcat doesn’t print an error during startup so I guess it succeeds to create the JNDI pool
    • DSpace starts up but I have no idea if it’s using the JNDI configuration because I see this in the logs:
    -
    2017-12-19 13:26:54,271 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS is '{}'PostgreSQL
    +
    2017-12-19 13:26:54,271 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS is '{}'PostgreSQL
     2017-12-19 13:26:54,277 INFO  org.dspace.storage.rdbms.DatabaseManager @ DBMS driver version is '{}'9.5.10
     2017-12-19 13:26:54,293 INFO  org.dspace.storage.rdbms.DatabaseUtils @ Loading Flyway DB migrations from: filesystem:/Users/aorth/dspace/etc/postgres, classpath:org.dspace.storage.rdbms.sqlmigration.postgres, classpath:org.dspace.storage.rdbms.migration
     2017-12-19 13:26:54,306 INFO  org.flywaydb.core.internal.dbsupport.DbSupportFactory @ Database: jdbc:postgresql://localhost:5432/dspacetest (PostgreSQL 9.5)
    @@ -580,7 +580,7 @@ javax.naming.NameNotFoundException: Name [jdbc/dspace] is not bound in this Cont
     
     
  • After adding the Resource to server.xml on Ubuntu I get this in Catalina’s logs:
  • -
    SEVERE: Unable to create initial connections of pool.
    +
    SEVERE: Unable to create initial connections of pool.
     java.sql.SQLException: org.postgresql.Driver
     ...
     Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
    @@ -589,17 +589,17 @@ Caused by: java.lang.ClassNotFoundException: org.postgresql.Driver
     
  • I tried installing Ubuntu’s libpostgresql-jdbc-java package but Tomcat still can’t find the class
  • Let me try to symlink the lib into Tomcat’s libs:
  • -
    # ln -sv /usr/share/java/postgresql.jar /usr/share/tomcat7/lib
    +
    # ln -sv /usr/share/java/postgresql.jar /usr/share/tomcat7/lib
     
    • Now Tomcat starts but the localhost container has errors:
    -
    SEVERE: Exception sending context initialized event to listener instance of class org.dspace.app.util.DSpaceContextListener
    +
    SEVERE: Exception sending context initialized event to listener instance of class org.dspace.app.util.DSpaceContextListener
     java.lang.AbstractMethodError: Method org/postgresql/jdbc3/Jdbc3ResultSet.isClosed()Z is abstract
     
    • Could be a version issue or something since the Ubuntu package provides 9.2 and DSpace’s are 9.1…
    • Let me try to remove it and copy in DSpace’s:
    -
    # rm /usr/share/tomcat7/lib/postgresql.jar
    +
    # rm /usr/share/tomcat7/lib/postgresql.jar
     # cp [dspace]/webapps/xmlui/WEB-INF/lib/postgresql-9.1-901-1.jdbc4.jar /usr/share/tomcat7/lib/
     
    • Wow, I think that actually works…
    • @@ -608,12 +608,12 @@ java.lang.AbstractMethodError: Method org/postgresql/jdbc3/Jdbc3ResultSet.isClos
    • Also, since I commented out all the db parameters in DSpace.cfg, how does the command line dspace tool work?
    • Let’s try the upstream JDBC driver first:
    -
    # rm /usr/share/tomcat7/lib/postgresql-9.1-901-1.jdbc4.jar
    +
    # rm /usr/share/tomcat7/lib/postgresql-9.1-901-1.jdbc4.jar
     # wget https://jdbc.postgresql.org/download/postgresql-42.1.4.jar -O /usr/share/tomcat7/lib/postgresql-42.1.4.jar
     
    • DSpace command line fails unless db settings are present in dspace.cfg:
    -
    $ dspace database info
    +
    $ dspace database info
     Caught exception:
     java.sql.SQLException: java.lang.ClassNotFoundException: 
             at org.dspace.storage.rdbms.DataSourceInit.getDatasource(DataSourceInit.java:171)
    @@ -633,7 +633,7 @@ Caused by: java.lang.ClassNotFoundException:
     
    • And in the logs:
    -
    2017-12-19 18:26:56,971 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
    +
    2017-12-19 18:26:56,971 ERROR org.dspace.storage.rdbms.DatabaseManager @ Error retrieving JNDI context: jdbc/dspace
     javax.naming.NoInitialContextException: Need to specify class name in environment or system property, or as an applet parameter, or in an application resource file:  java.naming.factory.initial
             at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:662)
             at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
    @@ -669,7 +669,7 @@ javax.naming.NoInitialContextException: Need to specify class name in environmen
     
  • There are short bursts of connections up to 10, but it generally stays around 5
  • Test and import 13 records to CGSpace for Abenet:
  • -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &> systemoffice.log
     
    • The fucking database went from 47 to 72 to 121 connections while I was importing so it stalled.
    • @@ -677,7 +677,7 @@ $ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchi
    • There was an initial connection storm of 50 PostgreSQL connections, but then it settled down to 7
    • After that CGSpace came up fine and I was able to import the 13 items just fine:
    -
    $ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &> systemoffice.log
    +
    $ dspace import -a -e aorth@mjanja.ch -s /home/aorth/cg_system_20Dec/SimpleArchiveFormat -m systemoffice.map &> systemoffice.log
     $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
     
    -
    # find /var/log/nginx -type f -newermt "2017-12-01" | xargs zcat --force | goaccess --log-format=COMBINED -
    +
    # find /var/log/nginx -type f -newermt "2017-12-01" | xargs zcat --force | goaccess --log-format=COMBINED -
     
    • I can see interesting things using this approach, for example:
        @@ -708,7 +708,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace filter-media -i 10568/89287
        • Looking at some old notes for metadata to clean up, I found a few hundred corrections in cg.fulltextstatus and dc.language.iso:
        -
        # update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
        +
        # update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
         UPDATE 5
         # delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
         DELETE 17
        @@ -735,7 +735,7 @@ DELETE 20
         
      • Uptime Robot noticed that the server went down for 1 minute a few hours later, around 9AM
      • Here’s the XMLUI logs:
      -
      # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "30/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
      +
      # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "30/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
           637 207.46.13.106
           641 157.55.39.186
           715 68.180.229.254
      @@ -751,7 +751,7 @@ DELETE 20
       
    • They identify as “com.plumanalytics”, which Google says is associated with Elsevier
    • They only seem to have used one Tomcat session so that’s good, I guess I don’t need to add them to the Tomcat Crawler Session Manager valve:
    -
    $ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l          
    +
    $ grep 54.175.208.220 dspace.log.2017-12-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l          
     1 
     
    • 216.244.66.245 seems to be moz.com’s DotBot
    • @@ -761,7 +761,7 @@ DELETE 20
    • I finished working on the 42 records for CCAFS after Magdalena sent the remaining corrections
    • After that I uploaded them to CGSpace:
    -
    $ dspace import -a -e aorth@mjanja.ch -s /home/aorth/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat -m ccafs.map &> ccafs.log
    +
    $ dspace import -a -e aorth@mjanja.ch -s /home/aorth/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat -m ccafs.map &> ccafs.log
     
    diff --git a/docs/2018-01/index.html b/docs/2018-01/index.html index 8b3a33beb..4c7893daa 100644 --- a/docs/2018-01/index.html +++ b/docs/2018-01/index.html @@ -150,7 +150,7 @@ dspace.log.2018-01-02:34 Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains "/> - + @@ -244,19 +244,19 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
  • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
  • And just before that I see this:
  • -
    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    +
    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
     
    • Ah hah! So the pool was actually empty!
    • I need to increase that, let’s try to bump it up from 50 to 75
    • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
    • I notice this error quite a few times in dspace.log:
    -
    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
    +
    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
     
    • And there are many of these errors every day for the past month:
    -
    $ grep -c "Error while searching for sidebar facets" dspace.log.*
    +
    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
     dspace.log.2017-11-22:1
     dspace.log.2017-11-23:4
    @@ -308,7 +308,7 @@ dspace.log.2018-01-02:34
     
  • I woke up to more up and down of CGSpace, this time UptimeRobot noticed a few rounds of up and down of a few minutes each and Linode also notified of high CPU load from 12 to 2 PM
  • Looks like I need to increase the database pool size again:
  • -
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
    +
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
     dspace.log.2018-01-01:0
     dspace.log.2018-01-02:1972
     dspace.log.2018-01-03:1909
    @@ -319,7 +319,7 @@ dspace.log.2018-01-03:1909
     
    • The active IPs in XMLUI are:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         607 40.77.167.141
         611 2a00:23c3:8c94:7800:392c:a491:e796:9c50
         663 188.226.169.37
    @@ -336,12 +336,12 @@ dspace.log.2018-01-03:1909
     
  • This appears to be the Internet Archive’s open source bot
  • They seem to be re-using their Tomcat session so I don’t need to do anything to them just yet:
  • -
    $ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ grep 134.155.96.78 dspace.log.2018-01-03 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     2
     
    • The API logs show the normal users:
    -
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "3/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
          32 207.46.13.182
          38 40.77.167.132
          38 68.180.229.254
    @@ -356,12 +356,12 @@ dspace.log.2018-01-03:1909
     
  • In other related news I see a sizeable amount of requests coming from python-requests
  • For example, just in the last day there were 1700!
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -c python-requests
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -c python-requests
     1773
     
    • But they come from hundreds of IPs, many of which are 54.x.x.x:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk '{print $1}' | sort -n | uniq -c | sort -h | tail -n 30
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep python-requests | awk '{print $1}' | sort -n | uniq -c | sort -h | tail -n 30
           9 54.144.87.92
           9 54.146.222.143
           9 54.146.249.249
    @@ -402,7 +402,7 @@ dspace.log.2018-01-03:1909
     
  • CGSpace went down and up a bunch of times last night and ILRI staff were complaining a lot last night
  • The XMLUI logs show this activity:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "4/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "4/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         968 197.211.63.81
         981 213.55.99.121
        1039 66.249.64.93
    @@ -416,12 +416,12 @@ dspace.log.2018-01-03:1909
     
    • Again we ran out of PostgreSQL database connections, even after bumping the pool max active limit from 50 to 75 to 125 yesterday!
    -
    2018-01-04 07:36:08,089 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
    +
    2018-01-04 07:36:08,089 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error -
     org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-256] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:125; busy:125; idle:0; lastwait:5000].
     
    • So for this week that is the number one problem!
    -
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
    +
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
     dspace.log.2018-01-01:0
     dspace.log.2018-01-02:1972
     dspace.log.2018-01-03:1909
    @@ -436,7 +436,7 @@ dspace.log.2018-01-04:1559
     
  • Peter said that CGSpace was down last night and Tsega restarted Tomcat
  • I don’t see any alerts from Linode or UptimeRobot, and there are no PostgreSQL connection errors in the dspace logs for today:
  • -
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
    +
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-*
     dspace.log.2018-01-01:0
     dspace.log.2018-01-02:1972
     dspace.log.2018-01-03:1909
    @@ -446,13 +446,13 @@ dspace.log.2018-01-05:0
     
  • Daniel asked for help with their DAGRIS server (linode2328112) that has no disk space
  • I had a look and there is one Apache 2 log file that is 73GB, with lots of this:
  • -
    [Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for "9-16-1-RV.doc" in "/home/files/journals/6//articles/9/". Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
    +
    [Fri Jan 05 09:31:22.965398 2018] [:error] [pid 9340] [client 213.55.99.121:64476] WARNING: Unable to find a match for "9-16-1-RV.doc" in "/home/files/journals/6//articles/9/". Skipping this file., referer: http://dagris.info/reviewtool/index.php/index/install/upgrade
     
    • I will delete the log file for now and tell Danny
    • Also, I’m still seeing a hundred or so of the “ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer” errors in dspace logs, I need to search the dspace-tech mailing list to see what the cause is
    • I will run a full Discovery reindex in the mean time to see if it’s something wrong with the Discovery Solr core
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
     
     real    110m43.985s
    @@ -465,7 +465,7 @@ sys     3m14.890s
     
    • I’m still seeing Solr errors in the DSpace logs even after the full reindex yesterday:
    -
    org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1983+TO+1989]': Encountered " "]" "] "" at line 1, column 32.
    +
    org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1983+TO+1989]': Encountered " "]" "] "" at line 1, column 32.
     
    • I posted a message to the dspace-tech mailing list to see if anyone can help
    @@ -474,13 +474,13 @@ sys 3m14.890s
  • Advise Sisay about blank lines in some IITA records
  • Generate a list of author affiliations for Peter to clean up:
  • -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
     COPY 4515
     

    2018-01-10

    • I looked to see what happened to this year’s Solr statistics sharding task that should have run on 2018-01-01 and of course it failed:
    -
    Moving: 81742 into core statistics-2010
    +
    Moving: 81742 into core statistics-2010
     Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
     org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2010
             at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
    @@ -526,7 +526,7 @@ Caused by: java.net.SocketException: Connection reset
     
    • DSpace Test has the same error but with creating the 2017 core:
    -
    Moving: 2243021 into core statistics-2017
    +
    Moving: 2243021 into core statistics-2017
     Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2017
     org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2017
             at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
    @@ -553,7 +553,7 @@ Caused by: org.apache.http.client.ClientProtocolException
     
  • I can apparently search for records in the Solr stats core that have an empty owningColl field using this in the Solr admin query: -owningColl:*
  • On CGSpace I see 48,000,000 records that have an owningColl field and 34,000,000 that don’t:
  • -
    $ http 'http://localhost:3000/solr/statistics/select?q=owningColl%3A*&wt=json&indent=true' | grep numFound 
    +
    $ http 'http://localhost:3000/solr/statistics/select?q=owningColl%3A*&wt=json&indent=true' | grep numFound 
       "response":{"numFound":48476327,"start":0,"docs":[
     $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=json&indent=true' | grep numFound
       "response":{"numFound":34879872,"start":0,"docs":[
    @@ -561,19 +561,19 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=js
     
  • I tested the dspace stats-util -s process on my local machine and it failed the same way
  • It doesn’t seem to be helpful, but the dspace log shows this:
  • -
    2018-01-10 10:51:19,301 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
    +
    2018-01-10 10:51:19,301 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
     2018-01-10 10:51:19,301 INFO  org.dspace.statistics.SolrLogger @ Moving: 3821 records into core statistics-2016
     
    -
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-10 
    +
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-10 
     0
     
    • The XMLUI logs show quite a bit of activity today:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         951 207.46.13.159
         954 157.55.39.123
        1217 95.108.181.88
    @@ -587,17 +587,17 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=js
     
    • The user agent for the top six or so IPs are all the same:
    -
    "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"
    +
    "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36"
     
    • whois says they come from Perfect IP
    • I’ve never seen those top IPs before, but they have created 50,000 Tomcat sessions today:
    -
    $ grep -E '(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)' /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                                                                                                                                  
    +
    $ grep -E '(2607:fa98:40:9:26b6:fdff:feff:1888|2607:fa98:40:9:26b6:fdff:feff:195d|2607:fa98:40:9:26b6:fdff:feff:1c96|70.36.107.49|70.36.107.190|70.36.107.50)' /home/cgspace.cgiar.org/log/dspace.log.2018-01-10 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l                                                                                                                                                                                                  
     49096
     
    • Rather than blocking their IPs, I think I might just add their user agent to the “badbots” zone with Baidu, because they seem to be the only ones using that user agent:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari
     /537.36" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
        6796 70.36.107.50
       11870 70.36.107.190
    @@ -608,13 +608,13 @@ $ http 'http://localhost:3000/solr/statistics/select?q=-owningColl%3A*&wt=js
     
    • I added the user agent to nginx’s badbots limit req zone but upon testing the config I got an error:
    -
    # nginx -t
    +
    # nginx -t
     nginx: [emerg] could not build map_hash, you should increase map_hash_bucket_size: 64
     nginx: configuration file /etc/nginx/nginx.conf test failed
     
    -
    # cat /proc/cpuinfo | grep cache_alignment | head -n1
    +
    # cat /proc/cpuinfo | grep cache_alignment | head -n1
     cache_alignment : 64
     
    • On our servers that is 64, so I increased this parameter to 128 and deployed the changes to nginx
    • @@ -637,7 +637,7 @@ cache_alignment : 64
    • Linode rebooted DSpace Test and CGSpace for their host hypervisor kernel updates
    • Following up with the Solr sharding issue on the dspace-tech mailing list, I noticed this interesting snippet in the Tomcat localhost_access_log at the time of my sharding attempt on my test machine:
    -
    127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
    +
    127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
     127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/statistics/select?q=*%3A*&rows=0&facet=true&facet.range=time&facet.range.start=NOW%2FYEAR-18YEARS&facet.range.end=NOW%2FYEAR%2B0YEARS&facet.range.gap=%2B1YEAR&facet.mincount=1&wt=javabin&version=2 HTTP/1.1" 200 447
     127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/admin/cores?action=STATUS&core=statistics-2016&indexInfo=true&wt=javabin&version=2 HTTP/1.1" 200 76
     127.0.0.1 - - [10/Jan/2018:10:51:19 +0200] "GET /solr/admin/cores?action=CREATE&name=statistics-2016&instanceDir=statistics&dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&wt=javabin&version=2 HTTP/1.1" 200 63
    @@ -649,7 +649,7 @@ cache_alignment : 64
     
  • This is apparently a common Solr error code that means “version conflict”: http://yonik.com/solr/optimistic-concurrency/
  • Looks like that bot from the PerfectIP.net host ended up making about 450,000 requests to XMLUI alone yesterday:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36" | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36" | grep "10/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
       21572 70.36.107.50
       30722 70.36.107.190
       34566 70.36.107.49
    @@ -659,7 +659,7 @@ cache_alignment : 64
     
    • Wow, I just figured out how to set the application name of each database pool in the JNDI config of Tomcat’s server.xml:
    -
    <Resource name="jdbc/dspaceWeb" auth="Container" type="javax.sql.DataSource"
    +
    <Resource name="jdbc/dspaceWeb" auth="Container" type="javax.sql.DataSource"
               driverClassName="org.postgresql.Driver"
               url="jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceWeb"
               username="dspace"
    @@ -677,7 +677,7 @@ cache_alignment : 64
     
  • Also, I realized that the db.jndi parameter in dspace.cfg needs to match the name value in your applicaiton’s context—not the global one
  • Ah hah! Also, I can name the default DSpace connection pool in dspace.cfg as well, like:
  • -
    db.url = jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceDefault
    +
    db.url = jdbc:postgresql://localhost:5432/dspacetest?ApplicationName=dspaceDefault
     
    • With that it is super easy to see where PostgreSQL connections are coming from in pg_stat_activity
    @@ -685,7 +685,7 @@ cache_alignment : 64
    • I’m looking at the DSpace 6.0 Install docs and notice they tweak the number of threads in their Tomcat connector:
    -
    <!-- Define a non-SSL HTTP/1.1 Connector on port 8080 -->
    +
    <!-- Define a non-SSL HTTP/1.1 Connector on port 8080 -->
     <Connector port="8080"
                maxThreads="150"
                minSpareThreads="25"
    @@ -702,7 +702,7 @@ cache_alignment : 64
     
  • Looks like in Tomcat 8.5 the default URIEncoding for Connectors is UTF-8, so we don’t need to specify that manually anymore: https://tomcat.apache.org/tomcat-8.5-doc/config/http.html
  • Ooh, I just saw the acceptorThreadCount setting (in Tomcat 7 and 8.5):
  • -
    The number of threads to be used to accept connections. Increase this value on a multi CPU machine, although you would never really need more than 2. Also, with a lot of non keep alive connections, you might want to increase this value as well. Default value is 1.
    +
    The number of threads to be used to accept connections. Increase this value on a multi CPU machine, although you would never really need more than 2. Also, with a lot of non keep alive connections, you might want to increase this value as well. Default value is 1.
     
    • That could be very interesting
    @@ -711,7 +711,7 @@ cache_alignment : 64
  • Still testing DSpace 6.2 on Tomcat 8.5.24
  • Catalina errors at Tomcat 8.5 startup:
  • -
    13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of "35" for "maxActive" property, which is being ignored.
    +
    13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxActive is not used in DBCP2, use maxTotal instead. maxTotal default value is 8. You have set value of "35" for "maxActive" property, which is being ignored.
     13-Jan-2018 13:59:05.245 WARNING [main] org.apache.tomcat.dbcp.dbcp2.BasicDataSourceFactory.getObjectInstance Name = dspace6 Property maxWait is not used in DBCP2 , use maxWaitMillis instead. maxWaitMillis default value is -1. You have set value of "5000" for "maxWait" property, which is being ignored.
     
    • I looked in my Tomcat 7.0.82 logs and I don’t see anything about DBCP2 errors, so I guess this a Tomcat 8.0.x or 8.5.x thing
    • @@ -719,7 +719,7 @@ cache_alignment : 64
    • I have updated our Ansible infrastructure scripts so that it will be ready whenever we switch to Tomcat 8 (probably with Ubuntu 18.04 later this year)
    • When I enable the ResourceLink in the ROOT.xml context I get the following error in the Tomcat localhost log:
    -
    13-Jan-2018 14:14:36.017 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.app.util.DSpaceWebappListener]
    +
    13-Jan-2018 14:14:36.017 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.app.util.DSpaceWebappListener]
      java.lang.ExceptionInInitializerError
             at org.dspace.app.util.AbstractDSpaceWebapp.register(AbstractDSpaceWebapp.java:74)
             at org.dspace.app.util.DSpaceWebappListener.contextInitialized(DSpaceWebappListener.java:31)
    @@ -761,7 +761,7 @@ Caused by: java.lang.NullPointerException
     
  • Help Udana from IWMI export a CSV from DSpace Test so he can start trying a batch upload
  • I’m going to apply these ~130 corrections on CGSpace:
  • -
    update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
    +
    update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
     delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
     update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
     update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
    @@ -777,11 +777,11 @@ delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and
     
    -
    $ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
     
    • In looking at some of the values to delete or check I found some metadata values that I could not resolve their handle via SQL:
    -
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
    +
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
      metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
     -------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
                2757936 |        4369 |                 3 | Tarawali   |           |     9 |           |        600 |                2
    @@ -796,7 +796,7 @@ dspace=# select handle from item, handle where handle.resource_id = item.item_id
     
  • Otherwise, the DSpace 5 SQL Helper Functions provide ds5_item2itemhandle(), which is much easier than my long query above that I always have to go search for
  • For example, to find the Handle for an item that has the author “Erni”:
  • -
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
    +
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
      metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place |              authority               | confidence | resource_type_id 
     -------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------
                2612150 |       70308 |                 3 | Erni       |           |     9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 |         -1 |                2
    @@ -809,16 +809,16 @@ dspace=# select ds5_item2itemhandle(70308);
     
    • Next I apply the author deletions:
    -
    $ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
    +
    $ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
     
    • Now working on the affiliation corrections from Peter:
    -
    $ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
     $ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
     
    • Now I made a new list of affiliations for Peter to look through:
    -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
     COPY 4552
     
    • Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT)
    • @@ -828,11 +828,11 @@ COPY 4552
    • Help Sisay with some thumbnails for book chapters in Open Refine and SAFBuilder
    • CGSpace users were having problems logging in, I think something’s wrong with LDAP because I see this in the logs:
    -
    2018-01-15 12:53:15,810 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2386749547D03E0AA4EC7E44181A7552:ip_addr=x.x.x.x:ldap_authentication:type=failed_auth javax.naming.AuthenticationException\colon; [LDAP\colon; error code 49 - 80090308\colon; LdapErr\colon; DSID-0C090400, comment\colon; AcceptSecurityContext error, data 775, v1db1^@]
    +
    2018-01-15 12:53:15,810 WARN  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2386749547D03E0AA4EC7E44181A7552:ip_addr=x.x.x.x:ldap_authentication:type=failed_auth javax.naming.AuthenticationException\colon; [LDAP\colon; error code 49 - 80090308\colon; LdapErr\colon; DSID-0C090400, comment\colon; AcceptSecurityContext error, data 775, v1db1^@]
     
    • Looks like we processed 2.9 million requests on CGSpace in 2017-12:
    -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Dec/2017"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Dec/2017"
     2890041
     
     real    0m25.756s
    @@ -864,14 +864,14 @@ sys     0m2.210s
     
  • Also, there are whitespace issues in many columns, and the items are mapped to the LIVES and ILRI articles collections, not Theses
  • In any case, importing them like this:
  • -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives.map &> lives.log
     
    • And fantastic, before I started the import there were 10 PostgreSQL connections, and then CGSpace crashed during the upload
    • When I looked there were 210 PostgreSQL connections!
    • I don’t see any high load in XMLUI or REST/OAI:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Jan/2018" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail
         381 40.77.167.124
         403 213.55.99.121
         431 207.46.13.60
    @@ -896,13 +896,13 @@ $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFor
     
    • But I do see this strange message in the dspace log:
    -
    2018-01-17 07:59:25,856 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://localhost:8081: The target server failed to respond
    +
    2018-01-17 07:59:25,856 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://localhost:8081: The target server failed to respond
     2018-01-17 07:59:25,856 INFO  org.apache.http.impl.client.SystemDefaultHttpClient @ Retrying request to {}->http://localhost:8081
     
    • I have NEVER seen this error before, and there is no error before or after that in DSpace’s solr.log
    • Tomcat’s catalina.out does show something interesting, though, right at that time:
    -
    [====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:02
    +
    [====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:02
     [====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 45 seconds. timestamp: 2018-01-17 07:57:11
     [====================>                              ]40% time remaining: 7 hour(s) 14 minute(s) 44 seconds. timestamp: 2018-01-17 07:57:37
     [====================>                              ]40% time remaining: 7 hour(s) 16 minute(s) 5 seconds. timestamp: 2018-01-17 07:57:49
    @@ -943,7 +943,7 @@ Exception in thread "http-bio-127.0.0.1-8081-exec-627" java.lang.OutOf
     
  • You can see the timestamp above, which is some Atmire nightly task I think, but I can’t figure out which one
  • So I restarted Tomcat and tried the import again, which finished very quickly and without errors!
  • -
    $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives2.map &> lives2.log
    +
    $ dspace import -a -e aorth@mjanja.ch -s /tmp/2018-01-16\ LIVES/SimpleArchiveFormat -m lives2.map &> lives2.log
     
    • Looking at the JVM graphs from Munin it does look like the heap ran out of memory (see the blue dip just before the green spike when I restarted Tomcat):
    @@ -951,7 +951,7 @@ Exception in thread "http-bio-127.0.0.1-8081-exec-627" java.lang.OutOf -
    $ docker pull docker.bintray.io/jfrog/artifactory-oss:latest
    +
    $ docker pull docker.bintray.io/jfrog/artifactory-oss:latest
     $ docker volume create --name artifactory5_data
     $ docker network create dspace-build
     $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss:latest
    @@ -961,11 +961,11 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
     
  • Wow, I even managed to add the Atmire repository as a remote and map it into the libs-release virtual repository, then tell maven to use it for atmire.com-releases in settings.xml!
  • Hmm, some maven dependencies for the SWORDv2 web application in DSpace 5.5 are broken:
  • -
    [ERROR] Failed to execute goal on project dspace-swordv2: Could not resolve dependencies for project org.dspace:dspace-swordv2:war:5.5: Failed to collect dependencies at org.swordapp:sword2-server:jar:classes:1.0 -> org.apache.abdera:abdera-client:jar:1.1.1 -> org.apache.abdera:abdera-core:jar:1.1.1 -> org.apache.abdera:abdera-i18n:jar:1.1.1 -> org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Failed to read artifact descriptor for org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Could not find artifact org.apache.geronimo.specs:specs:pom:1.1 in central (http://localhost:8081/artifactory/libs-release) -> [Help 1]
    +
    [ERROR] Failed to execute goal on project dspace-swordv2: Could not resolve dependencies for project org.dspace:dspace-swordv2:war:5.5: Failed to collect dependencies at org.swordapp:sword2-server:jar:classes:1.0 -> org.apache.abdera:abdera-client:jar:1.1.1 -> org.apache.abdera:abdera-core:jar:1.1.1 -> org.apache.abdera:abdera-i18n:jar:1.1.1 -> org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Failed to read artifact descriptor for org.apache.geronimo.specs:geronimo-activation_1.0.2_spec:jar:1.1: Could not find artifact org.apache.geronimo.specs:specs:pom:1.1 in central (http://localhost:8081/artifactory/libs-release) -> [Help 1]
     
    • I never noticed because I build with that web application disabled:
    -
    $ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-sword,\!dspace-swordv2 clean package
    +
    $ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=localhost -P \!dspace-sword,\!dspace-swordv2 clean package
     
    • UptimeRobot said CGSpace went down for a few minutes
    • I didn’t do anything but it came back up on its own
    • @@ -973,7 +973,7 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
    • Now Linode alert says the CPU load is high, sigh
    • Regarding the heap space error earlier today, it looks like it does happen a few times a week or month (I’m not sure how far these logs go back, as they are not strictly daily):
    -
    # zgrep -c java.lang.OutOfMemoryError /var/log/tomcat7/catalina.out* | grep -v :0
    +
    # zgrep -c java.lang.OutOfMemoryError /var/log/tomcat7/catalina.out* | grep -v :0
     /var/log/tomcat7/catalina.out:2
     /var/log/tomcat7/catalina.out.10.gz:7
     /var/log/tomcat7/catalina.out.11.gz:1
    @@ -1004,7 +1004,7 @@ $ docker run --network dspace-build --name artifactory -d -v artifactory5_data:/
     
  • I don’t see any errors in the nginx or catalina logs, so I guess UptimeRobot just got impatient and closed the request, which caused nginx to send an HTTP 499
  • I realize I never did a full re-index after the SQL author and affiliation updates last week, so I should force one now:
  • -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
     
    • Maria from Bioversity asked if I could remove the abstracts from all of their Limited Access items in the Bioversity Journal Articles collection
    • @@ -1012,7 +1012,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspa
    • Use this GREL in OpenRefine after isolating all the Limited Access items: value.startsWith("10568/35501")
    • UptimeRobot said CGSpace went down AGAIN and both Sisay and Danny immediately logged in and restarted Tomcat without talking to me or each other!
    -
    Jan 18 07:01:22 linode18 sudo[10805]: dhmichael : TTY=pts/5 ; PWD=/home/dhmichael ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
    +
    Jan 18 07:01:22 linode18 sudo[10805]: dhmichael : TTY=pts/5 ; PWD=/home/dhmichael ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
     Jan 18 07:01:22 linode18 sudo[10805]: pam_unix(sudo:session): session opened for user root by dhmichael(uid=0)
     Jan 18 07:01:22 linode18 systemd[1]: Stopping LSB: Start Tomcat....
     Jan 18 07:01:22 linode18 sudo[10812]: swebshet : TTY=pts/3 ; PWD=/home/swebshet ; USER=root ; COMMAND=/bin/systemctl restart tomcat7
    @@ -1026,14 +1026,14 @@ Jan 18 07:01:22 linode18 sudo[10812]: pam_unix(sudo:session): session opened for
     
  • Linode alerted and said that the CPU load was 264.1% on CGSpace
  • Start the Discovery indexing again:
  • -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-discovery -b
     
    • Linode alerted again and said that CGSpace was using 301% CPU
    • Peter emailed to ask why this item doesn’t have an Altmetric badge on CGSpace but does have one on the Altmetric dashboard
    • Looks like our badge code calls the handle endpoint which doesn’t exist:
    -
    https://api.altmetric.com/v1/handle/10568/88090
    +
    https://api.altmetric.com/v1/handle/10568/88090
     
    • I told Peter we should keep an eye out and try again next week
    @@ -1041,7 +1041,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspa
    • Run the authority indexing script on CGSpace and of course it died:
    -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-authority 
    +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace index-authority 
     Retrieving all data 
     Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer 
     Exception: null
    @@ -1071,7 +1071,7 @@ sys     0m12.317s
     
  • In the end there were 324 items in the collection that were Limited Access, but only 199 had abstracts
  • I want to document the workflow of adding a production PostgreSQL database to a development instance of DSpace in Docker:
  • -
    $ docker exec dspace_db dropdb -U postgres dspace
    +
    $ docker exec dspace_db dropdb -U postgres dspace
     $ docker exec dspace_db createdb -U postgres -O dspace --encoding=UNICODE dspace
     $ docker exec dspace_db psql -U postgres dspace -c 'alter user dspace createuser;'
     $ docker cp test.dump dspace_db:/tmp/test.dump
    @@ -1099,7 +1099,7 @@ $ docker exec dspace_db psql -U dspace -f /tmp/update-sequences.sql dspace
     
  • The source code is here: rest-find-collections.py
  • Peter had said that found a bunch of ILRI collections that were called “untitled”, but I don’t see any:
  • -
    $ ./rest-find-collections.py 10568/1 | wc -l
    +
    $ ./rest-find-collections.py 10568/1 | wc -l
     308
     $ ./rest-find-collections.py 10568/1 | grep -i untitled
     
      @@ -1119,12 +1119,12 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
    • Thinking about generating a jmeter test plan for DSpace, along the lines of Georgetown’s dspace-performance-test
    • I got a list of all the GET requests on CGSpace for January 21st (the last time Linode complained the load was high), excluding admin calls:
    -
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -c -v "/admin"
    +
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -c -v "/admin"
     56405
     
    • Apparently about 28% of these requests were for bitstreams, 30% for the REST API, and 30% for handles:
    -
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo "^/(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo "^/(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
          38 /oai/
       14406 /bitstream/
       15179 /rest/
    @@ -1132,14 +1132,14 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
     
    • And 3% were to the homepage or search:
    -
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo '^/($|open-search|discover)' | sort | uniq -c
    +
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -Eo '^/($|open-search|discover)' | sort | uniq -c
        1050 /
         413 /discover
         170 /open-search
     
    • The last 10% or so seem to be for static assets that would be served by nginx anyways:
    -
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -v bitstream | grep -Eo '\.(js|css|png|jpg|jpeg|php|svg|gif|txt|map)$' | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz /var/log/nginx/library-access.log.2.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/rest.log.2.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/oai.log.2.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/error.log.2.gz /var/log/nginx/error.log.3.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -v bitstream | grep -Eo '\.(js|css|png|jpg|jpeg|php|svg|gif|txt|map)$' | sort | uniq -c | sort -n
           2 .gif
           7 .css
          84 .js
    @@ -1153,7 +1153,7 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
     
    • Looking at the REST requests, most of them are to expand all or metadata, but 5% are for retrieving bitstreams:
    -
    # zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/library-access.log.4.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/rest.log.4.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/oai.log.4.gz /var/log/nginx/error.log.3.gz /var/log/nginx/error.log.4.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -E "^/rest" | grep -Eo "(retrieve|expand=[a-z].*)" | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/access.log.3.gz /var/log/nginx/access.log.4.gz /var/log/nginx/library-access.log.3.gz /var/log/nginx/library-access.log.4.gz /var/log/nginx/rest.log.3.gz /var/log/nginx/rest.log.4.gz /var/log/nginx/oai.log.3.gz /var/log/nginx/oai.log.4.gz /var/log/nginx/error.log.3.gz /var/log/nginx/error.log.4.gz | grep "21/Jan/2018" | grep "GET " | grep -v "/admin" | awk '{print $7}' | grep -E "^/rest" | grep -Eo "(retrieve|expand=[a-z].*)" | sort | uniq -c | sort -n
           1 expand=collections
          16 expand=all&limit=1
          45 expand=items
    @@ -1163,12 +1163,12 @@ $ ./rest-find-collections.py 10568/1 | grep -i untitled
     
    • I finished creating the test plan for DSpace Test and ran it from my Linode with:
    -
    $ jmeter -n -t DSpacePerfTest-dspacetest.cgiar.org.jmx -l 2018-01-24-1.jtl
    +
    $ jmeter -n -t DSpacePerfTest-dspacetest.cgiar.org.jmx -l 2018-01-24-1.jtl
     
    • Atmire responded to my issue from two weeks ago and said they will start looking into DSpace 5.8 compatibility for CGSpace
    • I set up a new Arch Linux Linode instance with 8192 MB of RAM and ran the test plan a few times to get a baseline:
    -
    # lscpu
    +
    # lscpu
     # lscpu 
     Architecture:        x86_64
     CPU op-mode(s):      32-bit, 64-bit
    @@ -1212,19 +1212,19 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
     
    • Then I generated reports for these runs like this:
    -
    $ jmeter -g 2018-01-24-linode5451120-baseline.jtl -o 2018-01-24-linode5451120-baseline
    +
    $ jmeter -g 2018-01-24-linode5451120-baseline.jtl -o 2018-01-24-linode5451120-baseline
     

    2018-01-25

    • Run another round of tests on DSpace Test with jmeter after changing Tomcat’s minSpareThreads to 20 (default is 10) and acceptorThreadCount to 2 (default is 1):
    -
    $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.log
    +
    $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads2.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads3.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-tomcat-threads3.log
     
    • I changed the parameters back to the baseline ones and switched the Tomcat JVM garbage collector to G1GC and re-ran the tests
    • JVM options for Tomcat changed from -Xms3072m -Xmx3072m -XX:+UseConcMarkSweepGC to -Xms3072m -Xmx3072m -XX:+UseG1GC -XX:+PerfDisableSharedMem
    -
    $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.log
    +
    $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc2.log
     $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.jmx -l ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.jtl -j ~/dspace-performance-test/2018-01-25-linode5451120-g1gc3.log
     
      @@ -1242,7 +1242,7 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
    • The problem is that Peter wanted to use two questions, one for CG centers and one for other, but using the same metadata value, which isn’t possible (?)
    • So I used some creativity and made several fields display values, but not store any, ie:
    -
    <pair>
    +
    <pair>
       <displayed-value>For products published by another party:</displayed-value>
       <stored-value></stored-value>
     </pair>
    @@ -1267,7 +1267,7 @@ $ ./jmeter -n -t ~/dspace-performance-test/DSpacePerfTest-dspacetest.cgiar.org.j
     
  • CGSpace went down this morning for a few minutes, according to UptimeRobot
  • Looking at the DSpace logs I see this error happened just before UptimeRobot noticed it going down:
  • -
    2018-01-29 05:30:22,226 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=3775D4125D28EF0C691B08345D905141:ip_addr=68.180.229.254:view_item:handle=10568/71890
    +
    2018-01-29 05:30:22,226 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=3775D4125D28EF0C691B08345D905141:ip_addr=68.180.229.254:view_item:handle=10568/71890
     2018-01-29 05:30:22,322 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractSearch @ org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1994+TO+1999]': Encountered " "]" "] "" at line 1, column 32.
     Was expecting one of:
         "TO" ...
    @@ -1284,12 +1284,12 @@ Was expecting one of:
     
  • I see a few dozen HTTP 499 errors in the nginx access log for a few minutes before this happened, but HTTP 499 is just when nginx says that the client closed the request early
  • Perhaps this from the nginx error log is relevant?
  • -
    2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: "GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1", upstream: "http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12", host: "cgspace.cgiar.org"
    +
    2018/01/29 05:26:34 [warn] 26895#26895: *944759 an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/6/16/0000026166 while reading upstream, client: 180.76.15.34, server: cgspace.cgiar.org, request: "GET /bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12 HTTP/1.1", upstream: "http://127.0.0.1:8443/bitstream/handle/10947/4658/FISH%20Leaflet.pdf?sequence=12", host: "cgspace.cgiar.org"
     
    -
    # awk '($9 ~ /200/) { i++;sum+=$10;max=$10>max?$10:max; } END { printf("Maximum: %d\nAverage: %d\n",max,i?sum/i:0); }' /var/log/nginx/access.log
    +
    # awk '($9 ~ /200/) { i++;sum+=$10;max=$10>max?$10:max; } END { printf("Maximum: %d\nAverage: %d\n",max,i?sum/i:0); }' /var/log/nginx/access.log
     Maximum: 2771268
     Average: 210483
     
      @@ -1297,7 +1297,7 @@ Average: 210483
    • My best guess is that the Solr search error is related somehow but I can’t figure it out
    • We definitely have enough database connections, as I haven’t seen a pool error in weeks:
    -
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-2*
    +
    $ grep -c "Timeout: Pool empty." dspace.log.2018-01-2*
     dspace.log.2018-01-20:0
     dspace.log.2018-01-21:0
     dspace.log.2018-01-22:0
    @@ -1326,7 +1326,7 @@ dspace.log.2018-01-29:0
     
  • Wow, so apparently you need to specify which connector to check if you want any of the Munin Tomcat plugins besides “tomcat_jvm” to work (the connector name can be seen in the Catalina logs)
  • I modified /etc/munin/plugin-conf.d/tomcat to add the connector (with surrounding quotes!) and now the other plugins work (obviously the credentials are incorrect):
  • -
    [tomcat_*]
    +
    [tomcat_*]
         env.host 127.0.0.1
         env.port 8081
         env.connector "http-bio-127.0.0.1-8443"
    @@ -1335,7 +1335,7 @@ dspace.log.2018-01-29:0
     
    • For example, I can see the threads:
    -
    # munin-run tomcat_threads
    +
    # munin-run tomcat_threads
     busy.value 0
     idle.value 20
     max.value 400
    @@ -1345,18 +1345,18 @@ max.value 400
     
  • Although following the logic of /usr/share/munin/plugins/jmx_tomcat_dbpools could be useful for getting the active Tomcat sessions
  • From debugging the jmx_tomcat_db_pools script from the munin-plugins-java package, I see that this is how you call arbitrary mbeans:
  • -
    # port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=DataSource,class=javax.sql.DataSource,name=* maxActive
    +
    # port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=DataSource,class=javax.sql.DataSource,name=* maxActive
     Catalina:type=DataSource,class=javax.sql.DataSource,name="jdbc/dspace"  maxActive       300
     
    -
    [===================>                               ]38% time remaining: 5 hour(s) 21 minute(s) 47 seconds. timestamp: 2018-01-29 06:25:16
    +
    [===================>                               ]38% time remaining: 5 hour(s) 21 minute(s) 47 seconds. timestamp: 2018-01-29 06:25:16
     
    • There are millions of these status lines, for example in just this one log file:
    -
    # zgrep -c "time remaining" /var/log/tomcat7/catalina.out.1.gz
    +
    # zgrep -c "time remaining" /var/log/tomcat7/catalina.out.1.gz
     1084741
     
    -
    # munin-run tomcat_threads
    +
    # munin-run tomcat_threads
     busy.value 400
     idle.value 0
     max.value 400
     
    • And wow, we finally exhausted the database connections, from dspace.log:
    -
    2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
    +
    2018-01-31 08:05:28,964 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
     org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-451] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:300; busy:300; idle:0; lastwait:5000].
     
    • Now even the nightly Atmire background thing is getting HTTP 500 error:
    -
    Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException
    +
    Jan 31, 2018 8:16:05 AM com.sun.jersey.spi.container.ContainerResponse logException
     SEVERE: Mapped exception to response: 500 (Internal Server Error)
     javax.ws.rs.WebApplicationException
     
    • For now I will restart Tomcat to clear this shit and bring the site back up
    • The top IPs from this morning, during 7 and 8AM in XMLUI and REST/OAI:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "31/Jan/2018:(07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          67 66.249.66.70
          70 207.46.13.12
          71 197.210.168.174
    @@ -1426,7 +1426,7 @@ javax.ws.rs.WebApplicationException
     
  • I should make separate database pools for the web applications and the API applications like REST and OAI
  • Ok, so this is interesting: I figured out how to get the MBean path to query Tomcat’s activeSessions from JMX (using munin-plugins-java):
  • -
    # port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
    +
    # port=5400 ip="127.0.0.1" /usr/bin/java -cp /usr/share/munin/munin-jmx-plugins.jar org.munin.plugin.jmx.Beans Catalina:type=Manager,context=/,host=localhost activeSessions
     Catalina:type=Manager,context=/,host=localhost  activeSessions  8
     
    • If you connect to Tomcat in jvisualvm it’s pretty obvious when you hover over the elements
    • diff --git a/docs/2018-02/index.html b/docs/2018-02/index.html index 302dad134..ba5eb6ff7 100644 --- a/docs/2018-02/index.html +++ b/docs/2018-02/index.html @@ -30,7 +30,7 @@ We don’t need to distinguish between internal and external works, so that Yesterday I figured out how to monitor DSpace sessions using JMX I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01 "/> - + @@ -128,7 +128,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-pl
    • Run all system updates and reboot DSpace Test
    • Wow, I packaged up the jmx_dspace_sessions stuff in the Ansible infrastructure scripts and deployed it on CGSpace and it totally works:
    -
    # munin-run jmx_dspace_sessions
    +
    # munin-run jmx_dspace_sessions
     v_.value 223
     v_jspui.value 1
     v_oai.value 0
    @@ -139,12 +139,12 @@ v_oai.value 0
     
  • I finally took a look at the second round of cleanups Peter had sent me for author affiliations in mid January
  • After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace:
  • -
    $ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
    +
    $ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
     $ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
     
    • Then I started a full Discovery reindex:
    -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
    +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
     
     real    96m39.823s
     user    14m10.975s
    @@ -152,12 +152,12 @@ sys     2m29.088s
     
    • Generate a new list of affiliations for Peter to sort through:
    -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
     COPY 3723
     
    • Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in December:
    -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2018"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2018"
     3126109
     
     real    0m23.839s
    @@ -167,14 +167,14 @@ sys     0m1.905s
     
    • Toying with correcting authors with trailing spaces via PostgreSQL:
    -
    dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
    +
    dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
     UPDATE 20
     
    • I tried the TRIM(TRAILING from text_value) function and it said it changed 20 items but the spaces didn’t go away
    • This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.
    • Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:
    -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
     COPY 55630
     

    2018-02-06

      @@ -182,7 +182,7 @@ COPY 55630
    • I see 308 PostgreSQL connections in pg_stat_activity
    • The usage otherwise seemed low for REST/OAI as well as XMLUI in the last hour:
    -
    # date
    +
    # date
     Tue Feb  6 09:30:32 UTC 2018
     # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           2 223.185.41.40
    @@ -232,7 +232,7 @@ Tue Feb  6 09:30:32 UTC 2018
     
  • CGSpace crashed again, this time around Wed Feb 7 11:20:28 UTC 2018
  • I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:
  • -
    $ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt
    +
    $ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt
     $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
     /tmp/pg_stat_activity1.txt:300
     /tmp/pg_stat_activity2.txt:272
    @@ -242,7 +242,7 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
     
    • Interestingly, all of those 751 connections were idle!
    -
    $ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* | grep -c idle
    +
    $ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* | grep -c idle
     751
     
    • Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps
    • @@ -252,17 +252,17 @@ $ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
      • Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM:
      -
      $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
      +
      $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
       1828
       
      • CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)
      • What’s interesting is that the DSpace log says the connections are all busy:
      -
      org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
      +
      org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
       
      • … but in PostgreSQL I see them idle or idle in transaction:
      -
      $ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
      +
      $ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
       250
       $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
       250
      @@ -274,13 +274,13 @@ $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle
       
    • I will try testOnReturn='true' too, just to add more validation, because I’m fucking grasping at straws
    • Also, WTF, there was a heap space error randomly in catalina.out:
    -
    Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
    +
    Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
     Exception in thread "http-bio-127.0.0.1-8081-exec-58" java.lang.OutOfMemoryError: Java heap space
     
    • I’m trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!
    • Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:
    -
    $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
    +
    $ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
          34 ip_addr=46.229.168.67
          34 ip_addr=46.229.168.73
          37 ip_addr=46.229.168.76
    @@ -304,7 +304,7 @@ Exception in thread "http-bio-127.0.0.1-8081-exec-58" java.lang.OutOfM
     
    • These IPs made thousands of sessions today:
    -
    $ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +
    $ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     530
     $ grep 207.46.13.71 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
     859
    @@ -342,11 +342,11 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
     
  • What in the actual fuck, why is our load doing this? It’s gotta be something fucked up with the database pool being “busy” but everything is fucking idle
  • One that I should probably add in nginx is 54.83.138.123, which is apparently the following user agent:
  • -
    BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
    +
    BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
     
    • This one makes two thousand requests per day or so recently:
    -
    # grep -c BUbiNG /var/log/nginx/access.log /var/log/nginx/access.log.1
    +
    # grep -c BUbiNG /var/log/nginx/access.log /var/log/nginx/access.log.1
     /var/log/nginx/access.log:1925
     /var/log/nginx/access.log.1:2029
     
      @@ -355,13 +355,13 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
    • Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker
    • This is how the connections looked when it crashed this afternoon:
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
         290 dspaceWeb
     
    • This is how it is right now:
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
           5 dspaceWeb
     
      @@ -378,11 +378,11 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
    • Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn’t show up on the item
    • Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error:
    -
    Field dc_contributor_author has choice presentation of type "select", it may NOT be authority-controlled.
    +
    Field dc_contributor_author has choice presentation of type "select", it may NOT be authority-controlled.
     
    • If I change choices.presentation to suggest it give this error:
    -
    xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
    +
    xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
     
    • So I don’t think we can disable the ORCID lookup function and keep the ORCID badges
    @@ -394,12 +394,12 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
    • I downloaded the PDF and manually generated a thumbnail with ImageMagick and it looked better:
    -
    $ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_cmyk.icc -thumbnail 600x600 -flatten -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_rgb.icc CCAFS_WP_223.jpg
    +
    $ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_cmyk.icc -thumbnail 600x600 -flatten -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_rgb.icc CCAFS_WP_223.jpg
     

    Manual thumbnail

    • Peter sent me corrected author names last week but the file encoding is messed up:
    -
    $ isutf8 authors-2018-02-05.csv
    +
    $ isutf8 authors-2018-02-05.csv
     authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between E1 and EC, expecting the 2nd byte between 80 and BF.
     
    • The isutf8 program comes from moreutils
    • @@ -409,18 +409,18 @@ authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between
    • I updated my fix-metadata-values.py and delete-metadata-values.py scripts on the scripts page: https://github.com/ilri/DSpace/wiki/Scripts
    • I ran the 342 author corrections (after trimming whitespace and excluding those with || and other syntax errors) on CGSpace:
    -
    $ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
     
    • Then I ran a full Discovery re-indexing:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
    -
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
    +
    dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
        text_value    |              authority               | confidence 
     -----------------+--------------------------------------+------------
      Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 |        600
    @@ -464,7 +464,7 @@ dspace=# commit;
     
  • I see that in April, 2017 I just used a SQL query to get a user’s submissions by checking the dc.description.provenance field
  • So for Abenet, I can check her submissions in December, 2017 with:
  • -
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*yabowork.*2017-12.*';
    +
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*yabowork.*2017-12.*';
     
    • I emailed Peter to ask whether we can move DSpace Test to a new Linode server and attach 300 GB of disk space to it
    • This would be using Linode’s new block storage volumes
    • @@ -477,14 +477,14 @@ dspace=# commit;
    • Peter said he was getting a “socket closed” error on CGSpace
    • I looked in the dspace.log.2018-02-13 and saw one recent one:
    -
    2018-02-13 12:50:13,656 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
    +
    2018-02-13 12:50:13,656 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
     org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
     ...
     Caused by: java.net.SocketException: Socket closed
     
    • Could be because of the removeAbandoned="true" that I enabled in the JDBC connection pool last week?
    -
    $ grep -c "java.net.SocketException: Socket closed" dspace.log.2018-02-*
    +
    $ grep -c "java.net.SocketException: Socket closed" dspace.log.2018-02-*
     dspace.log.2018-02-01:0
     dspace.log.2018-02-02:0
     dspace.log.2018-02-03:0
    @@ -503,7 +503,7 @@ dspace.log.2018-02-13:4
     
  • I will increase the removeAbandonedTimeout from its default of 60 to 90 and enable logAbandoned
  • Peter hit this issue one more time, and this is apparently what Tomcat’s catalina.out log says when an abandoned connection is removed:
  • -
    Feb 13, 2018 2:05:42 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
    +
    Feb 13, 2018 2:05:42 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
     WARNING: Connection has been abandoned PooledConnection[org.postgresql.jdbc.PgConnection@22e107be]:java.lang.Exception
     

    2018-02-14

      @@ -521,21 +521,21 @@ WARNING: Connection has been abandoned PooledConnection[org.postgresql.jdbc.PgCo
    • Atmire responded on the DSpace 5.8 compatability ticket and said they will let me know if they they want me to give them a clean 5.8 branch
    • I formatted my list of ORCID IDs as a controlled vocabulary, sorted alphabetically, then ran through XML tidy:
    -
    $ sort cgspace-orcids.txt > dspace/config/controlled-vocabularies/cg-creator-id.xml
    +
    $ sort cgspace-orcids.txt > dspace/config/controlled-vocabularies/cg-creator-id.xml
     $ add XML formatting...
     $ tidy -xml -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
     
    • It seems the tidy fucks up accents, for example it turns Adriana Tofiño (0000-0001-7115-7169) into Adriana Tofiño (0000-0001-7115-7169)
    • We need to force UTF-8:
    -
    $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    +
    $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
     
    • This preserves special accent characters
    • I tested the display and store of these in the XMLUI and PostgreSQL and it looks good
    • Sisay exported all ILRI, CIAT, etc authors from ORCID and sent a list of 600+
    • Peter combined it with mine and we have 1204 unique ORCIDs!
    -
    $ grep -coE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv
    +
    $ grep -coE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv
     1204
     $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv | sort | uniq | wc -l
     1204
    @@ -543,19 +543,19 @@ $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_c
     
  • Also, save that regex for the future because it will be very useful!
  • CIAT sent a list of their authors' ORCIDs and combined with ours there are now 1227:
  • -
    $ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
    +
    $ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
     1227
     
    • There are some formatting issues with names in Peter’s list, so I should remember to re-generate the list of names from ORCID’s API once we’re done
    • The dspace cleanup -v currently fails on CGSpace with the following:
    -
     - Deleting bitstream record from database (ID: 149473)
    +
     - Deleting bitstream record from database (ID: 149473)
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
       Detail: Key (bitstream_id)=(149473) is still referenced from table "bundle".
     
    • The solution is to update the bitstream table, as I’ve discovered several other times in 2016 and 2017:
    -
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);'
    +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);'
     UPDATE 1
     
    • Then the cleanup process will continue for awhile and hit another foreign key conflict, and eventually it will complete after you manually resolve them all
    • @@ -575,25 +575,25 @@ UPDATE 1
    • I only looked quickly in the logs but saw a bunch of database errors
    • PostgreSQL connections are currently:
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | uniq -c
           2 dspaceApi
           1 dspaceWeb
           3 dspaceApi
     
    • I see shitloads of memory errors in Tomcat’s logs:
    -
    # grep -c "Java heap space" /var/log/tomcat7/catalina.out
    +
    # grep -c "Java heap space" /var/log/tomcat7/catalina.out
     56
     
    • And shit tons of database connections abandoned:
    -
    # grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
    +
    # grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
     612
     
    • I have no fucking idea why it crashed
    • The XMLUI activity looks like:
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "15/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "15/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         715 63.143.42.244
         746 213.55.99.121
         886 68.180.228.157
    @@ -610,7 +610,7 @@ UPDATE 1
     
  • I made a pull request to fix it ((#354)[https://github.com/ilri/DSpace/pull/354])
  • I should remember to update existing values in PostgreSQL too:
  • -
    dspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
    +
    dspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
     UPDATE 2
     

    2018-02-18

      @@ -624,7 +624,7 @@ UPDATE 2
    • Run system updates on DSpace Test (linode02) and reboot the server
    • Looking back at the system errors on 2018-02-15, I wonder what the fuck caused this:
    -
    $ wc -l dspace.log.2018-02-1{0..8}
    +
    $ wc -l dspace.log.2018-02-1{0..8}
        383483 dspace.log.2018-02-10
        275022 dspace.log.2018-02-11
        249557 dspace.log.2018-02-12
    @@ -638,13 +638,13 @@ UPDATE 2
     
  • From an average of a few hundred thousand to over four million lines in DSpace log?
  • Using grep’s -B1 I can see the line before the heap space error, which has the time, ie:
  • -
    2018-02-15 16:02:12,748 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
    +
    2018-02-15 16:02:12,748 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
     org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
     
    • So these errors happened at hours 16, 18, 19, and 20
    • Let’s see what was going on in nginx then:
    -
    # zcat --force /var/log/nginx/*.log.{3,4}.gz | wc -l
    +
    # zcat --force /var/log/nginx/*.log.{3,4}.gz | wc -l
     168571
     # zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E "15/Feb/2018:(16|18|19|20)" | wc -l
     8188
    @@ -652,7 +652,7 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
     
  • Only 8,000 requests during those four hours, out of 170,000 the whole day!
  • And the usage of XMLUI, REST, and OAI looks SUPER boring:
  • -
    # zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E "15/Feb/2018:(16|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E "15/Feb/2018:(16|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         111 95.108.181.88
         158 45.5.184.221
         201 104.196.152.243
    @@ -677,20 +677,20 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
     
    • Combined list of CGIAR author ORCID iDs is up to 1,500:
    -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI-csv.csv CGcenter_ORCID_ID_combined.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l  
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI-csv.csv CGcenter_ORCID_ID_combined.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l  
     1571
     
    • I updated my resolve-orcids-from-solr.py script to be able to resolve ORCID identifiers from a text file so I renamed it to resolve-orcids.py
    • Also, I updated it so it uses several new options:
    -
    $ ./resolve-orcids.py -i input.txt -o output.txt
    +
    $ ./resolve-orcids.py -i input.txt -o output.txt
     $ cat output.txt 
     Ali Ramadhan: 0000-0001-5019-1368
     Ahmad Maryudi: 0000-0001-5051-7217
     
    • I was running this on the new list of 1571 and found an error:
    -
    Looking up the name associated with ORCID iD: 0000-0001-9634-1958
    +
    Looking up the name associated with ORCID iD: 0000-0001-9634-1958
     Traceback (most recent call last):
       File "./resolve-orcids.py", line 111, in <module>
         read_identifiers_from_file()
    @@ -704,7 +704,7 @@ TypeError: 'NoneType' object is not subscriptable
     
  • I fixed the script so that it checks if the family name is null
  • Now another:
  • -
    Looking up the name associated with ORCID iD: 0000-0002-1300-3636
    +
    Looking up the name associated with ORCID iD: 0000-0002-1300-3636
     Traceback (most recent call last):
       File "./resolve-orcids.py", line 117, in <module>
         read_identifiers_from_file()
    @@ -722,13 +722,13 @@ TypeError: 'NoneType' object is not subscriptable
     
  • Discuss some of the issues with null values and poor-quality names in some ORCID identifiers with Abenet and I think we’ll now only use ORCID iDs that have been sent to use from partners, not those extracted via keyword searches on orcid.org
  • This should be the version we use (the existing controlled vocabulary generated from CGSpace’s Solr authority core plus the IDs sent to us so far by partners):
  • -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > 2018-02-20-combined.txt
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > 2018-02-20-combined.txt
     
    • I updated the resolve-orcids.py to use the “credit-name” if it exists in a profile, falling back to “given-names” + “family-name”
    • Also, I added color coded output to the debug messages and added a “quiet” mode that supresses the normal behavior of printing results to the screen
    • I’m using this as the test input for resolve-orcids.py:
    -
    $ cat orcid-test-values.txt 
    +
    $ cat orcid-test-values.txt 
     # valid identifier with 'given-names' and 'family-name'
     0000-0001-5019-1368
     
    @@ -770,7 +770,7 @@ TypeError: 'NoneType' object is not subscriptable
     
  • It looks like Sisay restarted Tomcat because I was offline
  • There was absolutely nothing interesting going on at 13:00 on the server, WTF?
  • -
    # cat /var/log/nginx/*.log | grep -E "22/Feb/2018:13" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # cat /var/log/nginx/*.log | grep -E "22/Feb/2018:13" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          55 192.99.39.235
          60 207.46.13.26
          62 40.77.167.38
    @@ -784,7 +784,7 @@ TypeError: 'NoneType' object is not subscriptable
     
    • Otherwise there was pretty normal traffic the rest of the day:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         839 216.244.66.245
        1074 68.180.228.117
        1114 157.55.39.100
    @@ -798,7 +798,7 @@ TypeError: 'NoneType' object is not subscriptable
     
    • So I don’t see any definite cause for this crash, I see a shit ton of abandoned PostgreSQL connections today around 1PM!
    -
    # grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
    +
    # grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
     729
     # grep 'Feb 22, 2018 1' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' 
     519
    @@ -807,7 +807,7 @@ TypeError: 'NoneType' object is not subscriptable
     
  • Abandoned connections is not a cause but a symptom, though perhaps something more like a few minutes is better?
  • Also, while looking at the logs I see some new bot:
  • -
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.4.2661.102 Safari/537.36; 360Spider
    +
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.4.2661.102 Safari/537.36; 360Spider
     
    • It seems to re-use its user agent but makes tons of useless requests and I wonder if I should add “.spider.” to the Tomcat Crawler Session Manager valve?
    @@ -820,19 +820,19 @@ TypeError: 'NoneType' object is not subscriptable
  • A few days ago Abenet sent me the list of ORCID iDs from CCAFS
  • We currently have 988 unique identifiers:
  • -
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l          
    +
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l          
     988
     
    • After adding the ones from CCAFS we now have 1004:
    -
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
    +
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
     1004
     
    • I will add them to DSpace Test but Abenet says she’s still waiting to set us ILRI’s list
    • I will tell her that we should proceed on sharing our work on DSpace Test with the partners this week anyways and we can update the list later
    • While regenerating the names for these ORCID identifiers I saw one that has a weird value for its names:
    -
    Looking up the names associated with ORCID iD: 0000-0002-2614-426X
    +
    Looking up the names associated with ORCID iD: 0000-0002-2614-426X
     Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
     
    • I don’t know if the user accidentally entered this as their name or if that’s how ORCID behaves when the name is private?
    • @@ -843,7 +843,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
    • Thinking about how to preserve ORCID identifiers attached to existing items in CGSpace
    • We have over 60,000 unique author + authority combinations on CGSpace:
    -
    dspace=# select count(distinct (text_value, authority)) from metadatavalue where resource_type_id=2 and metadata_field_id=3;
    +
    dspace=# select count(distinct (text_value, authority)) from metadatavalue where resource_type_id=2 and metadata_field_id=3;
      count 
     -------
      62464
    @@ -853,7 +853,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
     
  • The query in Solr would simply be orcid_id:*
  • Assuming I know that authority record with id:d7ef744b-bbd4-4171-b449-00e37e1b776f, then I could query PostgreSQL for all metadata records using that authority:
  • -
    dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f';
    +
    dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f';
      metadata_value_id | resource_id | metadata_field_id |        text_value         | text_lang | place |              authority               | confidence | resource_type_id 
     -------------------+-------------+-------------------+---------------------------+-----------+-------+--------------------------------------+------------+------------------
                2726830 |       77710 |                 3 | Rodríguez Chalarca, Jairo |           |     2 | d7ef744b-bbd4-4171-b449-00e37e1b776f |        600 |                2
    @@ -862,13 +862,13 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
     
  • Then I suppose I can use the resource_id to identify the item?
  • Actually, resource_id is the same id we use in CSV, so I could simply build something like this for a metadata import!
  • -
    id,cg.creator.id
    +
    id,cg.creator.id
     93848,Alan S. Orth: 0000-0002-1735-7458||Peter G. Ballantyne: 0000-0001-9346-2893
     
    • I just discovered that requests-cache can transparently cache HTTP requests
    • Running resolve-orcids.py with my test input takes 10.5 seconds the first time, and then 3.0 seconds the second time!
    -
    $ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
    +
    $ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
     Ali Ramadhan: 0000-0001-5019-1368
     Alan S. Orth: 0000-0002-1735-7458
     Ibrahim Mohammed: 0000-0001-5199-5528
    @@ -896,7 +896,7 @@ Nor Azwadi: 0000-0001-9634-1958
     
  • I need to see which SQL queries are run during that time
  • And only a few hours after I disabled the removeAbandoned thing CGSpace went down and lo and behold, there were 264 connections, most of which were idle:
  • -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
         279 dspaceWeb
     $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
    @@ -905,7 +905,7 @@ $ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle
     
  • So I’m re-enabling the removeAbandoned setting
  • I grabbed a snapshot of the active connections in pg_stat_activity for all queries running longer than 2 minutes:
  • -
    dspace=# \copy (SELECT now() - query_start as "runtime", application_name, usename, datname, waiting, state, query
    +
    dspace=# \copy (SELECT now() - query_start as "runtime", application_name, usename, datname, waiting, state, query
       FROM  pg_stat_activity
       WHERE now() - query_start > '2 minutes'::interval
      ORDER BY runtime DESC) to /tmp/2018-02-27-postgresql.txt
    @@ -913,11 +913,11 @@ COPY 263
     
    • 100 of these idle in transaction connections are the following query:
    -
    SELECT * FROM resourcepolicy WHERE resource_type_id= $1 AND resource_id= $2 AND action_id= $3
    +
    SELECT * FROM resourcepolicy WHERE resource_type_id= $1 AND resource_id= $2 AND action_id= $3
     
    • … but according to the pg_locks documentation I should have done this to correlate the locks with the activity:
    -
    SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;
    +
    SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;
     
    • Tom Desair from Atmire shared some extra JDBC pool parameters that might be useful on my thread on the dspace-tech mailing list:
        @@ -936,7 +936,7 @@ COPY 263
      • CGSpace crashed today, the first HTTP 499 in nginx’s access.log was around 09:12
      • There’s nothing interesting going on in nginx’s logs around that time:
      -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Feb/2018:09:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      +
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Feb/2018:09:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
            65 197.210.168.174
            74 213.55.99.121
            74 66.249.66.90
      @@ -950,12 +950,12 @@ COPY 263
       
      • Looking in dspace.log-2018-02-28 I see this, though:
      -
      2018-02-28 09:19:29,692 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
      +
      2018-02-28 09:19:29,692 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
       org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
       
      • Memory issues seem to be common this month:
      -
      $ grep -c 'nested exception is java.lang.OutOfMemoryError: Java heap space' dspace.log.2018-02-* 
      +
      $ grep -c 'nested exception is java.lang.OutOfMemoryError: Java heap space' dspace.log.2018-02-* 
       dspace.log.2018-02-01:0
       dspace.log.2018-02-02:0
       dspace.log.2018-02-03:0
      @@ -987,7 +987,7 @@ dspace.log.2018-02-28:1
       
      • Top ten users by session during the first twenty minutes of 9AM:
      -
      $ grep -E '2018-02-28 09:(0|1)' dspace.log.2018-02-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq -c | sort -n | tail -n 10
      +
      $ grep -E '2018-02-28 09:(0|1)' dspace.log.2018-02-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq -c | sort -n | tail -n 10
            18 session_id=F2DFF64D3D707CD66AE3A873CEC80C49
            19 session_id=92E61C64A79F0812BE62A3882DA8F4BA
            21 session_id=57417F5CB2F9E3871E609CEEBF4E001F
      @@ -1006,13 +1006,13 @@ dspace.log.2018-02-28:1
       
    • I think I’ll increase the JVM heap size on CGSpace from 6144m to 8192m because I’m sick of this random crashing shit and the server has memory and I’d rather eliminate this so I can get back to solving PostgreSQL issues and doing other real work
    • Run the few corrections from earlier this month for sponsor on CGSpace:
    -
    cgspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
    +
    cgspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
     UPDATE 3
     
    • I finally got a CGIAR account so I logged into CGSpace with it and tried to delete my old unfinished submissions (22 of them)
    • Eventually it succeeded, but it took about five minutes and I noticed LOTS of locks happening with this query:
    -
    dspace=# \copy (SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid) to /tmp/locks-aorth.txt;
    +
    dspace=# \copy (SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid) to /tmp/locks-aorth.txt;
     
    • I took a few snapshots during the process and noticed 500, 800, and even 2000 locks at certain times during the process
    • Afterwards I looked a few times and saw only 150 or 200 locks
    • diff --git a/docs/2018-03/index.html b/docs/2018-03/index.html index 6656ecfa6..8e4b6462d 100644 --- a/docs/2018-03/index.html +++ b/docs/2018-03/index.html @@ -24,7 +24,7 @@ Export a CSV of the IITA community metadata for Martin Mueller Export a CSV of the IITA community metadata for Martin Mueller "/> - + @@ -122,7 +122,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
    • There were some records using a non-breaking space in their AGROVOC subject field
    • I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace
    -
    $ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3      
    +
    $ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3      
     $ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
     
    • This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character
    • @@ -132,16 +132,16 @@ $ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u d
    • Run all system updates on DSpace Test and reboot server
    • I ran the orcid-authority-to-item.py script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata
    -
    $ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
    +
    $ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
     
    • I ran the DSpace cleanup script on CGSpace and it threw an error (as always):
    -
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
       Detail: Key (bitstream_id)=(150659) is still referenced from table "bundle".
     
    • The solution is, as always:
    -
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
    +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
     UPDATE 1
     
    • Apply the proposed PostgreSQL indexes from DS-3636 (pull request #1791 on CGSpace (linode18)
    • @@ -159,7 +159,7 @@ UPDATE 1
    • This makes the CSV have tons of columns, for example dc.title, dc.title[], dc.title[en], dc.title[eng], dc.title[en_US] and so on!
    • I think I can fix — or at least normalize — them in the database:
    -
    dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
    +
    dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
      text_lang 
     -----------
      
    @@ -199,7 +199,7 @@ dspacetest=# select distinct text_lang from metadatavalue where resource_type_id
     
  • On second inspection it looks like dc.description.provenance fields use the text_lang “en” so that’s probably why there are over 100,000 fields changed…
  • If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:
  • -
    dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
    +
    dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
     UPDATE 2309
     
    • I will apply this on CGSpace right now
    • @@ -207,18 +207,18 @@ UPDATE 2309
    • Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the cg.creator.id field
    • For example, a GREL expression in a custom text facet to get all items with dc.contributor.author[en_US] of a certain author with several name variations (this is how you use a logical OR in OpenRefine):
    -
    or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
    +
    or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
     
    • Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:
    -
    if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
    +
    if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
     
    • One thing that bothers me is that this won’t honor author order
    • It might be better to do batches of these in PostgreSQL with a script that takes the place column of an author into account when setting the cg.creator.id
    • I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching cg.creator.id fields: add-orcid-identifiers-csv.py
    • The CSV should have two columns: author name and ORCID identifier:
    -
    dc.contributor.author,cg.creator.id
    +
    dc.contributor.author,cg.creator.id
     "Orth, Alan",Alan S. Orth: 0000-0002-1735-7458
     "Orth, A.",Alan S. Orth: 0000-0002-1735-7458
     
      @@ -236,7 +236,7 @@ UPDATE 2309
    • Peter also wrote to say he is having issues with the Atmire Listings and Reports module
    • When I logged in to try it I get a blank white page after continuing and I see this in dspace.log.2018-03-11:
    -
    2018-03-11 11:38:15,592 WARN  org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=91C2C0C59669B33A7683570F6010603A:internal_error:-- URL Was: https://cgspace.cgiar.or
    +
    2018-03-11 11:38:15,592 WARN  org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=91C2C0C59669B33A7683570F6010603A:internal_error:-- URL Was: https://cgspace.cgiar.or
     g/jspui/listings-and-reports
     -- Method: POST
     -- Parameters were:
    @@ -282,7 +282,7 @@ org.apache.jasper.JasperException: java.lang.NullPointerException
     
    • The error in the DSpace log is:
    -
    org.apache.jasper.JasperException: java.lang.ArrayIndexOutOfBoundsException: -1
    +
    org.apache.jasper.JasperException: java.lang.ArrayIndexOutOfBoundsException: -1
     
    • The full error is here: https://gist.github.com/alanorth/ea47c092725960e39610db9b0c13f6ca
    • If I do a report for “Orth, Alan” with the same custom layout it works!
    • @@ -295,16 +295,16 @@ org.apache.jasper.JasperException: java.lang.NullPointerException
    • I have removed the old server (linode02 aka linode578611) in favor of linode19 aka linode6624164
    • Looking at the CRP subjects on CGSpace I see there is one blank one so I’ll just fix it:
    -
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
    +
    dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
     
    • Copy all CRP subjects to a CSV to do the mass updates:
    -
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=230 group by text_value order by count desc) to /tmp/crps.csv with csv header;
    +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=230 group by text_value order by count desc) to /tmp/crps.csv with csv header;
     COPY 21
     
    • Once I prepare the new input forms (#362) I will need to do the batch corrections:
    -
    $ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.crp -t correct -m 230 -n -d
    +
    $ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.crp -t correct -m 230 -n -d
     
    • Create a pull request to update the input forms for the new CRP subject style (#366)
    @@ -316,13 +316,13 @@ COPY 21
  • CGSpace crashed this morning for about seven minutes and Dani restarted Tomcat
  • Around that time there were an increase of SQL errors:
  • -
    2018-03-19 09:10:54,856 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
    +
    2018-03-19 09:10:54,856 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
     ...
     2018-03-19 09:10:54,862 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
     
    • But these errors, I don’t even know what they mean, because a handful of them happen every day:
    -
    $ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
    +
    $ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
     dspace.log.2018-03-10:13
     dspace.log.2018-03-11:15
     dspace.log.2018-03-12:13
    @@ -336,7 +336,7 @@ dspace.log.2018-03-19:90
     
    • There wasn’t even a lot of traffic at the time (8–9 AM):
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Mar/2018:0[89]:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Mar/2018:0[89]:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          92 40.77.167.197
          92 83.103.94.48
          96 40.77.167.175
    @@ -350,7 +350,7 @@ dspace.log.2018-03-19:90
     
    • Well there is a hint in Tomcat’s catalina.out:
    -
    Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
    +
    Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
     Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOfMemoryError: Java heap space
     
    • So someone was doing something heavy somehow… my guess is content and usage stats!
    • @@ -367,7 +367,7 @@ Exception in thread "http-bio-127.0.0.1-8081-exec-280" java.lang.OutOf
      • DSpace Test has been down for a few hours with SQL and memory errors starting this morning:
      -
      2018-03-20 08:47:10,177 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
      +
      2018-03-20 08:47:10,177 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
       ...
       2018-03-20 08:53:11,624 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
       org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
      @@ -377,20 +377,20 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
       
    • Abenet told me that one of Lance Robinson’s ORCID iDs on CGSpace is incorrect
    • I will remove it from the controlled vocabulary (#367) and update any items using the old one:
    -
    dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
    +
    dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
     UPDATE 1
     
    • Communicate with DSpace editors on Yammer about being more careful about spaces and character editing when doing manual metadata edits
    • Merge the changes to CRP names to the 5_x-prod branch and deploy on CGSpace (#363)
    • Run corrections for CRP names in the database:
    -
    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
     
    • Run all system updates on CGSpace (linode18) and reboot the server
    • I started a full Discovery re-index on CGSpace because of the updated CRPs
    • I see this error in the DSpace log:
    -
    2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for  field "dc_contributor_author".
    +
    2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for  field "dc_contributor_author".
     java.lang.IllegalArgumentException: No choices plugin was configured for  field "dc_contributor_author".
             at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:261)
             at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:249)
    @@ -427,28 +427,28 @@ java.lang.IllegalArgumentException: No choices plugin was configured for  field
     
  • Afterwards we’ll want to do some batch tagging of ORCID identifiers to these names
  • CGSpace crashed again this afternoon, I’m not sure of the cause but there are a lot of SQL errors in the DSpace log:
  • -
    2018-03-21 15:11:08,166 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
    +
    2018-03-21 15:11:08,166 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
     java.sql.SQLException: Connection has already been closed.
     
    • I have no idea why so many connections were abandoned this afternoon:
    -
    # grep 'Mar 21, 2018' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
    +
    # grep 'Mar 21, 2018' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
     268
     
    • DSpace Test crashed again due to Java heap space, this is from the DSpace log:
    -
    2018-03-21 15:18:48,149 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
    +
    2018-03-21 15:18:48,149 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
     org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
     
    • And this is from the Tomcat Catalina log:
    -
    Mar 21, 2018 11:20:00 AM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
    +
    Mar 21, 2018 11:20:00 AM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
     SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
     java.lang.OutOfMemoryError: Java heap space
     
    • But there are tons of heap space errors on DSpace Test actually:
    -
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
    +
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     319
     
    • I guess we need to give it more RAM because it now has CGSpace’s large Solr core
    • @@ -457,7 +457,7 @@ java.lang.OutOfMemoryError: Java heap space
    • Deploy the new JDBC driver on DSpace Test
    • I’m also curious to see how long the dspace index-discovery -b takes on DSpace Test where the DSpace installation directory is on one of Linode’s new block storage volumes
    -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    208m19.155s
     user    8m39.138s
    @@ -470,7 +470,7 @@ sys     2m45.135s
     
  • For example, Peter has inadvertantly introduced Unicode character 0xfffd into several fields
  • I can search for Unicode values by their hex code in OpenRefine using the following GREL expression:
  • -
    isNotNull(value.match(/.*\ufffd.*/))
    +
    isNotNull(value.match(/.*\ufffd.*/))
     
    • I need to be able to add many common characters though so that it is useful to copy and paste into a new project to find issues
    @@ -489,11 +489,11 @@ sys 2m45.135s
  • Looking at Peter’s author corrections and trying to work out a way to find errors in OpenRefine easily
  • I can find all names that have acceptable characters using a GREL expression like:
  • -
    isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
    +
    isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
     
    • But it’s probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):
    -
    or(
    +
    or(
       isNotNull(value.match(/.*[(|)].*/)),
       isNotNull(value.match(/.*\uFFFD.*/)),
       isNotNull(value.match(/.*\u00A0.*/)),
    @@ -502,7 +502,7 @@ sys     2m45.135s
     
    • And here’s one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it’s time to add delete support to my fix-metadata-values.py script:
    -
    or(
    +
    or(
       isNotNull(value.match(/.*delete.*/i)),
       isNotNull(value.match(/.*remove.*/i)),
       isNotNull(value.match(/.*check.*/i))
    @@ -521,7 +521,7 @@ sys     2m45.135s
     

    Test the corrections and deletions locally, then run them on CGSpace:

    -
    $ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
    +
    $ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
     $ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p 'fuuu'
     
    • Afterwards I started a full Discovery reindexing on both CGSpace and DSpace Test
    • @@ -542,12 +542,12 @@ $ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.cont
    • DSpace Test crashed due to heap space so I’ve increased it from 4096m to 5120m
    • The error in Tomcat’s catalina.out was:
    -
    Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
    +
    Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
     
    • Add ISI Journal (cg.isijournal) as an option in Atmire’s Listing and Reports layout (#370) for Abenet
    • I noticed a few hundred CRPs using the old capitalized formatting so I corrected them:
    -
    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
     Fixed 29 occurences of: CLIMATE CHANGE, AGRICULTURE AND FOOD SECURITY
     Fixed 7 occurences of: WATER, LAND AND ECOSYSTEMS
     Fixed 19 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
    diff --git a/docs/2018-04/index.html b/docs/2018-04/index.html
    index 1fbf1e267..75045df5d 100644
    --- a/docs/2018-04/index.html
    +++ b/docs/2018-04/index.html
    @@ -26,7 +26,7 @@ Catalina logs at least show some memory errors yesterday:
     I tried to test something on DSpace Test but noticed that it’s down since god knows when
     Catalina logs at least show some memory errors yesterday:
     "/>
    -
    +
     
     
         
    @@ -117,7 +117,7 @@ Catalina logs at least show some memory errors yesterday:
     
  • I tried to test something on DSpace Test but noticed that it’s down since god knows when
  • Catalina logs at least show some memory errors yesterday:
  • -
    Mar 31, 2018 10:26:42 PM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
    +
    Mar 31, 2018 10:26:42 PM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
     SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]] 
     java.lang.OutOfMemoryError: Java heap space
     
    @@ -134,12 +134,12 @@ Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]]
     
  • Peter noticed that there were still some old CRP names on CGSpace, because I hadn’t forced the Discovery index to be updated after I fixed the others last week
  • For completeness I re-ran the CRP corrections on CGSpace:
  • -
    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
     Fixed 1 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
     
    • Then started a full Discovery index:
    -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    76m13.841s
    @@ -149,18 +149,18 @@ sys     2m2.498s
     
  • Elizabeth from CIAT emailed to ask if I could help her by adding ORCID identifiers to all of Joseph Tohme’s items
  • I used my add-orcid-identifiers-csv.py script:
  • -
    $ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i /tmp/jtohme-2018-04-04.csv -db dspace -u dspace -p 'fuuu'
     
    • The CSV format of jtohme-2018-04-04.csv was:
    -
    dc.contributor.author,cg.creator.id
    +
    dc.contributor.author,cg.creator.id
     "Tohme, Joseph M.",Joe Tohme: 0000-0003-2765-7101
     
    • There was a quoting error in my CRP CSV and the replacements for Forests, Trees and Agroforestry got messed up
    • So I fixed them and had to re-index again!
    • I started preparing the git branch for the the DSpace 5.5→5.8 upgrade:
    -
    $ git checkout -b 5_x-dspace-5.8 5_x-prod
    +
    $ git checkout -b 5_x-dspace-5.8 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.8
     
      @@ -181,7 +181,7 @@ $ git rebase -i dspace-5.8
    • Fix Sisay’s sudo access on the new DSpace Test server (linode19)
    • The reindexing process on DSpace Test took forever yesterday:
    -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    599m32.961s
     user    9m3.947s
    @@ -193,7 +193,7 @@ sys     2m52.585s
     
  • Help Peter with the GDPR compliance / reporting form for CGSpace
  • DSpace Test crashed due to memory issues again:
  • -
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
    +
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     16
     
    • I ran all system updates on DSpace Test and rebooted it
    • @@ -205,7 +205,7 @@ sys 2m52.585s
    • I got a notice that CGSpace CPU usage was very high this morning
    • Looking at the nginx logs, here are the top users today so far:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10                                                                                                   
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10                                                                                                   
         282 207.46.13.112
         286 54.175.208.220
         287 207.46.13.113
    @@ -220,24 +220,24 @@ sys     2m52.585s
     
  • 45.5.186.2 is of course CIAT
  • 95.108.181.88 appears to be Yandex:
  • -
    95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] "GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1" 200 2638 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
    +
    95.108.181.88 - - [09/Apr/2018:06:34:16 +0000] "GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1" 200 2638 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
     
    • And for some reason Yandex created a lot of Tomcat sessions today:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-04-10
     4363
     
    • 70.32.83.92 appears to be some harvester we’ve seen before, but on a new IP
    • They are not creating new Tomcat sessions so there is no problem there
    • 178.154.200.38 also appears to be Yandex, and is also creating many Tomcat sessions:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=178.154.200.38' dspace.log.2018-04-10
     3982
     
    • I’m not sure why Yandex creates so many Tomcat sessions, as its user agent should match the Crawler Session Manager valve
    • Let’s try a manual request with and without their user agent:
    -
    $ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
    +
    $ http --print Hh https://cgspace.cgiar.org/bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg 'User-Agent:Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)'
     GET /bitstream/handle/10568/21794/ILRI_logo_usage.jpg.jpg HTTP/1.1
     Accept: */*
     Accept-Encoding: gzip, deflate
    @@ -294,7 +294,7 @@ X-XSS-Protection: 1; mode=block
     
    • In other news, it looks like the number of total requests processed by nginx in March went down from the previous months:
    -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2018"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2018"
     2266594
     
     real    0m13.658s
    @@ -303,25 +303,25 @@ sys     0m1.087s
     
    • In other other news, the database cleanup script has an issue again:
    -
    $ dspace cleanup -v
    +
    $ dspace cleanup -v
     ...
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
       Detail: Key (bitstream_id)=(151626) is still referenced from table "bundle".
     
    • The solution is, as always:
    -
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (151626);'
    +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (151626);'
     UPDATE 1
     
    • Looking at abandoned connections in Tomcat:
    -
    # zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
    +
    # zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
     2115
     
    • Apparently from these stacktraces we should be able to see which code is not closing connections properly
    • Here’s a pretty good overview of days where we had database issues recently:
    -
    # zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' | awk '{print $1,$2, $3}' | sort | uniq -c | sort -n
    +
    # zcat /var/log/tomcat7/catalina.out.[1-9].gz | grep 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' | awk '{print $1,$2, $3}' | sort | uniq -c | sort -n
           1 Feb 18, 2018
           1 Feb 19, 2018
           1 Feb 20, 2018
    @@ -356,7 +356,7 @@ UPDATE 1
     
    • DSpace Test (linode19) crashed again some time since yesterday:
    -
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
    +
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     168
     
    • I ran all system updates and rebooted the server
    • @@ -374,12 +374,12 @@ UPDATE 1
      • While testing an XMLUI patch for DS-3883 I noticed that there is still some remaining Authority / Solr configuration left that we need to remove:
      -
      2018-04-14 18:55:25,841 ERROR org.dspace.authority.AuthoritySolrServiceImpl @ Authority solr is not correctly configured, check "solr.authority.server" property in the dspace.cfg
      +
      2018-04-14 18:55:25,841 ERROR org.dspace.authority.AuthoritySolrServiceImpl @ Authority solr is not correctly configured, check "solr.authority.server" property in the dspace.cfg
       java.lang.NullPointerException
       
      • I assume we need to remove authority from the consumers in dspace/config/dspace.cfg:
      -
      event.dispatcher.default.consumers = authority, versioning, discovery, eperson, harvester, statistics,batchedit, versioningmqm
      +
      event.dispatcher.default.consumers = authority, versioning, discovery, eperson, harvester, statistics,batchedit, versioningmqm
       
      • I see the same error on DSpace Test so this is definitely a problem
      • After disabling the authority consumer I no longer see the error
      • @@ -387,7 +387,7 @@ java.lang.NullPointerException
      • File a ticket on DSpace’s Jira for the target="_blank" security and performance issue (DS-3891)
      • I re-deployed DSpace Test (linode19) and was surprised by how long it took the ant update to complete:
      -
      BUILD SUCCESSFUL
      +
      BUILD SUCCESSFUL
       Total time: 4 minutes 12 seconds
       
      • The Linode block storage is much slower than the instance storage
      • @@ -404,7 +404,7 @@ Total time: 4 minutes 12 seconds
      • They will need to use OpenSearch, but I can’t remember all the parameters
      • Apparently search sort options for OpenSearch are in dspace.cfg:
      -
      webui.itemlist.sort-option.1 = title:dc.title:title
      +
      webui.itemlist.sort-option.1 = title:dc.title:title
       webui.itemlist.sort-option.2 = dateissued:dc.date.issued:date
       webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
       webui.itemlist.sort-option.4 = type:dc.type:text
      @@ -422,27 +422,27 @@ webui.itemlist.sort-option.4 = type:dc.type:text
       
    • They are missing the order parameter (ASC vs DESC)
    • I notice that DSpace Test has crashed again, due to memory:
    -
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
    +
    # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
     178
     
    • I will increase the JVM heap size from 5120M to 6144M, though we don’t have much room left to grow as DSpace Test (linode19) is using a smaller instance size than CGSpace
    • Gabriela from CIP asked if I could send her a list of all CIP authors so she can do some replacements on the name formats
    • I got a list of all the CIP collections manually and use the same query that I used in August, 2017:
    -
    dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/89347', '10568/88229', '10568/53086', '10568/53085', '10568/69069', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/70150', '10568/53093', '10568/64874', '10568/53094'))) group by text_value order by count desc) to /tmp/cip-authors.csv with csv;
    +
    dspace#= \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/89347', '10568/88229', '10568/53086', '10568/53085', '10568/69069', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/70150', '10568/53093', '10568/64874', '10568/53094'))) group by text_value order by count desc) to /tmp/cip-authors.csv with csv;
     

    2018-04-19

    • Run updates on DSpace Test (linode19) and reboot the server
    • Also try deploying updated GeoLite database during ant update while re-deploying code:
    -
    $ ant update update_geolite clean_backups
    +
    $ ant update update_geolite clean_backups
     
    • I also re-deployed CGSpace (linode18) to make the ORCID search, authority cleanup, CCAFS project tag PII-LAM_CSAGender live
    • When re-deploying I also updated the GeoLite databases so I hope the country stats become more accurate…
    • After re-deployment I ran all system updates on the server and rebooted it
    • After the reboot I forced a reïndexing of the Discovery to populate the new ORCID index:
    -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    73m42.635s
     user    8m15.885s
    @@ -456,21 +456,21 @@ sys     2m2.687s
     
  • I confirm that it’s just giving a white page around 4:16
  • The DSpace logs show that there are no database connections:
  • -
    org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-715] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle:0; lastwait:5000].
    +
    org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-715] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle:0; lastwait:5000].
     
    • And there have been shit tons of errors in the last (starting only 20 minutes ago luckily):
    -
    # grep -c 'org.apache.tomcat.jdbc.pool.PoolExhaustedException' /home/cgspace.cgiar.org/log/dspace.log.2018-04-20
    +
    # grep -c 'org.apache.tomcat.jdbc.pool.PoolExhaustedException' /home/cgspace.cgiar.org/log/dspace.log.2018-04-20
     32147
     
    • I can’t even log into PostgreSQL as the postgres user, WTF?
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c 
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c 
     ^C
     
    • Here are the most active IPs today:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Apr/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         917 207.46.13.182
         935 213.55.99.121
         970 40.77.167.134
    @@ -484,7 +484,7 @@ sys     2m2.687s
     
    • It doesn’t even seem like there is a lot of traffic compared to the previous days:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Apr/2018" | wc -l
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Apr/2018" | wc -l
     74931
     # zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz| grep -E "19/Apr/2018" | wc -l
     91073
    @@ -499,7 +499,7 @@ sys     2m2.687s
     
  • Everything is back but I have no idea what caused this—I suspect something with the hosting provider
  • Also super weird, the last entry in the DSpace log file is from 2018-04-20 16:35:09, and then immediately it goes to 2018-04-20 19:15:04 (three hours later!):
  • -
    2018-04-20 16:35:09,144 ERROR org.dspace.app.util.AbstractDSpaceWebapp @ Failed to record shutdown in Webapp table.
    +
    2018-04-20 16:35:09,144 ERROR org.dspace.app.util.AbstractDSpaceWebapp @ Failed to record shutdown in Webapp table.
     org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:18; idle
     :0; lastwait:5000].
             at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:685)
    @@ -543,12 +543,12 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [localhost-startStop-2] Time
     
  • One other new thing I notice is that PostgreSQL 9.6 no longer uses createuser and nocreateuser, as those have actually meant superuser and nosuperuser and have been deprecated for ten years
  • So for my notes, when I’m importing a CGSpace database dump I need to amend my notes to give super user permission to a user, rather than create user:
  • -
    $ psql dspacetest -c 'alter user dspacetest superuser;'
    +
    $ psql dspacetest -c 'alter user dspacetest superuser;'
     $ pg_restore -O -U dspacetest -d dspacetest -W -h localhost /tmp/dspace_2018-04-18.backup
     
    • There’s another issue with Tomcat in Ubuntu 18.04:
    -
    25-Apr-2018 13:26:21.493 SEVERE [http-nio-127.0.0.1-8443-exec-1] org.apache.coyote.AbstractProtocol$ConnectionHandler.process Error reading request, ignored
    +
    25-Apr-2018 13:26:21.493 SEVERE [http-nio-127.0.0.1-8443-exec-1] org.apache.coyote.AbstractProtocol$ConnectionHandler.process Error reading request, ignored
      java.lang.NoSuchMethodError: java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
             at org.apache.coyote.http11.Http11InputBuffer.init(Http11InputBuffer.java:688)
             at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:672)
    diff --git a/docs/2018-05/index.html b/docs/2018-05/index.html
    index 49a17f5be..458abbc44 100644
    --- a/docs/2018-05/index.html
    +++ b/docs/2018-05/index.html
    @@ -38,7 +38,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
     Then I reduced the JVM heap size from 6144 back to 5120m
     Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use
     "/>
    -
    +
     
     
         
    @@ -175,7 +175,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
     
  • There are lots of errors on language, CRP, and even some encoding errors on abstract fields
  • I export them and include the hidden metadata fields like dc.date.accessioned so I can filter the ones from 2018-04 and correct them in Open Refine:
  • -
    $ dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
    +
    $ dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616
     
    • Abenet sent a list of 46 ORCID identifiers for ILRI authors so I need to get their names using my resolve-orcids.py script and merge them into our controlled vocabulary
    • On the messed up IITA records from 2018-04 I see sixty DOIs in incorrect format (cg.identifier.doi)
    • @@ -185,7 +185,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
    • Fixing the IITA records from Sisay, sixty DOIs have completely invalid format like http:dx.doi.org10.1016j.cropro.2008.07.003
    • I corrected all the DOIs and then checked them for validity with a quick bash loop:
    -
    $ for line in $(< /tmp/links.txt); do echo $line; http --print h $line; done
    +
    $ for line in $(< /tmp/links.txt); do echo $line; http --print h $line; done
     
    • Most of the links are good, though one is duplicate and one seems to even be incorrect in the publisher’s site so…
    • Also, there are some duplicates: @@ -205,7 +205,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
    • A few more interesting Unicode characters to look for in text fields like author, abstracts, and citations might be: (0x2019), · (0x00b7), and (0x20ac)
    • A custom text facit in OpenRefine with this GREL expression could be a good for finding invalid characters or encoding errors in authors, abstracts, etc:
    -
    or(
    +
    or(
       isNotNull(value.match(/.*[(|)].*/)),
       isNotNull(value.match(/.*\uFFFD.*/)),
       isNotNull(value.match(/.*\u00A0.*/)),
    @@ -218,7 +218,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked
     
  • I found some more IITA records that Sisay imported on 2018-03-23 that have invalid CRP names, so now I kinda want to check those ones!
  • Combine the ORCID identifiers Abenet sent with our existing list and resolve their names using the resolve-orcids.py script:
  • -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-05-06-combined.txt
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-05-06-combined.txt
     $ ./resolve-orcids.py -i /tmp/2018-05-06-combined.txt -o /tmp/2018-05-06-combined-names.txt -d
     # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    @@ -242,12 +242,12 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
     
  • I could use it with reconcile-csv or to populate a Solr instance for reconciliation
  • This XPath expression gets close, but outputs all items on one line:
  • -
    $ xmllint --xpath '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/node()' dspace/config/input-forms.xml        
    +
    $ xmllint --xpath '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/node()' dspace/config/input-forms.xml        
     Agriculture for Nutrition and HealthBig DataClimate Change, Agriculture and Food SecurityExcellence in BreedingFishForests, Trees and AgroforestryGenebanksGrain Legumes and Dryland CerealsLivestockMaizePolicies, Institutions and MarketsRiceRoots, Tubers and BananasWater, Land and EcosystemsWheatAquatic Agricultural SystemsDryland CerealsDryland SystemsGrain LegumesIntegrated Systems for the Humid TropicsLivestock and Fish
     
    • Maybe xmlstarlet is better:
    -
    $ xmlstarlet sel -t -v '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/text()' dspace/config/input-forms.xml
    +
    $ xmlstarlet sel -t -v '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/text()' dspace/config/input-forms.xml
     Agriculture for Nutrition and Health
     Big Data
     Climate Change, Agriculture and Food Security
    @@ -275,7 +275,7 @@ Livestock and Fish
     
  • I told them to get all CIAT records via OAI
  • Just a note to myself, I figured out how to get reconcile-csv to run from source rather than running the old pre-compiled JAR file:
  • -
    $ lein run /tmp/crps.csv name id
    +
    $ lein run /tmp/crps.csv name id
     
    • I tried to reconcile against a CSV of our countries but reconcile-csv crashes
    @@ -310,7 +310,7 @@ Livestock and Fish
  • Also, I learned how to do something cool with Jython expressions in OpenRefine
  • This will fetch a URL and return its HTTP response code:
  • -
    import urllib2
    +
    import urllib2
     import re
     
     pattern = re.compile('.*10.1016.*')
    @@ -329,24 +329,24 @@ return "blank"
     
  • I was checking the CIFOR data for duplicates using Atmire’s Metadata Quality Module (and found some duplicates actually), but then DSpace died…
  • I didn’t see anything in the Tomcat, DSpace, or Solr logs, but I saw this in dmest -T:
  • -
    [Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
    +
    [Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
     [Tue May 15 12:10:01 2018] Killed process 3763 (java) total-vm:14667688kB, anon-rss:5705268kB, file-rss:0kB, shmem-rss:0kB
     [Tue May 15 12:10:01 2018] oom_reaper: reaped process 3763 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
     
    • So the Linux kernel killed Java…
    • Maria from Bioversity mailed to say she got an error while submitting an item on CGSpace:
    -
    Unable to load Submission Information, since WorkspaceID (ID:S96060) is not a valid in-process submission
    +
    Unable to load Submission Information, since WorkspaceID (ID:S96060) is not a valid in-process submission
     
    • Looking in the DSpace log I see something related:
    -
    2018-05-15 12:35:30,858 INFO  org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060
    +
    2018-05-15 12:35:30,858 INFO  org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060
     
    • So I’m not sure…
    • I finally figured out how to get OpenRefine to reconcile values from Solr via conciliator:
    • The trick was to use a more appropriate Solr fieldType text_en instead of text_general so that more terms match, for example uppercase and lower case:
    -
    $ ./bin/solr start
    +
    $ ./bin/solr start
     $ ./bin/solr create_core -c countries
     $ curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"country", "type":"text_en", "multiValued":false, "stored":true}}' http://localhost:8983/solr/countries/schema
     $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
    @@ -357,7 +357,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
     
    • I should probably make a general copy field and set it to be the default search field, like DSpace’s search core does (see schema.xml):
    -
    <defaultSearchField>search_text</defaultSearchField>
    +
    <defaultSearchField>search_text</defaultSearchField>
     ...
     <copyField source="*" dest="search_text"/>
     
      @@ -381,7 +381,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
    • I created and merged a pull request to fix the sorting issue in Listings and Reports (#374)
    • Regarding the IP Address Anonymization for GDPR, I ammended the Google Analytics snippet in page-structure-alterations.xsl to:
    -
    ga('send', 'pageview', {
    +
    ga('send', 'pageview', {
       'anonymizeIp': true
     });
     
      @@ -439,7 +439,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
      • I’m investigating how many non-CGIAR users we have registered on CGSpace:
      -
      dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
      +
      dspace=# select email, netid from eperson where email not like '%cgiar.org%' and email like '%@%';
       
      • We might need to do something regarding these users for GDPR compliance because we have their names, emails, and potentially phone numbers
      • I decided that I will just use the cookieconsent script as is, since it looks good and technically does set the cookie with “allow” or “dismiss”
      • @@ -460,7 +460,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
      • DSpace Test crashed last night, seems to be related to system memory (not JVM heap)
      • I see this in dmesg:
      -
      [Wed May 30 00:00:39 2018] Out of memory: Kill process 6082 (java) score 697 or sacrifice child
      +
      [Wed May 30 00:00:39 2018] Out of memory: Kill process 6082 (java) score 697 or sacrifice child
       [Wed May 30 00:00:39 2018] Killed process 6082 (java) total-vm:14876264kB, anon-rss:5683372kB, file-rss:0kB, shmem-rss:0kB
       [Wed May 30 00:00:40 2018] oom_reaper: reaped process 6082 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
       
        @@ -471,7 +471,7 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
      • I generated a list of CIFOR duplicates from the CIFOR_May_9 collection using the Atmire MQM module and then dumped the HTML source so I could process it for sending to Vika
      • I used grep to filter all relevant handle lines from the HTML source then used sed to insert a newline before each “Item1” line (as the duplicates are grouped like Item1, Item2, Item3 for each set of duplicates):
      -
      $ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html > ~/cifor-duplicates.txt
      +
      $ grep -E 'aspect.duplicatechecker.DuplicateResults.field.del_handle_[0-9]{1,3}_Item' ~/Desktop/https\ _dspacetest.cgiar.org_atmire_metadata-quality_duplicate-checker.html > ~/cifor-duplicates.txt
       $ sed 's/.*Item1.*/\n&/g' ~/cifor-duplicates.txt > ~/cifor-duplicates-cleaned.txt
       
      • I told Vika to look through the list manually and indicate which ones are indeed duplicates that we should delete, and which ones to map to CIFOR’s collection
      • @@ -482,18 +482,18 @@ $ sed 's/.*Item1.*/\n&/g' ~/cifor-duplicates.txt > ~/cifor-duplicates-cle
      • Oh shit, I literally already wrote a script to get all collections in a community hierarchy from the REST API: rest-find-collections.py
      • The output isn’t great, but all the handles and IDs are printed in debug mode:
      -
      $ ./rest-find-collections.py -u https://cgspace.cgiar.org/rest -d 10568/1 2> /tmp/ilri-collections.txt
      +
      $ ./rest-find-collections.py -u https://cgspace.cgiar.org/rest -d 10568/1 2> /tmp/ilri-collections.txt
       
      • Then I format the list of handles and put it into this SQL query to export authors from items ONLY in those collections (too many to list here):
      -
      dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
      +
      dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/67236','10568/67274',...))) group by text_value order by count desc) to /tmp/ilri-authors.csv with csv;
       

      2018-05-31

      • Clarify CGSpace’s usage of Google Analytics and personally identifiable information during user registration for Bioversity team who had been asking about GDPR compliance
      • Testing running PostgreSQL in a Docker container on localhost because when I’m on Arch Linux there isn’t an easily installable package for particular PostgreSQL versions
      • Now I can just use Docker:
      -
      $ docker pull postgres:9.5-alpine
      +
      $ docker pull postgres:9.5-alpine
       $ docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.5-alpine
       $ createuser -h localhost -U postgres --pwprompt dspacetest
       $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
      diff --git a/docs/2018-06/index.html b/docs/2018-06/index.html
      index a8545a69a..01922dbb9 100644
      --- a/docs/2018-06/index.html
      +++ b/docs/2018-06/index.html
      @@ -58,7 +58,7 @@ real    74m42.646s
       user    8m5.056s
       sys     2m7.289s
       "/>
      -
      +
       
       
           
      @@ -154,12 +154,12 @@ sys     2m7.289s
       
    • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
    • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
    -
    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    +
    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
     
    • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
    • Time to index ~70,000 items on CGSpace:
    -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
    +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
    @@ -193,19 +193,19 @@ sys     2m7.289s
     
  • I uploaded fixes for all those now, but I will continue with the rest of the data later
  • Regarding the SQL migration errors, Atmire told me I need to run some migrations manually in PostgreSQL:
  • -
    delete from schema_version where version = '5.6.2015.12.03.2';
    +
    delete from schema_version where version = '5.6.2015.12.03.2';
     update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015.12.03.2';
     update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015.12.03.3';
     
    • And then I need to ignore the ignored ones:
    -
    $ ~/dspace/bin/dspace database migrate ignored
    +
    $ ~/dspace/bin/dspace database migrate ignored
     
    • Now DSpace starts up properly!
    • Gabriela from CIP got back to me about the author names we were correcting on CGSpace
    • I did a quick sanity check on them and then did a test import with my fix-metadata-value.py script:
    -
    $ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
    +
    $ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
     
    • I will apply them on CGSpace tomorrow I think…
    @@ -220,7 +220,7 @@ update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015
  • I spent some time removing the Atmire Metadata Quality Module (MQM) from the proposed DSpace 5.8 changes
  • After removing all code mentioning MQM, mqm, metadata-quality, batchedit, duplicatechecker, etc, I think I got most of it removed, but there is a Spring error during Tomcat startup:
  • -
     INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
    +
     INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
     Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'org.dspace.servicemanager.spring.DSpaceBeanPostProcessor#0' defined in class path resource [spring/spring-dspace-applicationContext.xml]: Unsatisfied dependency expressed through constructor argument with index 0 of type [org.dspace.servicemanager.config.DSpaceConfigurationService]: : Cannot find class [com.atmire.dspace.discovery.ItemCollectionPlugin] for bean with name 'itemCollectionPlugin' defined in file [/home/aorth/dspace/config/spring/api/discovery.xml];
     
    • I can fix this by commenting out the ItemCollectionPlugin line of discovery.xml, but from looking at the git log I’m not actually sure if that is related to MQM or not
    • @@ -335,7 +335,7 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
    -
    or(
    +
    or(
       value.contains('€'),
       value.contains('6g'),
       value.contains('6m'),
    @@ -357,24 +357,24 @@ Failed to startup the DSpace Service Manager: failure starting up spring service
     
  • Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Robin Buruchara’s items
  • I used my add-orcid-identifiers-csv.py script:
  • -
    $ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p 'fuuu'
     
    • The contents of 2018-06-13-Robin-Buruchara.csv were:
    -
    dc.contributor.author,cg.creator.id
    +
    dc.contributor.author,cg.creator.id
     "Buruchara, Robin",Robin Buruchara: 0000-0003-0934-1218
     "Buruchara, Robin A.",Robin Buruchara: 0000-0003-0934-1218
     
    • On a hunch I checked to see if CGSpace’s bitstream cleanup was working properly and of course it’s broken:
    -
    $ dspace cleanup -v
    +
    $ dspace cleanup -v
     ...
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
       Detail: Key (bitstream_id)=(152402) is still referenced from table "bundle".
     
    • As always, the solution is to delete that ID manually in PostgreSQL:
    -
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (152402);'
    +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (152402);'
     UPDATE 1
     

    2018-06-14

      @@ -387,7 +387,7 @@ UPDATE 1
      • I was restoring a PostgreSQL dump on my test machine and found a way to restore the CGSpace dump as the postgres user, but have the owner of the schema be the dspacetest user:
      -
      $ dropdb -h localhost -U postgres dspacetest
      +
      $ dropdb -h localhost -U postgres dspacetest
       $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
       $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
       $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost /tmp/cgspace_2018-06-24.backup
      @@ -407,12 +407,12 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
       
    • There is already a search filter for this field defined in discovery.xml but we aren’t using it, so I quickly enabled and tested it, then merged it to the 5_x-prod branch (#380)
    • Back to testing the DSpace 5.8 changes from Atmire, I had another issue with SQL migrations:
    -
    Caused by: org.flywaydb.core.api.FlywayException: Validate failed. Found differences between applied migrations and available migrations: Detected applied migration missing on the classpath: 5.8.2015.12.03.3
    +
    Caused by: org.flywaydb.core.api.FlywayException: Validate failed. Found differences between applied migrations and available migrations: Detected applied migration missing on the classpath: 5.8.2015.12.03.3
     
    • It took me a while to figure out that this migration is for MQM, which I removed after Atmire’s original advice about the migrations so we actually need to delete this migration instead up updating it
    • So I need to make sure to run the following during the DSpace 5.8 upgrade:
    -
    -- Delete existing CUA 4 migration if it exists
    +
    -- Delete existing CUA 4 migration if it exists
     delete from schema_version where version = '5.6.2015.12.03.2';
     
     -- Update version of CUA 4 migration
    @@ -423,18 +423,18 @@ delete from schema_version where version = '5.5.2015.12.03.3';
     
    • After that you can run the migrations manually and then DSpace should work fine:
    -
    $ ~/dspace/bin/dspace database migrate ignored
    +
    $ ~/dspace/bin/dspace database migrate ignored
     ...
     Done.
     
    • Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Andy Jarvis' items on CGSpace
    • I used my add-orcid-identifiers-csv.py script:
    -
    $ ./add-orcid-identifiers-csv.py -i 2018-06-24-andy-jarvis-orcid.csv -db dspacetest -u dspacetest -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i 2018-06-24-andy-jarvis-orcid.csv -db dspacetest -u dspacetest -p 'fuuu'
     
    • The contents of 2018-06-24-andy-jarvis-orcid.csv were:
    -
    dc.contributor.author,cg.creator.id
    +
    dc.contributor.author,cg.creator.id
     "Jarvis, A.",Andy Jarvis: 0000-0001-6543-0798
     "Jarvis, Andy",Andy Jarvis: 0000-0001-6543-0798
     "Jarvis, Andrew",Andy Jarvis: 0000-0001-6543-0798
    @@ -444,7 +444,7 @@ Done.
     
  • I removed both those beans and did some simple tests to check item submission, media-filter of PDFs, REST API, but got an error “No matches for the query” when listing records in OAI
  • This warning appears in the DSpace log:
  • -
    2018-06-26 16:58:12,052 WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
    +
    2018-06-26 16:58:12,052 WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
     
    • It’s actually only a warning and it also appears in the logs on DSpace Test (which is currently running DSpace 5.5), so I need to keep troubleshooting
    • Ah, I think I just need to run dspace oai import
    • @@ -455,7 +455,7 @@ Done.
    • I’ll have to figure out how to separate those we’re keeping, deleting, and mapping into CIFOR’s archive collection
    • First, get the 62 deletes from Vika’s file and remove them from the collection:
    -
    $ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt
    +
    $ grep delete 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-delete.txt
     $ wc -l cifor-handle-to-delete.txt
     62 cifor-handle-to-delete.txt
     $ wc -l 10568-92904.csv
    @@ -467,14 +467,14 @@ $ wc -l 10568-92904.csv
     
  • This iterates over the handles for deletion and uses sed with an alternative pattern delimiter of ‘#’ (which must be escaped), because the pattern itself contains a ‘/’
  • The mapped ones will be difficult because we need their internal IDs in order to map them, and there are 50 of them:
  • -
    $ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-map.txt
    +
    $ grep map 2018-06-22-cifor-duplicates.txt | grep -o -E '[0-9]{5}\/[0-9]{5}' > cifor-handle-to-map.txt
     $ wc -l cifor-handle-to-map.txt
     50 cifor-handle-to-map.txt
     
    • I can either get them from the databse, or programatically export the metadata using dspace metadata-export -i 10568/xxxxx
    • Oooh, I can export the items one by one, concatenate them together, remove the headers, and extract the id and collection columns using csvkit:
    -
    $ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done < /tmp/cifor-handle-to-map.txt
    +
    $ while read line; do filename=${line/\//-}.csv; dspace metadata-export -i $line -f $filename; done < /tmp/cifor-handle-to-map.txt
     $ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
     
    • Then I can use Open Refine to add the “CIFOR Archive” collection to the mappings
    • @@ -487,7 +487,7 @@ $ sed '/^id/d' 10568-*.csv | csvcut -c 1,2 > map-to-cifor-archive.csv
    • DSpace Test appears to have crashed last night
    • There is nothing in the Tomcat or DSpace logs, but I see the following in dmesg -T:
    -
    [Thu Jun 28 00:00:30 2018] Out of memory: Kill process 14501 (java) score 701 or sacrifice child
    +
    [Thu Jun 28 00:00:30 2018] Out of memory: Kill process 14501 (java) score 701 or sacrifice child
     [Thu Jun 28 00:00:30 2018] Killed process 14501 (java) total-vm:14926704kB, anon-rss:5693608kB, file-rss:0kB, shmem-rss:0kB
     [Thu Jun 28 00:00:30 2018] oom_reaper: reaped process 14501 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
     
      diff --git a/docs/2018-07/index.html b/docs/2018-07/index.html index 0246eb78f..bb299ca1d 100644 --- a/docs/2018-07/index.html +++ b/docs/2018-07/index.html @@ -36,7 +36,7 @@ During the mvn package stage on the 5.8 branch I kept getting issues with java r There is insufficient memory for the Java Runtime Environment to continue. "/> - + @@ -126,20 +126,20 @@ There is insufficient memory for the Java Runtime Environment to continue.
      • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:
      -
      $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
      +
      $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
       
      • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:
      -
      There is insufficient memory for the Java Runtime Environment to continue.
      +
      There is insufficient memory for the Java Runtime Environment to continue.
       
      • As the machine only has 8GB of RAM, I reduced the Tomcat memory heap from 5120m to 4096m so I could try to allocate more to the build process:
      -
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
      +
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
       $ mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -Denv=dspacetest.cgiar.org -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2 clean package
       
      • Then I stopped the Tomcat 7 service, ran the ant update, and manually ran the old and ignored SQL migrations:
      -
      $ sudo su - postgres
      +
      $ sudo su - postgres
       $ psql dspace
       ...
       dspace=# begin;
      @@ -171,13 +171,13 @@ $ dspace database migrate ignored
       
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive.csv
     $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/2018-06-27-New-CIFOR-Archive2.csv
     
    • I noticed there are many items that use HTTP instead of HTTPS for their Google Books URL, and some missing HTTP entirely:
    -
    dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
    +
    dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
      count
     -------
        785
    @@ -188,7 +188,7 @@ dspace=# select count(*) from metadatavalue where resource_type_id=2 and metadat
     
    • I think I should fix that as well as some other garbage values like “test” and “dspace.ilri.org” etc:
    -
    dspace=# begin;
    +
    dspace=# begin;
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value like 'http://books.google.%';
     UPDATE 785
     dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'books.google', 'https://books.google') where resource_type_id=2 and metadata_field_id=222 and text_value ~ '^books\.google\..*';
    @@ -201,7 +201,7 @@ dspace=# commit;
     
    • Testing DSpace 5.8 with PostgreSQL 9.6 and Tomcat 8.5.32 (instead of my usual 7.0.88) and for some reason I get autowire errors on Catalina startup with 8.5.32:
    -
    03-Jul-2018 19:51:37.272 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
    +
    03-Jul-2018 19:51:37.272 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
      java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/aorth/dspace/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#3f6c3e6a': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
     	at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
     	at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4792)
    @@ -241,7 +241,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
     
  • It looks like I added Solr to the backup_to_s3.sh script, but that script is not even being used (s3cmd is run directly from root’s crontab)
  • For now I have just initiated a manual S3 backup of the Solr data:
  • -
    # s3cmd sync --delete-removed /home/backup/solr/ s3://cgspace.cgiar.org/solr/
    +
    # s3cmd sync --delete-removed /home/backup/solr/ s3://cgspace.cgiar.org/solr/
     
    • But I need to add this to cron!
    • I wonder if I should convert some of the cron jobs to systemd services / timers…
    • @@ -249,7 +249,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
    • Abenet wants to be able to search by journal title (dc.source) in the advanced Discovery search so I opened an issue for it (#384)
    • I regenerated the list of names for all our ORCID iDs using my resolve-orcids.py script:
    -
    $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > /tmp/2018-07-08-orcids.txt
    +
    $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > /tmp/2018-07-08-orcids.txt
     $ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt -d
     
    • But after comparing to the existing list of names I didn’t see much change, so I just ignored it
    • @@ -259,22 +259,22 @@ $ ./resolve-orcids.py -i /tmp/2018-07-08-orcids.txt -o /tmp/2018-07-08-names.txt
    • Uptime Robot said that CGSpace was down for two minutes early this morning but I don’t see anything in Tomcat logs or dmesg
    • Uptime Robot said that CGSpace was down for two minutes again later in the day, and this time I saw a memory error in Tomcat’s catalina.out:
    -
    Exception in thread "http-bio-127.0.0.1-8081-exec-557" java.lang.OutOfMemoryError: Java heap space
    +
    Exception in thread "http-bio-127.0.0.1-8081-exec-557" java.lang.OutOfMemoryError: Java heap space
     
    • I’m not sure if it’s the same error, but I see this in DSpace’s solr.log:
    -
    2018-07-09 06:25:09,913 ERROR org.apache.solr.servlet.SolrDispatchFilter @ null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
    +
    2018-07-09 06:25:09,913 ERROR org.apache.solr.servlet.SolrDispatchFilter @ null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
     
    • I see a strange error around that time in dspace.log.2018-07-08:
    -
    2018-07-09 06:23:43,510 ERROR com.atmire.statistics.SolrLogThread @ IOException occured when talking to server at: http://localhost:8081/solr/statistics
    +
    2018-07-09 06:23:43,510 ERROR com.atmire.statistics.SolrLogThread @ IOException occured when talking to server at: http://localhost:8081/solr/statistics
     org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr/statistics
     
    • But not sure what caused that…
    • I got a message from Linode tonight that CPU usage was high on CGSpace for the past few hours around 8PM GMT
    • Looking in the nginx logs I see the top ten IP addresses active today:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "09/Jul/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "09/Jul/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1691 40.77.167.84
        1701 40.77.167.69
        1718 50.116.102.77
    @@ -288,7 +288,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
     
    • Of those, all except 70.32.83.92 and 50.116.102.77 are NOT re-using their Tomcat sessions, for example from the XMLUI logs:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2018-07-09
     4435
     
    • 95.108.181.88 appears to be Yandex, so I dunno why it’s creating so many sessions, as its user agent should match Tomcat’s Crawler Session Manager Valve
    • @@ -314,7 +314,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
    • Linode sent an alert about CPU usage on CGSpace again, about 13:00UTC
    • These are the top ten users in the last two hours:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Jul/2018:(11|12|13)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Jul/2018:(11|12|13)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          81 193.95.22.113
          82 50.116.102.77
         112 40.77.167.90
    @@ -328,7 +328,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
     
    • Looks like 213.139.52.250 is Moayad testing his new CGSpace vizualization thing:
    -
    213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] "GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0" 200 53750 "http://localhost:4200/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"
    +
    213.139.52.250 - - [10/Jul/2018:13:39:41 +0000] "GET /bitstream/handle/10568/75668/dryad.png HTTP/2.0" 200 53750 "http://localhost:4200/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"
     
    • He said there was a bug that caused his app to request a bunch of invalid URLs
    • I’ll have to keep and eye on this and see how their platform evolves
    • @@ -349,7 +349,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
    • Uptime Robot said that CGSpace went down a few times last night, around 10:45 PM and 12:30 AM
    • Here are the top ten IPs from last night and this morning:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "11/Jul/2018:22" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "11/Jul/2018:22" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          48 66.249.64.91
          50 35.227.26.162
          57 157.55.39.234
    @@ -377,7 +377,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
     
  • A brief Google search doesn’t turn up any information about what this bot is, but lots of users complaining about it
  • This bot does make a lot of requests all through the day, although it seems to re-use its Tomcat session:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
       17098 208.110.72.10
     # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=208.110.72.10' dspace.log.2018-07-11
     1161
    @@ -386,7 +386,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
     
    • I think the problem is that, despite the bot requesting robots.txt, it almost exlusively requests dynamic pages from /discover:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep -o -E "GET /(browse|discover|search-filter)" | sort -n | uniq -c | sort -rn
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "Pcore-HTTP" | grep -o -E "GET /(browse|discover|search-filter)" | sort -n | uniq -c | sort -rn
       13364 GET /discover
         993 GET /search-filter
         804 GET /browse
    @@ -397,7 +397,7 @@ org.apache.solr.client.solrj.SolrServerException: IOException occured when talki
     
  • I’ll also add it to Tomcat’s Crawler Session Manager Valve to force the re-use of a common Tomcat sesssion for all crawlers just in case
  • Generate a list of all affiliations in CGSpace to send to Mohamed Salem to compare with the list on MEL (sorting the list by most occurrences):
  • -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv header
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where resource_type_id=2 and metadata_field_id=211 group by text_value order by count desc) to /tmp/affiliations.csv with csv header
     COPY 4518
     dspace=# \q
     $ csvcut -c 1 < /tmp/affiliations.csv > /tmp/affiliations-1.csv
    @@ -408,7 +408,7 @@ $ csvcut -c 1 < /tmp/affiliations.csv > /tmp/affiliations-1.csv
     
    • Generate a list of affiliations for Peter and Abenet to go over so we can batch correct them before we deploy the new data visualization dashboard:
    -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv header;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv header;
     COPY 4518
     

    2018-07-15

    -
    $ dspace oai import -c
    +
    $ dspace oai import -c
     OAI 2.0 manager action started
     Clearing index
     Index cleared
    @@ -438,19 +438,19 @@ OAI 2.0 manager action ended. It took 697 seconds.
     
  • I need to ask the dspace-tech mailing list if the nightly OAI import catches the case of old items that have had metadata or mappings change
  • ICARDA sent me a list of the ORCID iDs they have in the MEL system and it looks like almost 150 are new and unique to us!
  • -
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
    +
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
     1020
     $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
     1158
     
    • I combined the two lists and regenerated the names for all our the ORCID iDs using my resolve-orcids.py script:
    -
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-07-15-orcid-ids.txt
    +
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-07-15-orcid-ids.txt
     $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolved-orcids.txt -d
     
    • Then I added the XML formatting for controlled vocabularies, sorted the list with GNU sort in vim via % !sort and then checked the formatting with tidy:
    -
    $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    +
    $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
     
    • I will check with the CGSpace team to see if they want me to add these to CGSpace
    • Help Udana from WLE understand some Altmetrics concepts
    • @@ -465,7 +465,7 @@ $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolv
    • For every day in the past week I only see about 50 to 100 requests per day, but then about nine days ago I see 1500 requsts
    • In there I see two bots making about 750 requests each, and this one is probably Altmetric:
    -
    178.33.237.157 - - [09/Jul/2018:17:00:46 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////100 HTTP/1.1" 200 58653 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
    +
    178.33.237.157 - - [09/Jul/2018:17:00:46 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////100 HTTP/1.1" 200 58653 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
     178.33.237.157 - - [09/Jul/2018:17:01:11 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////200 HTTP/1.1" 200 67950 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
     ...
     178.33.237.157 - - [09/Jul/2018:22:10:39 +0000] "GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////73900 HTTP/1.1" 20 0 25049 "-" "Apache-HttpClient/4.5.2 (Java/1.8.0_121)"
    @@ -474,7 +474,7 @@ $ ./resolve-orcids.py -i /tmp/2018-07-15-orcid-ids.txt -o /tmp/2018-07-15-resolv
     
  • I wonder if I should add this user agent to the Tomcat Crawler Session Manager valve… does OAI use Tomcat sessions?
  • Appears not:
  • -
    $ http --print Hh 'https://cgspace.cgiar.org/oai/request?verb=ListRecords&resumptionToken=oai_dc////100'
    +
    $ http --print Hh 'https://cgspace.cgiar.org/oai/request?verb=ListRecords&resumptionToken=oai_dc////100'
     GET /oai/request?verb=ListRecords&resumptionToken=oai_dc////100 HTTP/1.1
     Accept: */*
     Accept-Encoding: gzip, deflate
    @@ -511,7 +511,7 @@ X-XSS-Protection: 1; mode=block
     
  • They say that it is a burden for them to capture the issue dates, so I cautioned them that this is in their own benefit for future posterity and that everyone else on CGSpace manages to capture the issue dates!
  • For future reference, as I had previously noted in 2018-04, sort options are configured in dspace.cfg, for example:
  • -
    webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
    +
    webui.itemlist.sort-option.3 = dateaccessioned:dc.date.accessioned:date
     
    • Just because I was curious I made sure that these options are working as expected in DSpace 5.8 on DSpace Test (they are)
    • I tested the Atmire Listings and Reports (L&R) module one last time on my local test environment with a new snapshot of CGSpace’s database and re-generated Discovery index and it worked fine
    • @@ -523,7 +523,7 @@ X-XSS-Protection: 1; mode=block
    • Still discussing dates with IWMI
    • I looked in the database to see the breakdown of date formats used in dc.date.issued, ie YYYY, YYYY-MM, or YYYY-MM-DD:
    -
    dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}$';
    +
    dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=15 and text_value ~ '^[0-9]{4}$';
      count
     -------
      53292
    diff --git a/docs/2018-08/index.html b/docs/2018-08/index.html
    index d2014c884..4e75676c5 100644
    --- a/docs/2018-08/index.html
    +++ b/docs/2018-08/index.html
    @@ -46,7 +46,7 @@ Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did
     The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes
     I ran all system updates on DSpace Test and rebooted it
     "/>
    -
    +
     
     
         
    @@ -136,7 +136,7 @@ I ran all system updates on DSpace Test and rebooted it
     
    • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:
    -
    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
    +
    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
     [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
     [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
     
      @@ -161,7 +161,7 @@ I ran all system updates on DSpace Test and rebooted it
      • DSpace Test crashed again and I don’t see the only error I see is this in dmesg:
      -
      [Thu Aug  2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
      +
      [Thu Aug  2 00:00:12 2018] Out of memory: Kill process 1407 (java) score 787 or sacrifice child
       [Thu Aug  2 00:00:12 2018] Killed process 1407 (java) total-vm:18876328kB, anon-rss:6323836kB, file-rss:0kB, shmem-rss:0kB
       
      • I am still assuming that this is the Tomcat process that is dying, so maybe actually we need to reduce its memory instead of increasing it?
      • @@ -179,13 +179,13 @@ I ran all system updates on DSpace Test and rebooted it
      • I did some quick sanity checks and small cleanups in Open Refine, checking for spaces, weird accents, and encoding errors
      • Finally I did a test run with the fix-metadata-value.py script:
      -
      $ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
      +
      $ ./fix-metadata-values.py -i 2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
       $ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
       

      2018-08-16

      • Generate a list of the top 1,500 authors on CGSpace for Sisay so he can create the controlled vocabulary:
      -
      dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv; 
      +
      dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc limit 1500) to /tmp/2018-08-16-top-1500-authors.csv with csv; 
       
      • Start working on adding the ORCID metadata to a handful of CIAT authors as requested by Elizabeth earlier this month
      • I might need to overhaul the add-orcid-identifiers-csv.py script to be a little more robust about author order and ORCID metadata that might have been altered manually by editors after submission, as this script was written without that consideration
      • @@ -195,7 +195,7 @@ $ ./delete-metadata-values.py -i 2018-08-15-Remove-11-Affiliations.csv -db dspac
      • I will have to update my script to extract the ORCID identifier and search for that
      • Re-create my local DSpace database using the latest PostgreSQL 9.6 Docker image and re-import the latest CGSpace dump:
      -
      $ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
      +
      $ sudo docker run --name dspacedb -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
       $ createuser -h localhost -U postgres --pwprompt dspacetest
       $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
       $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
      @@ -209,7 +209,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
       
    • This is less obvious and more error prone with names like “Peters” where there are many more authors
    • I see some errors in the variations of names as well, for example:
    -
    Verchot, Louis
    +
    Verchot, Louis
     Verchot, L
     Verchot, L. V.
     Verchot, L.V
    @@ -220,7 +220,7 @@ Verchot, Louis V.
     
  • I’ll just tag them all with Louis Verchot’s ORCID identifier…
  • In the end, I’ll run the following CSV with my add-orcid-identifiers-csv.py script:
  • -
    dc.contributor.author,cg.creator.id
    +
    dc.contributor.author,cg.creator.id
     "Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
     "Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
     "Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859
    @@ -251,13 +251,13 @@ Verchot, Louis V.
     
    • The invocation would be:
    -
    $ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
     
    • I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers
    • Looking at the list of author affialitions from Peter one last time
    • I notice that I should add the Unicode character 0x00b4 (`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:
    -
    or(
    +
    or(
       isNotNull(value.match(/.*\uFFFD.*/)),
       isNotNull(value.match(/.*\u00A0.*/)),
       isNotNull(value.match(/.*\u200A.*/)),
    @@ -268,12 +268,12 @@ Verchot, Louis V.
     
  • This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n
  • I will run the following on DSpace Test and CGSpace:
  • -
    $ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
    +
    $ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
     $ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
     
    • Then force an update of the Discovery index on DSpace Test:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    72m12.570s
    @@ -282,7 +282,7 @@ sys     2m2.461s
     
    • And then on CGSpace:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    79m44.392s
    @@ -292,7 +292,7 @@ sys     2m20.248s
     
  • Run system updates on DSpace Test and reboot the server
  • In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:
  • -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
     1553
     # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
     1724
    @@ -300,7 +300,7 @@ sys     2m20.248s
     
  • I don’t even know how its possible for the bot to use MORE sessions than total requests…
  • The user agent is:
  • -
    Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
    +
    Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
     
    • So I’m thinking we should add “crawl” to the Tomcat Crawler Session Manager valve, as we already have “bot” that catches Googlebot, Bingbot, etc.
    @@ -325,7 +325,7 @@ sys 2m20.248s
    • Something must have happened, as the mvn package always takes about two hours now, stopping for a very long time near the end at this step:
    -
    [INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
    +
    [INFO] Processing overlay [ id org.dspace.modules:xmlui-mirage2]
     
    • It’s the same on DSpace Test, my local laptop, and CGSpace…
    • It wasn’t this way before when I was constantly building the previous 5.8 branch with Atmire patches…
    • @@ -335,7 +335,7 @@ sys 2m20.248s
    • That one only took 13 minutes! So there is definitely something wrong with our 5.8 branch, now I should try vanilla DSpace 5.8
    • I notice that the step this pauses at is:
    -
    [INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
    +
    [INFO] --- maven-war-plugin:2.4:war (default-war) @ xmlui ---
     
    • And I notice that Atmire changed something in the XMLUI module’s pom.xml as part of the DSpace 5.8 changes, specifically to remove the exclude for node_modules in the maven-war-plugin step
    • This exclude is present in vanilla DSpace, and if I add it back the build time goes from 1 hour 23 minutes to 12 minutes!
    • @@ -352,23 +352,23 @@ sys 2m20.248s
    • It appears that the web UI’s upload interface requires you to specify the collection, whereas the CLI interface allows you to omit the collection command line flag and defer to the collections file inside each item in the bundle
    • I imported the CTA items on CGSpace for Sisay:
    -
    $ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
    +
    $ dspace import -a -e s.webshet@cgiar.org -s /home/swebshet/ictupdates_uploads_August_21 -m /tmp/2018-08-23-cta-ictupdates.map
     

    2018-08-26

    • Doing the DSpace 5.8 upgrade on CGSpace (linode18)
    • I already finished the Maven build, now I’ll take a backup of the PostgreSQL database and do a database cleanup just in case:
    -
    $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
    +
    $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-08-26-before-dspace-58.backup dspace
     $ dspace cleanup -v
     
    • Now I can stop Tomcat and do the install:
    -
    $ cd dspace/target/dspace-installer
    +
    $ cd dspace/target/dspace-installer
     $ ant update clean_backups update_geolite
     
    • After the successful Ant update I can run the database migrations:
    -
    $ psql dspace dspace
    +
    $ psql dspace dspace
     
     dspace=> \i /tmp/Atmire-DSpace-5.8-Schema-Migration.sql 
     DELETE 0
    @@ -380,7 +380,7 @@ $ dspace database migrate ignored
     
    • Then I’ll run all system updates and reboot the server:
    -
    $ sudo su -
    +
    $ sudo su -
     # apt update && apt full-upgrade
     # apt clean && apt autoclean && apt autoremove
     # reboot
    @@ -391,11 +391,11 @@ $ dspace database migrate ignored
     
  • I exported a list of items from Listings and Reports with the following criteria: from year 2013 until now, have WLE subject GENDER or GENDER POVERTY AND INSTITUTIONS, and CRP Water, Land and Ecosystems
  • Then I extracted the Handle links from the report so I could export each item’s metadata as CSV
  • -
    $ grep -o -E "[0-9]{5}/[0-9]{0,5}" listings-export.txt > /tmp/iwmi-gender-items.txt
    +
    $ grep -o -E "[0-9]{5}/[0-9]{0,5}" listings-export.txt > /tmp/iwmi-gender-items.txt
     
    • Then on the DSpace server I exported the metadata for each item one by one:
    -
    $ while read -r line; do dspace metadata-export -f "/tmp/${line/\//-}.csv" -i $line; sleep 2; done < /tmp/iwmi-gender-items.txt
    +
    $ while read -r line; do dspace metadata-export -f "/tmp/${line/\//-}.csv" -i $line; sleep 2; done < /tmp/iwmi-gender-items.txt
     
    • But from here I realized that each of the fifty-nine items will have different columns in their CSVs, making it difficult to combine them
    • I’m not sure how to proceed without writing some script to parse and join the CSVs, and I don’t think it’s worth my time
    • diff --git a/docs/2018-09/index.html b/docs/2018-09/index.html index d62103690..fe535998d 100644 --- a/docs/2018-09/index.html +++ b/docs/2018-09/index.html @@ -30,7 +30,7 @@ I’ll update the DSpace role in our Ansible infrastructure playbooks and ru Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again: "/> - + @@ -123,7 +123,7 @@ I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I
    • Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month
    • I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again:
    -
    02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
    +
    02-Sep-2018 11:18:52.678 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.listenerStart Exception sending context initialized event to listener instance of class [org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener]
      java.lang.RuntimeException: Failure during filter init: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'conversionService' defined in file [/home/dspacetest.cgiar.org/config/spring/xmlui/spring-dspace-addon-cua-services.xml]: Cannot create inner bean 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2' of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter] while setting bean property 'converters' with key [1]; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter#4c5d5a2': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$ColumnsConverter.filterConverter; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No matching bean of type [com.atmire.app.xmlui.aspect.statistics.mostpopular.MostPopularConfig$FilterConverter] found for dependency: expected at least 1 bean which qualifies as autowire candidate for this dependency. Dependency annotations: {@org.springframework.beans.factory.annotation.Autowired(required=true)}
         at org.dspace.servicemanager.servlet.DSpaceKernelServletContextListener.contextInitialized(DSpaceKernelServletContextListener.java:92)
         at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4776)
    @@ -184,7 +184,7 @@ Caused by: java.lang.RuntimeException: Failed to startup the DSpace Service Mana
     
  • Playing with strest to test the DSpace REST API programatically
  • For example, given this test.yaml:
  • -
    version: 1
    +
    version: 1
     
     requests:
       test:
    @@ -217,19 +217,19 @@ requests:
     
  • We could eventually use this to test sanity of the API for creating collections etc
  • A user is getting an error in her workflow:
  • -
    2018-09-10 07:26:35,551 ERROR org.dspace.submit.step.CompleteStep @ Caught exception in submission step: 
    +
    2018-09-10 07:26:35,551 ERROR org.dspace.submit.step.CompleteStep @ Caught exception in submission step: 
     org.dspace.authorize.AuthorizeException: Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:2 by user 3819
     
    -
    $ dspace community-filiator --set -p 10568/97114 -c 10568/51670
    +
    $ dspace community-filiator --set -p 10568/97114 -c 10568/51670
     $ dspace community-filiator --set -p 10568/97114 -c 10568/35409
     $ dspace community-filiator --set -p 10568/97114 -c 10568/3112
     
    • Valerio contacted me to point out some issues with metadata on CGSpace, which I corrected in PostgreSQL:
    -
    update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI Juornal';
    +
    update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI Juornal';
     UPDATE 1
     update metadatavalue set text_value='ISI Journal' where resource_type_id=2 and metadata_field_id=226 and text_value='ISI journal';
     UPDATE 23
    @@ -246,7 +246,7 @@ UPDATE 15
     
  • Linode said that CGSpace (linode18) had a high CPU load earlier today
  • When I looked, I see it’s the same Russian IP that I noticed last month:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "10/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1459 157.55.39.202
        1579 95.108.181.88
        1615 157.55.39.147
    @@ -260,17 +260,17 @@ UPDATE 15
     
    • And this bot is still creating more Tomcat sessions than Nginx requests (WTF?):
    -
    # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-09-10 
    +
    # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-09-10 
     14133
     
    • The user agent is still the same:
    -
    Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
    +
    Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
     
    • I added .*crawl.* to the Tomcat Session Crawler Manager Valve, so I’m not sure why the bot is creating so many sessions…
    • I just tested that user agent on CGSpace and it does not create a new session:
    -
    $ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
    +
    $ http --print Hh https://cgspace.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)'
     GET / HTTP/1.1
     Accept: */*
     Accept-Encoding: gzip, deflate
    @@ -300,7 +300,7 @@ X-XSS-Protection: 1; mode=block
     
  • Merge AReS explorer changes to nginx config and deploy on CGSpace so CodeObia can start testing more
  • Re-create my local Docker container for PostgreSQL data, but using a volume for the database data:
  • -
    $ sudo docker volume create --name dspacetest_data
    +
    $ sudo docker volume create --name dspacetest_data
     $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     
    • Sisay is still having problems with the controlled vocabulary for top authors
    • @@ -319,7 +319,7 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
    • Linode says that CGSpace (linode18) has had high CPU for the past two hours
    • The top IP addresses today are:
    -
    # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "13/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10                                                                                                
    +
    # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "13/Sep/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10                                                                                                
          32 46.229.161.131
          38 104.198.9.108
          39 66.249.64.91
    @@ -333,7 +333,7 @@ $ sudo docker run --name dspacedb -v dspacetest_data:/var/lib/postgresql/data -e
     
    • And the top two addresses seem to be re-using their Tomcat sessions properly:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92' dspace.log.2018-09-13 | sort | uniq
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=70.32.83.92' dspace.log.2018-09-13 | sort | uniq
     7
     $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-13 | sort | uniq
     2
    @@ -343,7 +343,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
     
  • I said no, but that we might be able to piggyback on the Atmire statlet REST API
  • For example, when you expand the “statlet” at the bottom of an item like 10568/97103 you can see the following request in the browser console:
  • -
    https://cgspace.cgiar.org/rest/statlets?handle=10568/97103
    +
    https://cgspace.cgiar.org/rest/statlets?handle=10568/97103
     
    • That JSON file has the total page views and item downloads for the item…
    • Abenet forwarded a request by CIP that item thumbnails be included in RSS feeds
    • @@ -397,12 +397,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
    • There are some example queries on the DSpace Solr wiki
    • For example, this query returns 1655 rows for item 10568/10630:
    -
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false'
    +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false'
     
    • The id in the Solr query is the item’s database id (get it from the REST API or something)
    • Next, I adopted a query to get the downloads and it shows 889, which is similar to the number Atmire’s statlet shows, though the query logic here is confusing:
    -
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
    +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-(bundleName:[*+TO+*]-bundleName:ORIGINAL)&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
     
    • According to the SolrQuerySyntax page on the Apache wiki, the [* TO *] syntax just selects a range (in this case all values for a field)
    • So it seems to be: @@ -413,15 +413,15 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
    • What the shit, I think I’m right: the simplified logic in this query returns the same 889:
    -
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
    +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=-(statistics_type:[*+TO+*]+-statistics_type:view)'
     
    • And if I simplify the statistics_type logic the same way, it still returns the same 889!
    -
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=statistics_type:view'
    +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=bundleName:ORIGINAL&fq=statistics_type:view'
     
    • As for item views, I suppose that’s just the same query, minus the bundleName:ORIGINAL:
    -
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-bundleName:ORIGINAL&fq=statistics_type:view'
    +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:0+owningItem:11576&fq=isBot:false&fq=-bundleName:ORIGINAL&fq=statistics_type:view'
     
    • That one returns 766, which is exactly 1655 minus 889…
    • Also, Solr’s fq is similar to the regular q query parameter, but it is considered for the Solr query cache so it should be faster for multiple queries
    • @@ -432,7 +432,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
    • It uses the Python-based Falcon web framework and talks to Solr directly using the SolrClient library (which seems to have issues in Python 3.7 currently)
    • After deploying on DSpace Test I can then get the stats for an item using its ID:
    -
    $ http -b 'https://dspacetest.cgiar.org/rest/statistics/item?id=110988'
    +
    $ http -b 'https://dspacetest.cgiar.org/rest/statistics/item?id=110988'
     {
         "downloads": 2,
         "id": 110988,
    @@ -443,7 +443,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
     
  • Moayad from CodeObia asked if I could make the API be able to paginate over all items, for example: /statistics?limit=100&page=1
  • Getting all the item IDs from PostgreSQL is certainly easy:
  • -
    dspace=# select item_id from item where in_archive is True and withdrawn is False and discoverable is True;
    +
    dspace=# select item_id from item where in_archive is True and withdrawn is False and discoverable is True;
     
    • The rest of the Falcon tooling will be more difficult…
    @@ -457,11 +457,11 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
  • Contact Atmire to ask how we can buy more credits for future development (#644)
  • I researched the Solr filterCache size and I found out that the formula for calculating the potential memory use of each entry in the cache is:
  • -
    ((maxDoc/8) + 128) * (size_defined_in_solrconfig.xml)
    +
    ((maxDoc/8) + 128) * (size_defined_in_solrconfig.xml)
     
    • Which means that, for our statistics core with 149 million documents, each entry in our filterCache would use 8.9 GB!
    -
    ((149374568/8) + 128) * 512 = 9560037888 bytes (8.9 GB)
    +
    ((149374568/8) + 128) * 512 = 9560037888 bytes (8.9 GB)
     
    • So I think we can forget about tuning this for now!
    • Discussion on the mailing list about filterCache size
    • @@ -495,7 +495,7 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=50.116.102.77' dspace.log.2018-09-
    • Trying to figure out how to get item views and downloads from SQLite in a join
    • It appears SQLite doesn’t support FULL OUTER JOIN so some people on StackOverflow have emulated it with LEFT JOIN and UNION:
    -
    > SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemviews views
    +
    > SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemviews views
     LEFT JOIN itemdownloads downloads USING(id)
     UNION ALL
     SELECT views.views, views.id, downloads.downloads, downloads.id FROM itemdownloads downloads
    @@ -505,7 +505,7 @@ WHERE views.id IS NULL;
     
  • This “works” but the resulting rows are kinda messy so I’d have to do extra logic in Python
  • Maybe we can use one “items” table with defaults values and UPSERT (aka insert… on conflict … do update):
  • -
    sqlite> CREATE TABLE items(id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0);
    +
    sqlite> CREATE TABLE items(id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0);
     sqlite> INSERT INTO items(id, views) VALUES(0, 52);
     sqlite> INSERT INTO items(id, downloads) VALUES(1, 171);
     sqlite> INSERT INTO items(id, downloads) VALUES(1, 176) ON CONFLICT(id) DO UPDATE SET downloads=176;
    @@ -521,7 +521,7 @@ sqlite> INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE S
     
  • Ok this is hilarious, I manually downloaded the libsqlite3 3.24.0 deb from Ubuntu 18.10 “cosmic” and installed it in Ubnutu 16.04 and now the Python indexer.py works
  • This is definitely a dirty hack, but the list of packages we use that depend on libsqlite3-0 in Ubuntu 16.04 are actually pretty few:
  • -
    # apt-cache rdepends --installed libsqlite3-0 | sort | uniq
    +
    # apt-cache rdepends --installed libsqlite3-0 | sort | uniq
       gnupg2
       libkrb5-26-heimdal
       libnss3
    @@ -530,7 +530,7 @@ sqlite> INSERT INTO items(id, views) VALUES(0, 7) ON CONFLICT(id) DO UPDATE S
     
    • I wonder if I could work around this by detecting the SQLite library version, for example on Ubuntu 16.04 after I replaced the library:
    -
    # python3
    +
    # python3
     Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
     [GCC 5.4.0 20160609] on linux
     Type "help", "copyright", "credits" or "license" for more information.
    @@ -542,7 +542,7 @@ Type "help", "copyright", "credits" or "licen
     
  • I changed the syntax of the SQLite stuff and PostgreSQL is working flawlessly with psycopg2… hmmm.
  • For reference, creating a PostgreSQL database for testing this locally (though indexer.py will create the table):
  • -
    $ createdb -h localhost -U postgres -O dspacestatistics --encoding=UNICODE dspacestatistics
    +
    $ createdb -h localhost -U postgres -O dspacestatistics --encoding=UNICODE dspacestatistics
     $ createuser -h localhost -U postgres --pwprompt dspacestatistics
     $ psql -h localhost -U postgres dspacestatistics
     dspacestatistics=> CREATE TABLE IF NOT EXISTS items
    @@ -558,7 +558,7 @@ dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
     
  • DSpace Test currently has about 2,000,000 documents with isBot:true in its Solr statistics core, and the size on disk is 2GB (it’s not much, but I have to test this somewhere!)
  • According to the DSpace 5.x Solr documentation I can use dspace stats-util -f, so let’s try it:
  • -
    $ dspace stats-util -f
    +
    $ dspace stats-util -f
     
    • The command comes back after a few seconds and I still see 2,000,000 documents in the statistics core with isBot:true
    • I was just writing a message to the dspace-tech mailing list and then I decided to check the number of bot view events on DSpace Test again, and now it’s 201 instead of 2,000,000, and statistics core is only 30MB now!
    • @@ -576,11 +576,11 @@ dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
    • According to the Googlebot FAQ the domain name in the reverse DNS lookup should contain either googlebot.com or google.com
    • In Solr this appears to be an appropriate query that I can maybe use later (returns 81,000 documents):
    -
    *:* AND (dns:*googlebot.com. OR dns:*google.com.) AND isBot:false
    +
    *:* AND (dns:*googlebot.com. OR dns:*google.com.) AND isBot:false
     
    • I translate that into a delete command using the /update handler:
    -
    http://localhost:8081/solr/statistics/update?commit=true&stream.body=<delete><query>*:*+AND+(dns:*googlebot.com.+OR+dns:*google.com.)+AND+isBot:false</query></delete>
    +
    http://localhost:8081/solr/statistics/update?commit=true&stream.body=<delete><query>*:*+AND+(dns:*googlebot.com.+OR+dns:*google.com.)+AND+isBot:false</query></delete>
     
    • And magically all those 81,000 documents are gone!
    • After a few hours the Solr statistics core is down to 44GB on CGSpace!
    • @@ -588,7 +588,7 @@ dspacestatistics-> (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DE
    • Basically, it turns out that using facet.mincount=1 is really beneficial for me because it reduces the size of the Solr result set, reduces the amount of data we need to ingest into PostgreSQL, and the API returns HTTP 404 Not Found for items without views or downloads anyways
    • I deployed the new version on CGSpace and now it looks pretty good!
    -
    Indexing item views (page 28 of 753)
    +
    Indexing item views (page 28 of 753)
     ...
     Indexing item downloads (page 260 of 260)
     
      @@ -606,12 +606,12 @@ Indexing item downloads (page 260 of 260)
    • I will have to keep an eye on that over the next few weeks to see if things stay as they are
    • I did a batch replacement of the access rights with my fix-metadata-values.py script on DSpace Test:
    -
    $ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.status -t correct -m 206
    +
    $ ./fix-metadata-values.py -i /tmp/fix-access-status.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.status -t correct -m 206
     
    • This changes “Open Access” to “Unrestricted Access” and “Limited Access” to “Restricted Access”
    • After that I did a full Discovery reindex:
    -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    77m3.755s
     user    7m39.785s
    @@ -629,7 +629,7 @@ sys     2m18.485s
     
  • Linode emailed to say that CGSpace’s (linode19) CPU load was high for a few hours last night
  • Looking in the nginx logs around that time I see some new IPs that look like they are harvesting things:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "26/Sep/2018:(19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "26/Sep/2018:(19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         295 34.218.226.147
         296 66.249.64.95
         350 157.55.39.185
    @@ -645,7 +645,7 @@ sys     2m18.485s
     
  • 68.6.87.12 is on Cox Communications in the US (?)
  • These hosts are not using proper user agents and are not re-using their Tomcat sessions:
  • -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-09-26 | sort | uniq
     5423
     $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26 | sort | uniq
     758
    @@ -659,12 +659,12 @@ $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=68.6.87.12' dspace.log.2018-09-26
     
  • Peter sent me a list of 43 author names to fix, but it had some encoding errors like Belalcázar, John like usual (I will tell him to stop trying to export as UTF-8 because it never seems to work)
  • I did batch replaces for both on CGSpace with my fix-metadata-values.py script:
  • -
    $ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
    +
    $ ./fix-metadata-values.py -i 2018-09-29-fix-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
     $ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
     
    • Afterwards I started a full Discovery re-index:
    -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
    • Linode sent an alert that both CGSpace and DSpace Test were using high CPU for the last two hours
    • It seems to be Moayad trying to do the AReS explorer indexing
    • @@ -675,18 +675,18 @@ $ ./fix-metadata-values.py -i 2018-09-29-fix-authors.csv -db dspace -u dspace -p
    • Valerio keeps sending items on CGSpace that have weird or incorrect languages, authors, etc
    • I think I should just batch export and update all languages…
    -
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2018-09-30-languages.csv with csv;
     
    • Then I can simply delete the “Other” and “other” ones because that’s not useful at all:
    -
    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Other';
    +
    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Other';
     DELETE 6
     dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='other';
     DELETE 79
     
    • Looking through the list I see some weird language codes like gh, so I checked out those items:
    -
    dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
    +
    dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
      resource_id
     -------------
            94530
    @@ -699,12 +699,12 @@ dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2
     
    • Those items are from Ghana, so the submitter apparently thought gh was a language… I can safely delete them:
    -
    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
    +
    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
     DELETE 2
     
    • The next issue would be jn:
    -
    dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
    +
    dspace=# SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='jn';
      resource_id
     -------------
            94001
    @@ -718,7 +718,7 @@ dspace=# SELECT handle,item_id FROM item, handle WHERE handle.resource_type_id=2
     
  • Those items are about Japan, so I will update them to be ja
  • Other replacements:
  • -
    DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
    +
    DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='gh';
     UPDATE metadatavalue SET text_value='fr' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='fn';
     UPDATE metadatavalue SET text_value='hi' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='in';
     UPDATE metadatavalue SET text_value='ja' WHERE resource_type_id=2 AND metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'language' and qualifier = 'iso') AND text_value='Ja';
    diff --git a/docs/2018-10/index.html b/docs/2018-10/index.html
    index 61378afe2..ca734de48 100644
    --- a/docs/2018-10/index.html
    +++ b/docs/2018-10/index.html
    @@ -26,7 +26,7 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai
     Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
     I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now
     "/>
    -
    +
     
     
         
    @@ -121,7 +121,7 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai
     
    • I see Moayad was busy collecting item views and downloads from CGSpace yesterday:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         933 40.77.167.90
         971 95.108.181.88
        1043 41.204.190.40
    @@ -135,18 +135,18 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai
     
    • Of those, about 20% were HTTP 500 responses (!):
    -
    $ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
    +
    $ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
      118927 200
       31435 500
     
    • I added Phil Thornton and Sonal Henson’s ORCID identifiers to the controlled vocabulary for cg.creator.orcid and then re-generated the names using my resolve-orcids.py script:
    -
    $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > 2018-10-03-orcids.txt
    +
    $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > 2018-10-03-orcids.txt
     $ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d
     
    • I found a new corner case error that I need to check, given and family names deactivated:
    -
    Looking up the names associated with ORCID iD: 0000-0001-7930-5752
    +
    Looking up the names associated with ORCID iD: 0000-0001-7930-5752
     Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
     
    • It appears to be Jim Lorenzen… I need to check that later!
    • @@ -154,7 +154,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
    • Linode sent another alert about CPU usage on CGSpace (linode18) this evening
    • It seems that Moayad is making quite a lot of requests today:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1594 157.55.39.160
        1627 157.55.39.173
        1774 136.243.6.84
    @@ -169,29 +169,29 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
     
  • But in super positive news, he says they are using my new dspace-statistics-api and it’s MUCH faster than using Atmire CUA’s internal “restlet” API
  • I don’t recognize the 138.201.49.199 IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:
  • -
    # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
    +
    # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
        8324 GET /bitstream
        4193 GET /handle
     
    • Suspiciously, it’s only grabbing the CGIAR System Office community (handle prefix 10947):
    -
    # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
    +
    # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
           7 GET /handle/10568
        4186 GET /handle/10947
     
    • The user agent is suspicious too:
    -
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
    +
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
     
    • It’s clearly a bot and it’s not re-using its Tomcat session, so I will add its IP to the nginx bad bot list
    • I looked in Solr’s statistics core and these hits were actually all counted as isBot:false (of course)… hmmm
    • I tagged all of Sonal and Phil’s items with their ORCID identifiers on CGSpace using my add-orcid-identifiers.py script:
    -
    $ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
     
    • Where 2018-10-03-add-orcids.csv contained:
    -
    dc.contributor.author,cg.creator.id
    +
    dc.contributor.author,cg.creator.id
     "Henson, Sonal P.",Sonal Henson: 0000-0002-2002-5462
     "Henson, S.",Sonal Henson: 0000-0002-2002-5462
     "Thornton, P.K.",Philip Thornton: 0000-0002-1854-0182
    @@ -214,7 +214,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
     
  • So it’s fixed, but I’m not sure why!
  • Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):
  • -
    # zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
    +
    # zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
     251226
     
    • I found a logic error in the dspace-statistics-api indexer.py script that was causing item views to be inserted into downloads
    • @@ -242,7 +242,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
    • Peter noticed that some recently added PDFs don’t have thumbnails
    • When I tried to force them to be generated I got an error that I’ve never seen before:
    -
    $ dspace filter-media -v -f -i 10568/97613
    +
    $ dspace filter-media -v -f -i 10568/97613
     org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412.
     
    • I see there was an update to Ubuntu’s ImageMagick on 2018-10-05, so maybe something changed or broke?
    • @@ -251,7 +251,7 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
    • Wow, someone on Twitter posted about this breaking his web application (and it was retweeted by the ImageMagick acount!)
    • I commented out the line that disables PDF thumbnails in /etc/ImageMagick-6/policy.xml:
    -
      <!--<policy domain="coder" rights="none" pattern="PDF" />-->
    +
      <!--<policy domain="coder" rights="none" pattern="PDF" />-->
     
    • This works, but I’m not sure what ImageMagick’s long-term plan is if they are going to disable ALL image formats…
    • I suppose I need to enable a workaround for this in Ansible?
    • @@ -261,7 +261,7 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
    • I emailed DuraSpace to update our entry in their DSpace registry (the data was still on DSpace 3, JSPUI, etc)
    • Generate a list of the top 1500 values for dc.subject so Sisay can start making a controlled vocabulary for it:
    -
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-10-11-top-1500-subject.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-10-11-top-1500-subject.csv WITH CSV HEADER;
     COPY 1500
     
    • Give WorldFish advice about Handles because they are talking to some company called KnowledgeArc who recommends they do not use Handles!
    • @@ -269,7 +269,7 @@ COPY 1500
    • Altmetric support responded to say no, but the reason is that Land Portal is doing even more strange stuff by not using <meta> tags in their page header, and using “dct:identifier” property instead of “dc:identifier”
    • I re-created my local DSpace databse container using podman instead of Docker:
    -
    $ mkdir -p ~/.local/lib/containers/volumes/dspacedb_data
    +
    $ mkdir -p ~/.local/lib/containers/volumes/dspacedb_data
     $ sudo podman create --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     $ sudo podman start dspacedb
     $ createuser -h localhost -U postgres --pwprompt dspacetest
    @@ -283,13 +283,13 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
     
  • I can pull the docker.bintray.io/jfrog/artifactory-oss:latest image, but not start it
  • I decided to use a Sonatype Nexus repository instead:
  • -
    $ mkdir -p ~/.local/lib/containers/volumes/nexus_data
    +
    $ mkdir -p ~/.local/lib/containers/volumes/nexus_data
     $ sudo podman run --name nexus -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
     
    • With a few changes to my local Maven settings.xml it is working well
    • Generate a list of the top 10,000 authors for Peter Ballantyne to look through:
    -
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 3 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 10000) to /tmp/2018-10-11-top-10000-authors.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 3 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 10000) to /tmp/2018-10-11-top-10000-authors.csv WITH CSV HEADER;
     COPY 10000
     
    • CTA uploaded some infographics that are very tall and their thumbnails disrupt the item lists on the front page and in their communities and collections
    • @@ -301,7 +301,7 @@ COPY 10000
    • Look through Peter’s list of 746 author corrections in OpenRefine
    • I first facet by blank, trim whitespace, and then check for weird characters that might be indicative of encoding issues with this GREL:
    -
    or(
    +
    or(
       isNotNull(value.match(/.*\uFFFD.*/)),
       isNotNull(value.match(/.*\u00A0.*/)),
       isNotNull(value.match(/.*\u200A.*/)),
    @@ -311,7 +311,7 @@ COPY 10000
     
    • Then I exported and applied them on my local test server:
    -
    $ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3
    +
    $ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3
     
    • I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay’s author controlled vocabulary
    @@ -321,7 +321,7 @@ COPY 10000
  • Switch to new CGIAR LDAP server on CGSpace, as it’s been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?)
  • Apply Peter’s 746 author corrections on CGSpace and DSpace Test using my fix-metadata-values.py script:
  • -
    $ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'
    +
    $ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'
     
    • Run all system updates on CGSpace (linode19) and reboot the server
    • After rebooting the server I noticed that Handles are not resolving, and the dspace-handle-server systemd service is not running (or rather, it exited with success)
    • @@ -352,7 +352,7 @@ COPY 10000
    • I ended up having some issues with podman and went back to Docker, so I had to re-create my containers:
    -
    $ sudo docker run --name nexus --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
    +
    $ sudo docker run --name nexus --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
     $ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     $ createuser -h localhost -U postgres --pwprompt dspacetest
     $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
    @@ -364,12 +364,12 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
     
    • Generate a list of the schema on CGSpace so CodeObia can compare with MELSpace:
    -
    dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
    +
    dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
     
    • Talking to the CodeObia guys about the REST API I started to wonder why it’s so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it
    • Interestingly, the speed doesn’t get better after you request the same thing multiple times–it’s consistently bad on both CGSpace and DSpace Test!
    -
    $ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
    +
    $ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     ...
     0.35s user 0.06s system 1% cpu 25.133 total
     0.31s user 0.04s system 1% cpu 25.223 total
    @@ -389,7 +389,7 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
     
  • I wonder if the Java garbage collector is important here, or if there are missing indexes in PostgreSQL?
  • I switched DSpace Test to the G1GC garbage collector and tried again and now the results are worse!
  • -
    $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
    +
    $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     ...
     0.20s user 0.03s system 0% cpu 25.017 total
     0.23s user 0.02s system 1% cpu 23.299 total
    @@ -399,7 +399,7 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
     
    • If I make a request without the expands it is ten time faster:
    -
    $ time http --print h 'https://dspacetest.cgiar.org/rest/items?limit=100&offset=0'
    +
    $ time http --print h 'https://dspacetest.cgiar.org/rest/items?limit=100&offset=0'
     ...
     0.20s user 0.03s system 7% cpu 3.098 total
     0.22s user 0.03s system 8% cpu 2.896 total
    @@ -414,7 +414,7 @@ $ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,b
     
  • Most of the are from Bioversity, and I asked Maria for permission before updating them
  • I manually went through and looked at the existing values and updated them in several batches:
  • -
    UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%CC BY %';
    +
    UPDATE metadatavalue SET text_value='CC-BY-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%CC BY %';
     UPDATE metadatavalue SET text_value='CC-BY-NC-ND-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-ND%' AND text_value LIKE '%by-nc-nd%';
     UPDATE metadatavalue SET text_value='CC-BY-NC-SA-4.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%4.0%' AND text_value LIKE '%BY-NC-SA%' AND text_value LIKE '%by-nc-sa%';
     UPDATE metadatavalue SET text_value='CC-BY-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%3.0%' AND text_value LIKE '%/by/%';
    @@ -436,7 +436,7 @@ UPDATE metadatavalue SET text_value='CC-BY-NC-4.0' WHERE resource_type_id=2 AND
     
  • Ask Jane if we can use some of the BDP money to host AReS explorer on a more powerful server
  • IWMI sent me a list of new ORCID identifiers for their staff so I combined them with our list, updated the names with my resolve-orcids.py script, and regenerated the controlled vocabulary:
  • -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json MEL\ ORCID_V2.json 2018-10-17-IWMI-ORCIDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq >
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml MEL\ ORCID.json MEL\ ORCID_V2.json 2018-10-17-IWMI-ORCIDs.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq >
     2018-10-17-orcids.txt
     $ ./resolve-orcids.py -i 2018-10-17-orcids.txt -o 2018-10-17-names.txt -d
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    @@ -444,7 +444,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
     
  • I also decided to add the ORCID identifiers that MEL had sent us a few months ago…
  • One problem I had with the resolve-orcids.py script is that one user seems to have disabled their profile data since we last updated:
  • -
    Looking up the names associated with ORCID iD: 0000-0001-7930-5752
    +
    Looking up the names associated with ORCID iD: 0000-0001-7930-5752
     Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
     
    • So I need to handle that situation in the script for sure, but I’m not sure what to do organizationally or ethically, since that user disabled their name! Do we remove him from the list?
    • @@ -457,7 +457,7 @@ Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
    • After they do some tests and we check the values Enrico will send a formal email to Peter et al to ask that they start depositing officially
    • I upgraded PostgreSQL to 9.6 on DSpace Test using Ansible, then had to manually migrate from 9.5 to 9.6:
    -
    # su - postgres
    +
    # su - postgres
     $ /usr/lib/postgresql/9.6/bin/pg_upgrade -b /usr/lib/postgresql/9.5/bin -B /usr/lib/postgresql/9.6/bin -d /var/lib/postgresql/9.5/main -D /var/lib/postgresql/9.6/main -o ' -c config_file=/etc/postgresql/9.5/main/postgresql.conf' -O ' -c config_file=/etc/postgresql/9.6/main/postgresql.conf'
     $ exit
     # systemctl start postgresql
    @@ -468,7 +468,7 @@ $ exit
     
  • Linode emailed me to say that CGSpace (linode18) had high CPU usage for a few hours this afternoon
  • Looking at the nginx logs around that time I see the following IPs making the most requests:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Oct/2018:(12|13|14|15)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Oct/2018:(12|13|14|15)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         361 207.46.13.179
         395 181.115.248.74
         485 66.249.64.93
    @@ -487,7 +487,7 @@ $ exit
     
  • I was going to try to run Solr in Docker because I learned I can run Docker on Travis-CI (for testing my dspace-statistics-api), but the oldest official Solr images are for 5.5, and DSpace’s Solr configuration is for 4.9
  • This means our existing Solr configuration doesn’t run in Solr 5.5:
  • -
    $ sudo docker pull solr:5
    +
    $ sudo docker pull solr:5
     $ sudo docker run --name my_solr -v ~/dspace/solr/statistics/conf:/tmp/conf -d -p 8983:8983 -t solr:5
     $ sudo docker logs my_solr
     ...
    @@ -498,7 +498,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
     
  • Linode sent a message that the CPU usage was high on CGSpace (linode18) last night
  • According to the nginx logs around that time it was 5.9.6.51 (MegaIndex) again:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Oct/2018:(14|15|16)" | awk '{print $1}' | sort
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "20/Oct/2018:(14|15|16)" | awk '{print $1}' | sort
      | uniq -c | sort -n | tail -n 10
         249 207.46.13.179
         250 157.55.39.173
    @@ -513,7 +513,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
     
    • This bot is only using the XMLUI and it does not seem to be re-using its sessions:
    -
    # grep -c 5.9.6.51 /var/log/nginx/*.log
    +
    # grep -c 5.9.6.51 /var/log/nginx/*.log
     /var/log/nginx/access.log:9323
     /var/log/nginx/error.log:0
     /var/log/nginx/library-access.log:0
    @@ -525,7 +525,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
     
    • Last month I added “crawl” to the Tomcat Crawler Session Manager Valve’s regular expression matching, and it seems to be working for MegaIndex’s user agent:
    -
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'"Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"'
    +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1' User-Agent:'"Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"'
     
    • So I’m not sure why this bot uses so many sessions — is it because it requests very slowly?
    @@ -539,7 +539,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
  • Change build.properties to use HTTPS for Handles in our Ansible infrastructure playbooks
  • We will still need to do a batch update of the dc.identifier.uri and other fields in the database:
  • -
    dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
    +
    dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
     
    • While I was doing that I found two items using CGSpace URLs instead of handles in their dc.identifier.uri so I corrected those
    • I also found several items that had invalid characters or multiple Handles in some related URL field like cg.link.reference so I corrected those too
    • @@ -547,7 +547,7 @@ ERROR: Error CREATEing SolrCore 'statistics': Unable to create core [statistics]
    • I deployed the changes on CGSpace, ran all system updates, and rebooted the server
    • Also, I updated all Handles in the database to use HTTPS:
    -
    dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
    +
    dspace=# UPDATE metadatavalue SET text_value=replace(text_value, 'http://', 'https://') WHERE resource_type_id=2 AND text_value LIKE 'http://hdl.handle.net%';
     UPDATE 76608
     
    • Skype with Peter about ToRs for the AReS open source work and future plans to develop tools around the DSpace ecosystem
    • @@ -560,14 +560,14 @@ UPDATE 76608
    • I emailed the MARLO guys to ask if they can send us a dump of rights data and Handles from their system so we can tag our older items on CGSpace
    • Testing REST login and logout via httpie because Felix from Earlham says he’s having issues:
    -
    $ http --print b POST 'https://dspacetest.cgiar.org/rest/login' email='testdeposit@cgiar.org' password=deposit
    +
    $ http --print b POST 'https://dspacetest.cgiar.org/rest/login' email='testdeposit@cgiar.org' password=deposit
     acef8a4a-41f3-4392-b870-e873790f696b
     
     $ http POST 'https://dspacetest.cgiar.org/rest/logout' rest-dspace-token:acef8a4a-41f3-4392-b870-e873790f696b
     
    • Also works via curl (login, check status, logout, check status):
    -
    $ curl -H "Content-Type: application/json" --data '{"email":"testdeposit@cgiar.org", "password":"deposit"}' https://dspacetest.cgiar.org/rest/login
    +
    $ curl -H "Content-Type: application/json" --data '{"email":"testdeposit@cgiar.org", "password":"deposit"}' https://dspacetest.cgiar.org/rest/login
     e09fb5e1-72b0-4811-a2e5-5c1cd78293cc
     $ curl -X GET -H "Content-Type: application/json" -H "Accept: application/json" -H "rest-dspace-token: e09fb5e1-72b0-4811-a2e5-5c1cd78293cc" https://dspacetest.cgiar.org/rest/status
     {"okay":true,"authenticated":true,"email":"testdeposit@cgiar.org","fullname":"Test deposit","token":"e09fb5e1-72b0-4811-a2e5-5c1cd78293cc"}
    diff --git a/docs/2018-11/index.html b/docs/2018-11/index.html
    index 7d27411d7..985a82331 100644
    --- a/docs/2018-11/index.html
    +++ b/docs/2018-11/index.html
    @@ -36,7 +36,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list
     Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
     Today these are the top 10 IPs:
     "/>
    -
    +
     
     
         
    @@ -132,7 +132,7 @@ Today these are the top 10 IPs:
     
  • Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage
  • Today these are the top 10 IPs:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1300 66.249.64.63
        1384 35.237.175.180
        1430 138.201.52.218
    @@ -148,22 +148,22 @@ Today these are the top 10 IPs:
     
  • 70.32.83.92 is well known, probably CCAFS or something, as it’s only a few thousand requests and always to REST API
  • 84.38.130.177 is some new IP in Latvia that is only hitting the XMLUI, using the following user agent:
  • -
    Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.792.0 Safari/535.1
    +
    Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.792.0 Safari/535.1
     
    • They at least seem to be re-using their Tomcat sessions:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03
     342
     
    • 50.116.102.77 is also a regular REST API user
    • 40.77.167.175 and 207.46.13.156 seem to be Bing
    • 138.201.52.218 seems to be on Hetzner in Germany, but is using this user agent:
    -
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
    +
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
     
    • And it doesn’t seem they are re-using their Tomcat sessions:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
     1243
     
    • Ah, we’ve apparently seen this server exactly a year ago in 2017-11, making 40,000 requests in one day…
    • @@ -171,7 +171,7 @@ Today these are the top 10 IPs:
    • Linode sent a mail that CGSpace (linode18) is using high outgoing bandwidth
    • Looking at the nginx logs again I see the following top ten IPs:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1979 50.116.102.77
        1980 35.237.175.180
        2186 207.46.13.156
    @@ -185,11 +185,11 @@ Today these are the top 10 IPs:
     
    • 78.46.89.18 is new since I last checked a few hours ago, and it’s from Hetzner with the following user agent:
    -
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
    +
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
     
    • It’s making lots of requests, though actually it does seem to be re-using its Tomcat sessions:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
     8449
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03 | sort | uniq | wc -l
     1
    @@ -200,7 +200,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
     
  • I think it’s reasonable for a human to click one of those links five or ten times a minute…
  • To contrast, 78.46.89.18 made about 300 requests per minute for a few hours today:
  • -
    # grep 78.46.89.18 /var/log/nginx/access.log | grep -o -E '03/Nov/2018:[0-9][0-9]:[0-9][0-9]' | sort | uniq -c | sort -n | tail -n 20
    +
    # grep 78.46.89.18 /var/log/nginx/access.log | grep -o -E '03/Nov/2018:[0-9][0-9]:[0-9][0-9]' | sort | uniq -c | sort -n | tail -n 20
         286 03/Nov/2018:18:02
         287 03/Nov/2018:18:21
         289 03/Nov/2018:18:23
    @@ -232,7 +232,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
     
  • Linode emailed about the CPU load and outgoing bandwidth on CGSpace (linode18) again
  • Here are the top ten IPs active so far this morning:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1083 2a03:2880:11ff:2::face:b00c
        1105 2a03:2880:11ff:d::face:b00c
        1111 2a03:2880:11ff:f::face:b00c
    @@ -246,7 +246,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
     
    • 78.46.89.18 is back… and it is still actually re-using its Tomcat sessions:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
     8765
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq | wc -l
     1
    @@ -254,7 +254,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
     
  • Updated on 2018-12-04 to correct the grep command and point out that the bot was actually re-using its Tomcat sessions properly
  • Also, now we have a ton of Facebook crawlers:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | grep "2a03:2880:11ff:" | awk '{print $1}' | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | grep "2a03:2880:11ff:" | awk '{print $1}' | sort | uniq -c | sort -n
         905 2a03:2880:11ff:b::face:b00c
         955 2a03:2880:11ff:5::face:b00c
         965 2a03:2880:11ff:e::face:b00c
    @@ -275,18 +275,18 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
     
    • They are really making shit tons of requests:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
     37721
     
    • Updated on 2018-12-04 to correct the grep command to accurately show the number of requests
    • Their user agent is:
    -
    facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
    +
    facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
     
    • I will add it to the Tomcat Crawler Session Manager valve
    • Later in the evening… ok, this Facebook bot is getting super annoying:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | grep "2a03:2880:11ff:" | awk '{print $1}' | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Nov/2018" | grep "2a03:2880:11ff:" | awk '{print $1}' | sort | uniq -c | sort -n
        1871 2a03:2880:11ff:3::face:b00c
        1885 2a03:2880:11ff:b::face:b00c
        1941 2a03:2880:11ff:8::face:b00c
    @@ -307,7 +307,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
     
    • Now at least the Tomcat Crawler Session Manager Valve seems to be forcing it to re-use some Tomcat sessions:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
     37721
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq | wc -l
     15206
    @@ -315,7 +315,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
     
  • I think we still need to limit more of the dynamic pages, like the “most popular” country, item, and author pages
  • It seems these are popular too, and there is no fucking way Facebook needs that information, yet they are requesting thousands of them!
  • -
    # grep 'face:b00c' /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c 'most-popular/'
    +
    # grep 'face:b00c' /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c 'most-popular/'
     7033
     
    • I added the “most-popular” pages to the list that return X-Robots-Tag: none to try to inform bots not to index or follow those pages
    • @@ -325,14 +325,14 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
      • I wrote a small Python script add-dc-rights.py to add usage rights (dc.rights) to CGSpace items based on the CSV Hector gave me from MARLO:
      -
      $ ./add-dc-rights.py -i /tmp/marlo.csv -db dspace -u dspace -p 'fuuu'
      +
      $ ./add-dc-rights.py -i /tmp/marlo.csv -db dspace -u dspace -p 'fuuu'
       
      • The file marlo.csv was cleaned up and formatted in Open Refine
      • 165 of the items in their 2017 data are from CGSpace!
      • I will add the data to CGSpace this week (done!)
      • Jesus, is Facebook trying to be annoying? At least the Tomcat Crawler Session Manager Valve is working to force the bot to re-use its Tomcat sessions:
      -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep -c "2a03:2880:11ff:"
      +
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep -c "2a03:2880:11ff:"
       29889
       # grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05
       29763
      @@ -350,7 +350,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
       
    • While I was updating the rest-find-collections.py script I noticed it was using expand=all to get the collection and community IDs
    • I realized I actually only need expand=collections,subCommunities, and I wanted to see how much overhead the extra expands created so I did three runs of each:
    -
    $ time ./rest-find-collections.py 10568/27629 --rest-url https://dspacetest.cgiar.org/rest
    +
    $ time ./rest-find-collections.py 10568/27629 --rest-url https://dspacetest.cgiar.org/rest
     
    • Average time with all expands was 14.3 seconds, and 12.8 seconds with collections,subCommunities, so 1.5 seconds difference!
    @@ -403,22 +403,22 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11
    • Testing corrections and deletions for AGROVOC (dc.subject) that Sisay and Peter were working on earlier this month:
    -
    $ ./fix-metadata-values.py -i 2018-11-19-correct-agrovoc.csv -f dc.subject -t correct -m 57 -db dspace -u dspace -p 'fuu' -d
    +
    $ ./fix-metadata-values.py -i 2018-11-19-correct-agrovoc.csv -f dc.subject -t correct -m 57 -db dspace -u dspace -p 'fuu' -d
     $ ./delete-metadata-values.py -i 2018-11-19-delete-agrovoc.csv -f dc.subject -m 57 -db dspace -u dspace -p 'fuu' -d
     
    • Then I ran them on both CGSpace and DSpace Test, and started a full Discovery re-index on CGSpace:
    -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
    • Generate a new list of the top 1500 AGROVOC subjects on CGSpace to send to Peter and Sisay:
    -
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-11-19-top-1500-subject.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-11-19-top-1500-subject.csv WITH CSV HEADER;
     

    2018-11-20

    • The Discovery re-indexing on CGSpace never finished yesterday… the command died after six minutes
    • The dspace.log.2018-11-19 shows this at the time:
    -
    2018-11-19 15:23:04,221 ERROR com.atmire.dspace.discovery.AtmireSolrService @ DSpace kernel cannot be null
    +
    2018-11-19 15:23:04,221 ERROR com.atmire.dspace.discovery.AtmireSolrService @ DSpace kernel cannot be null
     java.lang.IllegalStateException: DSpace kernel cannot be null
             at org.dspace.utils.DSpace.getServiceManager(DSpace.java:63)
             at org.dspace.utils.DSpace.getSingletonService(DSpace.java:87)
    @@ -479,13 +479,13 @@ java.lang.IllegalStateException: DSpace kernel cannot be null
     
  • This WLE item is issued on 2018-10 and accessioned on 2018-10-22 but does not show up in the WLE R4D Learning Series collection on CGSpace for some reason, and therefore does not show up on the WLE publication website
  • I tried to remove that collection from Discovery and do a simple re-index:
  • -
    $ dspace index-discovery -r 10568/41888
    +
    $ dspace index-discovery -r 10568/41888
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
     
    • … but the item still doesn’t appear in the collection
    • Now I will try a full Discovery re-index:
    -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
    • Ah, Marianne had set the item as private when she uploaded it, so it was still private
    • I made it public and now it shows up in the collection list
    • @@ -497,7 +497,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
    • Linode alerted me that the outbound traffic rate on CGSpace (linode19) was very high
    • The top users this morning are:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "27/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "27/Nov/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         229 46.101.86.248
         261 66.249.64.61
         447 66.249.64.59
    @@ -512,7 +512,7 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery
     
  • We know 70.32.83.92 is CCAFS harvester on MediaTemple, but 205.186.128.185 is new appears to be a new CCAFS harvester
  • I think we might want to prune some old accounts from CGSpace, perhaps users who haven’t logged in in the last two years would be a conservative bunch:
  • -
    $ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 | wc -l
    +
    $ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 | wc -l
     409
     $ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 -d
     
      diff --git a/docs/2018-12/index.html b/docs/2018-12/index.html index f0b11f550..c98fd3399 100644 --- a/docs/2018-12/index.html +++ b/docs/2018-12/index.html @@ -36,7 +36,7 @@ Then I ran all system updates and restarted the server I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another Ghostscript vulnerability last week "/> - + @@ -135,7 +135,7 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see
      • The error when I try to manually run the media filter for one item from the command line:
      -
      org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
      +
      org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
       org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-12989PcFN0DnJOej7%d" "-f/tmp/magick-129895Bmp44lvUfxo" "-f/tmp/magick-12989C0QFG51fktLF"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
               at org.im4java.core.Info.getBaseInfo(Info.java:360)
               at org.im4java.core.Info.<init>(Info.java:151)
      @@ -157,13 +157,13 @@ org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.c
       
    • I think we need to wait for a fix from Ubuntu
    • For what it’s worth, I get the same error on my local Arch Linux environment with Ghostscript 9.26:
    -
    $ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
    +
    $ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
     DEBUG: FC_WEIGHT didn't match
     zsh: segmentation fault (core dumped)  gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000
     
    • When I replace the pngalpha device with png16m as suggested in the StackOverflow comments it works:
    -
    $ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=png16m -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
    +
    $ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=png16m -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Food\ safety\ Kenya\ fruits.pdf
     DEBUG: FC_WEIGHT didn't match
     
    • Start proofing the latest round of 226 IITA archive records that Bosede sent last week and Sisay uploaded to DSpace Test this weekend (IITA_Dec_1_1997 aka Daniel1807) @@ -182,7 +182,7 @@ DEBUG: FC_WEIGHT didn't match
    • Expand my “encoding error” detection GREL to include ~ as I saw a lot of that in some copy pasted French text recently:
    -
    or(
    +
    or(
       isNotNull(value.match(/.*\uFFFD.*/)),
       isNotNull(value.match(/.*\u00A0.*/)),
       isNotNull(value.match(/.*\u200A.*/)),
    @@ -196,29 +196,29 @@ DEBUG: FC_WEIGHT didn't match
     
  • I can successfully generate a thumbnail for another recent item (10568/98394), but not for 10568/98930
  • Even manually on my Arch Linux desktop with ghostscript 9.26-1 and the pngalpha device, I can generate a thumbnail for the first one (10568/98394):
  • -
    $ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf
    +
    $ gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 -sDEVICE=pngalpha -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r72x72 -dFirstPage=1 -dLastPage=1 -sOutputFile=/tmp/out%d -f/home/aorth/Desktop/Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf
     
    • So it seems to be something about the PDFs themselves, perhaps related to alpha support?
    • The first item (10568/98394) has the following information:
    -
    $ identify Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf\[0\]
    +
    $ identify Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf\[0\]
     Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf[0]=>Info Note Mainstreaming gender and social differentiation into CCAFS research activities in West Africa-converted.pdf PDF 595x841 595x841+0+0 16-bit sRGB 107443B 0.000u 0:00.000
     identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1746.
     
    • And wow, I can’t even run ImageMagick’s identify on the first page of the second item (10568/98930):
    -
    $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
    +
    $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     zsh: abort (core dumped)  identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     
    • But with GraphicsMagick’s identify it works:
    -
    $ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
    +
    $ gm identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     DEBUG: FC_WEIGHT didn't match
     Food safety Kenya fruits.pdf PDF 612x792+0+0 DirectClass 8-bit 1.4Mi 0.000u 0m:0.000002s
     
    -
    $ identify Food\ safety\ Kenya\ fruits.pdf
    +
    $ identify Food\ safety\ Kenya\ fruits.pdf
     Food safety Kenya fruits.pdf[0] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
     Food safety Kenya fruits.pdf[1] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
     Food safety Kenya fruits.pdf[2] PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.010u 0:00.009
    @@ -228,7 +228,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
    • As I expected, ImageMagick cannot generate a thumbnail, but GraphicsMagick can (though it looks like crap):
    -
    $ convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
    +
    $ convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
     zsh: abort (core dumped)  convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten
     $ gm convert Food\ safety\ Kenya\ fruits.pdf\[0\] -thumbnail 600x600 -flatten Food\ safety\ Kenya\ fruits.pdf.jpg
     DEBUG: FC_WEIGHT didn't match
    @@ -236,7 +236,7 @@ DEBUG: FC_WEIGHT didn't match
     
  • I inspected the troublesome PDF using jhove and noticed that it is using ISO PDF/A-1, Level B and the other one doesn’t list a profile, though I don’t think this is relevant
  • I found another item that fails when generating a thumbnail (10568/98391, DSpace complains:
  • -
    org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
    +
    org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
     org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: FailedToExecuteCommand `"gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-142966vQs5Di64ntH%d" "-f/tmp/magick-14296Q0rJjfCeIj3w" "-f/tmp/magick-14296k_K6MWqwvpDm"' (-1) @ error/delegate.c/ExternalDelegateCommand/461.
             at org.im4java.core.Info.getBaseInfo(Info.java:360)
             at org.im4java.core.Info.<init>(Info.java:151)
    @@ -265,16 +265,16 @@ Caused by: org.im4java.core.CommandException: identify: FailedToExecuteCommand `
     
    • And on my Arch Linux environment ImageMagick’s convert also segfaults:
    -
    $ convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
    +
    $ convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
     zsh: abort (core dumped)  convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\]  x60
     
    • But GraphicsMagick’s convert works:
    -
    $ gm convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
    +
    $ gm convert bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf\[0\] -thumbnail x600 -flatten bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf.jpg
     
    • So far the only thing that stands out is that the two files that don’t work were created with Microsoft Office 2016:
    -
    $ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E '^(Creator|Producer)'
    +
    $ pdfinfo bnfb_biofortification\ Module_Participants\ Guide\ 2018.pdf | grep -E '^(Creator|Producer)'
     Creator:        Microsoft® Word 2016
     Producer:       Microsoft® Word 2016
     $ pdfinfo Food\ safety\ Kenya\ fruits.pdf | grep -E '^(Creator|Producer)'
    @@ -283,13 +283,13 @@ Producer:       Microsoft® Word 2016
     
    • And the one that works was created with Office 365:
    -
    $ pdfinfo Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf | grep -E '^(Creator|Producer)'
    +
    $ pdfinfo Info\ Note\ Mainstreaming\ gender\ and\ social\ differentiation\ into\ CCAFS\ research\ activities\ in\ West\ Africa-converted.pdf | grep -E '^(Creator|Producer)'
     Creator:        Microsoft® Word for Office 365
     Producer:       Microsoft® Word for Office 365
     
    • I remembered an old technique I was using to generate thumbnails in 2015 using Inkscape followed by ImageMagick or GraphicsMagick:
    -
    $ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png='cover.png'
    +
    $ inkscape Food\ safety\ Kenya\ fruits.pdf -z --export-dpi=72 --export-area-drawing --export-png='cover.png'
     $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
     
    • I’ve tried a few times this week to register for the Ethiopian eVisa website, but it is never successful
    • @@ -304,7 +304,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
    -
    2018-12-03 15:44:00,030 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
    +
    2018-12-03 15:44:00,030 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
     2018-12-03 15:44:03,390 ERROR com.atmire.app.webui.servlet.ExportServlet @ Error converter plugin not found: interface org.infoCon.ConverterPlugin
     ...
     2018-12-03 15:45:01,667 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-listing-and-reports not found
    @@ -312,7 +312,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
     
  • I tested it on my local environment with Tomcat 8.5.34 and the JSPUI application still has an error (again, the logs show something about tag cloud, so be unrelated), and the Listings and Reports still asks you to log in again, despite already being logged in in XMLUI, but does appear to work (I generated a report and exported a PDF)
  • I think the errors about missing Atmire components must be important, here on my local machine as well (though not the one about atmire-listings-and-reports):
  • -
    2018-12-03 16:44:00,009 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
    +
    2018-12-03 16:44:00,009 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
     
    • This has got to be part Ubuntu Tomcat packaging, and part DSpace 5.x Tomcat 8.5 readiness…?
    @@ -320,7 +320,7 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
    • Last night Linode sent a message that the load on CGSpace (linode18) was too high, here’s a list of the top users at the time and throughout the day:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         225 40.77.167.142
         226 66.249.64.63
         232 46.101.86.248
    @@ -345,30 +345,30 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
     
    • 35.237.175.180 is known to us (CCAFS?), and I’ve already added it to the list of bot IPs in nginx, which appears to be working:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
     4772
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
     630
     
    • I haven’t seen 2a01:4f8:140:3192::2 before. Its user agent is some new bot:
    -
    Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
    +
    Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
     
    • At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
     5111
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l
     419
     
    • 78.46.79.71 is another host on Hetzner with the following user agent:
    -
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
    +
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
     
    • This is not the first time a host on Hetzner has used a “normal” user agent to make thousands of requests
    • At least it is re-using its Tomcat sessions somehow:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
     2044
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
     1
    @@ -385,7 +385,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
     
  • Linode sent a message that the CPU usage of CGSpace (linode18) is too high last night
  • I looked in the logs and there’s nothing particular going on:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1225 157.55.39.177
        1240 207.46.13.12
        1261 207.46.13.101
    @@ -399,11 +399,11 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
     
    • 54.70.40.11 is some new bot with the following user agent:
    -
    Mozilla/5.0 (compatible) SemanticScholarBot (+https://www.semanticscholar.org/crawler)
    +
    Mozilla/5.0 (compatible) SemanticScholarBot (+https://www.semanticscholar.org/crawler)
     
    • But Tomcat is forcing them to re-use their Tomcat sessions with the Crawler Session Manager valve:
    -
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
    +
    $ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
     6980
     $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05 | sort | uniq | wc -l
     1156
    @@ -446,7 +446,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
     
  • Linode alerted me twice today that the load on CGSpace (linode18) was very high
  • Looking at the nginx logs I see a few new IPs in the top 10:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "17/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "17/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         927 157.55.39.81
         975 54.70.40.11
        2090 50.116.102.77
    @@ -460,7 +460,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
     
    • 94.71.244.172 and 143.233.227.216 are both in Greece and use the following user agent:
    -
    Mozilla/3.0 (compatible; Indy Library)
    +
    Mozilla/3.0 (compatible; Indy Library)
     
    • I see that I added this bot to the Tomcat Crawler Session Manager valve in 2017-12 so its XMLUI sessions are getting re-used
    • 2a01:4f8:173:1e85::2 is some new bot called BLEXBot/1.0 which should be matching the existing “bot” pattern in the Tomcat Crawler Session Manager regex
    • @@ -477,7 +477,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=54.70.40.11' dspace.log.2018-12-05
      • Testing compression of PostgreSQL backups with xz and gzip:
      -
      $ time xz -c cgspace_2018-12-19.backup > cgspace_2018-12-19.backup.xz
      +
      $ time xz -c cgspace_2018-12-19.backup > cgspace_2018-12-19.backup.xz
       xz -c cgspace_2018-12-19.backup > cgspace_2018-12-19.backup.xz  48.29s user 0.19s system 99% cpu 48.579 total
       $ time gzip -c cgspace_2018-12-19.backup > cgspace_2018-12-19.backup.gz
       gzip -c cgspace_2018-12-19.backup > cgspace_2018-12-19.backup.gz  2.78s user 0.09s system 99% cpu 2.899 total
      @@ -492,7 +492,7 @@ $ ls -lh cgspace_2018-12-19.backup*
       
    • Peter asked if we could create a controlled vocabulary for publishers (dc.publisher)
    • I see we have about 3500 distinct publishers:
    -
    # SELECT COUNT(DISTINCT(text_value)) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=39;
    +
    # SELECT COUNT(DISTINCT(text_value)) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=39;
      count
     -------
       3522
    @@ -501,17 +501,17 @@ $ ls -lh cgspace_2018-12-19.backup*
     
  • I reverted the metadata changes related to “Unrestricted Access” and “Restricted Access” on DSpace Test because we’re not pushing forward with the new status terms for now
  • Purge remaining Oracle Java 8 stuff from CGSpace (linode18) since we migrated to OpenJDK a few months ago:
  • -
    # dpkg -P oracle-java8-installer oracle-java8-set-default
    +
    # dpkg -P oracle-java8-installer oracle-java8-set-default
     
    • Update usage rights on CGSpace as we agreed with Maria Garruccio and Peter last month:
    -
    $ ./fix-metadata-values.py -i /tmp/2018-11-27-update-rights.csv -f dc.rights -t correct -m 53 -db dspace -u dspace -p 'fuu' -d
    +
    $ ./fix-metadata-values.py -i /tmp/2018-11-27-update-rights.csv -f dc.rights -t correct -m 53 -db dspace -u dspace -p 'fuu' -d
     Connected to database.
     Fixed 466 occurences of: Copyrighted; Any re-use allowed
     
    • Upgrade PostgreSQL on CGSpace (linode18) from 9.5 to 9.6:
    -
    # apt install postgresql-9.6 postgresql-client-9.6 postgresql-contrib-9.6 postgresql-server-dev-9.6
    +
    # apt install postgresql-9.6 postgresql-client-9.6 postgresql-contrib-9.6 postgresql-server-dev-9.6
     # pg_ctlcluster 9.5 main stop
     # tar -cvzpf var-lib-postgresql-9.5.tar.gz /var/lib/postgresql/9.5
     # tar -cvzpf etc-postgresql-9.5.tar.gz /etc/postgresql/9.5
    @@ -525,7 +525,7 @@ Fixed 466 occurences of: Copyrighted; Any re-use allowed
     
  • Run all system updates on CGSpace (linode18) and restart the server
  • Try to run the DSpace cleanup script on CGSpace (linode18), but I get some errors about foreign key constraints:
  • -
    $ dspace cleanup -v
    +
    $ dspace cleanup -v
      - Deleting bitstream information (ID: 158227)
      - Deleting bitstream record from database (ID: 158227)
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    @@ -534,7 +534,7 @@ Error: ERROR: update or delete on table "bitstream" violates foreign k
     
    • As always, the solution is to delete those IDs manually in PostgreSQL:
    -
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (158227, 158251);'
    +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (158227, 158251);'
     UPDATE 1
     
    • After all that I started a full Discovery reindex to get the index name changes and rights updates
    • @@ -544,7 +544,7 @@ UPDATE 1
    • CGSpace went down today for a few minutes while I was at dinner and I quickly restarted Tomcat
    • The top IP addresses as of this evening are:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         963 40.77.167.152
         987 35.237.175.180
        1062 40.77.167.55
    @@ -558,7 +558,7 @@ UPDATE 1
     
    • And just around the time of the alert:
    -
    # zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz | grep -E "29/Dec/2018:1(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log.1 /var/log/nginx/*.log.2.gz | grep -E "29/Dec/2018:1(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         115 66.249.66.223
         118 207.46.13.14
         123 34.218.226.147
    diff --git a/docs/2019-01/index.html b/docs/2019-01/index.html
    index 01a08e0cb..a48833203 100644
    --- a/docs/2019-01/index.html
    +++ b/docs/2019-01/index.html
    @@ -50,7 +50,7 @@ I don’t see anything interesting in the web server logs around that time t
         357 207.46.13.1
         903 54.70.40.11
     "/>
    -
    +
     
     
         
    @@ -141,7 +141,7 @@ I don’t see anything interesting in the web server logs around that time t
     
  • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
  • I don’t see anything interesting in the web server logs around that time though:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          92 40.77.167.4
          99 210.7.29.100
         120 38.126.157.45
    @@ -155,7 +155,7 @@ I don’t see anything interesting in the web server logs around that time t
     
    • Analyzing the types of requests made by the top few IPs during that time:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 54.70.40.11 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | grep 54.70.40.11 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c
          30 bitstream
         534 discover
         352 handle
    @@ -168,7 +168,7 @@ I don’t see anything interesting in the web server logs around that time t
     
  • It’s not clear to me what was causing the outbound traffic spike
  • Oh nice! The once-per-year cron job for rotating the Solr statistics actually worked now (for the first time ever!):
  • -
    Moving: 81742 into core statistics-2010
    +
    Moving: 81742 into core statistics-2010
     Moving: 1837285 into core statistics-2011
     Moving: 3764612 into core statistics-2012
     Moving: 4557946 into core statistics-2013
    @@ -185,7 +185,7 @@ Moving: 18497180 into core statistics-2018
     
    • Update local Docker image for DSpace PostgreSQL, re-using the existing data volume:
    -
    $ sudo docker pull postgres:9.6-alpine
    +
    $ sudo docker pull postgres:9.6-alpine
     $ sudo docker rm dspacedb
     $ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     
      @@ -197,7 +197,7 @@ $ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/d
    • The JSPUI application—which Listings and Reports depends upon—also does not load, though the error is perhaps unrelated:
    -
    2019-01-03 14:45:21,727 INFO  org.dspace.browse.BrowseEngine @ anonymous:session_id=9471D72242DAA05BCC87734FE3C66EA6:ip_addr=127.0.0.1:browse_mini:
    +
    2019-01-03 14:45:21,727 INFO  org.dspace.browse.BrowseEngine @ anonymous:session_id=9471D72242DAA05BCC87734FE3C66EA6:ip_addr=127.0.0.1:browse_mini:
     2019-01-03 14:45:21,971 INFO  org.dspace.app.webui.discovery.DiscoverUtility @ facets for scope, null: 23
     2019-01-03 14:45:22,115 WARN  org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=9471D72242DAA05BCC87734FE3C66EA6:internal_error:-- URL Was: http://localhost:8080/jspui/internal-error
     -- Method: GET
    @@ -283,7 +283,7 @@ org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discove
     
    • Linode sent a message last night that CGSpace (linode18) had high CPU usage, but I don’t see anything around that time in the web server logs:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Jan/2019:1(7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Jan/2019:1(7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         189 207.46.13.192
         217 31.6.77.23
         340 66.249.70.29
    @@ -298,7 +298,7 @@ org.apache.jasper.JasperException: /home.jsp (line: [214], column: [1]) /discove
     
  • I’m thinking about trying to validate our dc.subject terms against AGROVOC webservices
  • There seem to be a few APIs and the documentation is kinda confusing, but I found this REST endpoint that does work well, for example searching for SOIL:
  • -
    $ http http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=SOIL&lang=en
    +
    $ http http://agrovoc.uniroma2.it/agrovoc/rest/v1/search?query=SOIL&lang=en
     HTTP/1.1 200 OK
     Access-Control-Allow-Origin: *
     Connection: Keep-Alive
    @@ -345,7 +345,7 @@ X-Frame-Options: ALLOW-FROM http://aims.fao.org
     
  • The API does not appear to be case sensitive (searches for SOIL and soil return the same thing)
  • I’m a bit confused that there’s no obvious return code or status when a term is not found, for example SOILS:
  • -
    HTTP/1.1 200 OK
    +
    HTTP/1.1 200 OK
     Access-Control-Allow-Origin: *
     Connection: Keep-Alive
     Content-Length: 367
    @@ -381,7 +381,7 @@ X-Frame-Options: ALLOW-FROM http://aims.fao.org
     
  • I guess the results object will just be empty…
  • Another way would be to try with SPARQL, perhaps using the Python 2.7 sparql-client:
  • -
    $ python2.7 -m virtualenv /tmp/sparql
    +
    $ python2.7 -m virtualenv /tmp/sparql
     $ . /tmp/sparql/bin/activate
     $ pip install sparql-client ipython
     $ ipython
    @@ -466,7 +466,7 @@ In [14]: for row in result.fetchone():
     
     
  • I am testing the speed of the WorldFish DSpace repository’s REST API and it’s five to ten times faster than CGSpace as I tested in 2018-10:
  • -
    $ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
    +
    $ time http --print h 'https://digitalarchive.worldfishcenter.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     
     0.16s user 0.03s system 3% cpu 5.185 total
     0.17s user 0.02s system 2% cpu 7.123 total
    @@ -474,7 +474,7 @@ In [14]: for row in result.fetchone():
     
    • In other news, Linode sent a mail last night that the CPU load on CGSpace (linode18) was high, here are the top IPs in the logs around those few hours:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "14/Jan/2019:(17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "14/Jan/2019:(17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         157 31.6.77.23
         192 54.70.40.11
         202 66.249.64.157
    @@ -599,11 +599,11 @@ In [14]: for row in result.fetchone():
     
    • In the Solr admin UI I see the following error:
    -
    statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
    +
    statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
     
    • Looking in the Solr log I see this:
    -
    2019-01-16 13:37:55,395 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
    +
    2019-01-16 13:37:55,395 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
     org.apache.solr.common.SolrException: Error opening new searcher
         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
         at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
    @@ -721,7 +721,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
     
  • For 2019-01 alone the Usage Stats are already around 1.2 million
  • I tried to look in the nginx logs to see how many raw requests there are so far this month and it’s about 1.4 million:
  • -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     1442874
     
     real    0m17.161s
    @@ -786,7 +786,7 @@ sys     0m2.396s
     
    • That’s weird, I logged into DSpace Test (linode19) and it says it has been up for 213 days:
    -
    # w
    +
    # w
      04:46:14 up 213 days,  7:25,  4 users,  load average: 1.94, 1.50, 1.35
     
    • I’ve definitely rebooted it several times in the past few months… according to journalctl -b it was a few weeks ago on 2019-01-02
    • @@ -803,7 +803,7 @@ sys 0m2.396s
    • Investigating running Tomcat 7 on Ubuntu 18.04 with the tarball and a custom systemd package instead of waiting for our DSpace to get compatible with Ubuntu 18.04’s Tomcat 8.5
    • I could either run with a simple tomcat7.service like this:
    -
    [Unit]
    +
    [Unit]
     Description=Apache Tomcat 7 Web Application Container
     After=network.target
     [Service]
    @@ -817,7 +817,7 @@ WantedBy=multi-user.target
     
    • Or try to use adapt a real systemd service like Arch Linux’s:
    -
    [Unit]
    +
    [Unit]
     Description=Tomcat 7 servlet container
     After=network.target
     
    @@ -859,7 +859,7 @@ WantedBy=multi-user.target
     
  • I think I might manage this the same way I do the restic releases in the Ansible infrastructure scripts, where I download a specific version and symlink to some generic location without the version number
  • I verified that there is indeed an issue with sharded Solr statistics cores on DSpace, which will cause inaccurate results in the dspace-statistics-api:
  • -
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
    +
    $ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
     <result name="response" numFound="33" start="0">
     $ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
     <result name="response" numFound="241" start="0">
    @@ -868,7 +868,7 @@ $ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&rows=0&a
     
  • I don’t think the SolrClient library we are currently using supports these type of queries so we might have to just do raw queries with requests
  • The pysolr library says it supports multicore indexes, but I am not sure it does (or at least not with our setup):
  • -
    import pysolr
    +
    import pysolr
     solr = pysolr.Solr('http://localhost:3000/solr/statistics')
     results = solr.search('type:2', **{'fq': 'isBot:false AND statistics_type:view', 'facet': 'true', 'facet.field': 'id', 'facet.mincount': 1, 'facet.limit': 10, 'facet.offset': 0, 'rows': 0})
     print(results.facets['facet_fields'])
    @@ -876,7 +876,7 @@ print(results.facets['facet_fields'])
     
    • If I double check one item from above, for example 77572, it appears this is only working on the current statistics core and not the shards:
    -
    import pysolr
    +
    import pysolr
     solr = pysolr.Solr('http://localhost:3000/solr/statistics')
     results = solr.search('type:2 id:77572', **{'fq': 'isBot:false AND statistics_type:view'})
     print(results.hits)
    @@ -889,12 +889,12 @@ print(results.hits)
     
  • So I guess I need to figure out how to use join queries and maybe even switch to using raw Python requests with JSON
  • This enumerates the list of Solr cores and returns JSON format:
  • -
    http://localhost:3000/solr/admin/cores?action=STATUS&wt=json
    +
    http://localhost:3000/solr/admin/cores?action=STATUS&wt=json
     
    • I think I figured out how to search across shards, I needed to give the whole URL to each other core
    • Now I get more results when I start adding the other statistics cores:
    -
    $ http 'http://localhost:3000/solr/statistics/select?&indent=on&rows=0&q=*:*' | grep numFound<result name="response" numFound="2061320" start="0">
    +
    $ http 'http://localhost:3000/solr/statistics/select?&indent=on&rows=0&q=*:*' | grep numFound<result name="response" numFound="2061320" start="0">
     $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018&indent=on&rows=0&q=*:*' | grep numFound
     <result name="response" numFound="16280292" start="0" maxScore="1.0">
     $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/solr/statistics-2018,localhost:8081/solr/statistics-2017&indent=on&rows=0&q=*:*' | grep numFound
    @@ -913,7 +913,7 @@ $ http 'http://localhost:3000/solr/statistics/select?&shards=localhost:8081/
     
     
     
    -
    $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018' | grep numFound
    +
    $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics,localhost:8081/solr/statistics-2018' | grep numFound
     <result name="response" numFound="275" start="0" maxScore="12.205825">
     $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view&shards=localhost:8081/solr/statistics-2018' | grep numFound
     <result name="response" numFound="241" start="0" maxScore="12.205825">
    @@ -924,7 +924,7 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=
     
  • I deployed it on CGSpace (linode18) and restarted the indexer as well
  • Linode sent an alert that CGSpace (linode18) was using high CPU this afternoon, the top ten IPs during that time were:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Jan/2019:1(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Jan/2019:1(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         155 40.77.167.106
         176 2003:d5:fbda:1c00:1106:c7a0:4b17:3af8
         189 107.21.16.70
    @@ -939,12 +939,12 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=
     
  • 35.237.175.180 is known to us
  • I don’t think we’ve seen 196.191.127.37 before. Its user agent is:
  • -
    Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36
    +
    Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36
     
    • Interestingly this IP is located in Addis Ababa…
    • Another interesting one is 154.113.73.30, which is apparently at IITA Nigeria and uses the user agent:
    -
    Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
    +
    Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
     

    2019-01-23

    • Peter noticed that some goo.gl links in our tweets from Feedburner are broken, for example this one from last week:
    • @@ -979,13 +979,13 @@ $ http 'http://localhost:8081/solr/statistics/select?indent=on&rows=0&q=

      I got a list of their collections from the CGSpace XMLUI and then used an SQL query to dump the unique values to CSV:

    -
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/35501', '10568/41728', '10568/49622', '10568/56589', '10568/56592', '10568/65064', '10568/65718', '10568/65719', '10568/67373', '10568/67731', '10568/68235', '10568/68546', '10568/69089', '10568/69160', '10568/69419', '10568/69556', '10568/70131', '10568/70252', '10568/70978'))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/35501', '10568/41728', '10568/49622', '10568/56589', '10568/56592', '10568/65064', '10568/65718', '10568/65719', '10568/67373', '10568/67731', '10568/68235', '10568/68546', '10568/69089', '10568/69160', '10568/69419', '10568/69556', '10568/70131', '10568/70252', '10568/70978'))) group by text_value order by count desc) to /tmp/bioversity-affiliations.csv with csv;
     COPY 1109
     
    • Send a mail to the dspace-tech mailing list about the OpenSearch issue we had with the Livestock CRP
    • Linode sent an alert that CGSpace (linode18) had a high load this morning, here are the top ten IPs during that time:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         222 54.226.25.74
         241 40.77.167.13
         272 46.101.86.248
    @@ -1019,7 +1019,7 @@ COPY 1109
     

    Just to make sure these were not uploaded by the user or something, I manually forced the regeneration of these with DSpace’s filter-media:

    -
    $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98390
    +
    $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98390
     $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace filter-media -v -f -i 10568/98391
     
    • Both of these were successful, so there must have been an update to ImageMagick or Ghostscript in Ubuntu since early 2018-12
    • @@ -1034,7 +1034,7 @@ $ schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace fi
    • I re-compiled Arch’s ghostscript with the patch and then I was able to generate a thumbnail from one of the troublesome PDFs
    • Before and after:
    -
    $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
    +
    $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     zsh: abort (core dumped)  identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     $ identify Food\ safety\ Kenya\ fruits.pdf\[0\]
     Food safety Kenya fruits.pdf[0]=>Food safety Kenya fruits.pdf PDF 612x792 612x792+0+0 16-bit sRGB 64626B 0.000u 0:00.000
    @@ -1044,7 +1044,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
  • I told Atmire to go ahead with the Metadata Quality Module addition based on our 5_x-dev branch (657)
  • Linode sent alerts last night to say that CGSpace (linode18) was using high CPU last night, here are the top ten IPs from the nginx logs around that time:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         305 3.81.136.184
         306 3.83.14.11
         306 52.54.252.47
    @@ -1059,7 +1059,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
  • 45.5.186.2 is CIAT and 66.249.64.155 is Google… hmmm.
  • Linode sent another alert this morning, here are the top ten IPs active during that time:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:0(4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         360 3.89.134.93
         362 34.230.15.139
         366 100.24.48.177
    @@ -1073,7 +1073,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
    • Just double checking what CIAT is doing, they are mainly hitting the REST API:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:" | grep 45.5.186.2 | grep -Eo "GET /(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "24/Jan/2019:" | grep 45.5.186.2 | grep -Eo "GET /(handle|bitstream|rest|oai)/" | sort | uniq -c | sort -n
     
    • CIAT’s community currently has 12,000 items in it so this is normal
    • The issue with goo.gl links that we saw yesterday appears to be resolved, as links are working again…
    • @@ -1102,7 +1102,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
      • Linode sent an email that the server was using a lot of CPU this morning, and these were the top IPs in the web server logs at the time:
      -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "27/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      +
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "27/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           189 40.77.167.108
           191 157.55.39.2
           263 34.218.226.147
      @@ -1132,7 +1132,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
       
       
    • Linode alerted that CGSpace (linode18) was using too much CPU again this morning, here are the active IPs from the web server log at the time:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:0(6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          67 207.46.13.50
         105 41.204.190.40
         117 34.218.226.147
    @@ -1153,7 +1153,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
     
  • Last night Linode sent an alert that CGSpace (linode18) was using high CPU, here are the most active IPs in the hours just before, during, and after the alert:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Jan/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         310 45.5.184.2
         425 5.143.231.39
         526 54.70.40.11
    @@ -1168,12 +1168,12 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
  • Of course there is CIAT’s 45.5.186.2, but also 45.5.184.2 appears to be CIAT… I wonder why they have two harvesters?
  • 199.47.87.140 and 199.47.87.141 is TurnItIn with the following user agent:
  • -
    TurnitinBot (https://turnitin.com/robot/crawlerinfo.html)
    +
    TurnitinBot (https://turnitin.com/robot/crawlerinfo.html)
     

    2019-01-29

    • Linode sent an alert about CGSpace (linode18) CPU usage this morning, here are the top IPs in the web server logs just before, during, and after the alert:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Jan/2019:0(3|4|5|6|7)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "29/Jan/2019:0(3|4|5|6|7)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         334 45.5.184.72
         429 66.249.66.223
         522 35.237.175.180
    @@ -1198,7 +1198,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
    • Got another alert from Linode about CGSpace (linode18) this morning, here are the top IPs before, during, and after the alert:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "30/Jan/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "30/Jan/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         273 46.101.86.248
         301 35.237.175.180
         334 45.5.184.72
    @@ -1216,7 +1216,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
    • Linode sent alerts about CGSpace (linode18) last night and this morning, here are the top IPs before, during, and after those times:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "30/Jan/2019:(16|17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "30/Jan/2019:(16|17|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         436 18.196.196.108
         460 157.55.39.168
         460 207.46.13.96
    @@ -1242,7 +1242,7 @@ identify: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/
     
  • 45.5.186.2 and 45.5.184.2 are CIAT as always
  • 85.25.237.71 is some new server in Germany that I’ve never seen before with the user agent:
  • -
    Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)
    +
    Linguee Bot (http://www.linguee.com/bot; bot@linguee.com)
     
    diff --git a/docs/2019-02/index.html b/docs/2019-02/index.html index 27fa93bd8..0d7b2e29f 100644 --- a/docs/2019-02/index.html +++ b/docs/2019-02/index.html @@ -72,7 +72,7 @@ real 0m19.873s user 0m22.203s sys 0m1.979s "/> - + @@ -163,7 +163,7 @@ sys 0m1.979s
  • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
  • The top IPs before, during, and after this latest alert tonight were:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         245 207.46.13.5
         332 54.70.40.11
         385 5.143.231.38
    @@ -179,7 +179,7 @@ sys     0m1.979s
     
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
  • There were just over 3 million accesses in the nginx logs last month:
  • -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
     
     real    0m19.873s
    @@ -198,7 +198,7 @@ sys     0m1.979s
     
    • Another alert from Linode about CGSpace (linode18) this morning, here are the top IPs in the web server logs before, during, and after that time:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Feb/2019:0(1|2|3|4|5)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Feb/2019:0(1|2|3|4|5)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         284 18.195.78.144
         329 207.46.13.32
         417 35.237.175.180
    @@ -219,7 +219,7 @@ sys     0m1.979s
     
  • This is seriously getting annoying, Linode sent another alert this morning that CGSpace (linode18) load was 377%!
  • Here are the top IPs before, during, and after that time:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         325 85.25.237.71
         340 45.5.184.72
         431 5.143.231.8
    @@ -234,11 +234,11 @@ sys     0m1.979s
     
  • 45.5.184.2 is CIAT, 70.32.83.92 and 205.186.128.185 are Macaroni Bros harvesters for CCAFS I think
  • 195.201.104.240 is a new IP address in Germany with the following user agent:
  • -
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
    +
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
     
    • This user was making 20–60 requests per minute this morning… seems like I should try to block this type of behavior heuristically, regardless of user agent!
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019" | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Feb/2019" | grep 195.201.104.240 | grep -o -E '03/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 20
          19 03/Feb/2019:07:42
          20 03/Feb/2019:07:12
          21 03/Feb/2019:07:27
    @@ -262,7 +262,7 @@ sys     0m1.979s
     
    • At least they re-used their Tomcat session!
    -
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
    +
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=195.201.104.240' dspace.log.2019-02-03 | sort | uniq | wc -l
     1
     
    • This user was making requests to /browse, which is not currently under the existing rate limiting of dynamic pages in our nginx config @@ -280,14 +280,14 @@ sys 0m1.979s
      • Generate a list of CTA subjects from CGSpace for Peter:
      -
      dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
      +
      dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=124 GROUP BY text_value ORDER BY COUNT DESC) to /tmp/cta-subjects.csv with csv header;
       COPY 321
       
      • Skype with Michael Victor about CKM and CGSpace
      • Discuss the new IITA research theme field with Abenet and decide that we should use cg.identifier.iitatheme
      • This morning there was another alert from Linode about the high load on CGSpace (linode18), here are the top IPs in the web server logs before, during, and after that time:
      -
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      +
      # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "04/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           589 2a01:4f8:140:3192::2
           762 66.249.66.219
           889 35.237.175.180
      @@ -307,7 +307,7 @@ COPY 321
       
    • Peter sent me corrections and deletions for the CTA subjects and as usual, there were encoding errors with some accentsÁ in his file
    • In other news, it seems that the GREL syntax regarding booleans changed in OpenRefine recently, so I need to update some expressions like the one I use to detect encoding errors to use toString():
    -
    or(
    +
    or(
       isNotNull(value.match(/.*\uFFFD.*/)),
       isNotNull(value.match(/.*\u00A0.*/)),
       isNotNull(value.match(/.*\u200A.*/)),
    @@ -318,17 +318,17 @@ COPY 321
     
    -
    $ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d
    +
    $ ./fix-metadata-values.py -i 2019-02-04-Correct-65-CTA-Subjects.csv -f cg.subject.cta -t CORRECT -m 124 -db dspace -u dspace -p 'fuu' -d
     $ ./delete-metadata-values.py -i 2019-02-04-Delete-16-CTA-Subjects.csv -f cg.subject.cta -m 124 -db dspace -u dspace -p 'fuu' -d
     
    • I applied them on DSpace Test and CGSpace and started a full Discovery re-index:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
    • Peter had marked several terms with || to indicate multiple values in his corrections so I will have to go back and do those manually:
    -
    EMPODERAMENTO DE JOVENS,EMPODERAMENTO||JOVENS
    +
    EMPODERAMENTO DE JOVENS,EMPODERAMENTO||JOVENS
     ENVIRONMENTAL PROTECTION AND NATURAL RESOURCES MANAGEMENT,NATURAL RESOURCES MANAGEMENT||ENVIRONMENT
     FISHERIES AND AQUACULTURE,FISHERIES||AQUACULTURE
     MARKETING AND TRADE,MARKETING||TRADE
    @@ -340,21 +340,21 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
     
    • I dumped the CTA community so I can try to fix the subjects with multiple subjects that Peter indicated in his corrections:
    -
    $ dspace metadata-export -i 10568/42211 -f /tmp/cta.csv
    +
    $ dspace metadata-export -i 10568/42211 -f /tmp/cta.csv
     
    • Then I used csvcut to get only the CTA subject columns:
    -
    $ csvcut -c "id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]" /tmp/cta.csv > /tmp/cta-subjects.csv
    +
    $ csvcut -c "id,collection,cg.subject.cta,cg.subject.cta[],cg.subject.cta[en_US]" /tmp/cta.csv > /tmp/cta-subjects.csv
     
    • After that I imported the CSV into OpenRefine where I could properly identify and edit the subjects as multiple values
    • Then I imported it back into CGSpace:
    -
    $ dspace metadata-import -f /tmp/2019-02-06-CTA-multiple-subjects.csv
    +
    $ dspace metadata-import -f /tmp/2019-02-06-CTA-multiple-subjects.csv
     
    • Another day, another alert about high load on CGSpace (linode18) from Linode
    • This time the load average was 370% and the top ten IPs before, during, and after that time were:
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         689 35.237.175.180
        1236 5.9.6.51
        1305 34.218.226.147
    @@ -368,7 +368,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
     
    • Looking closer at the top users, I see 45.5.186.2 is in Brazil and was making over 100 requests per minute to the REST API:
    -
    # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E '06/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep 45.5.186.2 | grep -o -E '06/Feb/2019:0[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
         118 06/Feb/2019:05:46
         119 06/Feb/2019:05:37
         119 06/Feb/2019:05:47
    @@ -382,7 +382,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
     
    • I was thinking of rate limiting those because I assumed most of them would be errors, but actually most are HTTP 200 OK!
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '06/Feb/2019' | grep 45.5.186.2 | awk '{print $9}' | sort | uniq -c
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '06/Feb/2019' | grep 45.5.186.2 | awk '{print $9}' | sort | uniq -c
       10411 200
           1 301
           7 302
    @@ -392,7 +392,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
     
    • I should probably start looking at the top IPs for web (XMLUI) and for API (REST and OAI) separately:
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         328 220.247.212.35
         372 66.249.66.221
         380 207.46.13.2
    @@ -419,7 +419,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
     
  • Linode sent an alert last night that the load on CGSpace (linode18) was over 300%
  • Here are the top IPs in the web server and API logs before, during, and after that time, respectively:
  • -
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "06/Feb/2019:(17|18|19|20|23)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "06/Feb/2019:(17|18|19|20|23)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           5 66.249.66.209
           6 2a01:4f8:210:51ef::2
           6 40.77.167.75
    @@ -444,7 +444,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
     
    • Then again this morning another alert:
    -
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "07/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "07/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           5 66.249.66.223
           8 104.198.9.108
          13 110.54.160.222
    @@ -471,7 +471,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
     
  • I could probably rate limit the REST API, or maybe just keep increasing the alert threshold so I don’t get alert spam (this is probably the correct approach because it seems like the REST API can keep up with the requests and is returning HTTP 200 status as far as I can tell)
  • Bosede from IITA sent a message that a colleague is having problems submitting to some collections in their community:
  • -
    Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1056 by user 1759
    +
    Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1056 by user 1759
     
    @@ -482,7 +482,7 @@ PESCAS E AQUACULTURE,PISCICULTURA||AQUACULTURE
  • Bizuwork asked about the “DSpace Submission Approved and Archived” emails that stopped working last month
  • I tried the test-email command on DSpace and it indeed is not working:
  • -
    $ dspace test-email
    +
    $ dspace test-email
     
     About to send test email:
      - To: aorth@mjanja.ch
    @@ -503,7 +503,7 @@ Please see the DSpace documentation for assistance.
     
    • I re-configured CGSpace to use the email/password for cgspace-support, but I get this error when I try the test-email script:
    -
    Error sending email:
    +
    Error sending email:
      - Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR10CA0028.EURPRD10.PROD.OUTLOOK.COM]
     
    • I tried to log into Outlook 365 with the credentials but I think the ones I have must be wrong, so I will ask ICT to reset the password
    • @@ -513,7 +513,7 @@ Please see the DSpace documentation for assistance.
    • Linode sent alerts about CPU load yesterday morning, yesterday night, and this morning! All over 300% CPU load!
    • This is just for this morning:
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "09/Feb/2019:(07|08|09|10|11)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "09/Feb/2019:(07|08|09|10|11)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         289 35.237.175.180
         290 66.249.66.221
         296 18.195.78.144
    @@ -539,7 +539,7 @@ Please see the DSpace documentation for assistance.
     
  • I know 66.249.66.219 is Google, 5.9.6.51 is MegaIndex, and 5.143.231.38 is SputnikBot
  • Ooh, but 151.80.203.180 is some malicious bot making requests for /etc/passwd like this:
  • -
    /bitstream/handle/10568/68981/Identifying%20benefit%20flows%20studies%20on%20the%20potential%20monetary%20and%20non%20monetary%20benefits%20arising%20from%20the%20International%20Treaty%20on%20Plant%20Genetic_1671.pdf?sequence=1&amp;isAllowed=../etc/passwd
    +
    /bitstream/handle/10568/68981/Identifying%20benefit%20flows%20studies%20on%20the%20potential%20monetary%20and%20non%20monetary%20benefits%20arising%20from%20the%20International%20Treaty%20on%20Plant%20Genetic_1671.pdf?sequence=1&amp;isAllowed=../etc/passwd
     
    • 151.80.203.180 is on OVH so I sent a message to their abuse email…
    @@ -547,7 +547,7 @@ Please see the DSpace documentation for assistance.
    • Linode sent another alert about CGSpace (linode18) CPU load this morning, here are the top IPs in the web server XMLUI and API logs before, during, and after that time:
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         232 18.195.78.144
         238 35.237.175.180
         281 66.249.66.221
    @@ -572,14 +572,14 @@ Please see the DSpace documentation for assistance.
     
    • Another interesting thing might be the total number of requests for web and API services during that time:
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE "10/Feb/2019:0(5|6|7|8|9)"
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -cE "10/Feb/2019:0(5|6|7|8|9)"
     16333
     # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -cE "10/Feb/2019:0(5|6|7|8|9)"
     15964
     
    • Also, the number of unique IPs served during that time:
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l
     1622
     # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "10/Feb/2019:0(5|6|7|8|9)" | awk '{print $1}' | sort | uniq | wc -l
     95
    @@ -610,7 +610,7 @@ Please see the DSpace documentation for assistance.
     
     
     
    -
    Error sending email:
    +
    Error sending email:
      - Error: cannot test email because mail.server.disabled is set to true
     
    • I’m not sure why I didn’t know about this configuration option before, and always maintained multiple configurations for development and production @@ -620,7 +620,7 @@ Please see the DSpace documentation for assistance.
    • I updated my local Sonatype nexus Docker image and had an issue with the volume for some reason so I decided to just start from scratch:
    -
    # docker rm nexus
    +
    # docker rm nexus
     # docker pull sonatype/nexus3
     # mkdir -p /home/aorth/.local/lib/containers/volumes/nexus_data
     # chown 200:200 /home/aorth/.local/lib/containers/volumes/nexus_data
    @@ -628,7 +628,7 @@ Please see the DSpace documentation for assistance.
     
    -
    # docker pull docker.bintray.io/jfrog/artifactory-oss:latest
    +
    # docker pull docker.bintray.io/jfrog/artifactory-oss:latest
     # mkdir -p /home/aorth/.local/lib/containers/volumes/artifactory5_data
     # chown 1030 /home/aorth/.local/lib/containers/volumes/artifactory5_data
     # docker run --name artifactory --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
    @@ -643,13 +643,13 @@ Please see the DSpace documentation for assistance.
     
  • On a similar note, I wonder if we could use the performance-focused libvps and the third-party jlibvips Java library in DSpace
  • Testing the vipsthumbnail command line tool with this CGSpace item that uses CMYK:
  • -
    $ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o '%s.jpg[Q=92,optimize_coding,strip]'
    +
    $ vipsthumbnail alc_contrastes_desafios.pdf -s 300 -o '%s.jpg[Q=92,optimize_coding,strip]'
     
    • (DSpace 5 appears to use JPEG 92 quality so I do the same)
    • Thinking about making “top items” endpoints in my dspace-statistics-api
    • I could use the following SQL queries very easily to get the top items by views or downloads:
    -
    dspacestatistics=# SELECT * FROM items WHERE views > 0 ORDER BY views DESC LIMIT 10;
    +
    dspacestatistics=# SELECT * FROM items WHERE views > 0 ORDER BY views DESC LIMIT 10;
     dspacestatistics=# SELECT * FROM items WHERE downloads > 0 ORDER BY downloads DESC LIMIT 10;
     
    • I’d have to think about what to make the REST API endpoints, perhaps: /statistics/top/items?limit=10
    • @@ -660,7 +660,7 @@ dspacestatistics=# SELECT * FROM items WHERE downloads > 0 ORDER BY downloads
    -
    $ identify -verbose alc_contrastes_desafios.pdf.jpg
    +
    $ identify -verbose alc_contrastes_desafios.pdf.jpg
     ...
       Colorspace: sRGB
     
      @@ -671,35 +671,35 @@ dspacestatistics=# SELECT * FROM items WHERE downloads > 0 ORDER BY downloads
    • ILRI ICT reset the password for the CGSpace mail account, but I still can’t get it to send mail from DSpace’s test-email utility
    • I even added extra mail properties to dspace.cfg as suggested by someone on the dspace-tech mailing list:
    -
    mail.extraproperties = mail.smtp.starttls.required = true, mail.smtp.auth=true
    +
    mail.extraproperties = mail.smtp.starttls.required = true, mail.smtp.auth=true
     
    • But the result is still:
    -
    Error sending email:
    +
    Error sending email:
      - Error: com.sun.mail.smtp.SMTPSendFailedException: 530 5.7.57 SMTP; Client was not authenticated to send anonymous mail during MAIL FROM [AM6PR06CA0001.eurprd06.prod.outlook.com]
     
    • I tried to log into the Outlook 365 web mail and it doesn’t work so I’ve emailed ILRI ICT again
    • After reading the common mistakes in the JavaMail FAQ I reconfigured the extra properties in DSpace’s mail configuration to be simply:
    -
    mail.extraproperties = mail.smtp.starttls.enable=true
    +
    mail.extraproperties = mail.smtp.starttls.enable=true
     
    • … and then I was able to send a mail using my personal account where I know the credentials work
    • The CGSpace account still gets this error message:
    -
    Error sending email:
    +
    Error sending email:
      - Error: javax.mail.AuthenticationFailedException
     
    -
    $ dspace user --delete --email blah@cta.int
    +
    $ dspace user --delete --email blah@cta.int
     $ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int --password 'blah'
     
    • On this note, I saw a thread on the dspace-tech mailing list that says this functionality exists if you enable webui.user.assumelogin = true
    • I will enable this on CGSpace (#411)
    • Test re-creating my local PostgreSQL and Artifactory containers with podman instead of Docker (using the volumes from my old Docker containers though):
    -
    # podman pull postgres:9.6-alpine
    +
    # podman pull postgres:9.6-alpine
     # podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     # podman pull docker.bintray.io/jfrog/artifactory-oss
     # podman run --name artifactory -d -v /home/aorth/.local/lib/containers/volumes/artifactory5_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
    @@ -707,7 +707,7 @@ $ dspace user --add --givenname Thierry --surname Lewyllie --email blah@cta.int
     
  • Totally works… awesome!
  • Then I tried with rootless containers by creating the subuid and subgid mappings for aorth:
  • -
    $ sudo touch /etc/subuid /etc/subgid
    +
    $ sudo touch /etc/subuid /etc/subgid
     $ usermod --add-subuids 10000-75535 aorth
     $ usermod --add-subgids 10000-75535 aorth
     $ sudo sysctl kernel.unprivileged_userns_clone=1
    @@ -717,7 +717,7 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
     
  • Which totally works, but Podman’s rootless support doesn’t work with port mappings yet…
  • Deploy the Tomcat-7-from-tarball branch on CGSpace (linode18), but first stop the Ubuntu Tomcat 7 and do some basic prep before running the Ansible playbook:
  • -
    # systemctl stop tomcat7
    +
    # systemctl stop tomcat7
     # apt remove tomcat7 tomcat7-admin
     # useradd -m -r -s /bin/bash dspace
     # mv /usr/share/tomcat7/.m2 /home/dspace
    @@ -728,14 +728,14 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
     
    • After running the playbook CGSpace came back up, but I had an issue with some Solr cores not being loaded (similar to last month) and this was in the Solr log:
    -
    2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    +
    2019-02-14 18:17:31,304 ERROR org.apache.solr.core.SolrCore @ org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2018': Unable to create core [statistics-2018] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
     
    • The issue last month was address space, which is now set as LimitAS=infinity in tomcat7.service
    • I re-ran the Ansible playbook to make sure all configs etc were the, then rebooted the server
    • Still the error persists after reboot
    • I will try to stop Tomcat and then remove the locks manually:
    -
    # find /home/cgspace.cgiar.org/solr/ -iname "write.lock" -delete
    +
    # find /home/cgspace.cgiar.org/solr/ -iname "write.lock" -delete
     
    • After restarting Tomcat the usage statistics are back
    • Interestingly, many of the locks were from last month, last year, and even 2015! I’m pretty sure that’s not supposed to be how locks work…
    • @@ -747,19 +747,19 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
      • Tomcat was killed around 3AM by the kernel’s OOM killer according to dmesg:
      -
      [Fri Feb 15 03:10:42 2019] Out of memory: Kill process 12027 (java) score 670 or sacrifice child
      +
      [Fri Feb 15 03:10:42 2019] Out of memory: Kill process 12027 (java) score 670 or sacrifice child
       [Fri Feb 15 03:10:42 2019] Killed process 12027 (java) total-vm:14108048kB, anon-rss:5450284kB, file-rss:0kB, shmem-rss:0kB
       [Fri Feb 15 03:10:43 2019] oom_reaper: reaped process 12027 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
       
      • The tomcat7 service shows:
      -
      Feb 15 03:10:44 linode19 systemd[1]: tomcat7.service: Main process exited, code=killed, status=9/KILL
      +
      Feb 15 03:10:44 linode19 systemd[1]: tomcat7.service: Main process exited, code=killed, status=9/KILL
       
      • I suspect it was related to the media-filter cron job that runs at 3AM but I don’t see anything particular in the log files
      • I want to try to normalize the text_lang values to make working with metadata easier
      • We currently have a bunch of weird values that DSpace uses like NULL, en_US, and en and others that have been entered manually by editors:
      -
      dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
      +
      dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
        text_lang |  count
       -----------+---------
                  | 1069539
      @@ -778,7 +778,7 @@ $ podman run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspace
       
    • Theoretically this field could help if you wanted to search for Spanish-language fields in the API or something, but even for the English fields there are two different values (and those are from DSpace itself)!
    • I’m going to normalized these to NULL at least on DSpace Test for now:
    -
    dspace=# UPDATE metadatavalue SET text_lang = NULL WHERE resource_type_id=2 AND text_lang IS NOT NULL;
    +
    dspace=# UPDATE metadatavalue SET text_lang = NULL WHERE resource_type_id=2 AND text_lang IS NOT NULL;
     UPDATE 1045410
     
    • I started proofing IITA’s 2019-01 records that Sisay uploaded this week @@ -790,7 +790,7 @@ UPDATE 1045410
    • ILRI ICT fixed the password for the CGSpace support email account and I tested it on Outlook 365 web and DSpace and it works
    • Re-create my local PostgreSQL container to for new PostgreSQL version and to use podman’s volumes:
    -
    $ podman pull postgres:9.6-alpine
    +
    $ podman pull postgres:9.6-alpine
     $ podman volume create dspacedb_data
     $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     $ createuser -h localhost -U postgres --pwprompt dspacetest
    @@ -803,7 +803,7 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
     
  • And it’s all running without root!
  • Then re-create my Artifactory container as well, taking into account ulimit open file requirements by Artifactory as well as the user limitations caused by rootless subuid mappings:
  • -
    $ podman volume create artifactory_data
    +
    $ podman volume create artifactory_data
     artifactory_data
     $ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081:8081 docker.bintray.io/jfrog/artifactory-oss
     $ buildah unshare
    @@ -817,13 +817,13 @@ $ podman start artifactory
     
    • I ran DSpace’s cleanup task on CGSpace (linode18) and there were errors:
    -
    $ dspace cleanup -v
    +
    $ dspace cleanup -v
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
       Detail: Key (bitstream_id)=(162844) is still referenced from table "bundle".
     
    • The solution is, as always:
    -
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);'
    +
    $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (162844);'
     UPDATE 1
     
    • I merged the Atmire Metadata Quality Module (MQM) changes to the 5_x-prod branch and deployed it on CGSpace (#407)
    • @@ -834,7 +834,7 @@ UPDATE 1
    • Jesus fucking Christ, Linode sent an alert that CGSpace (linode18) was using 421% CPU for a few hours this afternoon (server time):
    • There seems to have been a lot of activity in XMLUI:
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
        1236 18.212.208.240
        1276 54.164.83.99
        1277 3.83.14.11
    @@ -864,7 +864,7 @@ UPDATE 1
     
  • 94.71.244.172 is in Greece and uses the user agent “Indy Library”
  • At least they are re-using their Tomcat session:
  • -
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172' dspace.log.2019-02-18 | sort | uniq | wc -l
    +
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=94.71.244.172' dspace.log.2019-02-18 | sort | uniq | wc -l
     
    • The following IPs were all hitting the server hard simultaneously and are located on Amazon and use the user agent “Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0”:

      @@ -886,7 +886,7 @@ UPDATE 1

      For reference most of these IPs hitting the XMLUI this afternoon are on Amazon:

    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 30
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "18/Feb/2019:1(2|3|4|5|6)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 30
        1173 52.91.249.23
        1176 107.22.118.106
        1178 3.88.173.152
    @@ -920,7 +920,7 @@ UPDATE 1
     
    • In the case of 52.54.252.47 they are only making about 10 requests per minute during this time (albeit from dozens of concurrent IPs):
    -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E '18/Feb/2019:1[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep 52.54.252.47 | grep -o -E '18/Feb/2019:1[0-9]:[0-9][0-9]' | uniq -c | sort -n | tail -n 10
          10 18/Feb/2019:17:20
          10 18/Feb/2019:17:22
          10 18/Feb/2019:17:31
    @@ -935,7 +935,7 @@ UPDATE 1
     
  • As this user agent is not recognized as a bot by DSpace this will definitely fuck up the usage statistics
  • There were 92,000 requests from these IPs alone today!
  • -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -c 'Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0'
     92756
     
    • I will add this user agent to the “badbots” rate limiting in our nginx configuration
    • @@ -943,7 +943,7 @@ UPDATE 1
    • IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary
    • I will merge them with our existing list and then resolve their names using my resolve-orcids.py script:
    -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt  | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-02-18-combined-orcids.txt
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml 2019-02-18-IWMI-ORCID-IDs.txt  | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-02-18-combined-orcids.txt
     $ ./resolve-orcids.py -i /tmp/2019-02-18-combined-orcids.txt -o /tmp/2019-02-18-combined-names.txt -d
     # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    @@ -956,7 +956,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
     
  • Unfortunately, I don’t see any strange activity in the web server API or XMLUI logs at that time in particular
  • So far today the top ten IPs in the XMLUI logs are:
  • -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
       11541 18.212.208.240
       11560 3.81.136.184
       11562 3.88.237.84
    @@ -978,7 +978,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
     
     
  • The top requests in the API logs today are:
  • -
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "19/Feb/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          42 66.249.66.221
          44 156.156.81.215
          55 3.85.54.129
    @@ -999,17 +999,17 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
     
  • I need to follow up with the DSpace developers and Atmire to see how they classify which requests are bots so we can try to estimate the impact caused by these users and perhaps try to update the list to make the stats more accurate
  • I found one IP address in Nigeria that has an Android user agent and has requested a bitstream from 10568/96140 almost 200 times:
  • -
    # grep 41.190.30.105 /var/log/nginx/access.log | grep -c 'acgg_progress_report.pdf'
    +
    # grep 41.190.30.105 /var/log/nginx/access.log | grep -c 'acgg_progress_report.pdf'
     185
     
    • Wow, and another IP in Nigeria made a bunch more yesterday from the same user agent:
    -
    # grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c 'acgg_progress_report.pdf'
    +
    # grep 41.190.3.229 /var/log/nginx/access.log.1 | grep -c 'acgg_progress_report.pdf'
     346
     
    • In the last two days alone there were 1,000 requests for this PDF, mostly from Nigeria!
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v 'upstream response is buffered' | awk '{print $1}' | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep acgg_progress_report.pdf | grep -v 'upstream response is buffered' | awk '{print $1}' | sort | uniq -c | sort -n
           1 139.162.146.60
           1 157.55.39.159
           1 196.188.127.94
    @@ -1032,7 +1032,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
     
    • That is so weird, they are all using this Android user agent:
    -
    Mozilla/5.0 (Linux; Android 7.0; TECNO Camon CX Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36
    +
    Mozilla/5.0 (Linux; Android 7.0; TECNO Camon CX Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0.0 Mobile Safari/537.36
     
    • I wrote a quick and dirty Python script called resolve-addresses.py to resolve IP addresses to their owning organization’s name, ASN, and country using the IPAPI.co API
    @@ -1042,7 +1042,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
  • I told him that they should probably try to use the REST API’s find-by-metadata-field endpoint
  • The annoying thing is that you have to match the text language attribute of the field exactly, but it does work:
  • -
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": ""}'
    +
    $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": ""}'
     $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": null}'
     $ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://cgspace.cgiar.org/rest/items/find-by-metadata-field" -d '{"key": "cg.creator.id","value": "Alan S. Orth: 0000-0002-1735-7458", "language": "en_US"}'
     
      @@ -1063,23 +1063,23 @@ $ curl -s -H "accept: application/json" -H "Content-Type: applica
    • It allows specifying the language the term should be queried in as well as output files to save the matched and unmatched terms to
    • I ran our top 1500 subjects through English, Spanish, and French and saved the matched and unmatched terms to separate files:
    -
    $ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt
    +
    $ ./agrovoc-lookup.py -l en -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-en.txt -or /tmp/rejected-subjects-en.txt
     $ ./agrovoc-lookup.py -l es -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-es.txt -or /tmp/rejected-subjects-es.txt
     $ ./agrovoc-lookup.py -l fr -i /tmp/top-1500-subjects.txt -om /tmp/matched-subjects-fr.txt -or /tmp/rejected-subjects-fr.txt
     
    • Then I generated a list of all the unique matched terms:
    -
    $ cat /tmp/matched-subjects-* | sort | uniq > /tmp/2019-02-21-matched-subjects.txt
    +
    $ cat /tmp/matched-subjects-* | sort | uniq > /tmp/2019-02-21-matched-subjects.txt
     
    • And then a list of all the unique unmatched terms using some utility I’ve never heard of before called comm or with diff:
    -
    $ sort /tmp/top-1500-subjects.txt > /tmp/subjects-sorted.txt
    +
    $ sort /tmp/top-1500-subjects.txt > /tmp/subjects-sorted.txt
     $ comm -13 /tmp/2019-02-21-matched-subjects.txt /tmp/subjects-sorted.txt > /tmp/2019-02-21-unmatched-subjects.txt
     $ diff --new-line-format="" --unchanged-line-format="" /tmp/subjects-sorted.txt /tmp/2019-02-21-matched-subjects.txt > /tmp/2019-02-21-unmatched-subjects.txt
     
    • Generate a list of countries and regions from CGSpace for Sisay to look through:
    -
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-countries.csv WITH CSV HEADER;
     COPY 202
     dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 227 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/2019-02-21-regions.csv WITH CSV HEADER;
     COPY 33
    @@ -1124,7 +1124,7 @@ COPY 33
     

    I figured out how to query AGROVOC from OpenRefine using Jython by creating a custom text facet:

    -
    import json
    +
    import json
     import re
     import urllib
     import urllib2
    @@ -1148,7 +1148,7 @@ return "unmatched"
     
  • I’m not sure how to deal with terms like “CORN” that are alternative labels (altLabel) in AGROVOC where the preferred label (prefLabel) would be “MAIZE”
  • For example, a query for CORN* returns:
  • -
        "results": [
    +
        "results": [
             {
                 "altLabel": "corn (maize)",
                 "lang": "en",
    @@ -1176,7 +1176,7 @@ return "unmatched"
     
  • There seems to be something going on with Solr on CGSpace (linode18) because statistics on communities and collections are blank for January and February this year
  • I see some errors started recently in Solr (yesterday):
  • -
    $ grep -c ERROR /home/cgspace.cgiar.org/log/solr.log.2019-02-*
    +
    $ grep -c ERROR /home/cgspace.cgiar.org/log/solr.log.2019-02-*
     /home/cgspace.cgiar.org/log/solr.log.2019-02-11.xz:0
     /home/cgspace.cgiar.org/log/solr.log.2019-02-12.xz:0
     /home/cgspace.cgiar.org/log/solr.log.2019-02-13.xz:0
    @@ -1195,7 +1195,7 @@ return "unmatched"
     
  • But I don’t see anything interesting in yesterday’s Solr log…
  • I see this in the Tomcat 7 logs yesterday:
  • -
    Feb 25 21:09:29 linode18 tomcat7[1015]: Error while updating
    +
    Feb 25 21:09:29 linode18 tomcat7[1015]: Error while updating
     Feb 25 21:09:29 linode18 tomcat7[1015]: java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
     Feb 25 21:09:29 linode18 tomcat7[1015]:         at org.dspace.statistics.SolrLogger$9.visit(SourceFile:1241)
     Feb 25 21:09:29 linode18 tomcat7[1015]:         at org.dspace.statistics.SolrLogger.visitEachStatisticShard(SourceFile:268)
    @@ -1207,7 +1207,7 @@ Feb 25 21:09:29 linode18 tomcat7[1015]:         at org.dspace.statistics.Statist
     
  • In the Solr admin GUI I see we have the following error: “statistics-2011: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher”
  • I restarted Tomcat and upon startup I see lots of errors in the systemd journal, like:
  • -
    Feb 25 21:37:49 linode18 tomcat7[28363]: SEVERE: IOException while loading persisted sessions: java.io.StreamCorruptedException: invalid type code: 00
    +
    Feb 25 21:37:49 linode18 tomcat7[28363]: SEVERE: IOException while loading persisted sessions: java.io.StreamCorruptedException: invalid type code: 00
     Feb 25 21:37:49 linode18 tomcat7[28363]: java.io.StreamCorruptedException: invalid type code: 00
     Feb 25 21:37:49 linode18 tomcat7[28363]:         at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1601)
     Feb 25 21:37:49 linode18 tomcat7[28363]:         at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
    @@ -1220,7 +1220,7 @@ Feb 25 21:37:49 linode18 tomcat7[28363]:         at sun.reflect.NativeMethodAcce
     
  • Also, now the Solr admin UI says “statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher”
  • In the Solr log I see:
  • -
    2019-02-25 21:38:14,246 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2015]: Error opening new searcher
    +
    2019-02-25 21:38:14,246 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2015]: Error opening new searcher
     org.apache.solr.common.SolrException: Error opening new searcher
             at org.apache.solr.core.SolrCore.<init>(SolrCore.java:873)
             at org.apache.solr.core.SolrCore.<init>(SolrCore.java:646)
    @@ -1243,7 +1243,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
     
    • I tried to shutdown Tomcat and remove the locks:
    -
    # systemctl stop tomcat7
    +
    # systemctl stop tomcat7
     # find /home/cgspace.cgiar.org/solr -iname "*.lock" -delete
     # systemctl start tomcat7
     
      @@ -1254,7 +1254,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
    • I did some tests with calling a shell script from systemd on DSpace Test (linode19) and the LimitAS setting does work, and the infinity setting in systemd does get translated to “unlimited” on the service
    • I thought it might be open file limit, but it seems we’re nowhere near the current limit of 16384:
    -
    # lsof -u dspace | wc -l
    +
    # lsof -u dspace | wc -l
     3016
     
    • For what it’s worth I see the same errors about solr_update_time_stamp on DSpace Test (linode19)
    • @@ -1270,7 +1270,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
    • I sent a mail to the dspace-tech mailing list about the “solr_update_time_stamp” error
    • A CCAFS user sent a message saying they got this error when submitting to CGSpace:
    -
    Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1021 by user 3049
    +
    Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1021 by user 3049
     
    • According to the REST API collection 1021 appears to be CCAFS Tools, Maps, Datasets and Models
    • I looked at the WORKFLOW_STEP_1 (Accept/Reject) and the group is of course empty
    • @@ -1287,7 +1287,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
    • He asked me to upload the files for him via the command line, but the file he referenced (Thumbnails_feb_2019.zip) doesn’t exist
    • I noticed that the command line batch import functionality is a bit weird when using zip files because you have to specify the directory where the zip file is location as well as the zip file’s name:
    -
    $ ~/dspace/bin/dspace import -a -e aorth@stfu.com -m mapfile -s /home/aorth/Downloads/2019-02-27-test/ -z SimpleArchiveFormat.zip
    +
    $ ~/dspace/bin/dspace import -a -e aorth@stfu.com -m mapfile -s /home/aorth/Downloads/2019-02-27-test/ -z SimpleArchiveFormat.zip
     
    • Why don’t they just derive the directory from the path to the zip file?
    • Working on Udana’s Restoring Degraded Landscapes (RDL) WLE records that we originally started in 2018-11 and fixing many of the same problems that I originally did then @@ -1303,12 +1303,12 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
      • I helped Sisay upload the nineteen CTA records from last week via the command line because they required mappings (which is not possible to do via the batch upload web interface)
      -
      $ dspace import -a -e swebshet@stfu.org -s /home/swebshet/Thumbnails_feb_2019 -m 2019-02-28-CTA-Thumbnails.map
      +
      $ dspace import -a -e swebshet@stfu.org -s /home/swebshet/Thumbnails_feb_2019 -m 2019-02-28-CTA-Thumbnails.map
       
      • Mails from CGSpace stopped working, looks like ICT changed the password again or we got locked out sigh
      • Now I’m getting this message when trying to use DSpace’s test-email script:
      -
      $ dspace test-email
      +
      $ dspace test-email
       
       About to send test email:
        - To: stfu@google.com
      diff --git a/docs/2019-03/index.html b/docs/2019-03/index.html
      index 7ddb5f11a..774c4bf7a 100644
      --- a/docs/2019-03/index.html
      +++ b/docs/2019-03/index.html
      @@ -46,7 +46,7 @@ Most worryingly, there are encoding errors in the abstracts for eleven items, fo
       
       I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
       "/>
      -
      +
       
       
           
      @@ -151,7 +151,7 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
       
      • Trying to finally upload IITA’s 259 Feb 14 items to CGSpace so I exported them from DSpace Test:
      -
      $ mkdir 2019-03-03-IITA-Feb14
      +
      $ mkdir 2019-03-03-IITA-Feb14
       $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
       
      • As I was inspecting the archive I noticed that there were some problems with the bitsreams: @@ -163,7 +163,7 @@ $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
      • After adding the missing bitstreams and descriptions manually I tested them again locally, then imported them to a temporary collection on CGSpace:
      -
      $ dspace import -a -c 10568/99832 -e aorth@stfu.com -m 2019-03-03-IITA-Feb14.map -s /tmp/2019-03-03-IITA-Feb14
      +
      $ dspace import -a -c 10568/99832 -e aorth@stfu.com -m 2019-03-03-IITA-Feb14.map -s /tmp/2019-03-03-IITA-Feb14
       
      • DSpace’s export function doesn’t include the collections for some reason, so you need to import them somewhere first, then export the collection metadata and re-map the items to proper owning collections based on their types using OpenRefine or something
      • After re-importing to CGSpace to apply the mappings, I deleted the collection on DSpace Test and ran the dspace cleanup script
      • @@ -180,7 +180,7 @@ $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-03-03-IITA-Feb14
      • I suspect it’s related to the email issue that ICT hasn’t responded about since last week
      • As I thought, I still cannot send emails from CGSpace:
      -
      $ dspace test-email
      +
      $ dspace test-email
       
       About to send test email:
        - To: blah@stfu.com
      @@ -197,7 +197,7 @@ Error sending email:
       
    • ICT reset the email password and I confirmed that it is working now
    • Generate a controlled vocabulary of 1187 AGROVOC subjects from the top 1500 that I checked last month, dumping the terms themselves using csvcut and then applying XML controlled vocabulary format in vim and then checking with tidy for good measure:
    -
    $ csvcut -c name 2019-02-22-subjects.csv > dspace/config/controlled-vocabularies/dc-contributor-author.xml
    +
    $ csvcut -c name 2019-02-22-subjects.csv > dspace/config/controlled-vocabularies/dc-contributor-author.xml
     $ # apply formatting in XML file
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.xml
     
      @@ -217,7 +217,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
    -
    # journalctl -u tomcat7 | grep -c 'Multiple update components target the same field:solr_update_time_stamp'
    +
    # journalctl -u tomcat7 | grep -c 'Multiple update components target the same field:solr_update_time_stamp'
     1076
     
    • I restarted Tomcat and it’s OK now…
    • @@ -238,11 +238,11 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
    • The FireOak report highlights the fact that several CGSpace collections have mixed-content errors due to the use of HTTP links in the Feedburner forms
    • I see 46 occurrences of these with this query:
    -
    dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id in (3,4) AND (text_value LIKE '%http://feedburner.%' OR text_value LIKE '%http://feeds.feedburner.%');
    +
    dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id in (3,4) AND (text_value LIKE '%http://feedburner.%' OR text_value LIKE '%http://feeds.feedburner.%');
     
    • I can replace these globally using the following SQL:
    -
    dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feedburner.','https//feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feedburner.%';
    +
    dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feedburner.','https//feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feedburner.%';
     UPDATE 43
     dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://feeds.feedburner.','https//feeds.feedburner.', 'g') WHERE resource_type_id in (3,4) AND text_value LIKE '%http://feeds.feedburner.%';
     UPDATE 44
    @@ -254,7 +254,7 @@ UPDATE 44
     
  • Working on tagging IITA’s items with their new research theme (cg.identifier.iitatheme) based on their existing IITA subjects (see notes from 2019-02)
  • I exported the entire IITA community from CGSpace and then used csvcut to extract only the needed fields:
  • -
    $ csvcut -c 'id,cg.subject.iita,cg.subject.iita[],cg.subject.iita[en],cg.subject.iita[en_US]' ~/Downloads/10568-68616.csv > /tmp/iita.csv
    +
    $ csvcut -c 'id,cg.subject.iita,cg.subject.iita[],cg.subject.iita[en],cg.subject.iita[en_US]' ~/Downloads/10568-68616.csv > /tmp/iita.csv
     
    • After importing to OpenRefine I realized that tagging items based on their subjects is tricky because of the row/record mode of OpenRefine when you split the multi-value cells as well as the fact that some items might need to be tagged twice (thus needing a ||)

      @@ -263,7 +263,7 @@ UPDATE 44

      I think it might actually be easier to filter by IITA subject, then by IITA theme (if needed), and then do transformations with some conditional values in GREL expressions like:

    -
    if(isBlank(value), 'PLANT PRODUCTION & HEALTH', value + '||PLANT PRODUCTION & HEALTH')
    +
    if(isBlank(value), 'PLANT PRODUCTION & HEALTH', value + '||PLANT PRODUCTION & HEALTH')
     
    • Then it’s more annoying because there are four IITA subject columns…
    • In total this would add research themes to 1,755 items
    • @@ -288,7 +288,7 @@ UPDATE 44
    • This is a bit ugly, but it works (using the DSpace 5 SQL helper function to resolve ID to handle):
    -
    for id in $(psql -U postgres -d dspacetest -h localhost -c "SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228 AND text_value LIKE '%SWAZILAND%'" | grep -oE '[0-9]{3,}'); do
    +
    for id in $(psql -U postgres -d dspacetest -h localhost -c "SELECT resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228 AND text_value LIKE '%SWAZILAND%'" | grep -oE '[0-9]{3,}'); do
     
         echo "Getting handle for id: ${id}"
     
    @@ -300,7 +300,7 @@ done
     
    • Then I couldn’t figure out a clever way to join all the CSVs, so I just grepped them to find the IDs with dates from 2018 and 2019 and there are apparently only three:
    -
    $ grep -oE '201[89]' /tmp/*.csv | sort -u
    +
    $ grep -oE '201[89]' /tmp/*.csv | sort -u
     /tmp/94834.csv:2018
     /tmp/95615.csv:2018
     /tmp/96747.csv:2018
    @@ -314,7 +314,7 @@ done
     
  • CGSpace (linode18) has the blank page error again
  • I’m not sure if it’s related, but I see the following error in DSpace’s log:
  • -
    2019-03-15 14:09:32,685 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
    +
    2019-03-15 14:09:32,685 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
     java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is closed.
             at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
             at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
    @@ -326,7 +326,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
     
    • Interestingly, I see a pattern of these errors increasing, with single and double digit numbers over the past month, but spikes of over 1,000 today, yesterday, and on 2019-03-08, which was exactly the first time we saw this blank page error recently
    -
    $ grep -I 'SQL QueryTable Error' dspace.log.2019-0* | awk -F: '{print $1}' | sort | uniq -c | tail -n 25
    +
    $ grep -I 'SQL QueryTable Error' dspace.log.2019-0* | awk -F: '{print $1}' | sort | uniq -c | tail -n 25
           5 dspace.log.2019-02-27
          11 dspace.log.2019-02-28
          29 dspace.log.2019-03-01
    @@ -356,14 +356,14 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@55ba10b5 is c
     
  • (Update on 2019-03-23 to use correct grep query)
  • There are not too many connections currently in PostgreSQL:
  • -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           6 dspaceApi
          10 dspaceCli
          15 dspaceWeb
     
    • I didn’t see anything interesting in the PostgreSQL logs, though this stack trace from the Tomcat logs (in the systemd journal) from earlier today might be related?
    -
    SEVERE: Servlet.service() for servlet [spring] in context with path [] threw exception [org.springframework.web.util.NestedServletException: Request processing failed; nested exception is java.util.EmptyStackException] with root cause
    +
    SEVERE: Servlet.service() for servlet [spring] in context with path [] threw exception [org.springframework.web.util.NestedServletException: Request processing failed; nested exception is java.util.EmptyStackException] with root cause
     java.util.EmptyStackException
             at java.util.Stack.peek(Stack.java:102)
             at java.util.Stack.pop(Stack.java:84)
    @@ -436,13 +436,13 @@ java.util.EmptyStackException
     
  • I copied the 2019 Solr statistics core from CGSpace to DSpace Test and it works (and is only 5.5GB currently), so now we have some useful stats on DSpace Test for the CUA module and the dspace-statistics-api
  • I ran DSpace’s cleanup task on CGSpace (linode18) and there were errors:
  • -
    $ dspace cleanup -v
    +
    $ dspace cleanup -v
     Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
       Detail: Key (bitstream_id)=(164496) is still referenced from table "bundle".
     
    • The solution is, as always:
    -
    # su - postgres
    +
    # su - postgres
     $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (164496);'
     UPDATE 1
     

    2019-03-18

    @@ -455,7 +455,7 @@ UPDATE 1
  • Dump top 1500 subjects from CGSpace to try one more time to generate a list of invalid terms using my agrovoc-lookup.py script:
  • -
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-03-18-top-1500-subject.csv WITH CSV HEADER;
     COPY 1500
     dspace=# \q
     $ csvcut -c text_value /tmp/2019-03-18-top-1500-subject.csv > 2019-03-18-top-1500-subject.csv
    @@ -474,7 +474,7 @@ $ wc -l 2019-03-18-subjects-unmatched.txt
     
  • Create and merge a pull request to update the controlled vocabulary for AGROVOC terms (#416)
  • We are getting the blank page issue on CGSpace again today and I see a large number of the “SQL QueryTable Error” in the DSpace log again (last time was 2019-03-15):
  • -
    $ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
    +
    $ grep -c 'SQL QueryTable Error' dspace.log.2019-03-1[5678]
     dspace.log.2019-03-15:929
     dspace.log.2019-03-16:67
     dspace.log.2019-03-17:72
    @@ -482,7 +482,7 @@ dspace.log.2019-03-18:1038
     
    • Though WTF, this grep seems to be giving weird inaccurate results actually, and the real number of errors is much lower if I exclude the “binary file matches” result with -I:
    -
    $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
    +
    $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-18 | wc -l
     8
     $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F: '{print $1}' | sort | uniq -c
           9 dspace.log.2019-03-08
    @@ -495,7 +495,7 @@ $ grep -I 'SQL QueryTable Error' dspace.log.2019-03-{08,14,15,16,17,18} | awk -F
     
  • It seems to be something with grep doing binary matching on some log files for some reason, so I guess I need to always use -I to say binary files don’t match
  • Anyways, the full error in DSpace’s log is:
  • -
    2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
    +
    2019-03-18 12:26:23,331 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
     java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is closed.
             at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.checkOpen(DelegatingConnection.java:398)
             at org.apache.tomcat.dbcp.dbcp.DelegatingConnection.prepareStatement(DelegatingConnection.java:279)
    @@ -504,7 +504,7 @@ java.sql.SQLException: Connection org.postgresql.jdbc.PgConnection@75eaa668 is c
     
    • There is a low number of connections to PostgreSQL currently:
    -
    $ psql -c 'select * from pg_stat_activity' | wc -l
    +
    $ psql -c 'select * from pg_stat_activity' | wc -l
     33
     $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           6 dspaceApi
    @@ -513,13 +513,13 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
     
    • I looked in the PostgreSQL logs, but all I see are a bunch of these errors going back two months to January:
    -
    2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR:  column "waiting" does not exist at character 217
    +
    2019-01-13 06:25:13.062 CET [9157] postgres@template1 ERROR:  column "waiting" does not exist at character 217
     
    • This is unrelated and apparently due to Munin checking a column that was changed in PostgreSQL 9.6
    • I suspect that this issue with the blank pages might not be PostgreSQL after all, perhaps it’s a Cocoon thing?
    • Looking in the cocoon logs I see a large number of warnings about “Can not load requested doc” around 11AM and 12PM:
    -
    $ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
    +
    $ grep 'Can not load requested doc' cocoon.log.2019-03-18 | grep -oE '2019-03-18 [0-9]{2}:' | sort | uniq -c
           2 2019-03-18 00:
           6 2019-03-18 02:
           3 2019-03-18 04:
    @@ -535,7 +535,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
     
    • And a few days ago on 2019-03-15 when I happened last it was in the afternoon when it happened and the same pattern occurs then around 1–2PM:
    -
    $ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
    +
    $ xzgrep 'Can not load requested doc' cocoon.log.2019-03-15.xz | grep -oE '2019-03-15 [0-9]{2}:' | sort | uniq -c
           4 2019-03-15 01:
           3 2019-03-15 02:
           1 2019-03-15 03:
    @@ -561,7 +561,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
     
    • And again on 2019-03-08, surprise surprise, it happened in the morning:
    -
    $ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
    +
    $ xzgrep 'Can not load requested doc' cocoon.log.2019-03-08.xz | grep -oE '2019-03-08 [0-9]{2}:' | sort | uniq -c
          11 2019-03-08 01:
           3 2019-03-08 02:
           1 2019-03-08 03:
    @@ -581,7 +581,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
     
  • I found a handful of AGROVOC subjects that use a non-breaking space (0x00a0) instead of a regular space, which makes for a pretty confusing debugging…
  • I will replace these in the database immediately to save myself the headache later:
  • -
    dspace=# SELECT count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = 57 AND text_value ~ '.+\u00a0.+';
    +
    dspace=# SELECT count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id = 57 AND text_value ~ '.+\u00a0.+';
      count 
     -------
         84
    @@ -591,7 +591,7 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
     
  • CGSpace (linode18) is having problems with Solr again, I’m seeing “Error opening new searcher” in the Solr logs and there are no stats for previous years
  • Apparently the Solr statistics shards didn’t load properly when we restarted Tomcat yesterday:
  • -
    2019-03-18 12:32:39,799 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
    +
    2019-03-18 12:32:39,799 ERROR org.apache.solr.core.CoreContainer @ Error creating core [statistics-2018]: Error opening new searcher
     ...
     Caused by: org.apache.solr.common.SolrException: Error opening new searcher
             at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
    @@ -603,7 +603,7 @@ Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
     
  • For reference, I don’t see the ulimit -v unlimited in the catalina.sh script, though the tomcat7 systemd service has LimitAS=infinity
  • The limits of the current Tomcat java process are:
  • -
    # cat /proc/27182/limits 
    +
    # cat /proc/27182/limits 
     Limit                     Soft Limit           Hard Limit           Units     
     Max cpu time              unlimited            unlimited            seconds   
     Max file size             unlimited            unlimited            bytes     
    @@ -629,7 +629,7 @@ Max realtime timeout      unlimited            unlimited            us
     
     
  • For now I will just stop Tomcat, delete Solr locks, then start Tomcat again:
  • -
    # systemctl stop tomcat7
    +
    # systemctl stop tomcat7
     # find /home/cgspace.cgiar.org/solr/ -iname "*.lock" -delete
     # systemctl start tomcat7
     
      @@ -660,7 +660,7 @@ Max realtime timeout unlimited unlimited us
      • It’s been two days since we had the blank page issue on CGSpace, and looking in the Cocoon logs I see very low numbers of the errors that we were seeing the last time the issue occurred:
      -
      $ grep 'Can not load requested doc' cocoon.log.2019-03-20 | grep -oE '2019-03-20 [0-9]{2}:' | sort | uniq -c
      +
      $ grep 'Can not load requested doc' cocoon.log.2019-03-20 | grep -oE '2019-03-20 [0-9]{2}:' | sort | uniq -c
             3 2019-03-20 00:
            12 2019-03-20 02:
       $ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21 [0-9]{2}:' | sort | uniq -c
      @@ -704,7 +704,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-21 | grep -oE '2019-03-21
       
      • CGSpace (linode18) is having the blank page issue again and it seems to have started last night around 21:00:
      -
      $ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 [0-9]{2}:' | sort | uniq -c
      +
      $ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 [0-9]{2}:' | sort | uniq -c
             2 2019-03-22 00:
            69 2019-03-22 01:
             1 2019-03-22 02:
      @@ -742,7 +742,7 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23
       
    • I was curious to see if clearing the Cocoon cache in the XMLUI control panel would fix it, but it didn’t
    • Trying to drill down more, I see that the bulk of the errors started aroundi 21:20:
    -
    $ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 21:[0-9]' | sort | uniq -c
    +
    $ grep 'Can not load requested doc' cocoon.log.2019-03-22 | grep -oE '2019-03-22 21:[0-9]' | sort | uniq -c
           1 2019-03-22 21:0
           1 2019-03-22 21:1
          59 2019-03-22 21:2
    @@ -752,11 +752,11 @@ $ grep 'Can not load requested doc' cocoon.log.2019-03-23 | grep -oE '2019-03-23
     
    • Looking at the Cocoon log around that time I see the full error is:
    -
    2019-03-22 21:21:34,378 WARN  org.apache.cocoon.components.xslt.TraxErrorListener  - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
    +
    2019-03-22 21:21:34,378 WARN  org.apache.cocoon.components.xslt.TraxErrorListener  - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
     
    • A few milliseconds before that time I see this in the DSpace log:
    -
    2019-03-22 21:21:34,356 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
    +
    2019-03-22 21:21:34,356 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
     org.postgresql.util.PSQLException: This statement has been closed.
             at org.postgresql.jdbc.PgStatement.checkClosed(PgStatement.java:694)
             at org.postgresql.jdbc.PgStatement.getMaxRows(PgStatement.java:501)
    @@ -824,7 +824,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
     
  • I did some more tests with the TomcatJdbcConnectionTest thing and while monitoring the number of active connections in jconsole and after adjusting the limits quite low I eventually saw some connections get abandoned
  • I forgot that to connect to a remote JMX session with jconsole you need to use a dynamic SSH SOCKS proxy (as I originally discovered in 2017-11:
  • -
    $ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=3000 service:jmx:rmi:///jndi/rmi://localhost:5400/jmxrmi -J-DsocksNonProxyHosts=
    +
    $ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=3000 service:jmx:rmi:///jndi/rmi://localhost:5400/jmxrmi -J-DsocksNonProxyHosts=
     
    • I need to remember to check the active connections next time we have issues with blank item pages on CGSpace
    • In other news, I’ve been running G1GC on DSpace Test (linode19) since 2018-11-08 without realizing it, which is probably a good thing
    • @@ -855,7 +855,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
    • Also, CGSpace doesn’t have many Cocoon errors yet this morning:
    -
    $ grep 'Can not load requested doc' cocoon.log.2019-03-25 | grep -oE '2019-03-25 [0-9]{2}:' | sort | uniq -c
    +
    $ grep 'Can not load requested doc' cocoon.log.2019-03-25 | grep -oE '2019-03-25 [0-9]{2}:' | sort | uniq -c
           4 2019-03-25 00:
           1 2019-03-25 01:
     
      @@ -869,7 +869,7 @@ org.postgresql.util.PSQLException: This statement has been closed.
    • Uptime Robot reported that CGSpace went down and I see the load is very high
    • The top IPs around the time in the nginx API and web logs were:
    -
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "25/Mar/2019:(18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "25/Mar/2019:(18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
           9 190.252.43.162
          12 157.55.39.140
          18 157.55.39.54
    @@ -894,16 +894,16 @@ org.postgresql.util.PSQLException: This statement has been closed.
     
    • The IPs look pretty normal except we’ve never seen 93.179.69.74 before, and it uses the following user agent:
    -
    Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.20 Safari/535.1
    +
    Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.20 Safari/535.1
     
    • Surprisingly they are re-using their Tomcat session:
    -
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=93.179.69.74' dspace.log.2019-03-25 | sort | uniq | wc -l
    +
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=93.179.69.74' dspace.log.2019-03-25 | sort | uniq | wc -l
     1
     
    • That’s weird because the total number of sessions today seems low compared to recent days:
    -
    $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-25 | sort -u | wc -l
    +
    $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-25 | sort -u | wc -l
     5657
     $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-24 | sort -u | wc -l
     17710
    @@ -914,7 +914,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
     
    • PostgreSQL seems to be pretty busy:
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
          11 dspaceApi
          10 dspaceCli
          67 dspaceWeb
    @@ -931,7 +931,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
     
  • UptimeRobot says CGSpace went down again and I see the load is again at 14.0!
  • Here are the top IPs in nginx logs in the last hour:
  • -
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "26/Mar/2019:(06|07)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 
    +
    # zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "26/Mar/2019:(06|07)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 
           3 35.174.184.209
           3 66.249.66.81
           4 104.198.9.108
    @@ -960,7 +960,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
     
  • I will add these three to the “bad bot” rate limiting that I originally used for Baidu
  • Going further, these are the IPs making requests to Discovery and Browse pages so far today:
  • -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "(discover|browse)" | grep -E "26/Mar/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "(discover|browse)" | grep -E "26/Mar/2019:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         120 34.207.146.166
         128 3.91.79.74
         132 108.179.57.67
    @@ -978,7 +978,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
     
  • I can only hope that this helps the load go down because all this traffic is disrupting the service for normal users and well-behaved bots (and interrupting my dinner and breakfast)
  • Looking at the database usage I’m wondering why there are so many connections from the DSpace CLI:
  • -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
            5 dspaceApi
          10 dspaceCli
          13 dspaceWeb
    @@ -987,19 +987,19 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
     
  • Make a minor edit to my agrovoc-lookup.py script to match subject terms with parentheses like COCOA (PLANT)
  • Test 89 corrections and 79 deletions for AGROVOC subject terms from the ones I cleaned up in the last week
  • -
    $ ./fix-metadata-values.py -i /tmp/2019-03-26-AGROVOC-89-corrections.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d -n
    +
    $ ./fix-metadata-values.py -i /tmp/2019-03-26-AGROVOC-89-corrections.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d -n
     $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d -n
     
    • UptimeRobot says CGSpace is down again, but it seems to just be slow, as the load is over 10.0
    • Looking at the nginx logs I don’t see anything terribly abusive, but SemrushBot has made ~3,000 requests to Discovery and Browse pages today:
    -
    # grep SemrushBot /var/log/nginx/access.log | grep -E "26/Mar/2019" | grep -E '(discover|browse)' | wc -l
    +
    # grep SemrushBot /var/log/nginx/access.log | grep -E "26/Mar/2019" | grep -E '(discover|browse)' | wc -l
     2931
     
    • So I’m adding it to the badbot rate limiting in nginx, and actually, I kinda feel like just blocking all user agents with “bot” in the name for a few days to see if things calm down… maybe not just yet
    • Otherwise, these are the top users in the web and API logs the last hour (18–19):
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "26/Mar/2019:(18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "26/Mar/2019:(18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 
          54 41.216.228.158
          65 199.47.87.140
          75 157.55.39.238
    @@ -1025,7 +1025,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
     
  • For the XMLUI I see 18.195.78.144 and 18.196.196.108 requesting only CTA items and with no user agent
  • They are responsible for almost 1,000 XMLUI sessions today:
  • -
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)' dspace.log.2019-03-26 | sort | uniq | wc -l
    +
    $ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)' dspace.log.2019-03-26 | sort | uniq | wc -l
     937
     
    • I will add their IPs to the list of bot IPs in nginx so I can tag them as bots to let Tomcat’s Crawler Session Manager Valve to force them to re-use their session
    • @@ -1033,19 +1033,19 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
    • I will add curl to the Tomcat Crawler Session Manager because anyone using curl is most likely an automated read-only request
    • I will add GuzzleHttp to the nginx badbots rate limiting, because it is making requests to dynamic Discovery pages
    -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep 45.5.184.72 | grep -E "26/Mar/2019:" | grep -E '(discover|browse)' | wc -l                                        
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep 45.5.184.72 | grep -E "26/Mar/2019:" | grep -E '(discover|browse)' | wc -l                                        
     119
     
    • What’s strange is that I can’t see any of their requests in the DSpace log…
    -
    $ grep -I -c 45.5.184.72 dspace.log.2019-03-26 
    +
    $ grep -I -c 45.5.184.72 dspace.log.2019-03-26 
     0
     

    2019-03-28

    • Run the corrections and deletions to AGROVOC (dc.subject) on DSpace Test and CGSpace, and then start a full re-index of Discovery
    • What the hell is going on with this CTA publication?
    -
    # grep Spore-192-EN-web.pdf /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
    +
    # grep Spore-192-EN-web.pdf /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
           1 37.48.65.147
           1 80.113.172.162
           2 108.174.5.117
    @@ -1077,7 +1077,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
     
     
  • In other news, I see that it’s not even the end of the month yet and we have 3.6 million hits already:
  • -
    # zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2019"
    +
    # zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2019"
     3654911
     
    • In other other news I see that DSpace has no statistics for years before 2019 currently, yet when I connect to Solr I see all the cores up
    • @@ -1105,7 +1105,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db ds
    • It is frustrating to see that the load spikes for own own legitimate load on the server were very aggravated and drawn out by the contention for CPU on this host
    • We had 4.2 million hits this month according to the web server logs:
    -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2019"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Mar/2019"
     4218841
     
     real    0m26.609s
    @@ -1114,7 +1114,7 @@ sys     0m2.551s
     
    • Interestingly, now that the CPU steal is not an issue the REST API is ten seconds faster than it was in 2018-10:
    -
    $ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
    +
    $ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
     ...
     0.33s user 0.07s system 2% cpu 17.167 total
     0.27s user 0.04s system 1% cpu 16.643 total
    @@ -1137,7 +1137,7 @@ sys     0m2.551s
     
  • Looking at the weird issue with shitloads of downloads on the CTA item again
  • The item was added on 2019-03-13 and these three IPs have attempted to download the item’s bitstream 43,000 times since it was added eighteen days ago:
  • -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2..17}.gz | grep 'Spore-192-EN-web.pdf' | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 5
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.{2..17}.gz | grep 'Spore-192-EN-web.pdf' | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 5
          42 196.43.180.134
         621 185.247.144.227
        8102 18.194.46.84
    @@ -1152,7 +1152,7 @@ sys     0m2.551s
     
     
     
    -
    2019-03-29 09:10:07,311 ERROR org.dspace.rest.Resource @ Could not delete collection(id=1451), AuthorizeException. Message: org.dspace.authorize.AuthorizeException: Authorization denied for action ADMIN on COLLECTION:1451 by user 9492
    +
    2019-03-29 09:10:07,311 ERROR org.dspace.rest.Resource @ Could not delete collection(id=1451), AuthorizeException. Message: org.dspace.authorize.AuthorizeException: Authorization denied for action ADMIN on COLLECTION:1451 by user 9492
     
    • IWMI people emailed to ask why two items with the same DOI don’t have the same Altmetric score:
        @@ -1168,15 +1168,15 @@ sys 0m2.551s
    -
    _altmetric.embed_callback({"title":"Distilling the role of ecosystem services in the Sustainable Development Goals","doi":"10.1016/j.ecoser.2017.10.010","tq":["Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of","Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers","How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!","Excellent paper about the contribution of #ecosystemservices to SDGs","So great to work with amazing collaborators"],"altmetric_jid":"521611533cf058827c00000a","issns":["2212-0416"],"journal":"Ecosystem Services","cohorts":{"sci":58,"pub":239,"doc":3,"com":2},"context":{"all":{"count":12732768,"mean":7.8220956572788,"rank":56146,"pct":99,"higher_than":12676701},"journal":{"count":549,"mean":7.7567299270073,"rank":2,"pct":99,"higher_than":547},"similar_age_3m":{"count":386919,"mean":11.573702536454,"rank":3299,"pct":99,"higher_than":383619},"similar_age_journal_3m":{"count":28,"mean":9.5648148148148,"rank":1,"pct":96,"higher_than":27}},"authors":["Sylvia L.R. Wood","Sarah K. Jones","Justin A. Johnson","Kate A. Brauman","Rebecca Chaplin-Kramer","Alexander Fremier","Evan Girvetz","Line J. Gordon","Carrie V. Kappel","Lisa Mandle","Mark Mulligan","Patrick O'Farrell","William K. Smith","Louise Willemen","Wei Zhang","Fabrice A. DeClerck"],"type":"article","handles":["10568/89975","10568/89846"],"handle":"10568/89975","altmetric_id":29816439,"schema":"1.5.4","is_oa":false,"cited_by_posts_count":377,"cited_by_tweeters_count":302,"cited_by_fbwalls_count":1,"cited_by_gplus_count":1,"cited_by_policies_count":2,"cited_by_accounts_count":306,"last_updated":1554039125,"score":208.65,"history":{"1y":54.75,"6m":10.35,"3m":5.5,"1m":5.5,"1w":1.5,"6d":1.5,"5d":1.5,"4d":1.5,"3d":1.5,"2d":1,"1d":1,"at":208.65},"url":"http://dx.doi.org/10.1016/j.ecoser.2017.10.010","added_on":1512153726,"published_on":1517443200,"readers":{"citeulike":0,"mendeley":248,"connotea":0},"readers_count":248,"images":{"small":"https://badges.altmetric.com/?size=64&score=209&types=tttttfdg","medium":"https://badges.altmetric.com/?size=100&score=209&types=tttttfdg","large":"https://badges.altmetric.com/?size=180&score=209&types=tttttfdg"},"details_url":"http://www.altmetric.com/details.php?citation_id=29816439"})
    +
    _altmetric.embed_callback({"title":"Distilling the role of ecosystem services in the Sustainable Development Goals","doi":"10.1016/j.ecoser.2017.10.010","tq":["Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of","Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers","How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!","Excellent paper about the contribution of #ecosystemservices to SDGs","So great to work with amazing collaborators"],"altmetric_jid":"521611533cf058827c00000a","issns":["2212-0416"],"journal":"Ecosystem Services","cohorts":{"sci":58,"pub":239,"doc":3,"com":2},"context":{"all":{"count":12732768,"mean":7.8220956572788,"rank":56146,"pct":99,"higher_than":12676701},"journal":{"count":549,"mean":7.7567299270073,"rank":2,"pct":99,"higher_than":547},"similar_age_3m":{"count":386919,"mean":11.573702536454,"rank":3299,"pct":99,"higher_than":383619},"similar_age_journal_3m":{"count":28,"mean":9.5648148148148,"rank":1,"pct":96,"higher_than":27}},"authors":["Sylvia L.R. Wood","Sarah K. Jones","Justin A. Johnson","Kate A. Brauman","Rebecca Chaplin-Kramer","Alexander Fremier","Evan Girvetz","Line J. Gordon","Carrie V. Kappel","Lisa Mandle","Mark Mulligan","Patrick O'Farrell","William K. Smith","Louise Willemen","Wei Zhang","Fabrice A. DeClerck"],"type":"article","handles":["10568/89975","10568/89846"],"handle":"10568/89975","altmetric_id":29816439,"schema":"1.5.4","is_oa":false,"cited_by_posts_count":377,"cited_by_tweeters_count":302,"cited_by_fbwalls_count":1,"cited_by_gplus_count":1,"cited_by_policies_count":2,"cited_by_accounts_count":306,"last_updated":1554039125,"score":208.65,"history":{"1y":54.75,"6m":10.35,"3m":5.5,"1m":5.5,"1w":1.5,"6d":1.5,"5d":1.5,"4d":1.5,"3d":1.5,"2d":1,"1d":1,"at":208.65},"url":"http://dx.doi.org/10.1016/j.ecoser.2017.10.010","added_on":1512153726,"published_on":1517443200,"readers":{"citeulike":0,"mendeley":248,"connotea":0},"readers_count":248,"images":{"small":"https://badges.altmetric.com/?size=64&score=209&types=tttttfdg","medium":"https://badges.altmetric.com/?size=100&score=209&types=tttttfdg","large":"https://badges.altmetric.com/?size=180&score=209&types=tttttfdg"},"details_url":"http://www.altmetric.com/details.php?citation_id=29816439"})
     
    • The response paylod for the second one is the same:
    -
    _altmetric.embed_callback({"title":"Distilling the role of ecosystem services in the Sustainable Development Goals","doi":"10.1016/j.ecoser.2017.10.010","tq":["Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of","Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers","How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!","Excellent paper about the contribution of #ecosystemservices to SDGs","So great to work with amazing collaborators"],"altmetric_jid":"521611533cf058827c00000a","issns":["2212-0416"],"journal":"Ecosystem Services","cohorts":{"sci":58,"pub":239,"doc":3,"com":2},"context":{"all":{"count":12732768,"mean":7.8220956572788,"rank":56146,"pct":99,"higher_than":12676701},"journal":{"count":549,"mean":7.7567299270073,"rank":2,"pct":99,"higher_than":547},"similar_age_3m":{"count":386919,"mean":11.573702536454,"rank":3299,"pct":99,"higher_than":383619},"similar_age_journal_3m":{"count":28,"mean":9.5648148148148,"rank":1,"pct":96,"higher_than":27}},"authors":["Sylvia L.R. Wood","Sarah K. Jones","Justin A. Johnson","Kate A. Brauman","Rebecca Chaplin-Kramer","Alexander Fremier","Evan Girvetz","Line J. Gordon","Carrie V. Kappel","Lisa Mandle","Mark Mulligan","Patrick O'Farrell","William K. Smith","Louise Willemen","Wei Zhang","Fabrice A. DeClerck"],"type":"article","handles":["10568/89975","10568/89846"],"handle":"10568/89975","altmetric_id":29816439,"schema":"1.5.4","is_oa":false,"cited_by_posts_count":377,"cited_by_tweeters_count":302,"cited_by_fbwalls_count":1,"cited_by_gplus_count":1,"cited_by_policies_count":2,"cited_by_accounts_count":306,"last_updated":1554039125,"score":208.65,"history":{"1y":54.75,"6m":10.35,"3m":5.5,"1m":5.5,"1w":1.5,"6d":1.5,"5d":1.5,"4d":1.5,"3d":1.5,"2d":1,"1d":1,"at":208.65},"url":"http://dx.doi.org/10.1016/j.ecoser.2017.10.010","added_on":1512153726,"published_on":1517443200,"readers":{"citeulike":0,"mendeley":248,"connotea":0},"readers_count":248,"images":{"small":"https://badges.altmetric.com/?size=64&score=209&types=tttttfdg","medium":"https://badges.altmetric.com/?size=100&score=209&types=tttttfdg","large":"https://badges.altmetric.com/?size=180&score=209&types=tttttfdg"},"details_url":"http://www.altmetric.com/details.php?citation_id=29816439"})
    +
    _altmetric.embed_callback({"title":"Distilling the role of ecosystem services in the Sustainable Development Goals","doi":"10.1016/j.ecoser.2017.10.010","tq":["Progress on 12 of 17 #SDGs rely on #ecosystemservices - new paper co-authored by a number of","Distilling the role of ecosystem services in the Sustainable Development Goals - new paper by @SNAPPartnership researchers","How do #ecosystemservices underpin the #SDGs? Our new paper starts counting the ways. Check it out in the link below!","Excellent paper about the contribution of #ecosystemservices to SDGs","So great to work with amazing collaborators"],"altmetric_jid":"521611533cf058827c00000a","issns":["2212-0416"],"journal":"Ecosystem Services","cohorts":{"sci":58,"pub":239,"doc":3,"com":2},"context":{"all":{"count":12732768,"mean":7.8220956572788,"rank":56146,"pct":99,"higher_than":12676701},"journal":{"count":549,"mean":7.7567299270073,"rank":2,"pct":99,"higher_than":547},"similar_age_3m":{"count":386919,"mean":11.573702536454,"rank":3299,"pct":99,"higher_than":383619},"similar_age_journal_3m":{"count":28,"mean":9.5648148148148,"rank":1,"pct":96,"higher_than":27}},"authors":["Sylvia L.R. Wood","Sarah K. Jones","Justin A. Johnson","Kate A. Brauman","Rebecca Chaplin-Kramer","Alexander Fremier","Evan Girvetz","Line J. Gordon","Carrie V. Kappel","Lisa Mandle","Mark Mulligan","Patrick O'Farrell","William K. Smith","Louise Willemen","Wei Zhang","Fabrice A. DeClerck"],"type":"article","handles":["10568/89975","10568/89846"],"handle":"10568/89975","altmetric_id":29816439,"schema":"1.5.4","is_oa":false,"cited_by_posts_count":377,"cited_by_tweeters_count":302,"cited_by_fbwalls_count":1,"cited_by_gplus_count":1,"cited_by_policies_count":2,"cited_by_accounts_count":306,"last_updated":1554039125,"score":208.65,"history":{"1y":54.75,"6m":10.35,"3m":5.5,"1m":5.5,"1w":1.5,"6d":1.5,"5d":1.5,"4d":1.5,"3d":1.5,"2d":1,"1d":1,"at":208.65},"url":"http://dx.doi.org/10.1016/j.ecoser.2017.10.010","added_on":1512153726,"published_on":1517443200,"readers":{"citeulike":0,"mendeley":248,"connotea":0},"readers_count":248,"images":{"small":"https://badges.altmetric.com/?size=64&score=209&types=tttttfdg","medium":"https://badges.altmetric.com/?size=100&score=209&types=tttttfdg","large":"https://badges.altmetric.com/?size=180&score=209&types=tttttfdg"},"details_url":"http://www.altmetric.com/details.php?citation_id=29816439"})
     
    • Very interesting to see this in the response:
    -
    "handles":["10568/89975","10568/89846"],
    +
    "handles":["10568/89975","10568/89846"],
     "handle":"10568/89975"
     
    • On further inspection I see that the Altmetric explorer pages for each of these Handles is actually doing the right thing: diff --git a/docs/2019-04/index.html b/docs/2019-04/index.html index 55d6c5b11..9bc7ce607 100644 --- a/docs/2019-04/index.html +++ b/docs/2019-04/index.html @@ -64,7 +64,7 @@ $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u ds $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d "/> - + @@ -163,13 +163,13 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
        4432 200
     
    • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
    • Apply country and region corrections and deletions on DSpace Test and CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
    +
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    @@ -191,26 +191,26 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
     
     
    -
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt
    +
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-04-03-orcid-ids.txt
     
    • We currently have 1177 unique ORCID identifiers, and this brings our total to 1237!
    • Next I will resolve all their names using my resolve-orcids.py script:
    -
    $ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
    +
    $ ./resolve-orcids.py -i /tmp/2019-04-03-orcid-ids.txt -o 2019-04-03-orcid-ids.txt -d
     
    • After that I added the XML formatting, formatted the file with tidy, and sorted the names in vim
    • One user’s name has changed so I will update those using my fix-metadata-values.py script:
    -
    $ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
    +
    $ ./fix-metadata-values.py -i 2019-04-03-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
     
    • I created a pull request and merged the changes to the 5_x-prod branch (#417)
    • A few days ago I noticed some weird update process for the statistics-2018 Solr core and I see it’s still going:
    -
    2019-04-03 16:34:02,262 INFO  org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
    +
    2019-04-03 16:34:02,262 INFO  org.dspace.statistics.SolrLogger @ Updating : 1754500/21701 docs in http://localhost:8081/solr//statistics-2018
     
    • Interestingly, there are 5666 occurences, and they are mostly for the 2018 core:
    -
    $ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
    +
    $ grep 'org.dspace.statistics.SolrLogger @ Updating' /home/cgspace.cgiar.org/log/dspace.log.2019-04-03 | awk '{print $11}' | sort | uniq -c
           1 
           3 http://localhost:8081/solr//statistics-2017
        5662 http://localhost:8081/solr//statistics-2018
    @@ -222,14 +222,14 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
  • Uptime Robot reported that CGSpace (linode18) went down tonight
  • I see there are lots of PostgreSQL connections:
  • -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
          10 dspaceCli
         250 dspaceWeb
     
    • I still see those weird messages about updating the statistics-2018 Solr core:
    -
    2019-04-05 21:06:53,770 INFO  org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
    +
    2019-04-05 21:06:53,770 INFO  org.dspace.statistics.SolrLogger @ Updating : 2444600/21697 docs in http://localhost:8081/solr//statistics-2018
     
    • Looking at iostat 1 10 I also see some CPU steal has come back, and I can confirm it by looking at the Munin graphs:
    @@ -242,7 +242,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -
    statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher 
    +
    statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher 
     
    • I restarted it again and all the Solr cores came up properly…
    @@ -257,7 +257,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
  • Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:
  • -
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         222 18.195.78.144
         245 207.46.13.58
         303 207.46.13.194
    @@ -282,17 +282,17 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
    • 45.5.184.72 is in Colombia so it’s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT’s datasets collection:
    -
    GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014
    +
    GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014
     
    • Their user agent is the one I added to the badbots list in nginx last week: “GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1”
    • They made 22,000 requests to Discover on this collection today alone (and it’s only 11AM):
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
       22077 /handle/10568/72970/discover
     
    • Yesterday they made 43,000 requests and we actually blocked most of them:
    -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c 
       43631 /handle/10568/72970/discover
     # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep "05/Apr/2019" | grep 45.5.184.72 | grep -E '/handle/[0-9]+/[0-9]+/discover' | awk '{print $9}' | sort | uniq -c 
         142 200
    @@ -315,7 +315,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
     
     
    -
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true'
    +
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-03&rows=0&wt=json&indent=true'
     {
         "response": {
             "docs": [],
    @@ -341,7 +341,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
    • Strangely I don’t see many hits in 2019-04:
    -
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
    +
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=type%3A0+AND+(ip%3A18.196.196.108+OR+ip%3A18.195.78.144+OR+ip%3A18.195.218.6)&fq=statistics_type%3Aview&fq=bundleName%3AORIGINAL&fq=dateYearMonth%3A2019-04&rows=0&wt=json&indent=true'
     {
         "response": {
             "docs": [],
    @@ -367,7 +367,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
    -
    $ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
    +
    $ http --print Hh GET https://dspacetest.cgiar.org/bitstream/handle/10568/100289/Spore-192-EN-web.pdf
     GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1
     Accept: */*
     Accept-Encoding: gzip, deflate
    @@ -419,7 +419,7 @@ X-XSS-Protection: 1; mode=block
     
    • And from the server side, the nginx logs show:
    -
    78.x.x.x - - [07/Apr/2019:01:38:35 -0700] "GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 68078 "-" "HTTPie/1.0.2"
    +
    78.x.x.x - - [07/Apr/2019:01:38:35 -0700] "GET /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 68078 "-" "HTTPie/1.0.2"
     78.x.x.x - - [07/Apr/2019:01:39:01 -0700] "HEAD /bitstream/handle/10568/100289/Spore-192-EN-web.pdf HTTP/1.1" 200 0 "-" "HTTPie/1.0.2"
     
    • So definitely the size of the transfer is more efficient with a HEAD, but I need to wait to see if these requests show up in Solr @@ -428,7 +428,7 @@ X-XSS-Protection: 1; mode=block
    -
    2019-04-07 02:05:30,966 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
    +
    2019-04-07 02:05:30,966 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=EF2DB6E4F69926C5555B3492BB0071A8:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
     2019-04-07 02:05:39,265 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous:session_id=B6381FC590A5160D84930102D068C7A3:ip_addr=78.x.x.x:view_bitstream:bitstream_id=165818
     
    • So my inclination is that both HEAD and GET requests are registered as views as far as Solr and DSpace are concerned @@ -437,7 +437,7 @@ X-XSS-Protection: 1; mode=block
    -
    2019-04-07 02:08:44,186 INFO  org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
    +
    2019-04-07 02:08:44,186 INFO  org.apache.solr.update.UpdateHandler @ start commit{,optimize=true,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
     
    • Ugh, even after optimizing there are no Solr results for requests from my IP, and actually I only see 18 results from 2019-04 so far and none of them are statistics_type:view… very weird
        @@ -448,7 +448,7 @@ X-XSS-Protection: 1; mode=block
      • According to the DSpace 5.x Solr documentation the default commit time is after 15 minutes or 10,000 documents (see solrconfig.xml)
      • I looped some GET and HEAD requests to a bitstream on my local instance and after some time I see that they do register as downloads (even though they are internal):
      -
      $ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
      +
      $ http --print b 'http://localhost:8080/solr/statistics/select?q=type%3A0+AND+time%3A2019-04-07*&fq=statistics_type%3Aview&fq=isInternal%3Atrue&rows=0&wt=json&indent=true'
       {
           "response": {
               "docs": [],
      @@ -496,12 +496,12 @@ X-XSS-Protection: 1; mode=block
       
    • UptimeRobot said CGSpace went down and up a few times tonight, and my first instict was to check iostat 1 10 and I saw that CPU steal is around 10–30 percent right now…
    • The load average is super high right now, as I’ve noticed the last few times UptimeRobot said that CGSpace went down:
    -
    $ cat /proc/loadavg 
    +
    $ cat /proc/loadavg 
     10.70 9.17 8.85 18/633 4198
     
    • According to the server logs there is actually not much going on right now:
    -
    # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "07/Apr/2019:(18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         118 18.195.78.144
         128 207.46.13.219
         129 167.114.64.100
    @@ -529,7 +529,7 @@ X-XSS-Protection: 1; mode=block
     
  • 2408:8214:7a00:868f:7c1e:e0f3:20c6:c142 is some stupid Chinese bot making malicious POST requests
  • There are free database connections in the pool:
  • -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
           7 dspaceCli
          23 dspaceWeb
    @@ -546,7 +546,7 @@ X-XSS-Protection: 1; mode=block
     
     
     
    -
    $ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id
    +
    $ lein run ~/src/git/DSpace/2019-02-22-affiliations.csv name id
     
    • After matching the values and creating some new matches I had trouble remembering how to copy the reconciled values to a new column
        @@ -555,34 +555,34 @@ X-XSS-Protection: 1; mode=block
    -
    if(cell.recon.matched, cell.recon.match.name, value)
    +
    if(cell.recon.matched, cell.recon.match.name, value)
     
    • See the OpenRefine variables documentation for more notes about the recon object
    • I also noticed a handful of errors in our current list of affiliations so I corrected them:
    -
    $ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
    +
    $ ./fix-metadata-values.py -i 2019-04-08-fix-13-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
     
    • We should create a new list of affiliations to update our controlled vocabulary again
    • I dumped a list of the top 1500 affiliations:
    -
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 211 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-04-08-top-1500-affiliations.csv WITH CSV HEADER;
     COPY 1500
     
    • Fix a few more messed up affiliations that have return characters in them (use Ctrl-V Ctrl-M to re-create control character):
    -
    dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
    +
    dspace=# UPDATE metadatavalue SET text_value='International Institute for Environment and Development' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'International Institute^M%';
     dspace=# UPDATE metadatavalue SET text_value='Kenya Agriculture and Livestock Research Organization' WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE 'Kenya Agricultural  and Livestock  Research^M%';
     
    • I noticed a bunch of subjects and affiliations that use stylized apostrophes so I will export those and then batch update them:
    -
    dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 AND text_value LIKE '%’%') to /tmp/2019-04-08-affiliations-apostrophes.csv WITH CSV HEADER;
     COPY 60
     dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 57 AND text_value LIKE '%’%') to /tmp/2019-04-08-subject-apostrophes.csv WITH CSV HEADER;
     COPY 20
     
    • I cleaned them up in OpenRefine and then applied the fixes on CGSpace and DSpace Test:
    -
    $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
    +
    $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-60-affiliations-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -d
     $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
     
    • UptimeRobot said that CGSpace (linode18) went down tonight @@ -592,14 +592,14 @@ $ ./fix-metadata-values.py -i /tmp/2019-04-08-fix-20-subject-apostrophes.csv -db
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
           7 dspaceCli
         250 dspaceWeb
     
    • On a related note I see connection pool errors in the DSpace log:
    -
    2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
    +
    2019-04-08 19:01:10,472 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL connection Error - 
     org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-319] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
     
    • But still I see 10 to 30% CPU steal in iostat that is also reflected in the Munin graphs:
    • @@ -609,7 +609,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
    • Linode Support still didn’t respond to my ticket from yesterday, so I attached a new output of iostat 1 10 and asked them to move the VM to a less busy host
    • The web server logs are not very busy:
    -
    # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E "08/Apr/2019:(17|18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         124 40.77.167.135
         135 95.108.181.88
         139 157.55.39.206
    @@ -636,7 +636,7 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
     
  • Linode sent an alert that CGSpace (linode18) was 440% CPU for the last two hours this morning
  • Here are the top IPs in the web server logs around that time:
  • -
    # zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "09/Apr/2019:(06|07|08)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          18 66.249.79.139
          21 157.55.39.160
          29 66.249.79.137
    @@ -661,11 +661,11 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
     
    • 45.5.186.2 is at CIAT in Colombia and I see they are mostly making requests to the REST API, but also to XMLUI with the following user agent:
    -
    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
    +
    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36
     
    • Database connection usage looks fine:
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
           7 dspaceCli
          11 dspaceWeb
    @@ -683,13 +683,13 @@ org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exe
     
  • Abenet pointed out a possibility of validating funders against the CrossRef API
  • Note that if you use HTTPS and specify a contact address in the API request you have less likelihood of being blocked
  • -
    $ http 'https://api.crossref.org/funders?query=mercator&mailto=me@cgiar.org'
    +
    $ http 'https://api.crossref.org/funders?query=mercator&mailto=me@cgiar.org'
     
    • Otherwise, they provide the funder data in CSV and RDF format
    • I did a quick test with the recent IITA records against reconcile-csv in OpenRefine and it matched a few, but the ones that didn’t match will need a human to go and do some manual checking and informed decision making…
    • If I want to write a script for this I could use the Python habanero library:
    -
    from habanero import Crossref
    +
    from habanero import Crossref
     cr = Crossref(mailto="me@cgiar.org")
     x = cr.funders(query = "mercator")
     

    2019-04-11

    @@ -720,7 +720,7 @@ x = cr.funders(query = "mercator")
  • I captured a few general corrections and deletions for AGROVOC subjects while looking at IITA’s records, so I applied them to DSpace Test and CGSpace:
  • -
    $ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
    +
    $ ./fix-metadata-values.py -i /tmp/2019-04-11-fix-14-subjects.csv -db dspace -u dspace -p 'fuuu' -f dc.subject -m 57 -t correct -d
     $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d
     
    • Answer more questions about DOIs and Altmetric scores from WLE
    • @@ -753,7 +753,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
      • Change DSpace Test (linode19) to use the Java GC tuning from the Solr 4.10.4 startup script:
      -
      GC_TUNE="-XX:NewRatio=3 \
      +
      GC_TUNE="-XX:NewRatio=3 \
           -XX:SurvivorRatio=4 \
           -XX:TargetSurvivorRatio=90 \
           -XX:MaxTenuringThreshold=8 \
      @@ -786,7 +786,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-04-11-delete-6-subjects.csv -db dspac
       
    -
    import json
    +
    import json
     import re
     import urllib
     import urllib2
    @@ -809,7 +809,7 @@ return item_id
     
     
  • I ran a full Discovery indexing on CGSpace because I didn’t do it after all the metadata updates last week:
  • -
    $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    82m45.324s
     user    7m33.446s
    @@ -1001,7 +1001,7 @@ sys     2m13.463s
     
  • For future reference, Linode mentioned that they consider CPU steal above 8% to be significant
  • Regarding the other Linode issue about speed, I did a test with iperf between linode18 and linode19:
  • -
    # iperf -s
    +
    # iperf -s
     ------------------------------------------------------------
     Server listening on TCP port 5001
     TCP window size: 85.3 KByte (default)
    @@ -1049,11 +1049,11 @@ TCP window size: 85.0 KByte (default)
     
     
  • I want to get rid of this annoying warning that is constantly in our DSpace logs:
  • -
    2019-04-08 19:02:31,770 WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
    +
    2019-04-08 19:02:31,770 WARN  org.dspace.xoai.services.impl.xoai.DSpaceRepositoryConfiguration @ { OAI 2.0 :: DSpace } Not able to retrieve the dspace.oai.url property from oai.cfg. Falling back to request address
     
    • Apparently it happens once per request, which can be at least 1,500 times per day according to the DSpace logs on CGSpace (linode18):
    -
    $ grep -c 'Falling back to request address' dspace.log.2019-04-20
    +
    $ grep -c 'Falling back to request address' dspace.log.2019-04-20
     dspace.log.2019-04-20:1515
     
    • I will fix it in dspace/config/modules/oai.cfg
    • @@ -1098,7 +1098,7 @@ dspace.log.2019-04-20:1515
    -
    $ csvcut -c id,dc.identifier.uri,'dc.identifier.uri[]' ~/Downloads/2019-04-24-IITA.csv > /tmp/iita.csv
    +
    $ csvcut -c id,dc.identifier.uri,'dc.identifier.uri[]' ~/Downloads/2019-04-24-IITA.csv > /tmp/iita.csv
     
    • Carlos Tejo from the Land Portal had been emailing me this week to ask about the old REST API that Tsega was building in 2017
        @@ -1108,7 +1108,7 @@ dspace.log.2019-04-20:1515
    -
    $ curl -f -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
    +
    $ curl -f -H "accept: application/json" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
     curl: (22) The requested URL returned error: 401
     
    • Note that curl only shows the HTTP 401 error if you use -f (fail), and only then if you don’t include -s @@ -1118,7 +1118,7 @@ curl: (22) The requested URL returned error: 401
    -
    dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
    +
    dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT' AND text_lang='en_US';
      count 
     -------
        376
    @@ -1138,7 +1138,7 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
     
    • I see that the HTTP 401 issue seems to be a bug due to an item that the user doesn’t have permission to access… from the DSpace log:
    -
    2019-04-24 08:11:51,129 INFO  org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
    +
    2019-04-24 08:11:51,129 INFO  org.dspace.rest.ItemsResource @ Looking for item with metadata(key=cg.subject.cpwf,value=WATER MANAGEMENT, language=en_US).
     2019-04-24 08:11:51,231 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72448
     2019-04-24 08:11:51,238 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/72491
     2019-04-24 08:11:51,243 INFO  org.dspace.usage.LoggerUsageEventListener @ anonymous::view_item:handle=10568/75703
    @@ -1146,14 +1146,14 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
     
    • Nevertheless, if I request using the null language I get 1020 results, plus 179 for a blank language attribute:
    -
    $ curl -s -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": null}' | jq length
    +
    $ curl -s -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": null}' | jq length
     1020
     $ curl -s -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": ""}' | jq length
     179
     
    • This is weird because I see 942–1156 items with “WATER MANAGEMENT” (depending on wildcard matching for errors in subject spelling):
    -
    dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT';
    +
    dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=208 AND text_value='WATER MANAGEMENT';
      count 
     -------
        942
    @@ -1177,13 +1177,13 @@ dspace=# SELECT COUNT(text_value) FROM metadatavalue WHERE resource_type_id=2 AN
     
     
  • I tested the REST API after logging in with my super admin account and I was able to get results for the problematic query:
  • -
    $ curl -f -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/login" -d '{"email":"example@me.com","password":"fuuuuu"}'
    +
    $ curl -f -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/login" -d '{"email":"example@me.com","password":"fuuuuu"}'
     $ curl -f -H "Content-Type: application/json" -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b" -X GET "https://dspacetest.cgiar.org/rest/status"
     $ curl -f -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
     
    • I created a normal user for Carlos to try as an unprivileged user:
    -
    $ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password 'ddmmdd'
    +
    $ dspace user --add --givenname Carlos --surname Tejo --email blah@blah.com --password 'ddmmdd'
     
    • But still I get the HTTP 401 and I have no idea which item is causing it
    • I enabled more verbose logging in ItemsResource.java and now I can at least see the item ID that causes the failure… @@ -1192,7 +1192,7 @@ $ curl -f -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b"
    -
    dspace=# SELECT * FROM item WHERE item_id=74648;
    +
    dspace=# SELECT * FROM item WHERE item_id=74648;
      item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection | discoverable
     ---------+--------------+------------+-----------+----------------------------+-------------------+--------------
        74648 |          113 | f          | f         | 2016-03-30 09:00:52.131+00 |                   | t
    @@ -1212,7 +1212,7 @@ $ curl -f -H "rest-dspace-token: b43d41a6-5ac1-455d-b49a-616b8debc25b"
     
    • Export a list of authors for Peter to look through:
    -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-04-26-all-authors.csv with csv header;
     COPY 65752
     

    2019-04-28

      @@ -1222,7 +1222,7 @@ COPY 65752
    -
    dspace=# SELECT * FROM item WHERE item_id=74648;
    +
    dspace=# SELECT * FROM item WHERE item_id=74648;
      item_id | submitter_id | in_archive | withdrawn |       last_modified        | owning_collection | discoverable 
     ---------+--------------+------------+-----------+----------------------------+-------------------+--------------
        74648 |          113 | f          | f         | 2019-04-28 08:48:52.114-07 |                   | f
    @@ -1230,7 +1230,7 @@ COPY 65752
     
    • And I tried the curl command from above again, but I still get the HTTP 401 and and the same error in the DSpace log:
    -
    2019-04-28 08:53:07,170 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=74648)!
    +
    2019-04-28 08:53:07,170 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=74648)!
     
    • I even tried to “expunge” the item using an action in CSV, and it said “EXPUNGED!” but the item is still there…
    @@ -1239,7 +1239,7 @@ COPY 65752
  • Send mail to the dspace-tech mailing list to ask about the item expunge issue
  • Delete and re-create Podman container for dspacedb after pulling a new PostgreSQL container:
  • -
    $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
    +
    $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     
    • Carlos from LandPortal asked if I could export CGSpace in a machine-readable format so I think I’ll try to do a CSV
        @@ -1247,7 +1247,7 @@ COPY 65752
    -
    dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id != 28 GROUP BY text_lang;
    +
    dspace=# SELECT DISTINCT text_lang, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id != 28 GROUP BY text_lang;
      text_lang |  count
     -----------+---------
                |  358647
    diff --git a/docs/2019-05/index.html b/docs/2019-05/index.html
    index 231a776e4..cfb3a09e0 100644
    --- a/docs/2019-05/index.html
    +++ b/docs/2019-05/index.html
    @@ -48,7 +48,7 @@ DELETE 1
     
     But after this I tried to delete the item from the XMLUI and it is still present…
     "/>
    -
    +
     
     
         
    @@ -145,7 +145,7 @@ But after this I tried to delete the item from the XMLUI and it is still present
     
     
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • -
    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
    +
    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
     
    • But after this I tried to delete the item from the XMLUI and it is still present…
    • @@ -158,7 +158,7 @@ DELETE 1
    -
    dspace=# DELETE FROM metadatavalue WHERE resource_id=74648;
    +
    dspace=# DELETE FROM metadatavalue WHERE resource_id=74648;
     dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     dspace=# DELETE FROM item WHERE item_id=74648;
     
      @@ -168,12 +168,12 @@ dspace=# DELETE FROM item WHERE item_id=74648;
    -
    $ curl -f -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
    +
    $ curl -f -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key":"cg.subject.cpwf", "value":"WATER MANAGEMENT","language": "en_US"}'
     curl: (22) The requested URL returned error: 401 Unauthorized
     
    • The DSpace log shows the item ID (because I modified the error text):
    -
    2019-05-01 11:41:11,069 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=77708)!
    +
    2019-05-01 11:41:11,069 ERROR org.dspace.rest.ItemsResource @ User(anonymous) has not permission to read item(id=77708)!
     
    • If I delete that one I get another, making the list of item IDs so far:
        @@ -202,7 +202,7 @@ curl: (22) The requested URL returned error: 401 Unauthorized
    -
    https://cgspace.cgiar.org/rest/collections/1179/items?limit=812&expand=metadata
    +
    https://cgspace.cgiar.org/rest/collections/1179/items?limit=812&expand=metadata
     

    2019-05-03

    • A user from CIAT emailed to say that CGSpace submission emails have not been working the last few weeks @@ -211,7 +211,7 @@ curl: (22) The requested URL returned error: 401 Unauthorized
    -
    $ dspace test-email
    +
    $ dspace test-email
     
     About to send test email:
      - To: woohoo@cgiar.org
    @@ -255,11 +255,11 @@ Please see the DSpace documentation for assistance.
     
     
     
    -
    statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher 
    +
    statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher 
     
    • As well as this error in the logs:
    -
    Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
    +
    Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2018/data/index/write.lock
     
    • Strangely enough, I do see the statistics-2018, statistics-2017, etc cores in the Admin UI…
    • I restarted Tomcat a few times (and even deleted all the Solr write locks) and at least five times there were issues loading one statistics core, causing the Atmire stats to be incomplete @@ -282,7 +282,7 @@ Please see the DSpace documentation for assistance.
      • The number of unique sessions today is ridiculously high compared to the last few days considering it’s only 12:30PM right now:
      -
      $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-06 | sort | uniq | wc -l
      +
      $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-06 | sort | uniq | wc -l
       101108
       $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-05 | sort | uniq | wc -l
       14618
      @@ -297,7 +297,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-01 | sort | uniq | wc
       
      • The number of unique IP addresses from 2 to 6 AM this morning is already several times higher than the average for that time of the morning this past week:
      -
      # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
      +
      # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
       7127
       # zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '05/May/2019:(02|03|04|05|06)' | awk '{print $1}' | sort | uniq | wc -l
       1231
      @@ -312,7 +312,7 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-05-01 | sort | uniq | wc
       
      • Just this morning between the hours of 2 and 6 the number of unique sessions was very high compared to previous mornings:
      -
      $ cat dspace.log.2019-05-06 | grep -E '2019-05-06 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +
      $ cat dspace.log.2019-05-06 | grep -E '2019-05-06 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       83650
       $ cat dspace.log.2019-05-05 | grep -E '2019-05-05 (02|03|04|05|06):' | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       2547
      @@ -327,7 +327,7 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
       
      • Most of the requests were GETs:
      -
      # cat /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E "(GET|HEAD|POST|PUT)" | sort | uniq -c | sort -n
      +
      # cat /var/log/nginx/{access,library-access}.log /var/log/nginx/{access,library-access}.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E "(GET|HEAD|POST|PUT)" | sort | uniq -c | sort -n
             1 PUT
            98 POST
          2845 HEAD
      @@ -336,19 +336,19 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
       
    • I’m not exactly sure what happened this morning, but it looks like some legitimate user traffic—perhaps someone launched a new publication and it got a bunch of hits?
    • Looking again, I see 84,000 requests to /handle this morning (not including logs for library.cgiar.org because those get HTTP 301 redirect to CGSpace and appear here in access.log):
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -c -o -E " /handle/[0-9]+/[0-9]+"
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -c -o -E " /handle/[0-9]+/[0-9]+"
     84350
     
    • But it would be difficult to find a pattern for those requests because they cover 78,000 unique Handles (ie direct browsing of items, collections, or communities) and only 2,492 discover/browse (total, not unique):
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E " /handle/[0-9]+/[0-9]+ HTTP" | sort | uniq | wc -l
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E " /handle/[0-9]+/[0-9]+ HTTP" | sort | uniq | wc -l
     78104
     # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E '06/May/2019:(02|03|04|05|06)' | grep -o -E " /handle/[0-9]+/[0-9]+/(discover|browse)" | wc -l
     2492
     
    • In other news, I see some IP is making several requests per second to the exact same REST API endpoints, for example:
    -
    # grep /rest/handle/10568/3703?expand=all rest.log | awk '{print $1}' | sort | uniq -c
    +
    # grep /rest/handle/10568/3703?expand=all rest.log | awk '{print $1}' | sort | uniq -c
           3 2a01:7e00::f03c:91ff:fe0a:d645
         113 63.32.242.35
     
      @@ -363,7 +363,7 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
      • The total number of unique IPs on CGSpace yesterday was almost 14,000, which is several thousand higher than previous day totals:
      -
      # zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '06/May/2019' | awk '{print $1}' | sort | uniq | wc -l
      +
      # zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep -E '06/May/2019' | awk '{print $1}' | sort | uniq | wc -l
       13969
       # zcat --force /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '05/May/2019' | awk '{print $1}' | sort | uniq | wc -l
       5936
      @@ -374,7 +374,7 @@ $ cat dspace.log.2019-05-01 | grep -E '2019-05-01 (02|03|04|05|06):' | grep -o -
       
      • Total number of sessions yesterday was much higher compared to days last week:
      -
      $ cat dspace.log.2019-05-06 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      +
      $ cat dspace.log.2019-05-06 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       144160
       $ cat dspace.log.2019-05-05 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
       57269
      @@ -407,7 +407,7 @@ $ cat dspace.log.2019-05-01 | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq |
       
    -
    $ dspace test-email
    +
    $ dspace test-email
     
     About to send test email:
      - To: wooooo@cgiar.org
    @@ -423,7 +423,7 @@ Please see the DSpace documentation for assistance.
     
  • Help Moayad with certbot-auto for Let’s Encrypt scripts on the new AReS server (linode20)
  • Normalize all text_lang values for metadata on CGSpace and DSpace Test (as I had tested last month):
  • -
    UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
    +
    UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
     UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
     UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('es', 'spa');
     
      @@ -454,7 +454,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
    • All of the IPs from these networks are using generic user agents like this, but MANY more, and they change many times:
    -
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2703.0 Safari/537.36"
    +
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2703.0 Safari/537.36"
     
    • I found a blog post from 2018 detailing an attack from a DDoS service that matches our pattern exactly
    • They specifically mention:
    • @@ -473,7 +473,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
      • I see that the Unpaywall bot is resonsible for a few thousand XMLUI sessions every day (IP addresses come from nginx access.log):
      -
      $ cat dspace.log.2019-05-11 | grep -E 'ip_addr=(100.26.206.188|100.27.19.233|107.22.98.199|174.129.156.41|18.205.243.110|18.205.245.200|18.207.176.164|18.207.209.186|18.212.126.89|18.212.5.59|18.213.4.150|18.232.120.6|18.234.180.224|18.234.81.13|3.208.23.222|34.201.121.183|34.201.241.214|34.201.39.122|34.203.188.39|34.207.197.154|34.207.232.63|34.207.91.147|34.224.86.47|34.227.205.181|34.228.220.218|34.229.223.120|35.171.160.166|35.175.175.202|3.80.201.39|3.81.120.70|3.81.43.53|3.84.152.19|3.85.113.253|3.85.237.139|3.85.56.100|3.87.23.95|3.87.248.240|3.87.250.3|3.87.62.129|3.88.13.9|3.88.57.237|3.89.71.15|3.90.17.242|3.90.68.247|3.91.44.91|3.92.138.47|3.94.250.180|52.200.78.128|52.201.223.200|52.90.114.186|52.90.48.73|54.145.91.243|54.160.246.228|54.165.66.180|54.166.219.216|54.166.238.172|54.167.89.152|54.174.94.223|54.196.18.211|54.198.234.175|54.208.8.172|54.224.146.147|54.234.169.91|54.235.29.216|54.237.196.147|54.242.68.231|54.82.6.96|54.87.12.181|54.89.217.141|54.89.234.182|54.90.81.216|54.91.104.162)' | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l   
      +
      $ cat dspace.log.2019-05-11 | grep -E 'ip_addr=(100.26.206.188|100.27.19.233|107.22.98.199|174.129.156.41|18.205.243.110|18.205.245.200|18.207.176.164|18.207.209.186|18.212.126.89|18.212.5.59|18.213.4.150|18.232.120.6|18.234.180.224|18.234.81.13|3.208.23.222|34.201.121.183|34.201.241.214|34.201.39.122|34.203.188.39|34.207.197.154|34.207.232.63|34.207.91.147|34.224.86.47|34.227.205.181|34.228.220.218|34.229.223.120|35.171.160.166|35.175.175.202|3.80.201.39|3.81.120.70|3.81.43.53|3.84.152.19|3.85.113.253|3.85.237.139|3.85.56.100|3.87.23.95|3.87.248.240|3.87.250.3|3.87.62.129|3.88.13.9|3.88.57.237|3.89.71.15|3.90.17.242|3.90.68.247|3.91.44.91|3.92.138.47|3.94.250.180|52.200.78.128|52.201.223.200|52.90.114.186|52.90.48.73|54.145.91.243|54.160.246.228|54.165.66.180|54.166.219.216|54.166.238.172|54.167.89.152|54.174.94.223|54.196.18.211|54.198.234.175|54.208.8.172|54.224.146.147|54.234.169.91|54.235.29.216|54.237.196.147|54.242.68.231|54.82.6.96|54.87.12.181|54.89.217.141|54.89.234.182|54.90.81.216|54.91.104.162)' | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l   
       2206
       
      • I added “Unpaywall” to the list of bots in the Tomcat Crawler Session Manager Valve
      • @@ -505,7 +505,7 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
        • Export a list of all investors (dc.description.sponsorship) for Peter to look through and correct:
        -
        dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-05-16-investors.csv WITH CSV HEADER;
        +
        dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-05-16-investors.csv WITH CSV HEADER;
         COPY 995
         
        • Fork the ICARDA AReS v1 repository to ILRI’s GitHub and give access to CodeObia guys @@ -519,19 +519,19 @@ COPY 995
        • Peter sent me a bunch of fixes for investors from yesterday
        • I did a quick check in Open Refine (trim and collapse whitespace, clean smart quotes, etc) and then applied them on CGSpace:
        -
        $ ./fix-metadata-values.py -i /tmp/2019-05-16-fix-306-Investors.csv -db dspace-u dspace-p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
        +
        $ ./fix-metadata-values.py -i /tmp/2019-05-16-fix-306-Investors.csv -db dspace-u dspace-p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
         $ ./delete-metadata-values.py -i /tmp/2019-05-16-delete-297-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
         
        • Then I started a full Discovery re-indexing:
        -
        $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
        +
        $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
         $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
         
        • I was going to make a new controlled vocabulary of the top 100 terms after these corrections, but I noticed a bunch of duplicates and variations when I sorted them alphabetically
        • Instead, I exported a new list and asked Peter to look at it again
        • Apply Peter’s new corrections on DSpace Test and CGSpace:
        -
        $ ./fix-metadata-values.py -i /tmp/2019-05-17-fix-25-Investors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
        +
        $ ./fix-metadata-values.py -i /tmp/2019-05-17-fix-25-Investors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
         $ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
         
        • Then I re-exported the sponsors and took the top 100 to update the existing controlled vocabulary (#423) @@ -564,7 +564,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dsp
        • Generate Simple Archive Format bundle with SAFBuilder and import into the AfricaRice Articles in Journals collection on CGSpace:
        -
        $ dspace import -a -e me@cgiar.org -m 2019-05-25-AfricaRice.map -s /tmp/SimpleArchiveFormat
        +
        $ dspace import -a -e me@cgiar.org -m 2019-05-25-AfricaRice.map -s /tmp/SimpleArchiveFormat
         

        2019-05-27

        • Peter sent me over two thousand corrections for the authors on CGSpace that I had dumped last month @@ -573,16 +573,16 @@ $ ./delete-metadata-values.py -i /tmp/2019-05-17-delete-14-Investors.csv -db dsp
      -
      $ ./fix-metadata-values.py -i /tmp/2019-05-27-fix-2472-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t corrections -d
      +
      $ ./fix-metadata-values.py -i /tmp/2019-05-27-fix-2472-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t corrections -d
       
      • Then start a full Discovery re-indexing on each server:
      -
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"                                   
      +
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"                                   
       $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
       
      • Export new list of all authors from CGSpace database to send to Peter:
      -
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-05-27-all-authors.csv with csv header;
      +
      dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-05-27-all-authors.csv with csv header;
       COPY 64871
       
      • Run all system updates on DSpace Test (linode19) and reboot it
      • @@ -605,11 +605,11 @@ COPY 64871
        • I see the following error in the DSpace log when the user tries to log in with her CGIAR email and password on the LDAP login:
        -
        2019-05-30 07:19:35,166 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A5E0C836AF8F3ABB769FE47107AE1CFF:ip_addr=185.71.4.34:failed_login:no DN found for user sa.saini@cgiar.org
        +
        2019-05-30 07:19:35,166 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A5E0C836AF8F3ABB769FE47107AE1CFF:ip_addr=185.71.4.34:failed_login:no DN found for user sa.saini@cgiar.org
         
        • For now I just created an eperson with her personal email address until I have time to check LDAP to see what’s up with her CGIAR account:
        -
        $ dspace user -a -m blah@blah.com -g Sakshi -s Saini -p 'sknflksnfksnfdls'
        +
        $ dspace user -a -m blah@blah.com -g Sakshi -s Saini -p 'sknflksnfksnfdls'
         
        diff --git a/docs/2019-06/index.html b/docs/2019-06/index.html index d35ae96c5..6ca1ddd6c 100644 --- a/docs/2019-06/index.html +++ b/docs/2019-06/index.html @@ -34,7 +34,7 @@ Run system updates on CGSpace (linode18) and reboot it Skype with Marie-Angélique and Abenet about CG Core v2 "/> - + @@ -169,7 +169,7 @@ Skype with Marie-Angélique and Abenet about CG Core v2
        • Thierry noticed that the CUA statistics were missing previous years again, and I see that the Solr admin UI has the following message:
        -
        statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher 
        +
        statistics-2018: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher 
         
        • I had to restart Tomcat a few times for all the stats cores to get loaded with no issue
        @@ -197,13 +197,13 @@ Skype with Marie-Angélique and Abenet about CG Core v2
    -
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/countries.csv WITH CSV HEADER
    +
    dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 228 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC) to /tmp/countries.csv WITH CSV HEADER
     COPY 192
     $ csvcut -l -c 0 /tmp/countries.csv > 2019-06-10-countries.csv
     
    • Get a list of all the unique AGROVOC subject terms in IITA’s data and export it to a text file so I can validate them with my agrovoc-lookup.py script:
    -
    $ csvcut -c dc.subject ~/Downloads/2019-06-10-IITA-20194th-Round-2.csv| sed 's/||/\n/g' | grep -v dc.subject | sort -u > iita-agrovoc.txt
    +
    $ csvcut -c dc.subject ~/Downloads/2019-06-10-IITA-20194th-Round-2.csv| sed 's/||/\n/g' | grep -v dc.subject | sort -u > iita-agrovoc.txt
     $ ./agrovoc-lookup.py -i iita-agrovoc.txt -om iita-agrovoc-matches.txt -or iita-agrovoc-rejects.txt
     $ wc -l iita-agrovoc*
       402 iita-agrovoc-matches.txt
    @@ -212,11 +212,11 @@ $ wc -l iita-agrovoc*
     
    • Combine these IITA matches with the subjects I matched a few months ago:
    -
    $ csvcut -c name 2019-03-18-subjects-matched.csv | grep -v name | cat - iita-agrovoc-matches.txt | sort -u > 2019-06-10-subjects-matched.txt
    +
    $ csvcut -c name 2019-03-18-subjects-matched.csv | grep -v name | cat - iita-agrovoc-matches.txt | sort -u > 2019-06-10-subjects-matched.txt
     
    • Then make a new list to use with reconcile-csv by adding line numbers with csvcut and changing the line number header to id:
    -
    $ csvcut -c name -l 2019-06-10-subjects-matched.txt | sed 's/line_number/id/' > 2019-06-10-subjects-matched.csv
    +
    $ csvcut -c name -l 2019-06-10-subjects-matched.txt | sed 's/line_number/id/' > 2019-06-10-subjects-matched.csv
     

    2019-06-20

    • Share some feedback about AReS v2 with the colleagues and encourage them to do the same
    • @@ -231,14 +231,14 @@ $ wc -l iita-agrovoc*
    • Update my local PostgreSQL container:
    -
    $ podman pull docker.io/library/postgres:9.6-alpine
    +
    $ podman pull docker.io/library/postgres:9.6-alpine
     $ podman rm dspacedb
     $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     

    2019-06-25

    • Normalize text_lang values for metadata on DSpace Test and CGSpace:
    -
    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
    +
    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IN ('ethnob', 'en', '*', 'E.', '');
     UPDATE 1551
     dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2 AND metadata_field_id != 28 AND text_lang IS NULL;
     UPDATE 2070
    @@ -291,7 +291,7 @@ UPDATE 2
     
     
     
    -
    $ dspace import -a -e me@cgiar.org -m 2019-06-30-AfricaRice-11to73.map -s /tmp/2019-06-30-AfricaRice-11to73
    +
    $ dspace import -a -e me@cgiar.org -m 2019-06-30-AfricaRice-11to73.map -s /tmp/2019-06-30-AfricaRice-11to73
     
    • I sent feedback about a few missing PDFs and one duplicate to Ibnou to check
    • Run all system updates on DSpace Test (linode19) and reboot it
    • diff --git a/docs/2019-07/index.html b/docs/2019-07/index.html index f70af4048..1c1c4649c 100644 --- a/docs/2019-07/index.html +++ b/docs/2019-07/index.html @@ -38,7 +38,7 @@ CGSpace Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community "/> - + @@ -153,12 +153,12 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
    -
    org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
    +
    org.apache.solr.common.SolrException: Error CREATEing SolrCore 'statistics-2010': Unable to create core [statistics-2010] Caused by: Lock obtain timed out: NativeFSLock@/home/cgspace.cgiar.org/solr/statistics-2010/data/index/write.lock
     
    • I restarted Tomcat ten times and it never worked…
    • I tried to stop Tomcat and delete the write locks:
    -
    # systemctl stop tomcat7
    +
    # systemctl stop tomcat7
     # find /dspace/solr/statistics* -iname "*.lock" -print -delete
     /dspace/solr/statistics/data/index/write.lock
     /dspace/solr/statistics-2010/data/index/write.lock
    @@ -176,23 +176,23 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
     
  • But it still didn’t work!
  • I stopped Tomcat, deleted the old locks, and will try to use the “simple” lock file type in solr/statistics/conf/solrconfig.xml:
  • -
    <lockType>${solr.lock.type:simple}</lockType>
    +
    <lockType>${solr.lock.type:simple}</lockType>
     
    • And after restarting Tomcat it still doesn’t work
    • Now I’ll try going back to “native” locking with unlockAtStartup:
    -
    <unlockOnStartup>true</unlockOnStartup>
    +
    <unlockOnStartup>true</unlockOnStartup>
     
    • Now the cores seem to load, but I still see an error in the Solr Admin UI and I still can’t access any stats before 2018
    • I filed an issue with Atmire, so let’s see if they can help
    • And since I’m annoyed and it’s been a few months, I’m going to move the JVM heap settings that I’ve been testing on DSpace Test to CGSpace
    • The old ones were:
    -
    -Djava.awt.headless=true -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5400 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
    +
    -Djava.awt.headless=true -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5400 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
     
    • And the new ones come from Solr 4.10.x’s startup scripts:
    -
        -Djava.awt.headless=true
    +
        -Djava.awt.headless=true
         -Xms8192m -Xmx8192m
         -Dfile.encoding=UTF-8
         -XX:NewRatio=3
    @@ -221,14 +221,14 @@ Abenet had another similar issue a few days ago when trying to find the stats fo
     
     
     
    -
    $ sed -i 's/CC-BY 4.0/CC-BY-4.0/' item_*/dublin_core.xml
    +
    $ sed -i 's/CC-BY 4.0/CC-BY-4.0/' item_*/dublin_core.xml
     $ echo "10568/101992" >> item_*/collections
     $ dspace import -a -e me@cgiar.org -m 2019-07-02-Sharefair.map -s /tmp/Sharefair_mapped
     
    • I noticed that all twenty-seven items had double dates like “2019-05||2019-05” so I fixed those, but the rest of the metadata looked good so I unmapped them from the temporary collection
    • Finish looking at the fifty-six AfricaRice items and upload them to CGSpace:
    -
    $ dspace import -a -e me@cgiar.org -m 2019-07-02-AfricaRice-11to73.map -s /tmp/SimpleArchiveFormat
    +
    $ dspace import -a -e me@cgiar.org -m 2019-07-02-AfricaRice-11to73.map -s /tmp/SimpleArchiveFormat
     
    • Peter pointed out that the Sharefair dates I fixed were not actually fixed
        @@ -249,20 +249,20 @@ $ dspace import -a -e me@cgiar.org -m 2019-07-02-Sharefair.map -s /tmp/Sharefair
    -
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/new-bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-07-04-orcid-ids.txt
    +
    $ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/new-bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2019-07-04-orcid-ids.txt
     $ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names.txt -d
     
    • Send and merge a pull request for the new ORCID identifiers (#428)
    • I created a CSV with some ORCID identifiers that I had seen change so I could update any existing ones in the databse:
    -
    cg.creator.id,correct
    +
    cg.creator.id,correct
     "Marius Ekué: 0000-0002-5829-6321","Marius R.M. Ekué: 0000-0002-5829-6321"
     "Mwungu: 0000-0001-6181-8445","Chris Miyinzi Mwungu: 0000-0001-6181-8445"
     "Mwungu: 0000-0003-1658-287X","Chris Miyinzi Mwungu: 0000-0003-1658-287X"
     
    • But when I ran fix-metadata-values.py I didn’t see any changes:
    -
    $ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
    +
    $ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -m 240 -t correct -d
     

    2019-07-06

    • Send a reminder to Marie about my notes on the CG Core v2 issue I created two weeks ago
    • @@ -282,7 +282,7 @@ $ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names
    • Playing with the idea of using xsv to do some basic batch quality checks on CSVs, for example to find items that might be duplicates if they have the same DOI or title:
    -
    $ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
    +
    $ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
     field,value,count
     cg.identifier.doi,https://doi.org/10.1016/j.agwat.2018.06.018,2
     $ xsv frequency --select dc.title --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'         
    @@ -291,13 +291,13 @@ dc.title,Reference evapotranspiration prediction using hybridized fuzzy model wi
     
    • Or perhaps if DOIs are valid or not (having doi.org in the URL):
    -
    $ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E 'doi.org'
    +
    $ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E 'doi.org'
     field,value,count
     cg.identifier.doi,https://hdl.handle.net/10520/EJC-1236ac700f,1
     
    -
    $ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v '"' | grep -v -E '^[0-9]{4}-[0-9]{3}[0-9xX]$'
    +
    $ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v '"' | grep -v -E '^[0-9]{4}-[0-9]{3}[0-9xX]$'
     dc.identifier.issn
     978-3-319-71997-9
     978-3-319-71997-9
    @@ -333,7 +333,7 @@ dc.identifier.issn
     
  • Yesterday Theirry from CTA asked me about an error he was getting while submitting an item on CGSpace: “Unable to load Submission Information, since WorkspaceID (ID:S106658) is not a valid in-process submission.”
  • I looked in the DSpace logs and found this right around the time of the screenshot he sent me:
  • -
    2019-07-10 11:50:27,433 INFO  org.dspace.submit.step.CompleteStep @ lewyllie@cta.int:session_id=A920730003BCAECE8A3B31DCDE11A97E:submission_complete:Completed submission with id=106658
    +
    2019-07-10 11:50:27,433 INFO  org.dspace.submit.step.CompleteStep @ lewyllie@cta.int:session_id=A920730003BCAECE8A3B31DCDE11A97E:submission_complete:Completed submission with id=106658
     
    • I’m assuming something happened in his browser (like a refresh) after the item was submitted…
    @@ -350,24 +350,24 @@ dc.identifier.issn
  • Run all system updates on DSpace Test (linode19) and reboot it
  • Try to run dspace cleanup -v on CGSpace and ran into an error:
  • -
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
       Detail: Key (bitstream_id)=(167394) is still referenced from table "bundle".
     
    • The solution is, as always:
    -
    # su - postgres
    +
    # su - postgres
     $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (167394);'
     UPDATE 1
     

    2019-07-16

    • Completely reset the Podman configuration on my laptop because there were some layers that I couldn’t delete and it had been some time since I did a cleanup:
    -
    $ podman system prune -a -f --volumes
    +
    $ podman system prune -a -f --volumes
     $ sudo rm -rf ~/.local/share/containers
     
    • Then pull a new PostgreSQL 9.6 image and load a CGSpace database dump into a new local test container:
    -
    $ podman pull postgres:9.6-alpine
    +
    $ podman pull postgres:9.6-alpine
     $ podman run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
     $ createuser -h localhost -U postgres --pwprompt dspacetest
     $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
    @@ -388,7 +388,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
     
     
  • Sisay said a user was having problems registering on CGSpace and it looks like the email account expired again:
  • -
    $ dspace test-email
    +
    $ dspace test-email
     
     About to send test email:
      - To: blahh@cgiar.org
    @@ -414,7 +414,7 @@ Please see the DSpace documentation for assistance.
     
    • Create an account for Lionelle Samnick on CGSpace because the registration isn’t working for some reason:
    -
    $ dspace user --add --givenname Lionelle --surname Samnick --email blah@blah.com --password 'blah'
    +
    $ dspace user --add --givenname Lionelle --surname Samnick --email blah@blah.com --password 'blah'
     
    • I added her as a submitter to CTA ISF Pro-Agro series
    • Start looking at 1429 records for the Bioversity batch import @@ -442,7 +442,7 @@ Please see the DSpace documentation for assistance.
    -
        <dct:coverage>
    +
        <dct:coverage>
             <dct:spatial>
                 <type>Country</type>
                 <dct:identifier>http://sws.geonames.org/192950</dct:identifier>
    @@ -484,14 +484,14 @@ Please see the DSpace documentation for assistance.
     

    I might be able to use isbnlib to validate ISBNs in Python:

    -
    if isbnlib.is_isbn10('9966-955-07-0') or isbnlib.is_isbn13('9966-955-07-0'):
    +
    if isbnlib.is_isbn10('9966-955-07-0') or isbnlib.is_isbn13('9966-955-07-0'):
         print("Yes")
     else:
         print("No")
     
    -
    from stdnum import isbn
    +
    from stdnum import isbn
     from stdnum import issn
     
     isbn.validate('978-92-9043-389-7')
    @@ -510,7 +510,7 @@ issn.validate('1020-3362')
     

    I figured out a GREL to trim spaces in multi-value cells without splitting them:

    -
    value.replace(/\s+\|\|/,"||").replace(/\|\|\s+/,"||")
    +
    value.replace(/\s+\|\|/,"||").replace(/\|\|\s+/,"||")
     
    • I whipped up a quick script using Python Pandas to do whitespace cleanup
    diff --git a/docs/2019-08/index.html b/docs/2019-08/index.html index 77f7b357a..50a5e6238 100644 --- a/docs/2019-08/index.html +++ b/docs/2019-08/index.html @@ -46,7 +46,7 @@ After rebooting, all statistics cores were loaded… wow, that’s luck Run system updates on DSpace Test (linode19) and reboot it "/> - + @@ -194,7 +194,7 @@ Run system updates on DSpace Test (linode19) and reboot it -
    or(
    +
    or(
       isNotNull(value.match(/^.*’.*$/)),
       isNotNull(value.match(/^.*é.*$/)),
       isNotNull(value.match(/^.*á.*$/)),
    @@ -235,14 +235,14 @@ Run system updates on DSpace Test (linode19) and reboot it
     
     
     
    -
    # /opt/certbot-auto renew --standalone --pre-hook "/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld" --post-hook "/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx"
    +
    # /opt/certbot-auto renew --standalone --pre-hook "/usr/bin/docker stop angular_nginx; /bin/systemctl stop firewalld" --post-hook "/bin/systemctl start firewalld; /usr/bin/docker start angular_nginx"
     
    • It is important that the firewall starts back up before the Docker container or else Docker will complain about missing iptables chains
    • Also, I updated to the latest TLS Intermediate settings as appropriate for Ubuntu 18.04’s OpenSSL 1.1.0g with nginx 1.16.0
    • Run all system updates on AReS dev server (linode20) and reboot it
    • Get a list of all PDFs from the Bioversity migration that fail to download and save them so I can try again with a different path in the URL:
    -
    $ ./generate-thumbnails.py -i /tmp/2019-08-05-Bioversity-Migration.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs.txt
    +
    $ ./generate-thumbnails.py -i /tmp/2019-08-05-Bioversity-Migration.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs.txt
     $ grep -B1 "Download failed" /tmp/2019-08-08-download-pdfs.txt | grep "Downloading" | sed -e 's/> Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 > /tmp/user-upload.csv
     $ ./generate-thumbnails.py -i /tmp/user-upload.csv -w --url-field-name url -d | tee /tmp/2019-08-08-download-pdfs2.txt
     $ grep -B1 "Download failed" /tmp/2019-08-08-download-pdfs2.txt | grep "Downloading" | sed -e 's/> Downloading //' -e 's/\.\.\.//' | sed -r 's/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g' | csvcut -H -c 1,1 > /tmp/user-upload2.csv
    @@ -277,7 +277,7 @@ $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d |
     
     
     
    -
    proxy_set_header Host dev.ares.codeobia.com;
    +
    proxy_set_header Host dev.ares.codeobia.com;
     
    • Though I am really wondering why this happened now, because the configuration has been working for months…
    • Improve the output of the suspicious characters check in csv-metadata-quality script and tag version 0.2.0
    • @@ -329,7 +329,7 @@ $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d |
      • Create a test user on DSpace Test for Mohammad Salem to attempt depositing:
      -
      $ dspace user -a -m blah@blah.com -g Mohammad -s Salem -p 'domoamaaa'
      +
      $ dspace user -a -m blah@blah.com -g Mohammad -s Salem -p 'domoamaaa'
       
      • Create and merge a pull request (#429) to add eleven new CCAFS Phase II Project Tags to CGSpace
      • Atmire responded to the Solr cores issue last week, but they could not reproduce the issue @@ -339,13 +339,13 @@ $ ./generate-thumbnails.py -i /tmp/user-upload2.csv -w --url-field-name url -d |
      • Testing an import of 1,429 Bioversity items (metadata only) on my local development machine and got an error with Java memory after about 1,000 items:
      -
      $ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
      +
      $ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
       ...
       java.lang.OutOfMemoryError: GC overhead limit exceeded
       
      • I increased the heap size to 1536m and tried again:
      -
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1536m"
      +
      $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1536m"
       $ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
       
      • This time it succeeded, and using VisualVM I noticed that the import process used a maximum of 620MB of RAM
      • @@ -361,7 +361,7 @@ $ ~/dspace/bin/dspace metadata-import -f /tmp/bioversity.csv -e blah@blah.com
    -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
     $ dspace metadata-import -f /tmp/bioversity1.csv -e blah@blah.com
     $ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
     
      @@ -377,7 +377,7 @@ $ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
    • Deploy Tomcat 7.0.96 and PostgreSQL JDBC 42.2.6 driver on CGSpace (linde18)
    • After restarting Tomcat one of the Solr statistics cores failed to start up:
    -
    statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
    +
    statistics-2015: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
     
    • I decided to run all system updates on the server and reboot it
    • After reboot the statistics-2018 core failed to load so I restarted tomcat7 again
    • @@ -393,7 +393,7 @@ $ dspace metadata-import -f /tmp/bioversity2.csv -e blah@blah.com
    -
    import os
    +
    import os
     
     return os.path.basename(value)
     
      @@ -429,7 +429,7 @@ return os.path.basename(value)
    -
    $ ./fix-metadata-values.py -i ~/Downloads/2019-08-26-Peter-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
    +
    $ ./fix-metadata-values.py -i ~/Downloads/2019-08-26-Peter-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
     
    • Apply the corrections on CGSpace and DSpace Test
        @@ -437,7 +437,7 @@ return os.path.basename(value)
    -
    $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    81m47.057s 
     user    8m5.265s 
    @@ -478,21 +478,21 @@ sys     2m24.715s
     
     
     
    -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-08-28-all-authors.csv with csv header;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/2019-08-28-all-authors.csv with csv header;
     COPY 65597
     
    • Then I created a new CSV with two author columns (edit title of second column after):
    -
    $ csvcut -c dc.contributor.author,dc.contributor.author /tmp/2019-08-28-all-authors.csv > /tmp/all-authors.csv
    +
    $ csvcut -c dc.contributor.author,dc.contributor.author /tmp/2019-08-28-all-authors.csv > /tmp/all-authors.csv
     
    • Then I ran my script on the new CSV, skipping one of the author columns:
    -
    $ csv-metadata-quality -u -i /tmp/all-authors.csv -o /tmp/authors.csv -x dc.contributor.author
    +
    $ csv-metadata-quality -u -i /tmp/all-authors.csv -o /tmp/authors.csv -x dc.contributor.author
     
    • This fixed a bunch of issues with spaces, commas, unneccesary Unicode characters, etc
    • Then I ran the corrections on my test server and there were 185 of them!
    -
    $ ./fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correctauthor
    +
    $ ./fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correctauthor
     
    • I very well might run these on CGSpace soon…
    @@ -506,7 +506,7 @@ COPY 65597 -
    $ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname "*.xsl" -exec ./cgcore-xsl-replacements.sed {} \;
    +
    $ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname "*.xsl" -exec ./cgcore-xsl-replacements.sed {} \;
     
    • I think I got everything in the XMLUI themes, but there may be some things I should check once I get a deployment up and running:
        @@ -526,7 +526,7 @@ COPY 65597
    -
    "handles":["10986/30568","10568/97825"],"handle":"10986/30568"
    +
    "handles":["10986/30568","10568/97825"],"handle":"10986/30568"
     
    • So this is the same issue we had before, where Altmetric knows this Handle is associated with a DOI that has a score, but the client-side JavaScript code doesn’t show it because it seems to a secondary handle or something
    @@ -535,7 +535,7 @@ COPY 65597
  • Run system updates on DSpace Test (linode19) and reboot the server
  • Run the author fixes on DSpace Test and CGSpace and start a full Discovery re-index:
  • -
    $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
      
     real    90m47.967s
     user    8m12.826s
    diff --git a/docs/2019-09/index.html b/docs/2019-09/index.html
    index 5803bdd2c..116692387 100644
    --- a/docs/2019-09/index.html
    +++ b/docs/2019-09/index.html
    @@ -72,7 +72,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
        7249 2a01:7e00::f03c:91ff:fe18:7396
        9124 45.5.186.2
     "/>
    -
    +
     
     
         
    @@ -163,7 +163,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
     
  • Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
  • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
  • -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         440 17.58.101.255
         441 157.55.39.101
         485 207.46.13.43
    @@ -189,18 +189,18 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
     
  • 3.94.211.189 is MauiBot, and most of its requests are to Discovery and get rate limited with HTTP 503
  • 163.172.71.23 is some IP on Online SAS in France and its user agent is:
  • -
    Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
    +
    Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
     
    • It actually got mostly HTTP 200 responses:
    -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | grep 163.172.71.23 | awk '{print $9}' | sort | uniq -c
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | grep 163.172.71.23 | awk '{print $9}' | sort | uniq -c
        1775 200
         703 499
          72 503
     
    • And it was mostly requesting Discover pages:
    -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | grep 163.172.71.23 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c 
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | grep 163.172.71.23 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c 
        2350 discover
          71 handle
     
      @@ -279,16 +279,16 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
    -
    2019-09-15 15:32:18,137 WARN  org.apache.cocoon.components.xslt.TraxErrorListener  - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
    +
    2019-09-15 15:32:18,137 WARN  org.apache.cocoon.components.xslt.TraxErrorListener  - Can not load requested doc: unknown protocol: cocoon at jndi:/localhost/themes/CIAT/xsl/../../0_CGIAR/xsl//aspect/artifactbrowser/common.xsl:141:90
     
    • Around the same time I see the following in the DSpace log:
    -
    2019-09-15 15:32:18,079 INFO  org.dspace.usage.LoggerUsageEventListener @ aorth@blah:session_id=A11C362A7127004C24E77198AF9E4418:ip_addr=x.x.x.x:view_item:handle=10568/103644 
    +
    2019-09-15 15:32:18,079 INFO  org.dspace.usage.LoggerUsageEventListener @ aorth@blah:session_id=A11C362A7127004C24E77198AF9E4418:ip_addr=x.x.x.x:view_item:handle=10568/103644 
     2019-09-15 15:32:18,135 WARN  org.dspace.core.PluginManager @ Cannot find named plugin for interface=org.dspace.content.crosswalk.DisseminationCrosswalk, name="METSRIGHTS"
     
    • I see a lot of these errors today, but not earlier this month:
    -
    # grep -c 'Cannot find named plugin' dspace.log.2019-09-*
    +
    # grep -c 'Cannot find named plugin' dspace.log.2019-09-*
     dspace.log.2019-09-01:0
     dspace.log.2019-09-02:0
     dspace.log.2019-09-03:0
    @@ -307,7 +307,7 @@ dspace.log.2019-09-15:808
     
    • Something must have happened when I restarted Tomcat a few hours ago, because earlier in the DSpace log I see a bunch of errors like this:
    -
    2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.METSRightsCrosswalk", name="METSRIGHTS"
    +
    2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.METSRightsCrosswalk", name="METSRIGHTS"
     2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.OREDisseminationCrosswalk", name="ore"
     2019-09-15 13:59:24,136 ERROR org.dspace.core.PluginManager @ Name collision in named plugin, implementation class="org.dspace.content.crosswalk.DIMDisseminationCrosswalk", name="dim"
     
      @@ -321,7 +321,7 @@ dspace.log.2019-09-15:808
      • For some reason my podman PostgreSQL container isn’t working so I had to use Docker to re-create it for my testing work today:
      -
      # docker pull docker.io/library/postgres:9.6-alpine
      +
      # docker pull docker.io/library/postgres:9.6-alpine
       # docker create volume dspacedb_data
       # docker run --name dspacedb -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
       $ createuser -h localhost -U postgres --pwprompt dspacetest
      @@ -338,7 +338,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
       
    -
    dc.contributor.author,cg.creator.id
    +
    dc.contributor.author,cg.creator.id
     "Kihara, Job","Job Kihara: 0000-0002-4394-9553"
     "Twyman, Jennifer","Jennifer Twyman: 0000-0002-8581-5668"
     "Ishitani, Manabu","Manabu Ishitani: 0000-0002-6950-4018"
    @@ -358,7 +358,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
     
    • I tested the file on my local development machine with the following invocation:
    -
    $ ./add-orcid-identifiers-csv.py -i 2019-09-19-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i 2019-09-19-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
     
    • In my test environment this added 390 ORCID identifier
    • I ran the same updates on CGSpace and DSpace Test and then started a Discovery re-index to force the search index to update
    • @@ -386,15 +386,15 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
    • Follow up with Marissa again about the CCAFS phase II project tags
    • Generate a list of the top 1500 authors on CGSpace:
    -
    dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-09-19-top-1500-authors.csv WITH CSV HEADER;
    +
    dspace=# \copy (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-09-19-top-1500-authors.csv WITH CSV HEADER;
     
    • Then I used csvcut to select the column of author names, strip the header and quote characters, and saved the sorted file:
    -
    $ csvcut -c text_value /tmp/2019-09-19-top-1500-authors.csv | grep -v text_value | sed 's/"//g' | sort > dspace/config/controlled-vocabularies/dc-contributor-author.xml
    +
    $ csvcut -c text_value /tmp/2019-09-19-top-1500-authors.csv | grep -v text_value | sed 's/"//g' | sort > dspace/config/controlled-vocabularies/dc-contributor-author.xml
     
    • After adding the XML formatting back to the file I formatted it using XML tidy:
    -
    $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contributor-author.xml
    +
    $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-contributor-author.xml
     
    • I created and merged a pull request for the updates
        @@ -416,7 +416,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
    -
    $ perl-rename -n 's/_{2,3}/_/g' *.pdf
    +
    $ perl-rename -n 's/_{2,3}/_/g' *.pdf
     
    • I was going preparing to run SAFBuilder for the Bioversity migration and decided to check the list of PDFs on my local machine versus on DSpace Test (where I had downloaded them last month)
        @@ -426,7 +426,7 @@ $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-s
    -
    $ rename -v 's/___/_/g'  *.pdf
    +
    $ rename -v 's/___/_/g'  *.pdf
     $ rename -v 's/__/_/g'  *.pdf
     
    • I’m still waiting to hear what Carol and Francesca want to do with the 1195.pdf.LCK file (for now I’ve removed it from the CSV, but for future reference it has the number 630 in its permalink)
    • @@ -436,15 +436,15 @@ $ rename -v 's/__/_/g' *.pdf
    -
    value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,"")
    +
    value.replace(/,? ?\((ANDES|APAFRI|APFORGEN|Canada|CFC|CGRFA|China|CacaoNet|CATAS|CDU|CIAT|CIRF|CIP|CIRNMA|COSUDE|Colombia|COA|COGENT|CTDT|Denmark|DfLP|DSE|ECPGR|ECOWAS|ECP\/GR|England|EUFORGEN|FAO|France|Francia|FFTC|Germany|GEF|GFU|GGCO|GRPI|italy|Italy|Italia|India|ICCO|ICAR|ICGR|ICRISAT|IDRC|INFOODS|IPGRI|IBPGR|ICARDA|ILRI|INIBAP|INBAR|IPK|ISG|IT|Japan|JIRCAS|Kenya|LI\-BIRD|Malaysia|NARC|NBPGR|Nepal|OOAS|RDA|RISBAP|Rome|ROPPA|SEARICE|Senegal|SGRP|Sweden|Syrian Arab Republic|The Netherlands|UNDP|UK|UNEP|UoB|UoM|United Kingdom|WAHO)\)/,"")
     
    • The second targets cities and countries after names like “International Livestock Research Intstitute, Kenya”:
    -
    replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,"")
    +
    replace(/,? ?(ali|Aleppo|Amsterdam|Beijing|Bonn|Burkina Faso|CN|Dakar|Gatersleben|London|Montpellier|Nairobi|New Delhi|Kaski|Kepong|Malaysia|Khumaltar|Lima|Ltpur|Ottawa|Patancheru|Peru|Pokhara|Rome|Uppsala|University of Mauritius|Tsukuba)/,"")
     
    • I imported the 1,427 Bioversity records with bitstreams to a new collection called 2019-09-20 Bioversity Migration Test on DSpace Test (after splitting them in two batches of about 700 each):
    -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx768m'
     $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity1.map -s /home/aorth/Bioversity/bioversity1
     $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bioversity/bioversity2
     
      @@ -513,7 +513,7 @@ $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bio
    • Get a list of institutions from CCAFS’s Clarisa API and try to parse it with jq, do some small cleanups and add a header in sed, and then pass it through csvcut to add line numbers:
    -
    $ cat ~/Downloads/institutions.json| jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
    +
    $ cat ~/Downloads/institutions.json| jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
     $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv -u
     
    • The csv-metadata-quality tool caught a few records with excessive spacing and unnecessary Unicode
    • diff --git a/docs/2019-10/index.html b/docs/2019-10/index.html index 63242266a..5885b8da4 100644 --- a/docs/2019-10/index.html +++ b/docs/2019-10/index.html @@ -18,7 +18,7 @@ - + @@ -113,7 +113,7 @@
    -
    $ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv > /tmp/iwmi-title-region-subregion-river.csv
    +
    $ csvcut -c 'id,dc.title[en_US],cg.coverage.region[en_US],cg.coverage.subregion[en_US],cg.river.basin[en_US]' ~/Downloads/10568-16814.csv > /tmp/iwmi-title-region-subregion-river.csv
     
    • Then I replace them in vim with :% s/\%u00a0/ /g because I can’t figure out the correct sed syntax to do it directly from the pipe above
    • I uploaded those to CGSpace and then re-exported the metadata
    • @@ -121,7 +121,7 @@
    • I modified the script so it replaces the non-breaking spaces instead of removing them
    • Then I ran the csv-metadata-quality script to do some general cleanups (though I temporarily commented out the whitespace fixes because it was too many thousands of rows):
    -
    $ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u
    +
    $ csv-metadata-quality -i ~/Downloads/10568-16814.csv -o /tmp/iwmi.csv -x 'dc.date.issued,dc.date.issued[],dc.date.issued[en_US]' -u
     
    • That fixed 153 items (unnecessary Unicode, duplicates, comma–space fixes, etc)
    • Release version 0.3.1 of the csv-metadata-quality script with the non-breaking spaces change
    • @@ -134,7 +134,7 @@
      • Create an account for Bioversity’s ICT consultant Francesco on DSpace Test:
      -
      $ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
      +
      $ dspace user -a -m blah@mail.it -g Francesco -s Vernocchi -p 'fffff'
       
      • Email Francesca and Carol to ask for follow up about the test upload I did on 2019-09-21
          @@ -193,19 +193,19 @@
      -
      $ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
      +
      $ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
       

      2019-10-11

      • I ran the DSpace cleanup function on CGSpace and it found some errors:
      -
      $ dspace cleanup -v
      +
      $ dspace cleanup -v
       ...
       Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
         Detail: Key (bitstream_id)=(171221) is still referenced from table "bundle".
       
      • The solution, as always, is (repeat as many times as needed):
      -
      # su - postgres
      +
      # su - postgres
       $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (171221);'
       UPDATE 1
       

      2019-10-12

      @@ -223,7 +223,7 @@ UPDATE 1
    -
    from,to
    +
    from,to
     CIAT,International Center for Tropical Agriculture
     International Centre for Tropical Agriculture,International Center for Tropical Agriculture
     International Maize and Wheat Improvement Center (CIMMYT),International Maize and Wheat Improvement Center
    @@ -234,7 +234,7 @@ International Maize and Wheat Improvement Centre,International Maize and Wheat I
     
    • Then I applied it with my fix-metadata-values.py script on CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p 'fuuu' -f from -m 211 -t to
    +
    $ ./fix-metadata-values.py -i /tmp/affiliations.csv -db dspace -u dspace -p 'fuuu' -f from -m 211 -t to
     
    • I did some manual curation of about 300 authors in OpenRefine in preparation for telling Peter and Abenet that the migration is almost ready
        @@ -260,17 +260,17 @@ International Maize and Wheat Improvement Centre,International Maize and Wheat I
    -
    $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time schedtool -B -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    82m35.993s
     
    • After the re-indexing the top authors still list the following:
    -
    Jagwe, J.|Ouma, E.A.|Brandes-van Dorresteijn, D.|Kawuma, Brian|Smith, J.
    +
    Jagwe, J.|Ouma, E.A.|Brandes-van Dorresteijn, D.|Kawuma, Brian|Smith, J.
     
    • I looked in the database to find authors that had “|” in them:
    -
    dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE '%|%';
    +
    dspace=# SELECT text_value, resource_id FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value LIKE '%|%';
                 text_value            | resource_id 
     ----------------------------------+-------------
      Anandajayasekeram, P.|Puskur, R. |         157
    @@ -280,7 +280,7 @@ real    82m35.993s
     
    • Then I found their handles and corrected them, for example:
    -
    dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '157' and handle.resource_type_id=2;
    +
    dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '157' and handle.resource_type_id=2;
       handle   
     -----------
      10568/129
    @@ -304,7 +304,7 @@ real    82m35.993s
     
     
     
    -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx512m'
     $ mkdir 2019-10-15-Bioversity
     $ dspace export -i 10568/108684 -t COLLECTION -m -n 0 -d 2019-10-15-Bioversity
     $ sed -i '/<dcvalue element="identifier" qualifier="uri">/d' 2019-10-15-Bioversity/*/dublin_core.xml
    @@ -312,12 +312,12 @@ $ sed -i '/<dcvalue element="identifier" qualifier="uri"&
     
  • It’s really stupid, but for some reason the handles are included even though I specified the -m option, so after the export I removed the dc.identifier.uri metadata values from the items
  • Then I imported a test subset of them in my local test environment:
  • -
    $ ~/dspace/bin/dspace import -a -c 10568/104049 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s /tmp/2019-10-15-Bioversity
    +
    $ ~/dspace/bin/dspace import -a -c 10568/104049 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s /tmp/2019-10-15-Bioversity
     
    • I had forgotten (again) that the dspace export command doesn’t preserve collection ownership or mappings, so I will have to create a temporary collection on CGSpace to import these to, then do the mappings again after import…
    • On CGSpace I will increase the RAM of the command line Java process for good luck before import…
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ dspace import -a -c 10568/104057 -e fuu@cgiar.org -m 2019-10-15-Bioversity.map -s 2019-10-15-Bioversity
     
    • After importing the 1,367 items I re-exported the metadata, changed the owning collections to those based on their type, then re-imported them
    • diff --git a/docs/2019-11/index.html b/docs/2019-11/index.html index 1040b2d1b..bb159b9d0 100644 --- a/docs/2019-11/index.html +++ b/docs/2019-11/index.html @@ -58,7 +58,7 @@ Let’s see how many of the REST API requests were for bitstreams (because t # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" 106781 "/> - + @@ -152,7 +152,7 @@ Let’s see how many of the REST API requests were for bitstreams (because t
    -
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
    +
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     4671942
     # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     1277694
    @@ -160,14 +160,14 @@ Let’s see how many of the REST API requests were for bitstreams (because t
     
  • So 4.6 million from XMLUI and another 1.2 million from API requests
  • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
  • -
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
    +
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
     1183456 
     # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
     106781
     
    • The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)
    -
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
           1 PUT
           8 PROPFIND
         283 OPTIONS
    @@ -177,16 +177,16 @@ Let’s see how many of the REST API requests were for bitstreams (because t
     
    • Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:
    -
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
    +
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
     365288
     
    • Their user agent is one I’ve never seen before:
    -
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
    +
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
     
    • Most of them seem to be to community or collection discover and browse results pages like /handle/10568/103/discover:
    -
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c
    +
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -o -E "GET /(bitstream|discover|handle)" | sort | uniq -c
        6566 GET /bitstream
      351928 GET /handle
     # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep Amazonbot | grep -E "GET /(bitstream|discover|handle)" | grep -c discover
    @@ -196,12 +196,12 @@ Let’s see how many of the REST API requests were for bitstreams (because t
     
    • As far as I can tell, none of their requests are counted in the Solr statistics:
    -
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
    +
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'
     
    • Still, those requests are CPU intensive so I will add their user agent to the “badbots” rate limiting in nginx to reduce the impact on server load
    • After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):
    -
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
    +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:"Amazonbot/0.1"
     
    • On the topic of spiders, I have been wanting to update DSpace’s default list of spiders in config/spiders/agents, perhaps by dropping a new list in from Atmire’s COUNTER-Robots project
        @@ -210,13 +210,13 @@ Let’s see how many of the REST API requests were for bitstreams (because t
    -
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
    +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
     $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"iskanie"
     $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"iskanie"
     
    • A bit later I checked Solr and found three requests from my IP with that user agent this month:
    -
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
    +
    $ http --print b 'http://localhost:8081/solr/statistics/select?q=ip:73.178.9.24+AND+userAgent:iskanie&fq=dateYearMonth%3A2019-11&rows=0'
     <?xml version="1.0" encoding="UTF-8"?>
     <response>
     <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="q">ip:73.178.9.24 AND userAgent:iskanie</str><str name="fq">dateYearMonth:2019-11</str><str name="rows">0</str></lst></lst><result name="response" numFound="3" start="0"></result>
    @@ -224,7 +224,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
     
    • Now I want to make similar requests with a user agent that is included in DSpace’s current user agent list:
    -
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
    +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
     $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"celestial"
     $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/csl_Crane_oct2019.pptx?sequence=1&isAllowed=y' User-Agent:"celestial"
     
      @@ -234,7 +234,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
    -
    spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
    +
    spider.agentregex.regexfile = ${dspace.dir}/config/spiders/Bots-2013-03.txt
     
    • Apparently that is part of Atmire’s CUA, despite being in a standard DSpace configuration file…
    • I tried with some other garbage user agents like “fuuuualan” and they were visible in Solr @@ -247,7 +247,7 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
    -
    else if (line.hasOption('m'))
    +
    else if (line.hasOption('m'))
     {
         SolrLogger.markRobotsByIP();
     }
    @@ -263,12 +263,12 @@ $ http --print Hh 'https://dspacetest.cgiar.org/bitstream/handle/10568/105487/cs
     
    • I added “alanfuu2” to the example spiders file, restarted Tomcat, then made two requests to DSpace Test:
    -
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1"
    +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu1"
     $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"alanfuuu2"
     
    • After committing the changes in Solr I saw one request for “alanfuu1” and no requests for “alanfuu2”:
    -
    $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
    +
    $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
     $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu1&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
       <result name="response" numFound="1" start="0">
     $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanfuuu2&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
    @@ -281,12 +281,12 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:alanf
     
     
  • I’m curious how the special character matching is in Solr, so I will test two requests: one with “www.gnip.com” which is in the spider list, and one with “www.gnyp.com” which isn’t:
  • -
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
    +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnip.com"
     $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"www.gnyp.com"
     
    • Then commit changes to Solr so we don’t have to wait:
    -
    $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
    +
    $ http --print b 'http://localhost:8081/solr/statistics/update?commit=true'
     $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnip.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound 
       <result name="response" numFound="0" start="0"/>
     $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.gnyp.com&fq=dateYearMonth%3A2019-11' | xmllint --format - | grep numFound
    @@ -314,12 +314,12 @@ $ http --print b 'http://localhost:8081/solr/statistics/select?q=userAgent:www.g
     
     
     
    -
    $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
    +
    $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:BUbiNG*' | xmllint --format - | grep numFound
       <result name="response" numFound="62944" start="0">
     
    • Similar for com.plumanalytics, Grammarly, and ltx71!
    -
    $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
    +
    $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:
     *com.plumanalytics*' | xmllint --format - | grep numFound
       <result name="response" numFound="28256" start="0">
     $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*Grammarly*' | xmllint --format - | grep numFound
    @@ -329,7 +329,7 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
     
    • Deleting these seems to work, for example the 105,000 ltx71 records from 2018:
    -
    $ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
    +
    $ http --print b 'http://localhost:8081/solr/statistics-2018/update?stream.body=<delete><query>userAgent:*ltx71*</query><query>type:0</query></delete>&commit=true'
     $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&facet.field=ip&facet.mincount=1&type:0&q=userAgent:*ltx71*' | xmllint --format - | grep numFound
       <result name="response" numFound="0" start="0"/>
     
      @@ -341,7 +341,7 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
    -
    $ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit=
    +
    $ curl -s 'http://localhost:8081/solr/statistics/select?facet=true&facet.field=dateYearMonth&facet.mincount=1&facet.offset=0&facet.limit=
     12&q=userAgent:*Unpaywall*' | xmllint --format - | less
     ...
       <lst name="facet_counts">
    @@ -394,7 +394,7 @@ $ http --print b 'http://localhost:8081/solr/statistics-2018/select?facet=true&a
     
     
     
    -
    $ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat
    +
    $ for shard in statistics statistics-2018 statistics-2017 statistics-2016 statistics-2015 stat
     istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do ./check-spider-hits.sh -s $shard -p yes; done
     
    • Open a pull request against COUNTER-Robots to remove unnecessary escaping of dashes
    • @@ -423,7 +423,7 @@ istics-2014 statistics-2013 statistics-2012 statistics-2011 statistics-2010; do
    • Testing modifying some of the COUNTER-Robots patterns to use [0-9] instead of \d digit character type, as Solr’s regex search can’t use those
    -
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
    +
    $ http --print Hh 'https://dspacetest.cgiar.org/handle/10568/105487' User-Agent:"Scrapoo/1"
     $ http "http://localhost:8081/solr/statistics/update?commit=true"
     $ http "http://localhost:8081/solr/statistics/select?q=userAgent:Scrapoo*" | xmllint --format - | grep numFound
       <result name="response" numFound="1" start="0">
    @@ -433,7 +433,7 @@ $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
     
  • Nice, so searching with regex in Solr with // syntax works for those digits!
  • I realized that it’s easier to search Solr from curl via POST using this syntax:
  • -
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
    +
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:*Scrapoo*&rows=0")
     
    • If the parameters include something like “[0-9]” then curl interprets it as a range and will make ten requests
        @@ -441,7 +441,7 @@ $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
    -
    $ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
    +
    $ curl -s 'http://localhost:8081/solr/statistics/select' -d 'q=userAgent:/Postgenomic(\s|\+)v2/&rows=2'
     
    • I updated the check-spider-hits.sh script to use the POST syntax, and I’m evaluating the feasability of including the regex search patterns from the spider agent file, as I had been filtering them out due to differences in PCRE and Solr regex syntax and issues with shell handling
    @@ -450,7 +450,7 @@ $ http "http://localhost:8081/solr/statistics/select?q=userAgent:/Scrapoo\/
  • IWMI sent a few new ORCID identifiers for us to add to our controlled vocabulary
  • I will merge them with our existing list and then resolve their names using my resolve-orcids.py script:
  • -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-11-14-combined-orcids.txt
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2019-11-14-combined-orcids.txt
     $ ./resolve-orcids.py -i /tmp/2019-11-14-combined-orcids.txt -o /tmp/2019-11-14-combined-names.txt -d
     # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    @@ -513,7 +513,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
     
     
  • Most of the curl hits were from CIAT in mid-2019, where they were using GuzzleHttp from PHP, which uses something like this for its user agent:
  • -
    Guzzle/<Guzzle_Version> curl/<curl_version> PHP/<PHP_VERSION>
    +
    Guzzle/<Guzzle_Version> curl/<curl_version> PHP/<PHP_VERSION>
     
    • Run system updates on DSpace Test and reboot the server
    @@ -564,7 +564,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
  • Buck is one I’ve never heard of before, its user agent is:
  • -
    Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
    +
    Buck/2.2; (+https://app.hypefactors.com/media-monitoring/about.html)
     
    • All in all that’s about 85,000 more hits purged, in addition to the 3.4 million I purged last week
    diff --git a/docs/2019-12/index.html b/docs/2019-12/index.html index ee393347e..59f8640c8 100644 --- a/docs/2019-12/index.html +++ b/docs/2019-12/index.html @@ -46,7 +46,7 @@ Make sure all packages are up to date and the package manager is up to date, the # dpkg -C # reboot "/> - + @@ -142,14 +142,14 @@ Make sure all packages are up to date and the package manager is up to date, the -
    # apt update && apt full-upgrade
    +
    # apt update && apt full-upgrade
     # apt-get autoremove && apt-get autoclean
     # dpkg -C
     # reboot
     
    • Take some backups:
    -
    # dpkg -l > 2019-12-01-linode18-dpkg.txt
    +
    # dpkg -l > 2019-12-01-linode18-dpkg.txt
     # tar czf 2019-12-01-linode18-etc.tar.gz /etc
     
    • Then check all third-party repositories in /etc/apt to see if everything using “xenial” has packages available for “bionic” and then update the sources:
    • @@ -157,18 +157,18 @@ Make sure all packages are up to date and the package manager is up to date, the
    • Pause the Uptime Robot monitoring for CGSpace
    • Make sure the update manager is installed and do the upgrade:
    -
    # apt install update-manager-core
    +
    # apt install update-manager-core
     # do-release-upgrade
     
    • After the upgrade finishes, remove Java 11, force the installation of bionic nginx, and reboot the server:
    -
    # apt purge openjdk-11-jre-headless
    +
    # apt purge openjdk-11-jre-headless
     # apt install 'nginx=1.16.1-1~bionic'
     # reboot
     
    • After the server comes back up, remove Python virtualenvs that were created with Python 3.5 and re-run certbot to make sure it’s working:
    -
    # rm -rf /opt/eff.org/certbot/venv/bin/letsencrypt
    +
    # rm -rf /opt/eff.org/certbot/venv/bin/letsencrypt
     # rm -rf /opt/ilri/dspace-statistics-api/venv
     # /opt/certbot-auto
     
      @@ -195,7 +195,7 @@ Make sure all packages are up to date and the package manager is up to date, the
    -
    $ http 'https://cgspace.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:cgspace.cgiar.org:10568/104030' > /tmp/cgspace-104030.xml
    +
    $ http 'https://cgspace.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:cgspace.cgiar.org:10568/104030' > /tmp/cgspace-104030.xml
     $ http 'https://dspacetest.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:cgspace.cgiar.org:10568/104030' > /tmp/dspacetest-104030.xml
     
    • The DSpace Test ones actually now capture the DOI, where the CGSpace doesn’t…
    • @@ -209,7 +209,7 @@ $ http 'https://dspacetest.cgiar.org/oai/request?verb=GetRecord&metadataPref
    -
    dspace=# \COPY (SELECT handle, owning_collection FROM item, handle WHERE item.discoverable='f' AND item.in_archive='t' AND handle.resource_id = item.item_id) to /tmp/2019-12-04-CGSpace-private-items.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT handle, owning_collection FROM item, handle WHERE item.discoverable='f' AND item.in_archive='t' AND handle.resource_id = item.item_id) to /tmp/2019-12-04-CGSpace-private-items.csv WITH CSV HEADER;
     COPY 48
     

    2019-12-05

      @@ -288,13 +288,13 @@ COPY 48
    • I looked into creating RTF documents from HTML in Node.js and there is a library called html-to-rtf that works well, but doesn’t support images
    • Export a list of all investors (dc.description.sponsorship) for Peter to look through and correct:
    -
    dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.sponsor", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-12-17-investors.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.sponsor", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-12-17-investors.csv WITH CSV HEADER;
     COPY 643
     

    2019-12-18

    • Apply the investor corrections and deletions from Peter on CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/2019-12-17-investors-fix-112.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
    +
    $ ./fix-metadata-values.py -i /tmp/2019-12-17-investors-fix-112.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
     $ ./delete-metadata-values.py -i /tmp/2019-12-17-investors-delete-68.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
     
    • Peter asked about the “Open Government Licence 3.0” that is used by some items @@ -304,7 +304,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-12-17-investors-delete-68.csv -db dsp
    -
    dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%Open%';
    +
    dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%Open%';
              text_value          
     -----------------------------
      Open Government License 3.0
    @@ -321,7 +321,7 @@ UPDATE 2
     
     
     
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c MegaIndex.ru 
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c MegaIndex.ru 
     27320
     
    • I see they did check robots.txt and their requests are only going to XMLUI item pages… so I guess I just leave them alone
    • @@ -338,12 +338,12 @@ UPDATE 2
      • I ran the dspace cleanup process on CGSpace (linode18) and had an error:
      -
      Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
      +
      Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
         Detail: Key (bitstream_id)=(179441) is still referenced from table "bundle".
       
      • The solution is to delete that bitstream manually:
      -
      $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (179441);'
      +
      $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (179441);'
       UPDATE 1
       
      • Adjust CG Core v2 migrataion notes to use cg.review-status instead of cg.peer-reviewed diff --git a/docs/2020-01/index.html b/docs/2020-01/index.html index 052ff7cbb..56361b7e3 100644 --- a/docs/2020-01/index.html +++ b/docs/2020-01/index.html @@ -56,7 +56,7 @@ I tweeted the CGSpace repository link "/> - + @@ -166,17 +166,17 @@ I tweeted the CGSpace repository link
        • Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:
        -
        dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
        +
        dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
         COPY 68790
         
        • As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:
        -
        $ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
        +
        $ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
         iconv: illegal input sequence at position 104779
         
        • According to this trick the troublesome character is on line 5227:
        -
        $ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv                                   
        +
        $ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv                                   
         5227: "Oue
         $ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
         00000000: 22  "
        @@ -190,7 +190,7 @@ $ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
         
        • According to the blog post linked above the troublesome character is probably the “High Octect Preset” (81), which vim identifies (using ga on the character) as:
        -
        <e>  101,  Hex 65,  Octal 145 < ́> 769, Hex 0301, Octal 1401
        +
        <e>  101,  Hex 65,  Octal 145 < ́> 769, Hex 0301, Octal 1401
         
        • If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it’s stored incorrectly in the database…
        • Other encodings like windows-1251 and windows-1257 also fail on different characters like “ž” and “é” that are legitimate UTF-8 characters
        • @@ -207,7 +207,7 @@ $ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
      -
      Exception: Read timed out
      +
      Exception: Read timed out
       java.net.SocketTimeoutException: Read timed out
       
      • I am not sure how I will fix that shard…
      • @@ -225,7 +225,7 @@ java.net.SocketTimeoutException: Read timed out
    -
    In [7]: unicodedata.is_normalized('NFC', 'é')
    +
    In [7]: unicodedata.is_normalized('NFC', 'é')
     Out[7]: False
     
     In [8]: unicodedata.is_normalized('NFC', 'é')
    @@ -235,7 +235,7 @@ Out[8]: True
     
  • I added support for Unicode normalization to my csv-metadata-quality tool in v0.4.0
  • Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:
  • -
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ilri", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ilri", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
     COPY 144
     dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.bioversity", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
     COPY 1325
    @@ -243,12 +243,12 @@ COPY 1325
     
  • She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC
  • I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my fix-metadata.py script:
  • -
    $ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
    +
    $ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
     

    2020-01-16

    • Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity:
    -
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ciat", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ciat", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
     COPY 35
     
    • Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls) @@ -301,7 +301,7 @@ COPY 35
      • I tried to create a MaxMind account so I can download the GeoLite2-City database with a license key, but their server refuses to accept me:
      -
      Sorry, we were not able to create your account. Please ensure that you are using an email that is not disposable, and that you are not connecting via a proxy or VPN.
      +
      Sorry, we were not able to create your account. Please ensure that you are using an email that is not disposable, and that you are not connecting via a proxy or VPN.
       
      -
      $ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d
      +
      $ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d
       
      • Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to csv-metadata-quality:
      -
      dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
      +
      dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
       COPY 67314
       dspace=# \q
       $ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'
      @@ -331,7 +331,7 @@ $ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -
       
    -
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", text_value as "correct", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", text_value as "correct", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
     COPY 6170
     dspace=# \q
     $ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
    @@ -339,11 +339,11 @@ $ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dsp
     
    • I applied the corrections on DSpace Test and CGSpace, and then scheduled a full Discovery reindex for later tonight:
    -
    $ sleep 4h && time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ sleep 4h && time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
    • Then I generated a new list for Peter:
    -
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
     COPY 6162
     
    • Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author “Hung, Nguyen” @@ -352,7 +352,7 @@ COPY 6162
    -
    $ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u > hung-nguyen-ares-handles.txt
    +
    $ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u > hung-nguyen-ares-handles.txt
     $ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u > hung-nguyen-atmire-handles.txt
     $ wc -l hung-nguyen-a*handles.txt
       46 hung-nguyen-ares-handles.txt
    @@ -374,7 +374,7 @@ $ wc -l hung-nguyen-a*handles.txt
     
     
     
    -
    # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2020:0[12345678]" | goaccess --log-format=COMBINED -
    +
    # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2020:0[12345678]" | goaccess --log-format=COMBINED -
     
    • The top two hosts according to the amount of data transferred are:
        @@ -388,12 +388,12 @@ $ wc -l hung-nguyen-a*handles.txt
      • They are apparently using this Drupal module to generate the thumbnails: sites/all/modules/contrib/pdf_to_imagefield
      • I see some excellent suggestions in this ImageMagick thread from 2012 that lead me to some nice thumbnails (default PDF density is 72, so supersample to 4X and then resize back to 25%) as well as this blog post:
      -
      $ convert -density 288 -filter lagrange -thumbnail 25% -background white -alpha remove -sampling-factor 1:1 -colorspace sRGB 10568-97925.pdf\[0\] 10568-97925.jpg
      +
      $ convert -density 288 -filter lagrange -thumbnail 25% -background white -alpha remove -sampling-factor 1:1 -colorspace sRGB 10568-97925.pdf\[0\] 10568-97925.jpg
       
      • Here I’m also explicitly setting the background to white and removing any alpha layers, but I could probably also just keep using -flatten like DSpace already does
      • I did some tests with a modified version of above that uses uses -flatten and drops the sampling-factor and colorspace, but bumps up the image size to 600px (default on CGSpace is currently 300):
      -
      $ convert -density 288 -filter lagrange -resize 25% -flatten 10568-97925.pdf\[0\] 10568-97925-d288-lagrange.pdf.jpg
      +
      $ convert -density 288 -filter lagrange -resize 25% -flatten 10568-97925.pdf\[0\] 10568-97925-d288-lagrange.pdf.jpg
       $ convert -flatten 10568-97925.pdf\[0\] 10568-97925.pdf.jpg
       $ convert -thumbnail x600 10568-97925-d288-lagrange.pdf.jpg 10568-97925-d288-lagrange-thumbnail.pdf.jpg
       $ convert -thumbnail x600 10568-97925.pdf.jpg 10568-97925-thumbnail.pdf.jpg
      @@ -404,7 +404,7 @@ $ convert -thumbnail x600 10568-97925.pdf.jpg 10568-97925-thumbnail.pdf.jpg
       
    • The file size is about double the old ones, but the quality is very good and the file size is nowhere near ilri.org’s 400KiB PNG!
    • Peter sent me the corrections and deletions for affiliations last night so I imported them into OpenRefine to work around the normal UTF-8 issue, ran them through csv-metadata-quality to make sure all Unicode values were normalized (NFC), then applied them on DSpace Test and CGSpace:
    -
    $ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
    +
    $ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
     $ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct
     $ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
     

    2020-01-26

    @@ -422,11 +422,11 @@ $ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db -
    $ convert -density 288 10568-97925.pdf\[0\] -density 72 -filter lagrange -flatten 10568-97925-density.jpg
    +
    $ convert -density 288 10568-97925.pdf\[0\] -density 72 -filter lagrange -flatten 10568-97925-density.jpg
     
    • One thing worth mentioning was this syntax for extracting bits from JSON in bash using jq:
    -
    $ RESPONSE=$(curl -s 'https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams')
    +
    $ RESPONSE=$(curl -s 'https://dspacetest.cgiar.org/rest/handle/10568/103447?expand=bitstreams')
     $ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName=="ORIGINAL") | .retrieveLink'
     "/bitstreams/172559/retrieve"
     

    2020-01-27

    @@ -438,7 +438,7 @@ $ echo $RESPONSE | jq '.bitstreams[] | select(.bundleName=="ORIGINAL") -
    2020-01-27 06:02:23,681 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
    +
    2020-01-27 06:02:23,681 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'read:(g0 OR e610 OR g0 OR g3 OR g5 OR g4102 OR g9 OR g4105 OR g10 OR g4107 OR g4108 OR g13 OR g4109 OR g14 OR g15 OR g16 OR g18 OR g20 OR g23 OR g24 OR g2072 OR g2074 OR g28 OR g2076 OR g29 OR g2078 OR g2080 OR g34 OR g2082 OR g2084 OR g38 OR g2086 OR g2088 OR g43 OR g2093 OR g2095 OR g2097 OR g50 OR g51 OR g2101 OR g2103 OR g62 OR g65 OR g77 OR g78 OR g2127 OR g2142 OR g2151 OR g2152 OR g2153 OR g2154 OR g2156 OR g2165 OR g2171 OR g2174 OR g2175 OR g129 OR g2178 OR g2182 OR g2186 OR g153 OR g155 OR g158 OR g166 OR g167 OR g168 OR g169 OR g2225 OR g179 OR g2227 OR g2229 OR g183 OR g2231 OR g184 OR g2233 OR g186 OR g2235 OR g2237 OR g191 OR g192 OR g193 OR g2242 OR g2244 OR g2246 OR g2250 OR g204 OR g205 OR g207 OR g208 OR g2262 OR g2265 OR g218 OR g2268 OR g222 OR g223 OR g2271 OR g2274 OR g2277 OR g230 OR g231 OR g2280 OR g2283 OR g238 OR g2286 OR g241 OR g2289 OR g244 OR g2292 OR g2295 OR g2298 OR g2301 OR g254 OR g255 OR g2305 OR g2308 OR g262 OR g2311 OR g265 OR g268 OR g269 OR g273 OR g276 OR g277 OR g279 OR g282 OR g292 OR g293 OR g296 OR g297 OR g301 OR g303 OR g305 OR g2353 OR g310 OR g311 OR g313 OR g321 OR g325 OR g328 OR g333 OR g334 OR g342 OR g343 OR g345 OR g348 OR g2409 [...] ': too many boolean clauses
     
    • Now this appears to be a Solr limit of some kind (“too many boolean clauses”) @@ -453,7 +453,7 @@ org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError:
      • Generate a list of CIP subjects for Abenet:
      -
      dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.cip", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 127 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-28-cip-subjects.csv WITH CSV HEADER;
      +
      dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.cip", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 127 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-28-cip-subjects.csv WITH CSV HEADER;
       COPY 77
       
      • Start looking over the IITA records from earlier this month (IITA_201907_Jan13) @@ -483,7 +483,7 @@ COPY 77
        • Normalize about 4,500 DOI, YouTube, and SlideShare links on CGSpace that are missing HTTPS or using old format:
        -
        UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://www.doi.org%';
        +
        UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://www.doi.org%';
         UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://doi.org%';
         UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'http://dx.doi.org%';
         UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE resource_type_id = 2 AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
        @@ -492,24 +492,24 @@ UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'http://www.sli
         
        • I exported a list of all of our ISSNs with item IDs so that I could fix them in OpenRefine and submit them with multi-value separators to DSpace metadata import:
        -
        dspace=# \COPY (SELECT resource_id as "id", text_value as "dc.identifier.issn" FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21) to /tmp/2020-01-29-issn.csv WITH CSV HEADER;
        +
        dspace=# \COPY (SELECT resource_id as "id", text_value as "dc.identifier.issn" FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21) to /tmp/2020-01-29-issn.csv WITH CSV HEADER;
         COPY 23339
         
        • Then, after spending two hours correcting 1,000 ISSNs I realized that I need to normalize the text_lang fields in the database first or else these will all look like changes due to the “en_US” and NULL, etc (for both ISSN and ISBN):
        -
        dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id = 2 AND metadata_field_id IN (20,21);
        +
        dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id = 2 AND metadata_field_id IN (20,21);
         UPDATE 30454
         
        • Then I realized that my initial PostgreSQL query wasn’t so genius because if a field already has multiple values it will appear on separate lines with the same ID, so when dspace metadata-import sees it, the change will be removed and added, or added and removed, depending on the order it is seen!
        • A better course of action is to select the distinct ones and then correct them using fix-metadata-values.py
        -
        dspace=# \COPY (SELECT DISTINCT text_value as "dc.identifier.issn[en_US]", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-29-issn-distinct.csv WITH CSV HEADER;
        +
        dspace=# \COPY (SELECT DISTINCT text_value as "dc.identifier.issn[en_US]", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 21 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-29-issn-distinct.csv WITH CSV HEADER;
         COPY 2900
         
        • I re-applied all my corrections, filtering out things like multi-value separators and values that are actually ISBNs so I can fix them later
        • Then I applied 181 fixes for ISSNs using fix-metadata-values.py on DSpace Test and CGSpace (after testing locally):
        -
        $ ./fix-metadata-values.py -i /tmp/2020-01-29-ISSNs-Distinct.csv -db dspace -u dspace -p 'fuuu' -f 'dc.identifier.issn[en_US]' -m 21 -t correct -d
        +
        $ ./fix-metadata-values.py -i /tmp/2020-01-29-ISSNs-Distinct.csv -db dspace -u dspace -p 'fuuu' -f 'dc.identifier.issn[en_US]' -m 21 -t correct -d
         

        2020-01-30

        • About to start working on the DSpace 6 port and I’m looking at commits that are in the not-yet-tagged DSpace 6.4: diff --git a/docs/2020-02/index.html b/docs/2020-02/index.html index 1047775e6..df5696464 100644 --- a/docs/2020-02/index.html +++ b/docs/2020-02/index.html @@ -38,7 +38,7 @@ The code finally builds and runs with a fresh install "/> - + @@ -138,11 +138,11 @@ The code finally builds and runs with a fresh install
          • Now we don’t specify the build environment because site modification are in local.cfg, so we just build like this:
          -
          $ schedtool -D -e ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
          +
          $ schedtool -D -e ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
           
          • And it seems that we need to enabled pg_crypto now (used for UUIDs):
          -
          $ psql -h localhost -U postgres dspace63
          +
          $ psql -h localhost -U postgres dspace63
           dspace63=# CREATE EXTENSION pgcrypto;
           CREATE EXTENSION pgcrypto;
           
            @@ -153,11 +153,11 @@ CREATE EXTENSION pgcrypto;
        -
        dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
        +
        dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
         
        • Then I ran dspace database migrate and got an error:
        -
        $ ~/dspace63/bin/dspace database migrate
        +
        $ ~/dspace63/bin/dspace database migrate
         
         Database URL: jdbc:postgresql://localhost:5432/dspace63?ApplicationName=dspaceCli
         Migrating database to latest version... (Check dspace logs for details)
        @@ -225,7 +225,7 @@ Caused by: org.postgresql.util.PSQLException: ERROR: cannot drop table metadatav
         
      • A thread on the dspace-tech mailing list regarding this migration noticed that his database had some views created that were using the resource_id column
      • Our database had the same issue, where the eperson_metadata view was created by something (Atmire module?) but has no references in the vanilla DSpace code, so I dropped it and tried the migration again:
      -
      dspace63=# DROP VIEW eperson_metadata;
      +
      dspace63=# DROP VIEW eperson_metadata;
       DROP VIEW
       
      • After that the migration was successful and DSpace starts up successfully and begins indexing @@ -252,7 +252,7 @@ DROP VIEW
      • There are lots of errors in the DSpace log, which might explain some of the issues with recent submissions / Solr:
      -
      2020-02-03 10:27:14,485 ERROR org.dspace.browse.ItemCountDAOSolr @ caught exception: 
      +
      2020-02-03 10:27:14,485 ERROR org.dspace.browse.ItemCountDAOSolr @ caught exception: 
       org.dspace.discovery.SearchServiceException: Invalid UUID string: 1
       2020-02-03 13:20:20,475 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
       org.dspace.discovery.SearchServiceException: Invalid UUID string: 111210
      @@ -260,11 +260,11 @@ org.dspace.discovery.SearchServiceException: Invalid UUID string: 111210
       
    • If I look in Solr’s search core I do actually see items with integers for their resource ID, which I think are all supposed to be UUIDs now…
    • I dropped all the documents in the search core:
    -
    $ http --print b 'http://localhost:8080/solr/search/update?stream.body=<delete><query>*:*</query></delete>&commit=true'
    +
    $ http --print b 'http://localhost:8080/solr/search/update?stream.body=<delete><query>*:*</query></delete>&commit=true'
     
    • Still didn’t work, so I’m going to try a clean database import and migration:
    -
    $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
    +
    $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
     $ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
     $ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost dspace_2020-01-27.backup
     $ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
    @@ -301,7 +301,7 @@ $ ~/dspace63/bin/dspace database migrate
     
     
     
    -
    $ git checkout -b 6_x-dev64 6_x-dev
    +
    $ git checkout -b 6_x-dev64 6_x-dev
     $ git rebase -i upstream/dspace-6_x
     
    • I finally understand why our themes show all the “Browse by” buttons on community and collection pages in DSpace 6.x @@ -321,7 +321,7 @@ $ git rebase -i upstream/dspace-6_x
    • UptimeRobot told me that AReS Explorer crashed last night, so I logged into it, ran all updates, and rebooted it
    • Testing Discovery indexing speed on my local DSpace 6.3:
    -
    $ time schedtool -D -e ~/dspace63/bin/dspace index-discovery -b
    +
    $ time schedtool -D -e ~/dspace63/bin/dspace index-discovery -b
     schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  3771.78s user 93.63s system 41% cpu 2:34:19.53 total
     schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  3360.28s user 82.63s system 38% cpu 2:30:22.07 total
     schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  4678.72s user 138.87s system 42% cpu 3:08:35.72 total
    @@ -329,7 +329,7 @@ schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  3334.19s user 86.54s s
     
    • DSpace 5.8 was taking about 1 hour (or less on this laptop), so this is 2-3 times longer!
    -
    $ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b
    +
    $ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b
     schedtool -D -e ~/dspace/bin/dspace index-discovery -b  299.53s user 69.67s system 20% cpu 30:34.47 total
     schedtool -D -e ~/dspace/bin/dspace index-discovery -b  270.31s user 69.88s system 19% cpu 29:01.38 total
     
      @@ -360,7 +360,7 @@ schedtool -D -e ~/dspace/bin/dspace index-discovery -b 270.31s user 69.88s syst
    • I sent a mail to the dspace-tech mailing list asking about slow Discovery indexing speed in DSpace 6
    • I destroyed my PostgreSQL 9.6 containers and re-created them using PostgreSQL 10 to see if there are any speedups with DSpace 6.x:
    -
    $ podman pull postgres:10-alpine
    +
    $ podman pull postgres:10-alpine
     $ podman run --name dspacedb10 -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:10-alpine
     $ createuser -h localhost -U postgres --pwprompt dspacetest
     $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
    @@ -379,29 +379,29 @@ dspace63=# \q
     
    • I purged ~33,000 hits from the “Jersey/2.6” bot in CGSpace’s statistics using my check-spider-hits.sh script:
    -
    $ ./check-spider-hits.sh -d -p -f /tmp/jersey -s statistics -u http://localhost:8081/solr
    +
    $ ./check-spider-hits.sh -d -p -f /tmp/jersey -s statistics -u http://localhost:8081/solr
     $ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jersey -s "statistics-${year}" -u http://localhost:8081/solr; done
     
    • I noticed another user agen in the logs that we should add to the list:
    -
    ReactorNetty/0.9.2.RELEASE
    +
    ReactorNetty/0.9.2.RELEASE
     
    -
    $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
    +
    $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
     $ ls -lh /tmp/statistics-2019-01.json
     -rw-rw-r-- 1 aorth aorth 3.7G Feb  6 09:26 /tmp/statistics-2019-01.json
     
    • Then I tested importing this by creating a new core in my development environment:
    -
    $ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace/solr/statistics&dataDir=/home/aorth/dspace/solr/statistics-2019/data'
    +
    $ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace/solr/statistics&dataDir=/home/aorth/dspace/solr/statistics-2019/data'
     $ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Downloads/statistics-2019-01.json -k uid
     
    • This imports the records into the core, but DSpace can’t see them, and when I restart Tomcat the core is not seen by Solr…
    • I got the core to load by adding it to dspace/solr/solr.xml manually, ie:
    -
      <cores adminPath="/admin/cores">
    +
      <cores adminPath="/admin/cores">
       ...
         <core name="statistics" instanceDir="statistics" />
         <core name="statistics-2019" instanceDir="statistics">
    @@ -415,11 +415,11 @@ $ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Download
     
  • Just for fun I tried to load these stats into a Solr 7.7.2 instance using the DSpace 7 solr config:
  • First, create a Solr statistics core using the DSpace 7 config:
  • -
    $ ./bin/solr create_core -c statistics -d ~/src/git/DSpace/dspace/solr/statistics/conf -p 8983
    +
    $ ./bin/solr create_core -c statistics -d ~/src/git/DSpace/dspace/solr/statistics/conf -p 8983
     
    • Then try to import the stats, skipping a shitload of fields that are apparently added to our Solr statistics by Atmire modules:
    -
    $ ./run.sh -s http://localhost:8983/solr/statistics -a import -o ~/Downloads/statistics-2019-01.json -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
    +
    $ ./run.sh -s http://localhost:8983/solr/statistics -a import -o ~/Downloads/statistics-2019-01.json -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
     
    • OK that imported! I wonder if it works… maybe I’ll try another day
    @@ -433,7 +433,7 @@ $ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Download -
    $ cd ~/src/git/perf-map-agent
    +
    $ cd ~/src/git/perf-map-agent
     $ cmake     .
     $ make
     $ ./bin/create-links-in ~/.local/bin
    @@ -467,7 +467,7 @@ $ perf-java-flames 11359
     
    • This weekend I did a lot more testing of indexing performance with our DSpace 5.8 branch, vanilla DSpace 5.10, and vanilla DSpace 6.4-SNAPSHOT:
    -
    # CGSpace 5.8
    +
    # CGSpace 5.8
     schedtool -D -e ~/dspace/bin/dspace index-discovery -b  385.72s user 131.16s system 19% cpu 43:21.18 total
     schedtool -D -e ~/dspace/bin/dspace index-discovery -b  382.95s user 127.31s system 20% cpu 42:10.07 total
     schedtool -D -e ~/dspace/bin/dspace index-discovery -b  368.56s user 143.97s system 20% cpu 42:22.66 total
    @@ -483,7 +483,7 @@ schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  5112.96s user 127.80s
     
    • I generated better flame graphs for the DSpace indexing process by using perf-record-stack and filtering out the java process:
    -
    $ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
    +
    $ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
     $ export PERF_RECORD_SECONDS=60
     $ export JAVA_OPTS="-XX:+PreserveFramePointer"
     $ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b &
    @@ -525,14 +525,14 @@ $ cat out.dspace510-1 | ../FlameGraph/stackcollapse-perf.pl | grep -E '^java' |
     
    • Maria from Bioversity asked me to add some ORCID iDs to our controlled vocabulary so I combined them with our existing ones and updated the names from the ORCID API:
    -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-02-11-combined-orcids.txt
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-02-11-combined-orcids.txt
     $ ./resolve-orcids.py -i /tmp/2020-02-11-combined-orcids.txt -o /tmp/2020-02-11-combined-names.txt -d
     # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
     
    • Then I noticed some author names had changed, so I captured the old and new names in a CSV file and fixed them using fix-metadata-values.py:
    -
    $ ./fix-metadata-values.py -i 2020-02-11-correct-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -t correct -m 240 -d
    +
    $ ./fix-metadata-values.py -i 2020-02-11-correct-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -t correct -m 240 -d
     
    • On a hunch I decided to try to add these ORCID iDs to existing items that might not have them yet
        @@ -540,7 +540,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
    -
    dc.contributor.author,cg.creator.id
    +
    dc.contributor.author,cg.creator.id
     "Staver, Charles",charles staver: 0000-0002-4532-6077
     "Staver, C.",charles staver: 0000-0002-4532-6077
     "Fungo, R.",Robert Fungo: 0000-0002-4264-6905
    @@ -556,7 +556,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
     
    • Running the add-orcid-identifiers-csv.py script I added 144 ORCID iDs to items on CGSpace!
    -
    $ ./add-orcid-identifiers-csv.py -i /tmp/2020-02-11-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i /tmp/2020-02-11-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu'
     
    • Minor updates to all Python utility scripts in the CGSpace git repository
    • Update the spider agent patterns in CGSpace 5_x-prod branch from the latest COUNTER-Robots project @@ -575,7 +575,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
    • Peter asked me to update John McIntire’s name format on CGSpace so I ran the following PostgreSQL query:
    -
    dspace=# UPDATE metadatavalue SET text_value='McIntire, John M.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='McIntire, John';
    +
    dspace=# UPDATE metadatavalue SET text_value='McIntire, John M.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='McIntire, John';
     UPDATE 26
     

    2020-02-17

      @@ -607,12 +607,12 @@ UPDATE 26
      • I see a new spider in the nginx logs on CGSpace:
      -
      Mozilla/5.0 (compatible;Linespider/1.1;+https://lin.ee/4dwXkTH)
      +
      Mozilla/5.0 (compatible;Linespider/1.1;+https://lin.ee/4dwXkTH)
       
      • I think this should be covered by the COUNTER-Robots patterns for the statistics at least…
      • I see some IP (186.32.217.255) in Costa Rica making requests like a bot with the following user agent:
      -
      Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
      +
      Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
       
      • Another IP address (31.6.77.23) in the UK making a few hundred requests without a user agent
      • I will add the IP addresses to the nginx badbots list
      • @@ -622,7 +622,7 @@ UPDATE 26
    -
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=dns:/squeeze3.bronco.co.uk./&rows=0"
    +
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=dns:/squeeze3.bronco.co.uk./&rows=0"
     <?xml version="1.0" encoding="UTF-8"?>
     <response>
     <lst name="responseHeader"><int name="status">0</int><int name="QTime">4</int><lst name="params"><str name="q">dns:/squeeze3.bronco.co.uk./</str><str name="rows">0</str></lst></lst><result name="response" numFound="86044" start="0"></result>
    @@ -641,7 +641,7 @@ UPDATE 26
     
     
  • I will purge them from each core one by one, ie:
  • -
    $ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:squeeze3.bronco.co.uk.</query></delete>"
    +
    $ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:squeeze3.bronco.co.uk.</query></delete>"
     $ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:squeeze3.bronco.co.uk.</query></delete>"
     
    • Deploy latest Tomcat and PostgreSQL JDBC driver changes on CGSpace (linode18)
    • @@ -654,12 +654,12 @@ $ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=tru
    • I ran the dspace cleanup -v process on CGSpace and got an error:
    -
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
       Detail: Key (bitstream_id)=(183996) is still referenced from table "bundle".
     
    • The solution is, as always:
    -
    # su - postgres
    +
    # su - postgres
     $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
     UPDATE 1
     
      @@ -671,7 +671,7 @@ UPDATE 1
    -
    $ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
    +
    $ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
     
    • For some reason the Atmire Content and Usage Analysis (CUA) module’s Usage Statistics is drawing blank graphs
        @@ -679,7 +679,7 @@ UPDATE 1
    -
    2020-02-23 11:28:13,696 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
    +
    2020-02-23 11:28:13,696 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
     org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoClassDefFoundError: Could not
      initialize class org.jfree.chart.JFreeChart
     
      @@ -694,11 +694,11 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
    • I copied the jfreechart-1.0.5.jar file to the Tomcat lib folder and then there was a different error when I loaded Atmire CUA:
    -
    2020-02-23 16:25:10,841 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!  org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.awt.AWTError: Assistive Technology not found: org.GNOME.Accessibility.AtkWrapper
    +
    2020-02-23 16:25:10,841 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!  org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.awt.AWTError: Assistive Technology not found: org.GNOME.Accessibility.AtkWrapper
     
    • Some search results suggested commenting out the following line in /etc/java-8-openjdk/accessibility.properties:
    -
    assistive_technologies=org.GNOME.Accessibility.AtkWrapper
    +
    assistive_technologies=org.GNOME.Accessibility.AtkWrapper
     
    • And removing the extra jfreechart library and restarting Tomcat I was able to load the usage statistics graph on DSpace Test…
        @@ -708,7 +708,7 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
    -
    # grep -c 'initialize class org.jfree.chart.JFreeChart' dspace.log.2020-0*
    +
    # grep -c 'initialize class org.jfree.chart.JFreeChart' dspace.log.2020-0*
     dspace.log.2020-01-12:4
     dspace.log.2020-01-13:66
     dspace.log.2020-01-14:4
    @@ -724,7 +724,7 @@ dspace.log.2020-01-21:4
     
  • I deployed the fix on CGSpace (linode18) and I was able to see the graphs in the Atmire CUA Usage Statistics…
  • On an unrelated note there is something weird going on in that I see millions of hits from IP 34.218.226.147 in Solr statistics, but if I remember correctly that IP belongs to CodeObia’s AReS explorer, but it should only be using REST and therefore no Solr statistics…?
  • -
    $ curl -s "http://localhost:8081/solr/statistics-2018/select" -d "q=ip:34.218.226.147&rows=0"
    +
    $ curl -s "http://localhost:8081/solr/statistics-2018/select" -d "q=ip:34.218.226.147&rows=0"
     <?xml version="1.0" encoding="UTF-8"?>
     <response>
     <lst name="responseHeader"><int name="status">0</int><int name="QTime">811</int><lst name="params"><str name="q">ip:34.218.226.147</str><str name="rows">0</str></lst></lst><result name="response" numFound="5536097" start="0"></result>
    @@ -732,7 +732,7 @@ dspace.log.2020-01-21:4
     
    • And there are apparently two million from last month (2020-01):
    -
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=ip:34.218.226.147&fq=dateYearMonth:2020-01&rows=0"
    +
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=ip:34.218.226.147&fq=dateYearMonth:2020-01&rows=0"
     <?xml version="1.0" encoding="UTF-8"?>
     <response>
     <lst name="responseHeader"><int name="status">0</int><int name="QTime">248</int><lst name="params"><str name="q">ip:34.218.226.147</str><str name="fq">dateYearMonth:2020-01</str><str name="rows">0</str></lst></lst><result name="response" numFound="2173455" start="0"></result>
    @@ -740,7 +740,7 @@ dspace.log.2020-01-21:4
     
    • But when I look at the nginx access logs for the past month or so I only see 84,000, all of which are on /rest and none of which are to XMLUI:
    -
    # zcat /var/log/nginx/*.log.*.gz | grep -c 34.218.226.147
    +
    # zcat /var/log/nginx/*.log.*.gz | grep -c 34.218.226.147
     84322
     # zcat /var/log/nginx/*.log.*.gz | grep 34.218.226.147 | grep -c '/rest'
     84322
    @@ -758,7 +758,7 @@ dspace.log.2020-01-21:4
     
     
  • Anyways, I faceted by IP in 2020-01 and see:
  • -
    $ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-01&rows=0&wt=json&indent=true&facet=true&facet.field=ip'
    +
    $ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-01&rows=0&wt=json&indent=true&facet=true&facet.field=ip'
     ...
             "172.104.229.92",2686876,
             "34.218.226.147",2173455,
    @@ -769,19 +769,19 @@ dspace.log.2020-01-21:4
     
  • Surprise surprise, the top two IPs are from AReS servers… wtf.
  • The next three are from Online in France and they are all using this weird user agent and making tens of thousands of requests to Discovery:
  • -
    Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
    +
    Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
     
    • And all the same three are already inflating the statistics for 2020-02… hmmm.
    • I need to see why AReS harvesting is inflating the stats, as it should only be making REST requests…
    • Shiiiiit, I see 84,000 requests from the AReS IP today alone:
    -
    $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-22*+AND+ip:172.104.229.92&rows=0&wt=json&indent=true'
    +
    $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-22*+AND+ip:172.104.229.92&rows=0&wt=json&indent=true'
     ...
       "response":{"numFound":84594,"start":0,"docs":[]
     
    • Fuck! And of course the ILRI websites doing their daily REST harvesting are causing issues too, from today alone:
    -
            "2a01:7e00::f03c:91ff:fe9a:3a37",35512,
    +
            "2a01:7e00::f03c:91ff:fe9a:3a37",35512,
             "2a01:7e00::f03c:91ff:fe18:7396",26155,
     
    • I need to try to make some requests for these URLs and observe if they make a statistics hit: @@ -793,7 +793,7 @@ dspace.log.2020-01-21:4
    • Those are the requests AReS and ILRI servers are making… nearly 150,000 per day!
    • Well that settles it!
    -
    $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
    +
    $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
       "response":{"numFound":12,"start":0,"docs":[
     $ curl -s 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=82450'
     $ curl -s 'http://localhost:8081/solr/statistics/update?softCommit=true'
    @@ -817,12 +817,12 @@ $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+s
     
  • I tried to add the IPs to our nginx IP bot mapping but it doesn’t seem to work… WTF, why is everything broken?!
  • Oh lord have mercy, the two AReS harvester IPs alone are responsible for 42 MILLION hits in 2019 and 2020 so far by themselves:
  • -
    $ http 'http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&rows=0&wt=json&indent=true' | grep numFound
    +
    $ http 'http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&rows=0&wt=json&indent=true' | grep numFound
       "response":{"numFound":42395486,"start":0,"docs":[]
     
    • I modified my check-spider-hits.sh script to create a version that works with IPs and purged 47 million stats from Solr on CGSpace:
    -
    $ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics -p
    +
    $ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics -p
     Purging 22809216 hits from 34.218.226.147 in statistics
     Purging 19586270 hits from 172.104.229.92 in statistics
     Purging 111137 hits from 2a01:7e00::f03c:91ff:fe9a:3a37 in statistics
    @@ -856,11 +856,11 @@ Total number of bot hits purged: 5535399
     
     
     
    -
    add_header X-debug-message "ua is $ua" always;
    +
    add_header X-debug-message "ua is $ua" always;
     
    • Then in the HTTP response you see:
    -
    X-debug-message: ua is bot
    +
    X-debug-message: ua is bot
     
    • So the IP to bot mapping is working, phew.
    • More bad news, I checked the remaining IPs in our existing bot IP mapping, and there are statistics registered for them! @@ -880,7 +880,7 @@ Total number of bot hits purged: 5535399
    • These IPs are all active in the REST API logs over the last few months and they account for thirty-four million more hits in the statistics!
    • I purged them from CGSpace:
    -
    $ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
    +
    $ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
     Purging 15 hits from 104.196.152.243 in statistics
     Purging 61064 hits from 35.237.175.180 in statistics
     Purging 1378 hits from 70.32.90.172 in statistics
    @@ -910,7 +910,7 @@ Total number of bot hits purged: 1752548
     
     
  • The client at 3.225.28.105 is using the following user agent:
  • -
    Apache-HttpClient/4.3.4 (java 1.5)
    +
    Apache-HttpClient/4.3.4 (java 1.5)
     
    • But I don’t see any hits for it in the statistics core for some reason
    • Looking more into the 2015 statistics I see some questionable IPs: @@ -925,7 +925,7 @@ Total number of bot hits purged: 1752548
    • For the IPs I purged them using check-spider-ip-hits.sh:
    -
    $ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
    +
    $ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
     Purging 11478 hits from 95.110.154.135 in statistics
     Purging 1208 hits from 34.209.213.122 in statistics
     Purging 10 hits from 54.184.39.242 in statistics
    @@ -966,7 +966,7 @@ Total number of bot hits purged: 2228
     
    • Then I purged about 200,000 Baidu hits from the 2015 to 2019 statistics cores with a few manual delete queries because they didn’t have a proper user agent and the only way to identify them was via DNS:
    -
    $ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:*crawl.baidu.com.</query></delete>"
    +
    $ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:*crawl.baidu.com.</query></delete>"
     
    • Jesus, the more I keep looking, the more I see ridiculous stuff…
    • In 2019 there were a few hundred thousand requests from CodeObia on Orange Jordan network… @@ -982,7 +982,7 @@ Total number of bot hits purged: 2228
    • Also I see some IP in Greece making 130,000 requests with weird user agents: 143.233.242.130
    • I purged a bunch more from all cores:
    -
    $ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p     
    +
    $ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p     
     Purging 109965 hits from 45.5.186.2 in statistics
     Purging 78648 hits from 79.173.222.114 in statistics
     Purging 49032 hits from 149.200.141.57 in statistics
    @@ -1024,7 +1024,7 @@ Total number of bot hits purged: 14110
     
  • Though looking in my REST logs for the last month I am second guessing my judgement on 45.5.186.2 because I see user agents like “Microsoft Office Word 2014”
  • Actually no, the overwhelming majority of these are coming from something harvesting the REST API with no user agent:
  • -
    # zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\" '{print $6}' | sort | uniq -c | sort -h
    +
    # zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\" '{print $6}' | sort | uniq -c | sort -h
           1 Microsoft Office Word 2014
           1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Win64; x64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; ms-office)
           1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
    @@ -1038,7 +1038,7 @@ Total number of bot hits purged: 14110
     
    • I see lots of requests coming from the following user agents:
    -
    "Apache-HttpClient/4.5.7 (Java/11.0.3)"
    +
    "Apache-HttpClient/4.5.7 (Java/11.0.3)"
     "Apache-HttpClient/4.5.7 (Java/11.0.2)"
     "LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)"
     "EventMachine HttpClient"
    @@ -1054,7 +1054,7 @@ Total number of bot hits purged: 14110
     
     
  • More weird user agents in 2019:
  • -
    ecolink (+https://search.ecointernet.org/)
    +
    ecolink (+https://search.ecointernet.org/)
     ecoweb (+https://search.ecointernet.org/)
     EcoInternet http://www.ecointernet.org/
     EcoInternet http://ecointernet.org/
    @@ -1062,12 +1062,12 @@ EcoInternet http://ecointernet.org/
     
    • And what’s the 950,000 hits from Online.net IPs with the following user agent:
    -
    Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
    +
    Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
     
    • Over half of the requests were to Discover and Browse pages, and the rest were to actual item pages, but they were within seconds of each other, so I’m purging them all
    • I looked deeper in the Solr statistics and found a bunch more weird user agents:
    -
    LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3
    +
    LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3
     EventMachine HttpClient
     ecolink (+https://search.ecointernet.org/)
     ecoweb (+https://search.ecointernet.org/)
    @@ -1098,13 +1098,13 @@ HTTPie/1.0.2
     
     
     
    -
    Link.?Check
    +
    Link.?Check
     Http.?Client
     ecointernet
     
    • That removes another 500,000 or so:
    -
    $ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics -p
    +
    $ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics -p
     Purging 253 hits from Jersey\/[0-9] in statistics
     Purging 7302 hits from Link.?Check in statistics
     Purging 85574 hits from Http.?Client in statistics
    @@ -1171,12 +1171,12 @@ Total number of bot hits purged: 159
     
     
  • I tried to migrate the 2019 Solr statistics again on CGSpace because the automatic sharding failed last month:
  • -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
     $ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-util.log.$(date --iso-8601)
     
    • Interestingly I saw this in the Solr log:
    -
    2020-02-26 08:55:47,433 INFO  org.apache.solr.core.SolrCore @ [statistics-2019] Opening new SolrCore at [dspace]/solr/statistics/, dataDir=[dspace]/solr/statistics-2019/data/
    +
    2020-02-26 08:55:47,433 INFO  org.apache.solr.core.SolrCore @ [statistics-2019] Opening new SolrCore at [dspace]/solr/statistics/, dataDir=[dspace]/solr/statistics-2019/data/
     2020-02-26 08:55:47,511 INFO  org.apache.solr.servlet.SolrDispatchFilter @ [admin] webapp=null path=/admin/cores params={dataDir=[dspace]/solr/statistics-2019/data&name=statistics-2019&action=CREATE&instanceDir=statistics&wt=javabin&version=2} status=0 QTime=590
     
    • The process has been going for several hours now and I suspect it will fail eventually @@ -1186,7 +1186,7 @@ $ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-ut
    • Manually create a core in the DSpace 6.4-SNAPSHOT instance on my local environment:
    -
    $ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace63/solr/statistics&dataDir=/home/aorth/dspace63/solr/statistics-2019/data'
    +
    $ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace63/solr/statistics&dataDir=/home/aorth/dspace63/solr/statistics-2019/data'
     
    • After that the statistics-2019 core was immediately available in the Solr UI, but after restarting Tomcat it was gone
        @@ -1195,11 +1195,11 @@ $ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-ut
      • First export a small slice of 2019 stats from the main CGSpace statistics core, skipping Atmire schema additions:
      -
      $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
      +
      $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
       
      • Then import into my local statistics core:
      -
      $ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-01-16.json -k uid
      +
      $ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-01-16.json -k uid
       $ ~/dspace63/bin/dspace stats-util -s
       Moving: 21993 into core statistics-2019
       
        @@ -1226,7 +1226,7 @@ Moving: 21993 into core statistics-2019
    -
    <meta content="Thu hoạch v&agrave; bảo quản c&agrave; ph&ecirc; ch&egrave; đ&uacute;ng kỹ thuật (Harvesting and storing Arabica coffee)" name="citation_title">
    +
    <meta content="Thu hoạch v&agrave; bảo quản c&agrave; ph&ecirc; ch&egrave; đ&uacute;ng kỹ thuật (Harvesting and storing Arabica coffee)" name="citation_title">
     <meta name="citation_title" content="Thu hoạch và bảo quản cà phê chè đúng kỹ thuật (Harvesting and storing Arabica coffee)" />
     
    • DS-4397 controlled vocabulary loading speedup
    • @@ -1250,7 +1250,7 @@ Moving: 21993 into core statistics-2019
    • I added some debugging to the Solr core loading in DSpace 6.4-SNAPSHOT (SolrLoggerServiceImpl.java) and I see this when DSpace starts up now:
    -
    2020-02-27 12:26:35,695 INFO  org.dspace.statistics.SolrLoggerServiceImpl @ Alan Ping of Solr Core [statistics-2019] Failed with [org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException].  New Core Will be Created
    +
    2020-02-27 12:26:35,695 INFO  org.dspace.statistics.SolrLoggerServiceImpl @ Alan Ping of Solr Core [statistics-2019] Failed with [org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException].  New Core Will be Created
     
    • When I check Solr I see the statistics-2019 core loaded (from stats-util -s yesterday, not manually created)
    diff --git a/docs/2020-03/index.html b/docs/2020-03/index.html index fe923738f..c45c5f876 100644 --- a/docs/2020-03/index.html +++ b/docs/2020-03/index.html @@ -42,7 +42,7 @@ You need to download this into the DSpace 6.x source and compile it "/> - + @@ -141,7 +141,7 @@ You need to download this into the DSpace 6.x source and compile it -
    $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
    +
    $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
     $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
     

    2020-03-03

      @@ -160,7 +160,7 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
    -
    $ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
    +
    $ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
     
    • But I have not run it on CGSpace yet because we want to ask Peter if he is sure about it…
    • Send a message to Macaroni Bros to ask them about their Drupal module and its readiness for DSpace 6 UUIDs
    • @@ -177,7 +177,7 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
    • I want to try to consolidate our yearly Solr statistics cores back into one statistics core using the solr-import-export-json tool
    • I will try it on DSpace test, doing one year at a time:
    -
    $ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o /tmp/statistics-2010.json -k uid
    +
    $ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o /tmp/statistics-2010.json -k uid
     $ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2010.json -k uid
     $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2010*</query></delete>"
     $ ./run.sh -s http://localhost:8081/solr/statistics-2011 -a export -o /tmp/statistics-2011.json -k uid
    @@ -196,7 +196,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
     
     
     
    -
    $ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f 'time:/2014-0[1-6].*/'
    +
    $ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f 'time:/2014-0[1-6].*/'
     
    • Upgrade PostgreSQL from 9.6 to 10 on DSpace Test (linode19)
        @@ -204,7 +204,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
    -
    # apt install postgresql-10 postgresql-contrib-10
    +
    # apt install postgresql-10 postgresql-contrib-10
     # systemctl stop tomcat7
     # pg_ctlcluster 9.6 main stop
     # tar -cvzpf var-lib-postgresql-9.6.tar.gz /var/lib/postgresql/9.6
    @@ -232,11 +232,11 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
     
     
     
    -
    Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)
    +
    Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)
     
    • It seems to only be a problem in the last week:
    -
    # zgrep -c 64.225.40.66 /var/log/nginx/rest.log.{1..9}
    +
    # zgrep -c 64.225.40.66 /var/log/nginx/rest.log.{1..9}
     /var/log/nginx/rest.log.1:0
     /var/log/nginx/rest.log.2:0
     /var/log/nginx/rest.log.3:0
    @@ -250,22 +250,22 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
     
  • In Solr the IP is 127.0.0.1, but in the nginx logs I can luckily see the real IP (64.225.40.66), which is on Digital Ocean
  • I will purge them from Solr statistics:
  • -
    $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"</query></delete>'
    +
    $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"</query></delete>'
     
    • Another user agent that seems to be a bot is:
    -
    Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
    +
    Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
     
    • In Solr the IP is 127.0.0.1 because of the misconfiguration, but in nginx’s logs I see it belongs to three IPs on Online.net in France:
    -
    # zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
    +
    # zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
       63090 163.172.68.99
      183428 163.172.70.248
      147608 163.172.71.24
     
    • It is making 10,000 to 40,000 requests to XMLUI per day…
    -
    # zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
    +
    # zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
     /var/log/nginx/access.log.30.gz:18687
     /var/log/nginx/access.log.31.gz:28936
     /var/log/nginx/access.log.32.gz:36402
    @@ -284,7 +284,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
     
    • I will purge those hits too!
    -
    $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)"</query></delete>'
    +
    $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)"</query></delete>'
     
    • Shit, and something happened and a few thousand hits from user agents with “Bot” in their user agent got through
        @@ -292,7 +292,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
    -
    $ ./check-spider-hits.sh -f /tmp/bots -d -p
    +
    $ ./check-spider-hits.sh -f /tmp/bots -d -p
     (DEBUG) Using spiders pattern file: /tmp/bots
     (DEBUG) Checking for hits from spider: Citoid
     Purging 11 hits from Citoid in statistics
    @@ -337,7 +337,7 @@ Purging 62 hits from [Ss]pider in statistics
     
     
     
    -
    dspace=# SELECT DISTINCT text_lang, COUNT(*) FROM metadatavalue WHERE resource_type_id=2 AND resource_id in (111295,111294,111293,111292,111291,111290,111288,111286,111285,111284,111283,111282,111281,111280,111279,111278,111277,111276,111275,111274,111273,111272,111271,111270,111269,111268,111267,111266,111265,111264,111263,111262,111261,111260,111259,111258,111257,111256,111255,111254,111253,111252,111251,111250,111249,111248,111247,111246,111245,111244,111243,111242,111241,111240,111238,111237,111236,111235,111234,111233,111232,111231,111230,111229,111228,111227,111226,111225,111224,111223,111222,111221,111220,111219,111218,111217,111216,111215,111214,111213,111212,111211,111209,111208,111207,111206,111205,111204,111203,111202,111201,111200,111199,111198,111197,111196,111195,111194,111193,111192,111191,111190,111189,111188,111187,111186,111185,111184,111183,111182,111181,111180,111179,111178,111177,111176,111175,111174,111173,111172,111171,111170,111169,111168,111299,111298,111297,111296,111167,111166,111165,111164,111163,111162,111161,111160,111159,111158,111157,111156,111155,111154,111153,111152,111151,111150,111149,111148,111147,111146,111145,111144,111143,111142,111141,111140,111139,111138,111137,111136,111135,111134,111133,111132,111131,111129,111128,111127,111126,111125) GROUP BY text_lang ORDER BY count;
    +
    dspace=# SELECT DISTINCT text_lang, COUNT(*) FROM metadatavalue WHERE resource_type_id=2 AND resource_id in (111295,111294,111293,111292,111291,111290,111288,111286,111285,111284,111283,111282,111281,111280,111279,111278,111277,111276,111275,111274,111273,111272,111271,111270,111269,111268,111267,111266,111265,111264,111263,111262,111261,111260,111259,111258,111257,111256,111255,111254,111253,111252,111251,111250,111249,111248,111247,111246,111245,111244,111243,111242,111241,111240,111238,111237,111236,111235,111234,111233,111232,111231,111230,111229,111228,111227,111226,111225,111224,111223,111222,111221,111220,111219,111218,111217,111216,111215,111214,111213,111212,111211,111209,111208,111207,111206,111205,111204,111203,111202,111201,111200,111199,111198,111197,111196,111195,111194,111193,111192,111191,111190,111189,111188,111187,111186,111185,111184,111183,111182,111181,111180,111179,111178,111177,111176,111175,111174,111173,111172,111171,111170,111169,111168,111299,111298,111297,111296,111167,111166,111165,111164,111163,111162,111161,111160,111159,111158,111157,111156,111155,111154,111153,111152,111151,111150,111149,111148,111147,111146,111145,111144,111143,111142,111141,111140,111139,111138,111137,111136,111135,111134,111133,111132,111131,111129,111128,111127,111126,111125) GROUP BY text_lang ORDER BY count;
     
    • Then I exported the metadata from DSpace Test and imported it into OpenRefine
        @@ -346,7 +346,7 @@ Purging 62 hits from [Ss]pider in statistics
      • I exported a new list of affiliations from the database, added line numbers with csvcut, and then validated them in OpenRefine using reconcile-csv:
      -
      dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2020-03-12-affiliations.csv WITH CSV HEADER;`
      +
      dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2020-03-12-affiliations.csv WITH CSV HEADER;`
       dspace=# \q
       $ csvcut -l -c 0 /tmp/2020-03-12-affiliations.csv | sed -e 's/^line_number/id/' -e 's/text_value/name/' > /tmp/affiliations.csv
       $ lein run /tmp/affiliations.csv name id
      @@ -417,14 +417,14 @@ $ lein run /tmp/affiliations.csv name id
       
    • Update Tomcat to version 7.0.103 in the Ansible infrastrcutrue playbooks and deploy on DSpace Test (linode26)
    • Maria sent me a few new ORCID identifiers from Bioversity so I combined them with our existing ones, filtered the unique ones, and then resolved their names using my resolve-orcids.py script:
    -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcids | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-03-26-combined-orcids.txt
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcids | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-03-26-combined-orcids.txt
     $ ./resolve-orcids.py -i /tmp/2020-03-26-combined-orcids.txt -o /tmp/2020-03-26-combined-names.txt -d
     # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
     
    • I checked the database for likely matches to the author name and then created a CSV with the author names and ORCID iDs:
    -
    dc.contributor.author,cg.creator.id
    +
    dc.contributor.author,cg.creator.id
     "King, Brian","Brian King: 0000-0002-7056-9214"
     "Ortiz-Crespo, Berta","Berta Ortiz-Crespo: 0000-0002-6664-0815"
     "Ekesa, Beatrice","Beatrice Ekesa: 0000-0002-2630-258X"
    @@ -434,7 +434,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
     
    • Running the add-orcid-identifiers-csv.py script I added 32 ORCID iDs to items on CGSpace!
    -
    $ ./add-orcid-identifiers-csv.py -i /tmp/2020-03-26-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
    +
    $ ./add-orcid-identifiers-csv.py -i /tmp/2020-03-26-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
     
    • Udana from IWMI asked about some items that are missing Altmetric donuts on CGSpace
        @@ -449,7 +449,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
        • Add two more Bioversity ORCID iDs to CGSpace and then tag ~70 of the authors' existing publications in the database using this CSV with my add-orcid-identifiers-csv.py script:
        -
        dc.contributor.author,cg.creator.id
        +
        dc.contributor.author,cg.creator.id
         "Snook, L.K.","Laura Snook: 0000-0002-9168-1301"
         "Snook, L.","Laura Snook: 0000-0002-9168-1301"
         "Zheng, S.J.","Sijun Zheng: 0000-0003-1550-3738"
        diff --git a/docs/2020-04/index.html b/docs/2020-04/index.html
        index 6e7e32d98..dd8582ffb 100644
        --- a/docs/2020-04/index.html
        +++ b/docs/2020-04/index.html
        @@ -48,7 +48,7 @@ The third item now has a donut with score 1 since I tweeted it last week
         
         On the same note, the one item Abenet pointed out last week now has a donut with score of 104 after I tweeted it last week
         "/>
        -
        +
         
         
             
        @@ -171,23 +171,23 @@ On the same note, the one item Abenet pointed out last week now has a donut with
         
    -
    $ psql -h localhost -U postgres dspace -c "DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value LIKE '%Ballantyne%';"
    +
    $ psql -h localhost -U postgres dspace -c "DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value LIKE '%Ballantyne%';"
     DELETE 97
     $ ./add-orcid-identifiers-csv.py -i 2020-04-07-peter-orcids.csv -db dspace -u dspace -p 'fuuu' -d
     
    • I used this CSV with the script (all records with his name have the name standardized like this):
    -
    dc.contributor.author,cg.creator.id
    +
    dc.contributor.author,cg.creator.id
     "Ballantyne, Peter G.","Peter G. Ballantyne: 0000-0001-9346-2893"
     
    • Then I tried another way, to identify all duplicate ORCID identifiers for a given resource ID and group them so I can see if count is greater than 1:
    -
    dspace=# \COPY (SELECT DISTINCT(resource_id, text_value) as distinct_orcid, COUNT(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 240 GROUP BY distinct_orcid ORDER BY count DESC) TO /tmp/2020-04-07-duplicate-orcids.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT(resource_id, text_value) as distinct_orcid, COUNT(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 240 GROUP BY distinct_orcid ORDER BY count DESC) TO /tmp/2020-04-07-duplicate-orcids.csv WITH CSV HEADER;
     COPY 15209
     
    • Of those, about nine authors had duplicate ORCID identifiers over about thirty records, so I created a CSV with all their name variations and ORCID identifiers:
    -
    dc.contributor.author,cg.creator.id
    +
    dc.contributor.author,cg.creator.id
     "Ballantyne, Peter G.","Peter G. Ballantyne: 0000-0001-9346-2893"
     "Ramirez-Villegas, Julian","Julian Ramirez-Villegas: 0000-0002-8044-583X"
     "Villegas-Ramirez, J","Julian Ramirez-Villegas: 0000-0002-8044-583X"
    @@ -207,12 +207,12 @@ COPY 15209
     
    • Then I deleted all their existing ORCID identifier records:
    -
    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value SIMILAR TO '%(0000-0001-6543-0798|0000-0001-9346-2893|0000-0002-6950-4018|0000-0002-7583-3811|0000-0002-8044-583X|0000-0002-8599-7895|0000-0003-0934-1218|0000-0003-2765-7101)%';
    +
    dspace=# DELETE FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240 AND text_value SIMILAR TO '%(0000-0001-6543-0798|0000-0001-9346-2893|0000-0002-6950-4018|0000-0002-7583-3811|0000-0002-8044-583X|0000-0002-8599-7895|0000-0003-0934-1218|0000-0003-2765-7101)%';
     DELETE 994
     
    • And then I added them again using the add-orcid-identifiers records:
    -
    $ ./add-orcid-identifiers-csv.py -i 2020-04-07-fix-duplicate-orcids.csv -db dspace -u dspace -p 'fuuu' -d
    +
    $ ./add-orcid-identifiers-csv.py -i 2020-04-07-fix-duplicate-orcids.csv -db dspace -u dspace -p 'fuuu' -d
     
    • I ran the fixes on DSpace Test and CGSpace as well
    • I started testing the pull request sent by Atmire yesterday @@ -230,7 +230,7 @@ DELETE 994
    -
    dspace63=# DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');
    +
    dspace63=# DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');
     dspace63=# CREATE EXTENSION pgcrypto;
     
    • Then DSpace 6.3 started up OK and I was able to see some statistics in the Content and Usage Analysis (CUA) module, but not on community, collection, or item pages @@ -239,11 +239,11 @@ dspace63=# CREATE EXTENSION pgcrypto;
    -
    2020-04-12 16:34:33,363 ERROR com.atmire.dspace.app.xmlui.aspect.statistics.editorparts.DataTableTransformer @ java.lang.IllegalArgumentException: Invalid UUID string: 1
    +
    2020-04-12 16:34:33,363 ERROR com.atmire.dspace.app.xmlui.aspect.statistics.editorparts.DataTableTransformer @ java.lang.IllegalArgumentException: Invalid UUID string: 1
     
    • And I remembered I actually need to run the DSpace 6.4 Solr UUID migrations:
    -
    $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
    +
    $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
     $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
     
    • Run system updates on DSpace Test (linode26) and reboot it
    • @@ -258,7 +258,7 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
    • I realized that solr-upgrade-statistics-6x only processes 100,000 records by default so I think we actually need to finish running it for all legacy Solr records before asking Atmire why CUA statlets and detailed statistics aren’t working
    • For now I am just doing 250,000 records at a time on my local environment:
    -
    $ export JAVA_OPTS="-Xmx2000m -Dfile.encoding=UTF-8"
    +
    $ export JAVA_OPTS="-Xmx2000m -Dfile.encoding=UTF-8"
     $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x -n 250000
     
    • Despite running the migration for all of my local 1.5 million Solr records, I still see a few hundred thousand like -1 and 0-unmigrated @@ -269,14 +269,14 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x -n 250000
    -
    /** DSpace site type */
    +
    /** DSpace site type */
     public static final int SITE = 5;
     
    • Even after deleting those documents and re-running solr-upgrade-statistics-6x I still get the UUID errors when using CUA and the statlets
    • I have sent some feedback and questions to Atmire (including about the  issue with glypicons in the header trail)
    • In other news, my local Artifactory container stopped working for some reason so I re-created it and it seems some things have changed upstream (port 8082 for web UI?):
    -
    $ podman rm artifactory
    +
    $ podman rm artifactory
     $ podman pull docker.bintray.io/jfrog/artifactory-oss:latest
     $ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081-8082:8081-8082 docker.bintray.io/jfrog/artifactory-oss
     $ podman start artifactory
    @@ -284,7 +284,7 @@ $ podman start artifactory
     
    • A few days ago Peter asked me to update an author’s name on CGSpace and in the controlled vocabularies:
    -
    dspace=# UPDATE metadatavalue SET text_value='Knight-Jones, Theodore J.D.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='Knight-Jones, T.J.D.';
    +
    dspace=# UPDATE metadatavalue SET text_value='Knight-Jones, Theodore J.D.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='Knight-Jones, T.J.D.';
     
    • I updated his existing records on CGSpace, changed the controlled lists, added his ORCID identifier to the controlled list, and tagged his thirty-nine items with the ORCID iD
    • The new DSpace 6 stuff that Atmire sent modifies the Mirage 2’s pom.xml to copy the each theme’s resulting node_modules to each theme after building and installing with ant update because they moved some packages from bower to npm and now reference them in page-structure.xsl @@ -315,7 +315,7 @@ $ podman start artifactory
      • Looking into a high rate of outgoing bandwidth from yesterday on CGSpace (linode18):
      -
      # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Apr/2020:0[6789]" | goaccess --log-format=COMBINED -
      +
      # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "19/Apr/2020:0[6789]" | goaccess --log-format=COMBINED -
       
      • One host in Russia (91.241.19.70) download 23GiB over those few hours in the morning
          @@ -323,18 +323,18 @@ $ podman start artifactory
      -
      # grep -c 91.241.19.70 /var/log/nginx/access.log.1
      +
      # grep -c 91.241.19.70 /var/log/nginx/access.log.1
       8900
       # grep 91.241.19.70 /var/log/nginx/access.log.1 | grep -c '10568/35187'
       8900
       
      • I thought the host might have been Yandex misbehaving, but its user agent is:
      -
      Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_3; nl-nl) AppleWebKit/527  (KHTML, like Gecko) Version/3.1.1 Safari/525.20
      +
      Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_3; nl-nl) AppleWebKit/527  (KHTML, like Gecko) Version/3.1.1 Safari/525.20
       
      • I will purge that IP from the Solr statistics using my check-spider-ip-hits.sh script:
      -
      $ ./check-spider-ip-hits.sh -d -f /tmp/ip -p
      +
      $ ./check-spider-ip-hits.sh -d -f /tmp/ip -p
       (DEBUG) Using spider IPs file: /tmp/ip
       (DEBUG) Checking for hits from spider IP: 91.241.19.70
       Purging 8909 hits from 91.241.19.70 in statistics
      @@ -343,11 +343,11 @@ Total number of bot hits purged: 8909
       
      • While investigating that I noticed ORCID identifiers missing from a few authors names, so I added them with my add-orcid-identifiers.py script:
      -
      $ ./add-orcid-identifiers-csv.py -i 2020-04-20-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
      +
      $ ./add-orcid-identifiers-csv.py -i 2020-04-20-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
       
      • The contents of 2020-04-20-add-orcids.csv was:
      -
      dc.contributor.author,cg.creator.id
      +
      dc.contributor.author,cg.creator.id
       "Schut, Marc","Marc Schut: 0000-0002-3361-4581"
       "Schut, M.","Marc Schut: 0000-0002-3361-4581"
       "Kamau, G.","Geoffrey Kamau: 0000-0002-6995-4801"
      @@ -387,17 +387,17 @@ Total number of bot hits purged: 8909
       
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
     $ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
    • I ran the dspace cleanup -v process on CGSpace and got an error:
    -
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
       Detail: Key (bitstream_id)=(184980) is still referenced from table "bundle".
     
    • The solution is, as always:
    -
    $ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
    +
    $ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
     UPDATE 1
     
    • I spent some time working on the XMLUI themes in DSpace 6 @@ -412,7 +412,7 @@ UPDATE 1
    -
    .breadcrumb > li + li:before {
    +
    .breadcrumb > li + li:before {
       content: "/\00a0";
     }
     

    2020-04-27

    @@ -421,7 +421,7 @@ UPDATE 1
  • My changes to DSpace XMLUI Mirage 2 build process mean that we don’t need Ruby gems at all anymore! We can completely build without them!
  • Trying to test the com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI script but there is an error:
  • -
    Exception: org.apache.solr.search.SyntaxError: Cannot parse 'cua_version:${cua.version.number}': Encountered " "}" "} "" at line 1, column 32.
    +
    Exception: org.apache.solr.search.SyntaxError: Cannot parse 'cua_version:${cua.version.number}': Encountered " "}" "} "" at line 1, column 32.
     Was expecting one of:
         "TO" ...
         <RANGE_QUOTED> ...
    @@ -429,7 +429,7 @@ Was expecting one of:
     
    • Seems something is wrong with the variable interpolation, and I see two configurations in the atmire-cua.cfg file:
    -
    atmire-cua.cua.version.number=${cua.version.number}
    +
    atmire-cua.cua.version.number=${cua.version.number}
     atmire-cua.version.number=${cua.version.number}
     
    • I sent a message to Atmire to check
    • @@ -473,7 +473,7 @@ atmire-cua.version.number=${cua.version.number}
    -
    Record uid: ee085cc0-0110-42c5-80b9-0fad4015ed9f couldn't be processed
    +
    Record uid: ee085cc0-0110-42c5-80b9-0fad4015ed9f couldn't be processed
     com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: ee085cc0-0110-42c5-80b9-0fad4015ed9f, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
             at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(AtomicStatisticsUpdater.java:304)
             at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(AtomicStatisticsUpdater.java:176)
    @@ -508,7 +508,7 @@ Caused by: java.lang.NullPointerException
     
     
     
    -
    $ grep ERROR dspace.log.2020-04-29 | cut -f 3- -d' ' | sort | uniq -c | sort -n
    +
    $ grep ERROR dspace.log.2020-04-29 | cut -f 3- -d' ' | sort | uniq -c | sort -n
           1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL findByUnique Error -
           1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL find Error -
           1 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
    @@ -524,20 +524,20 @@ Caused by: java.lang.NullPointerException
     
    • Database connections do seem high:
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
           6 dspaceCli
          88 dspaceWeb
     
    • Most of those are idle in transaction:
    -
    $ psql -c 'select * from pg_stat_activity' | grep 'dspaceWeb' | grep -c "idle in transaction"
    +
    $ psql -c 'select * from pg_stat_activity' | grep 'dspaceWeb' | grep -c "idle in transaction"
     67
     
    • I don’t see anything in the PostgreSQL or Tomcat logs suggesting anything is wrong… I think the solution to clear these idle connections is probably to just restart Tomcat
    • I looked at the Solr stats for this month and see lots of suspicious IPs:
    -
    $ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-04&rows=0&wt=json&indent=true&facet=true&facet.field=ip
    +
    $ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-04&rows=0&wt=json&indent=true&facet=true&facet.field=ip
     
             "88.99.115.53",23621, # Hetzner, using XMLUI and REST API with no user agent
             "104.154.216.0",11865,# Google cloud, scraping XMLUI with no user agent
    @@ -555,13 +555,13 @@ Caused by: java.lang.NullPointerException
     
  • I need to start blocking requests without a user agent…
  • I purged these user agents using my check-spider-ip-hits.sh script:
  • -
    $ for year in {2010..2019}; do ./check-spider-ip-hits.sh -f /tmp/ips -s statistics-$year -p; done
    +
    $ for year in {2010..2019}; do ./check-spider-ip-hits.sh -f /tmp/ips -s statistics-$year -p; done
     $ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
     
    • Then I added a few of them to the bot mapping in the nginx config because it appears they are regular harvesters since 2018
    • Looking through the Solr stats faceted by the userAgent field I see some interesting ones:
    -
    $ curl 'http://localhost:8081/solr/statistics/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=userAgent'
    +
    $ curl 'http://localhost:8081/solr/statistics/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=userAgent'
     ...
     "Delphi 2009",50725,
     "OgScrper/1.0.0",12421,
    @@ -580,13 +580,13 @@ $ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
     
  • I don’t know why, but my check-spider-hits.sh script doesn’t seem to be handling the user agents with spaces properly so I will delete those manually after
  • First delete the ones without spaces, creating a temp file in /tmp/agents containing the patterns:
  • -
    $ for year in {2010..2019}; do ./check-spider-hits.sh -f /tmp/agents -s statistics-$year -p; done
    +
    $ for year in {2010..2019}; do ./check-spider-hits.sh -f /tmp/agents -s statistics-$year -p; done
     $ ./check-spider-hits.sh -f /tmp/agents -s statistics -p
     
    • That’s about 300,000 hits purged…
    • Then remove the ones with spaces manually, checking the query syntax first, then deleting in yearly cores and the statistics core:
    -
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Delphi 2009/&rows=0"
    +
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Delphi 2009/&rows=0"
     ...
     <lst name="responseHeader"><int name="status">0</int><int name="QTime">52</int><lst name="params"><str name="q">userAgent:/Delphi 2009/</str><str name="rows">0</str></lst></lst><result name="response" numFound="38760" start="0"></result>
     $ for year in {2010..2019}; do curl -s "http://localhost:8081/solr/statistics-$year/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Delphi 2009"</query></delete>'; done
    @@ -606,7 +606,7 @@ $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true&quo
     
     
     
    -
    # mv /etc/letsencrypt /etc/letsencrypt.bak
    +
    # mv /etc/letsencrypt /etc/letsencrypt.bak
     # /opt/certbot-auto certonly --standalone --email fu@m.com -d dspacetest.cgiar.org --standalone --pre-hook "/bin/systemctl stop nginx" --post-hook "/bin/systemctl start nginx"
     # /opt/certbot-auto revoke --cert-path /etc/letsencrypt.bak/live/dspacetest.cgiar.org/cert.pem
     # rm -rf /etc/letsencrypt.bak
    @@ -618,7 +618,7 @@ $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true&quo
     
    • But I don’t see a lot of connections in PostgreSQL itself:
    -
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +
    $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
           5 dspaceApi
           6 dspaceCli
          14 dspaceWeb
    @@ -636,7 +636,7 @@ $ psql -c 'select * from pg_stat_activity' | wc -l
     
    • The PostgreSQL log shows a lot of errors about deadlocks and queries waiting on other processes…
    -
    ERROR:  deadlock detected
    +
    ERROR:  deadlock detected
     
    diff --git a/docs/2020-05/index.html b/docs/2020-05/index.html index aa6c58fcc..70642ad6e 100644 --- a/docs/2020-05/index.html +++ b/docs/2020-05/index.html @@ -34,7 +34,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2 "/> - + @@ -166,7 +166,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2 -
    # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "07/May/2020:(01|03|04)" | goaccess --log-format=COMBINED -
    +
    # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "07/May/2020:(01|03|04)" | goaccess --log-format=COMBINED -
     
    • The two main IPs making requests around then are 188.134.31.88 and 212.34.8.188
        @@ -176,7 +176,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2
    -
    $ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
    +
    $ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
     Purging 171641 hits from 212.34.8.188 in statistics
     Purging 20691 hits from 188.134.31.88 in statistics
     
    @@ -209,7 +209,7 @@ Total number of bot hits purged: 192332
     
     
     
    -
    $ cat 2020-05-11-add-orcids.csv
    +
    $ cat 2020-05-11-add-orcids.csv
     dc.contributor.author,cg.creator.id
     "Lutakome, P.","Pius Lutakome: 0000-0002-0804-2649"
     "Lutakome, Pius","Pius Lutakome: 0000-0002-0804-2649"
    @@ -263,7 +263,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-11-add-orcids.csv -db dspace -u dspa
     
     
     
    -
    $ cat 2020-05-19-add-orcids.csv
    +
    $ cat 2020-05-19-add-orcids.csv
     dc.contributor.author,cg.creator.id
     "Bahta, Sirak T.","Sirak Bahta: 0000-0002-5728-2489"
     $ ./add-orcid-identifiers-csv.py -i 2020-05-19-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d
    @@ -298,7 +298,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-19-add-orcids.csv -db dspace -u dspa
     
     
     
    -
    $ cat 2020-05-25-add-orcids.csv
    +
    $ cat 2020-05-25-add-orcids.csv
     dc.contributor.author,cg.creator.id
     "Díaz, Manuel F.","Manuel Francisco Diaz Baca: 0000-0001-8996-5092"
     "Díaz, Manuel Francisco","Manuel Francisco Diaz Baca: 0000-0001-8996-5092"
    @@ -327,7 +327,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-25-add-orcids.csv -db dspace -u dspa
     
     
     
    -
    # cat /var/log/nginx/*.log.1 | grep -E "29/May/2020:(02|03|04|05)" | goaccess --log-format=COMBINED -
    +
    # cat /var/log/nginx/*.log.1 | grep -E "29/May/2020:(02|03|04|05)" | goaccess --log-format=COMBINED -
     
    • The top is 172.104.229.92, which is the AReS harvester (still not using a user agent, but it’s tagged as a bot in the nginx mapping)
    • Second is 188.134.31.88, which is a Russian host that we also saw in the last few weeks, using a browser user agent and hitting the XMLUI (but it is tagged as a bot in nginx as well)
    • @@ -358,7 +358,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-05-25-add-orcids.csv -db dspace -u dspa
    -
    $ sudo su - postgres
    +
    $ sudo su - postgres
     $ dropdb dspacetest
     $ createdb -O dspacetest --encoding=UNICODE dspacetest
     $ psql dspacetest -c 'alter user dspacetest superuser;'
    @@ -372,14 +372,14 @@ $ exit
     
    • Now switch to the DSpace 6.x branch and start a build:
    -
    $ chrt -i 0 ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false package
    +
    $ chrt -i 0 ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false package
     ...
     [ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:6.3: Failed to collect dependencies at com.atmire:atmire-listings-and-reports-api:jar:6.x-2.10.8-0-SNAPSHOT: Failed to read artifact descriptor for com.atmire:atmire-listings-and-reports-api:jar:6.x-2.10.8-0-SNAPSHOT: Could not transfer artifact com.atmire:atmire-listings-and-reports-api:pom:6.x-2.10.8-0-SNAPSHOT from/to atmire.com-snapshots (https://atmire.com/artifactory/atmire.com-snapshots): Not authorized , ReasonPhrase:Unauthorized. -> [Help 1]
     
    • Great! I will have to send Atmire a note about this… but for now I can sync over my local ~/.m2 directory and the build completes
    • After the Maven build completed successfully I installed the updated code with Ant (make sure to delete the old spring directory):
    -
    $ cd dspace/target/dspace-installer
    +
    $ cd dspace/target/dspace-installer
     $ rm -rf /blah/dspacetest/config/spring
     $ ant update
     
      @@ -391,7 +391,7 @@ $ ant update
    • I had a mistake in my Solr internal URL parameter so DSpace couldn’t find it, but once I fixed that DSpace starts up OK!
    • Once the initial Discovery reindexing was completed (after three hours or so!) I started the Solr statistics UUID migration:
    -
    $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
    +
    $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
     $ dspace solr-upgrade-statistics-6x -i statistics -n 250000
     $ dspace solr-upgrade-statistics-6x -i statistics -n 1000000
     $ dspace solr-upgrade-statistics-6x -i statistics -n 1000000
    @@ -400,7 +400,7 @@ $ dspace solr-upgrade-statistics-6x -i statistics -n 1000000
     
  • It’s taking about 35 minutes for 1,000,000 records…
  • Some issues towards the end of this core:
  • -
    Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
    +
    Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
     org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
             at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
             at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    @@ -425,17 +425,17 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
     
     
     
    -
    $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o statistics-unmigrated.json -k uid -f '(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)'
    +
    $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o statistics-unmigrated.json -k uid -f '(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)'
     $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
     
    • Now the UUID conversion script says there is nothing left to convert, so I can try to run the Atmire CUA conversion utility:
    -
    $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
    +
    $ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
     $ dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 1
     
    • The processing is very slow and there are lots of errors like this:
    -
    Record uid: 7b5b3900-28e8-417f-9c1c-e7d88a753221 couldn't be processed
    +
    Record uid: 7b5b3900-28e8-417f-9c1c-e7d88a753221 couldn't be processed
     com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: 7b5b3900-28e8-417f-9c1c-e7d88a753221, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
             at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(AtomicStatisticsUpdater.java:304)
             at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(AtomicStatisticsUpdater.java:176)
    diff --git a/docs/2020-06/index.html b/docs/2020-06/index.html
    index 7077eab53..1263fa897 100644
    --- a/docs/2020-06/index.html
    +++ b/docs/2020-06/index.html
    @@ -36,7 +36,7 @@ I sent Atmire the dspace.log from today and told them to log into the server to
     In other news, I checked the statistics API on DSpace 6 and it’s working
     I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Test and I get an error:
     "/>
    -
    +
     
     
         
    @@ -132,7 +132,7 @@ I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Tes
     
  • In other news, I checked the statistics API on DSpace 6 and it’s working
  • I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Test and I get an error:
  • -
    $ dspace oai import -c
    +
    $ dspace oai import -c
     OAI 2.0 manager action started
     Loading @mire database changes for module MQM
     Changes have been processed
    @@ -161,7 +161,7 @@ java.lang.NullPointerException
     
     
     
    -
    $ curl http://localhost:8080/solr/oai/update -H "Content-type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
    +
    $ curl http://localhost:8080/solr/oai/update -H "Content-type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
     $ curl http://localhost:8080/solr/oai/update -H "Content-type: text/xml" --data-binary '<commit />'
     $ ~/dspace63/bin/dspace oai import
     OAI 2.0 manager action started
    @@ -213,7 +213,7 @@ java.lang.NullPointerException
     
     
     
    -
    $ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    125m37.423s
     user    11m20.312s
    @@ -250,7 +250,7 @@ sys     3m19.965s
     
     
     
    -
    $ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    101m41.195s
     user    10m9.569s
    @@ -264,7 +264,7 @@ sys     3m13.929s
     
  • Peter said he was annoyed with a CSV export from CGSpace because of the different text_lang attributes and asked if we can fix it
  • The last time I normalized these was in 2019-06, and currently it looks like this:
  • -
    dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
    +
    dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
       text_lang  |  count
     -------------+---------
      en_US       | 2158377
    @@ -279,7 +279,7 @@ sys     3m13.929s
     
  • In theory we can have different languages for metadata fields but in practice we don’t do that, so we might as well normalize everything to “en_US” (and perhaps I should make a curation task to do this)
  • For now I will do it manually on CGSpace and DSpace Test:
  • -
    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2;
    +
    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2;
     UPDATE 2414738
     
    -
    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item);
    +
    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item);
     
    • Peter asked if it was possible to find all ILRI items that have “zoonoses” or “zoonotic” in their titles and check if they have the ILRI subject “ZOONOTIC DISEASES” (and add it if not)
        @@ -319,7 +319,7 @@ UPDATE 2414738
    -
    $ dspace metadata-export -i 10568/1 -f /tmp/2020-06-08-ILRI.csv
    +
    $ dspace metadata-export -i 10568/1 -f /tmp/2020-06-08-ILRI.csv
     $ csvcut -c 'id,cg.subject.ilri[en_US],dc.title[en_US]' ~/Downloads/2020-06-08-ILRI.csv > /tmp/ilri.csv
     
    • Moayad asked why he’s getting HTTP 500 errors on CGSpace @@ -329,12 +329,12 @@ $ csvcut -c 'id,cg.subject.ilri[en_US],dc.title[en_US]' ~/Downloads/2020-06-08-I
    -
    # journalctl --since=today -u tomcat7  | grep -c 'Internal Server Error'
    +
    # journalctl --since=today -u tomcat7  | grep -c 'Internal Server Error'
     482
     
    • They are all related to the REST API, like:
    -
    Jun 07 02:00:27 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
    +
    Jun 07 02:00:27 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
     Jun 07 02:00:27 linode18 tomcat7[6286]: javax.ws.rs.WebApplicationException
     Jun 07 02:00:27 linode18 tomcat7[6286]:         at org.dspace.rest.Resource.processException(Resource.java:151)
     Jun 07 02:00:27 linode18 tomcat7[6286]:         at org.dspace.rest.ItemsResource.getItems(ItemsResource.java:195)
    @@ -346,7 +346,7 @@ Jun 07 02:00:27 linode18 tomcat7[6286]:         at com.sun.jersey.spi.container.
     
    • And:
    -
    Jun 08 09:28:29 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
    +
    Jun 08 09:28:29 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
     Jun 08 09:28:29 linode18 tomcat7[6286]: javax.ws.rs.WebApplicationException
     Jun 08 09:28:29 linode18 tomcat7[6286]:         at org.dspace.rest.Resource.processFinally(Resource.java:169)
     Jun 08 09:28:29 linode18 tomcat7[6286]:         at org.dspace.rest.HandleResource.getObject(HandleResource.java:81)
    @@ -356,7 +356,7 @@ Jun 08 09:28:29 linode18 tomcat7[6286]:         at java.lang.reflect.Method.invo
     
    • And:
    -
    Jun 06 08:19:54 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
    +
    Jun 06 08:19:54 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
     Jun 06 08:19:54 linode18 tomcat7[6286]: javax.ws.rs.WebApplicationException
     Jun 06 08:19:54 linode18 tomcat7[6286]:         at org.dspace.rest.Resource.processException(Resource.java:151)
     Jun 06 08:19:54 linode18 tomcat7[6286]:         at org.dspace.rest.CollectionsResource.getCollectionItems(CollectionsResource.java:289)
    @@ -366,12 +366,12 @@ Jun 06 08:19:54 linode18 tomcat7[6286]:         at java.lang.reflect.Method.invo
     
    • Looking back, I see ~800 of these errors since I changed the database configuration last week:
    -
    # journalctl --since=2020-06-04 --until=today -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
    +
    # journalctl --since=2020-06-04 --until=today -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
     795
     
    • And only ~280 in the entire month before that…
    -
    # journalctl --since=2020-05-01 --until=2020-06-04 -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
    +
    # journalctl --since=2020-05-01 --until=2020-06-04 -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
     286
     
    • So it seems to be related to the database, perhaps that there are less connections in the pool? @@ -390,11 +390,11 @@ Jun 06 08:19:54 linode18 tomcat7[6286]: at java.lang.reflect.Method.invo
    -
    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
    +
    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
     
    • Looking at the nginx access logs I see that, other than something that seems like Google Feedburner, all hosts using this user agent are all in Sweden!
    -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.*.gz | grep 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' | grep -v '/feed' | awk '{print $1}' | sort | uniq -c | sort -n
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.*.gz | grep 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' | grep -v '/feed' | awk '{print $1}' | sort | uniq -c | sort -n
        1624 192.36.136.246
        1627 192.36.241.95
        1629 192.165.45.204
    @@ -419,7 +419,7 @@ Jun 06 08:19:54 linode18 tomcat7[6286]:         at java.lang.reflect.Method.invo
     
  • The earliest I see any of these hosts is 2020-06-05 (three days ago)
  • I will purge them from the Solr statistics and add them to abusive IPs ipset in the Ansible deployment scripts
  • -
    $ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
    +
    $ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
     Purging 1423 hits from 192.36.136.246 in statistics
     Purging 1387 hits from 192.36.241.95 in statistics
     Purging 1398 hits from 192.165.45.204 in statistics
    @@ -480,7 +480,7 @@ Total number of bot hits purged: 29025
     
     
     
    -
    172.104.229.92 - - [13/Jun/2020:02:00:00 +0200] "GET /rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=0 HTTP/1.1" 403 260 "-" "-"
    +
    172.104.229.92 - - [13/Jun/2020:02:00:00 +0200] "GET /rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=0 HTTP/1.1" 403 260 "-" "-"
     
    • I created an nginx map based on the host’s IP address that sets a temporary user agent (ua) and then changed the conditional in the REST API location block so that it checks this mapped ua instead of the default one
        @@ -497,11 +497,11 @@ Total number of bot hits purged: 29025
    -
    $ curl -s 'https://cgspace.cgiar.org/rest/handle/10568/51671?expand=collections' 'https://cgspace.cgiar.org/rest/handle/10568/89346?expand=collections' | grep -oE '10568/[0-9]+' | sort | uniq > /tmp/cip-collections.txt
    +
    $ curl -s 'https://cgspace.cgiar.org/rest/handle/10568/51671?expand=collections' 'https://cgspace.cgiar.org/rest/handle/10568/89346?expand=collections' | grep -oE '10568/[0-9]+' | sort | uniq > /tmp/cip-collections.txt
     
    • Then I formatted it into a SQL query and exported a CSV:
    -
    dspace=# \COPY (SELECT DISTINCT text_value AS author, COUNT(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (SELECT item_id FROM collection2item WHERE collection_id IN (SELECT resource_id FROM hANDle WHERE hANDle IN ('10568/100533', '10568/100653', '10568/101955', '10568/106580', '10568/108469', '10568/51671', '10568/53085', '10568/53086', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/53093', '10568/53094', '10568/64874', '10568/69069', '10568/70150', '10568/88229', '10568/89346', '10568/89347', '10568/99301', '10568/99302', '10568/99303', '10568/99304', '10568/99428'))) GROUP BY text_value ORDER BY count DESC) TO /tmp/cip-authors.csv WITH CSV;
    +
    dspace=# \COPY (SELECT DISTINCT text_value AS author, COUNT(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (SELECT item_id FROM collection2item WHERE collection_id IN (SELECT resource_id FROM hANDle WHERE hANDle IN ('10568/100533', '10568/100653', '10568/101955', '10568/106580', '10568/108469', '10568/51671', '10568/53085', '10568/53086', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/53093', '10568/53094', '10568/64874', '10568/69069', '10568/70150', '10568/88229', '10568/89346', '10568/89347', '10568/99301', '10568/99302', '10568/99303', '10568/99304', '10568/99428'))) GROUP BY text_value ORDER BY count DESC) TO /tmp/cip-authors.csv WITH CSV;
     COPY 3917
     

    2020-06-15

    -
    $ http 'https://api.crossref.org/funders?query=Bill+and+Melinda+Gates&mailto=a.orth@cgiar.org'
    +
    $ http 'https://api.crossref.org/funders?query=Bill+and+Melinda+Gates&mailto=a.orth@cgiar.org'
     
    • Searching for “Bill and Melinda Gates” we can see the name literal and a list of alt-names literals
        @@ -645,7 +645,7 @@ COPY 3917
      • I made a pull request on CG Core v2 to recommend using persistent identifiers for DOIs and ORCID iDs (#26)
      • I exported sponsors/funders from CGSpace and wrote a script to query the CrossRef API for matches:
      -
      dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29) TO /tmp/2020-06-29-sponsors.csv;
      +
      dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29) TO /tmp/2020-06-29-sponsors.csv;
       COPY 682
       
      • The script is crossref-funders-lookup.py and it is based on agrovoc-lookup.py @@ -656,7 +656,7 @@ COPY 682
      • I tested the script on our funders:
      -
      $ ./crossref-funders-lookup.py -i /tmp/2020-06-29-sponsors.csv -om /tmp/sponsors-matched.txt -or /tmp/sponsors-rejected.txt -d -e blah@blah.com
      +
      $ ./crossref-funders-lookup.py -i /tmp/2020-06-29-sponsors.csv -om /tmp/sponsors-matched.txt -or /tmp/sponsors-rejected.txt -d -e blah@blah.com
       $ wc -l /tmp/2020-06-29-sponsors.csv 
       682 /tmp/2020-06-29-sponsors.csv
       $ wc -l /tmp/sponsors-*
      @@ -684,7 +684,7 @@ $ wc -l /tmp/sponsors-*
       
    • Gabriela from CIP sent me a list of erroneously added CIP subjects to remove from CGSpace:
    -
    $ cat /tmp/2020-06-30-remove-cip-subjects.csv 
    +
    $ cat /tmp/2020-06-30-remove-cip-subjects.csv 
     cg.subject.cip
     INTEGRATED PEST MANAGEMENT
     ORANGE FLESH SWEET POTATOES
    @@ -701,7 +701,7 @@ $ ./delete-metadata-values.py -i /tmp/2020-06-30-remove-cip-subjects.csv -db dsp
     
    • She also wants to change their SWEET POTATOES term to SWEETPOTATOES, both in the CIP subject list and existing items so I updated those too:
    -
    $ cat /tmp/2020-06-30-fix-cip-subjects.csv 
    +
    $ cat /tmp/2020-06-30-fix-cip-subjects.csv 
     cg.subject.cip,correct
     SWEET POTATOES,SWEETPOTATOES
     $ ./fix-metadata-values.py -i /tmp/2020-06-30-fix-cip-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.cip -t correct -m 127 -d
    @@ -710,7 +710,7 @@ $ ./fix-metadata-values.py -i /tmp/2020-06-30-fix-cip-subjects.csv -db dspace -u
     
  • I ran the fixes and deletes on CGSpace, but not on DSpace Test yet because those scripts need updating for DSpace 6 UUIDs
  • I spent about two hours manually checking our sponsors that were rejected from CrossRef and found about fifty-five corrections that I ran on CGSpace:
  • -
    $ cat 2020-06-29-fix-sponsors.csv
    +
    $ cat 2020-06-29-fix-sponsors.csv
     dc.description.sponsorship,correct
     "Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil","Conselho Nacional de Desenvolvimento Científico e Tecnológico"
     "Claussen Simon Stiftung","Claussen-Simon-Stiftung"
    @@ -772,7 +772,7 @@ $ ./fix-metadata-values.py -i /tmp/2020-06-29-fix-sponsors.csv -db dspace -u dsp
     
    • Then I started a full re-index at batch CPU priority:
    -
    $ time chrt --batch 0 dspace index-discovery -b
    +
    $ time chrt --batch 0 dspace index-discovery -b
     
     real    99m16.230s
     user    11m23.245s
    @@ -784,7 +784,7 @@ sys     2m56.635s
     
     
     
    -
    $ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
    +
    $ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
     $ dspace metadata-export -i 10568/1 -f /tmp/ilri.cs
     $ csvcut -c 'id,cg.subject.ilri[],cg.subject.ilri[en_US],dc.subject[en_US]' /tmp/ilri.csv > /tmp/ilri-covid19.csv
     
      diff --git a/docs/2020-07/index.html b/docs/2020-07/index.html index f10701d9a..0389b6eac 100644 --- a/docs/2020-07/index.html +++ b/docs/2020-07/index.html @@ -38,7 +38,7 @@ I restarted Tomcat and PostgreSQL and the issue was gone Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter’s request "/> - + @@ -139,7 +139,7 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
    • Also, Linode is alerting that we had high outbound traffic rate early this morning around midnight AND high CPU load later in the morning
    • First looking at the traffic in the morning:
    -
    # cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E "01/Jul/2020:(00|01|02|03|04)" | goaccess --log-format=COMBINED -
    +
    # cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E "01/Jul/2020:(00|01|02|03|04)" | goaccess --log-format=COMBINED -
     ...
     9659 33.56%    1  0.08% 340.94 MiB 64.39.99.13
     3317 11.53%    1  0.08% 871.71 MiB 199.47.87.140
    @@ -148,23 +148,23 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
     
    • 64.39.99.13 belongs to Qualys, but I see they are using a normal desktop user agent:
    -
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15
    +
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15
     
    • I will purge hits from that IP from Solr
    • The 199.47.87.x IPs belong to Turnitin, and apparently they are NOT marked as bots and we have 40,000 hits from them in 2020 statistics alone:
    -
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Turnitin.*/&rows=0" | grep -oE 'numFound="[0-9]+"'
    +
    $ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Turnitin.*/&rows=0" | grep -oE 'numFound="[0-9]+"'
     numFound="41694"
     
    • They used to be “TurnitinBot”… hhmmmm, seems they use both: https://turnitin.com/robot/crawlerinfo.html
    • I will add Turnitin to the DSpace bot user agent list, but I see they are reqesting robots.txt and only requesting item pages, so that’s impressive! I don’t need to add them to the “bad bot” rate limit list in nginx
    • While looking at the logs I noticed eighty-one IPs in the range 185.152.250.x making little requests this user agent:
    -
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0
    +
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0
     
    • The IPs all belong to HostRoyale:
    -
    # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | wc -l
    +
    # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | wc -l
     81
     # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | sort -h
     185.152.250.1
    @@ -269,7 +269,7 @@ numFound="41694"
     
  • I purged 20,000 hits from IPs and 45,000 hits from user agents
  • I will revert the default “example” agents file back to the upstream master branch of COUNTER-Robots, and then add all my custom ones that are pending in pull requests they haven’t merged yet:
  • -
    $ diff --unchanged-line-format= --old-line-format= --new-line-format='%L' dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt
    +
    $ diff --unchanged-line-format= --old-line-format= --new-line-format='%L' dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt
     Citoid
     ecointernet
     GigablastOpenSource
    @@ -285,7 +285,7 @@ Typhoeus
     
    • Just a note that I still can’t deploy the 6_x-dev-atmire-modules branch as it fails at ant update:
    -
         [java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
    +
         [java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
     
    • I had told Atmire about this several weeks ago… but I reminded them again in the ticket
        @@ -308,7 +308,7 @@ Typhoeus
    -
    $ curl -g -s 'http://localhost:8081/solr/statistics-2019/select?q=*:*&fq=time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z%5D&rows=0&wt=json&indent=true'
    +
    $ curl -g -s 'http://localhost:8081/solr/statistics-2019/select?q=*:*&fq=time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z%5D&rows=0&wt=json&indent=true'
     {
       "responseHeader":{
         "status":0,
    @@ -324,12 +324,12 @@ Typhoeus
     
    • But not in solr-import-export-json… hmmm… seems we need to URL encode only the date range itself, but not the brackets:
    -
    $ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z]' -k uid
    +
    $ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z]' -k uid
     $ zstd /tmp/statistics-2019-1.json
     
    • Then import it on my local dev environment:
    -
    $ zstd -d statistics-2019-1.json.zst
    +
    $ zstd -d statistics-2019-1.json.zst
     $ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-1.json -k uid
     

    2020-07-05

      @@ -358,11 +358,11 @@ $ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/sta
    • I noticed that we have 20,000 distinct values for dc.subject, but there are at least 500 that are lower or mixed case that we should simply uppercase without further thought:
    -
    dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
    +
    dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
     
    • DSpace Test needs a different query because it is running DSpace 6 with UUIDs for everything:
    -
    dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
    +
    dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
     
    • Note the use of the POSIX character class :)
    • I suggest that we generate a list of the top 5,000 values that don’t match AGROVOC so that Sisay can correct them @@ -371,14 +371,14 @@ $ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/sta
    -
    dspace=# \COPY (SELECT DISTINCT text_value, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY text_value ORDER BY count DESC) TO /tmp/2020-07-05-subjects.csv WITH CSV;
    +
    dspace=# \COPY (SELECT DISTINCT text_value, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY text_value ORDER BY count DESC) TO /tmp/2020-07-05-subjects.csv WITH CSV;
     COPY 19640
     dspace=# \q
     $ csvcut -c1 /tmp/2020-07-05-subjects-upper.csv | head -n 6500 > 2020-07-05-cgspace-subjects.txt
     
    • Then start looking them up using agrovoc-lookup.py:
    -
    $ ./agrovoc-lookup.py -i 2020-07-05-cgspace-subjects.txt -om 2020-07-05-cgspace-subjects-matched.txt -or 2020-07-05-cgspace-subjects-rejected.txt -d
    +
    $ ./agrovoc-lookup.py -i 2020-07-05-cgspace-subjects.txt -om 2020-07-05-cgspace-subjects-matched.txt -or 2020-07-05-cgspace-subjects-rejected.txt -d
     

    2020-07-06

    • I made some optimizations to the suite of Python utility scripts in our DSpace directory as well as the csv-metadata-quality script @@ -399,12 +399,12 @@ $ csvcut -c1 /tmp/2020-07-05-subjects-upper.csv | head -n 6500 > 2020-07-05-c
      • Peter asked me to send him a list of sponsors on CGSpace
      -
      dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC) TO /tmp/2020-07-07-sponsors.csv WITH CSV HEADER;
      +
      dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC) TO /tmp/2020-07-07-sponsors.csv WITH CSV HEADER;
       COPY 707
       
      • I ran it quickly through my csv-metadata-quality tool and found two issues that I will correct with fix-metadata-values.py on CGSpace immediately:
      -
      $ cat 2020-07-07-fix-sponsors.csv
      +
      $ cat 2020-07-07-fix-sponsors.csv
       dc.description.sponsorship,correct
       "Ministe`re des Affaires Etrange`res et Européennes, France","Ministère des Affaires Étrangères et Européennes, France"
       "Global Food Security Programme,  United Kingdom","Global Food Security Programme, United Kingdom"
      @@ -432,7 +432,7 @@ $ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -
       
      • Generate a CSV of all the AGROVOC subjects that didn’t match from the top 6500 I exported earlier this week:
      -
      $ csvgrep -c 'number of matches' -r "^0$" 2020-07-05-cgspace-subjects.csv | csvcut -c 1 > 2020-07-05-cgspace-invalid-subjects.csv
      +
      $ csvgrep -c 'number of matches' -r "^0$" 2020-07-05-cgspace-subjects.csv | csvcut -c 1 > 2020-07-05-cgspace-invalid-subjects.csv
       
      • Yesterday Gabriela from CIP emailed to say that she was removing the accents from her authors' names because of “funny character” issues with reports generated from CGSpace
          @@ -442,7 +442,7 @@ $ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -
      -
      $ csvgrep -c 2 -r "^.+$" ~/Downloads/cip-authors-GH-20200706.csv | csvgrep -c 1 -r "^.*[À-ú].*$" | csvgrep -c 2 -r "^.*[À-ú].*$" -i | csvcut -c 1,2
      +
      $ csvgrep -c 2 -r "^.+$" ~/Downloads/cip-authors-GH-20200706.csv | csvgrep -c 1 -r "^.*[À-ú].*$" | csvgrep -c 2 -r "^.*[À-ú].*$" -i | csvcut -c 1,2
       dc.contributor.author,correction
       "López, G.","Lopez, G."
       "Gómez, R.","Gomez, R."
      @@ -475,11 +475,11 @@ dc.contributor.author,correction
       
    -
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-07-08-affiliations.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-07-08-affiliations.csv WITH CSV HEADER;
     
    • Then I stripped the CSV header and quotes to make it a plain text file and ran ror-lookup.py:
    -
    $ ./ror-lookup.py -i /tmp/2020-07-08-affiliations.txt -r ror.json -o 2020-07-08-affiliations-ror.csv -d
    +
    $ ./ror-lookup.py -i /tmp/2020-07-08-affiliations.txt -r ror.json -o 2020-07-08-affiliations-ror.csv -d
     $ wc -l /tmp/2020-07-08-affiliations.txt 
     5866 /tmp/2020-07-08-affiliations.txt
     $ csvgrep -c matched -m true 2020-07-08-affiliations-ror.csv | wc -l 
    @@ -500,7 +500,7 @@ $ csvgrep -c matched -m false 2020-07-08-affiliations-ror.csv | wc -l
     
     
  • I updated ror-lookup.py to check aliases and acronyms as well and now the results are better for CGSpace’s affiliation list:
  • -
    $ wc -l /tmp/2020-07-08-affiliations.txt 
    +
    $ wc -l /tmp/2020-07-08-affiliations.txt 
     5866 /tmp/2020-07-08-affiliations.txt
     $ csvgrep -c matched -m true 2020-07-08-affiliations-ror.csv | wc -l 
     1516
    @@ -510,16 +510,16 @@ $ csvgrep -c matched -m false 2020-07-08-affiliations-ror.csv | wc -l
     
  • So now our matching improves to 1515 out of 5866 (25.8%)
  • Gabriela from CIP said that I should run the author corrections minus those that remove accent characters so I will run it on CGSpace:
  • -
    $ ./fix-metadata-values.py -i /tmp/2020-07-09-fix-90-cip-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correction -m 3
    +
    $ ./fix-metadata-values.py -i /tmp/2020-07-09-fix-90-cip-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correction -m 3
     
    • Apply 110 fixes and 90 deletions to sponsorships that Peter sent me a few days ago:
    -
    $ ./fix-metadata-values.py -i /tmp/2020-07-07-fix-110-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct/action' -m 29
    +
    $ ./fix-metadata-values.py -i /tmp/2020-07-07-fix-110-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct/action' -m 29
     $ ./delete-metadata-values.py -i /tmp/2020-07-07-delete-90-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
     
    • Start a full Discovery re-index on CGSpace:
    -
    $ time chrt -b 0 dspace index-discovery -b
    +
    $ time chrt -b 0 dspace index-discovery -b
     
     real    94m21.413s
     user    9m40.364s
    @@ -527,7 +527,7 @@ sys     2m37.246s
     
    • I modified crossref-funders-lookup.py to be case insensitive and now CGSpace’s sponsors match 173 out of 534 (32.4%):
    -
    $ ./crossref-funders-lookup.py -i 2020-07-09-cgspace-sponsors.txt -o 2020-07-09-cgspace-sponsors-crossref.csv -d -e a.orth@cgiar.org
    +
    $ ./crossref-funders-lookup.py -i 2020-07-09-cgspace-sponsors.txt -o 2020-07-09-cgspace-sponsors-crossref.csv -d -e a.orth@cgiar.org
     $ wc -l 2020-07-09-cgspace-sponsors.txt
     534 2020-07-09-cgspace-sponsors.txt
     $ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l 
    @@ -552,7 +552,7 @@ $ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
     
     
     
    -
    # grep 199.47.87 dspace.log.2020-07-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
    +
    # grep 199.47.87 dspace.log.2020-07-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
     2815
     
    • So I need to add this alternative user-agent to the Tomcat Crawler Session Manager valve to force it to re-use a common bot session
    • @@ -563,11 +563,11 @@ $ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
    -
    Mozilla/5.0 (compatible; heritrix/3.4.0-SNAPSHOT-2019-02-07T13:53:20Z +http://ifm.uni-mannheim.de)
    +
    Mozilla/5.0 (compatible; heritrix/3.4.0-SNAPSHOT-2019-02-07T13:53:20Z +http://ifm.uni-mannheim.de)
     
    • Generate a list of sponsors to update our controlled vocabulary:
    -
    dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC LIMIT 125) TO /tmp/2020-07-12-sponsors.csv;
    +
    dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC LIMIT 125) TO /tmp/2020-07-12-sponsors.csv;
     COPY 125
     dspace=# \q
     $ csvcut -c 1 --tabs /tmp/2020-07-12-sponsors.csv > dspace/config/controlled-vocabularies/dc-description-sponsorship.xml
    @@ -590,12 +590,12 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-descripti
     
    • I ran the dspace cleanup -v process on CGSpace and got an error:
    -
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
       Detail: Key (bitstream_id)=(189618) is still referenced from table "bundle".
     
    • The solution is, as always:
    -
    $ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (189618, 188837);'
    +
    $ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (189618, 188837);'
     UPDATE 1
     
    • Udana from WLE asked me about some items that didn’t show Altmetric donuts @@ -616,12 +616,12 @@ UPDATE 1
    • All four IWMI items that I tweeted yesterday have Altmetric donuts with a score of 1 now…
    • Export CGSpace countries to check them against ISO 3166-1 and ISO 3166-3 (historic countries):
    -
    dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-07-15-countries.csv;
    +
    dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-07-15-countries.csv;
     COPY 194
     
    • I wrote a script iso3166-lookup.py to check them:
    -
    $ ./iso3166-1-lookup.py -i /tmp/2020-07-15-countries.csv -o /tmp/2020-07-15-countries-resolved.csv
    +
    $ ./iso3166-1-lookup.py -i /tmp/2020-07-15-countries.csv -o /tmp/2020-07-15-countries-resolved.csv
     $ csvgrep -c matched -m false /tmp/2020-07-15-countries-resolved.csv       
     country,match type,matched
     CAPE VERDE,,false
    @@ -642,16 +642,16 @@ IRAN,,false
     
    • Check the database for DOIs that are not in the preferred “https://doi.org/" format:
    -
    dspace=# \COPY (SELECT text_value as "cg.identifier.doi" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=220 AND text_value NOT LIKE 'https://doi.org/%') TO /tmp/2020-07-15-doi.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT text_value as "cg.identifier.doi" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=220 AND text_value NOT LIKE 'https://doi.org/%') TO /tmp/2020-07-15-doi.csv WITH CSV HEADER;
     COPY 186
     
    • Then I imported them into OpenRefine and replaced them in a new “correct” column using this GREL transform:
    -
    value.replace("dx.doi.org", "doi.org").replace("http://", "https://").replace("https://dx,doi,org", "https://doi.org").replace("https://doi.dx.org", "https://doi.org").replace("https://dx.doi:", "https://doi.org").replace("DOI: ", "https://doi.org/").replace("doi: ", "https://doi.org/").replace("http:/​/​dx.​doi.​org", "https://doi.org").replace("https://dx. doi.org. ", "https://doi.org").replace("https://dx.doi", "https://doi.org").replace("https://dx.doi:", "https://doi.org/").replace("hdl.handle.net", "doi.org")
    +
    value.replace("dx.doi.org", "doi.org").replace("http://", "https://").replace("https://dx,doi,org", "https://doi.org").replace("https://doi.dx.org", "https://doi.org").replace("https://dx.doi:", "https://doi.org").replace("DOI: ", "https://doi.org/").replace("doi: ", "https://doi.org/").replace("http:/​/​dx.​doi.​org", "https://doi.org").replace("https://dx. doi.org. ", "https://doi.org").replace("https://dx.doi", "https://doi.org").replace("https://dx.doi:", "https://doi.org/").replace("hdl.handle.net", "doi.org")
     
    • Then I fixed the DOIs on CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/2020-07-15-fix-164-DOIs.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.doi -t 'correct' -m 220
    +
    $ ./fix-metadata-values.py -i /tmp/2020-07-15-fix-164-DOIs.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.doi -t 'correct' -m 220
     
    • I filed an issue on Debian’s iso-codes project to ask why “Swaziland” does not appear in the ISO 3166-3 list of historical country names despite it being changed to “Eswatini” in 2018.
    • Atmire responded about the Solr issue @@ -666,7 +666,7 @@ COPY 186
      • Looking at the nginx logs on CGSpace (linode18) last night I see that the Macaroni Bros have started using a unique identifier for at least one of their harvesters:
      -
      217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] "GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0" 302 138 "-" "ILRI Livestock Website Publications importer BOT"
      +
      217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] "GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0" 302 138 "-" "ILRI Livestock Website Publications importer BOT"
       
      • I still see 12,000 records in Solr from this user agent, though.
          @@ -683,7 +683,7 @@ COPY 186
        • I re-ran the check-spider-hits.sh script with the new lists and purged another 14,000 more stats hits for several years each (2020, 2019, 2018, 2017, 2016), around 70,000 total
        • I looked at the CLARISA institutions list again, since I hadn’t looked at it in over six months:
        -
        $ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
        +
        $ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
         
        • The API still needs a key unless you query from Swagger web interface
            @@ -700,7 +700,7 @@ COPY 186
        -
        $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv
        +
        $ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv
         Removing excessive whitespace (name): Comitato Internazionale per lo Sviluppo dei Popoli /  International Committee for the Development of Peoples
         Removing excessive whitespace (name): Deutsche Landwirtschaftsgesellschaft /  German agriculture society
         Removing excessive whitespace (name): Institute of Arid Regions  of Medenine
        @@ -732,7 +732,7 @@ Removing unnecessary Unicode (U+200B): Agencia de Servicios a la Comercializaci
         
      • I started processing the 2019 stats in a batch of 1 million on DSpace Test:
      -
      $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
      +
      $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
       $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2019
       ...
               *** Statistics Records with Legacy Id ***
      @@ -749,7 +749,7 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2019
       
      • The statistics-2019 finished processing after about 9 hours so I started the 2018 ones:
      -
      $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
      +
      $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
       $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2018
               *** Statistics Records with Legacy Id ***
       
      @@ -765,7 +765,7 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2018
       
      • Moayad finally made OpenRXV use a unique user agent:
      -
      OpenRXV harvesting bot; https://github.com/ilri/OpenRXV
      +
      OpenRXV harvesting bot; https://github.com/ilri/OpenRXV
       
      • I see nearly 200,000 hits in Solr from the IP address, though, so I need to make sure those are old ones from before today
          @@ -793,12 +793,12 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2018
      -
      Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
      +
      Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
       org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
       
      • There were four records so I deleted them:
      -
      $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:10</query></delete>'
      +
      $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:10</query></delete>'
       
      • Meeting with Moayad and Peter and Abenet to discuss the latest AReS changes
      @@ -826,7 +826,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
    -
    Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1
    +
    Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1
     
    • Also, in the same month with the same exact user agent, I see 300,000 from 192.157.89.x
        @@ -842,7 +842,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
    -
    Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
    +
    Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
     
    • In statistics-2018 I see more weird IPs
        @@ -860,7 +860,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
    -
    Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36
    +
    Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36
     
    • Then there is 213.139.53.62 in 2018, which is on Orange Telecom Jordan, so it’s definitely CodeObia / ICARDA and I will purge them
    • Jesus, and then there are 100,000 from the ILRI harvestor on Linode on 2a01:7e00::f03c:91ff:fe0a:d645
    • @@ -869,7 +869,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
    • Jesus fuck there is 104.198.9.108 on Google Cloud that was making 30,000 requests with no user agent
    • I will purge the hits from all the following IPs:
    -
    192.157.89.4
    +
    192.157.89.4
     192.157.89.5
     192.157.89.6
     192.157.89.7
    @@ -898,7 +898,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
     
     
  • I noticed a few other user agents that should be purged too:
  • -
    ^Java\/\d{1,2}.\d
    +
    ^Java\/\d{1,2}.\d
     FlipboardProxy\/\d
     API scraper
     RebelMouse\/\d
    @@ -932,7 +932,7 @@ mailto\:team@impactstory\.org
     
     
  • Export some of the CGSpace Solr stats minus the Atmire CUA schema additions for Salem to play with:
  • -
    $ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:[2019-01-01T00\:00\:00Z TO 2019-06-30T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
    +
    $ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:[2019-01-01T00\:00\:00Z TO 2019-06-30T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
     
    • Run system updates on DSpace Test (linode26) and reboot it

      @@ -1036,11 +1036,11 @@ mailto\:team@impactstory\.org

      I started processing Solr stats with the Atmire tool now:

    -
    $ dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -c statistics -f -t 12
    +
    $ dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -c statistics -f -t 12
     
    • This one failed after a few hours:
    -
    Record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b couldn't be processed
    +
    Record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b couldn't be processed
     com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
             at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
             at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
    @@ -1063,7 +1063,7 @@ If run the update again with the resume option (-r) they will be reattempted
     
  • I started the same script for the statistics-2019 core (12 million records…)
  • Update an ILRI author’s name on CGSpace:
  • -
    $ ./fix-metadata-values.py -i /tmp/2020-07-27-fix-ILRI-author.csv -db dspace -u cgspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
    +
    $ ./fix-metadata-values.py -i /tmp/2020-07-27-fix-ILRI-author.csv -db dspace -u cgspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
     Fixed 13 occurences of: Muloi, D.
     Fixed 4 occurences of: Muloi, D.M.
     

    2020-07-28

    @@ -1110,7 +1110,7 @@ Fixed 4 occurences of: Muloi, D.M. -
    # grep -c numeric /usr/share/iso-codes/json/iso_3166-1.json
    +
    # grep -c numeric /usr/share/iso-codes/json/iso_3166-1.json
     249
     # grep -c -E '"name":' /usr/share/iso-codes/json/iso_3166-1.json
     249
    diff --git a/docs/2020-08/index.html b/docs/2020-08/index.html
    index dd922abda..0359f240c 100644
    --- a/docs/2020-08/index.html
    +++ b/docs/2020-08/index.html
    @@ -36,7 +36,7 @@ It is class based so I can easily add support for other vocabularies, and the te
     
     
     "/>
    -
    +
     
     
         
    @@ -150,7 +150,7 @@ It is class based so I can easily add support for other vocabularies, and the te
     
     
  • I purged all unmigrated stats in a few cores and then restarted processing:
  • -
    $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
    +
    $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
     $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
     $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
     
      @@ -192,14 +192,14 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
    -
    $ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
    +
    $ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
        "numberItems" : 63,
     $ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length'
     61
     
    • Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there:
    -
    $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
    +
    $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
        "numberItems" : 61,
     $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length'
     59
    @@ -210,7 +210,7 @@ $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-49
     
     
     
    -
    dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
    +
    dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
        id   | collection_id | item_id
     --------+---------------+---------
      133698 |           966 |  107687
    @@ -220,12 +220,12 @@ $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-49
     
    • So for each id you can delete one duplicate mapping:
    -
    dspace=# DELETE FROM collection2item WHERE id='134686';
    +
    dspace=# DELETE FROM collection2item WHERE id='134686';
     dspace=# DELETE FROM collection2item WHERE id='128819';
     
    • Update countries on CGSpace to be closer to ISO 3166-1 with some minor differences based on Peter’s preferred display names
    -
    $ cat 2020-08-04-PB-new-countries.csv
    +
    $ cat 2020-08-04-PB-new-countries.csv
     cg.coverage.country,correct
     CAPE VERDE,CABO VERDE
     COCOS ISLANDS,COCOS (KEELING) ISLANDS
    @@ -267,7 +267,7 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
     
     
  • I checked the nginx logs around 5PM yesterday to see who was accessing the server:
  • -
    # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
    +
    # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
     
    • I see the Macaroni Bros are using their new user agent for harvesting: RTB website BOT
        @@ -276,7 +276,7 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
    -
    $ cat dspace.log.2020-08-04 | grep -E "(63.32.242.35|64.62.202.71)" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
    +
    $ cat dspace.log.2020-08-04 | grep -E "(63.32.242.35|64.62.202.71)" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
     5693
     
    • DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don’t misuse the resources @@ -291,18 +291,18 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
    • A few more IPs causing lots of Tomcat sessions yesterday:
    -
    $ cat dspace.log.2020-08-04 | grep "38.128.66.10" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
    +
    $ cat dspace.log.2020-08-04 | grep "38.128.66.10" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
     1585
     $ cat dspace.log.2020-08-04 | grep "64.62.202.71" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
     5691
     
    • 38.128.66.10 isn’t creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:
    -
    Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2
    +
    Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2
     
    • 64.62.202.71 is using a user agent I’ve never seen before:
    -
    Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
    +
    Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
     
    • So now our “bot” regex can’t even match that…
        @@ -310,7 +310,7 @@ $ cat dspace.log.2020-08-04 | grep "64.62.202.71" | grep -E 'session_i
    -
    RTB website BOT
    +
    RTB website BOT
     Altmetribot
     Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
     Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
    @@ -318,7 +318,7 @@ Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
     
    • And another IP belonging to Turnitin (the alternate user agent of Turnitinbot):
    -
    $ cat dspace.log.2020-08-04 | grep "199.47.87.145" | grep -E 'sessi
    +
    $ cat dspace.log.2020-08-04 | grep "199.47.87.145" | grep -E 'sessi
     on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
     2777
     
      @@ -377,7 +377,7 @@ on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
      • The Atmire stats processing for the statistics-2018 Solr core keeps stopping with this error:
      -
      Exception: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
      +
      Exception: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
       java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
               at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.storeOnServer(SourceFile:317)
               at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:177)
      @@ -398,13 +398,13 @@ java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's mo
       
    -
    $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/[0-9]+/</query></delete>'
    +
    $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/[0-9]+/</query></delete>'
     

    2020-08-09

    • The Atmire script did something to the server and created 132GB of log files so the root partition ran out of space…
    • I removed the log file and tried to re-run the process but it seems to be looping over 11,000 records and failing, creating millions of lines in the logs again:
    -
    # grep -oE "Record uid: ([a-f0-9\\-]*){1} couldn't be processed" /home/dspacetest.cgiar.org/log/dspace.log.2020-08-09 > /tmp/not-processed-errors.txt
    +
    # grep -oE "Record uid: ([a-f0-9\\-]*){1} couldn't be processed" /home/dspacetest.cgiar.org/log/dspace.log.2020-08-09 > /tmp/not-processed-errors.txt
     # wc -l /tmp/not-processed-errors.txt
     2202973 /tmp/not-processed-errors.txt
     # sort /tmp/not-processed-errors.txt | uniq -c | tail -n 10
    @@ -421,7 +421,7 @@ java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's mo
     
    • I looked at some of those records and saw strange objects in their containerCommunity, containerCollection, etc…
    -
    {
    +
    {
       "responseHeader": {
         "status": 0,
         "QTime": 0,
    @@ -470,7 +470,7 @@ java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's mo
     
    • I deleted those 11,724 records with the strange “set” object in the collections and communities, as well as 360,000 records with id: -1
    -
    $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
    +
    $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
     $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:\-1</query></delete>'
     
    • I was going to compare the CUA stats for 2018 and 2019 on CGSpace and DSpace Test, but after Linode rebooted CGSpace (linode18) for maintenance yesterday the solr cores didn’t all come back up OK @@ -485,7 +485,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=tru
    -
    $ cat 2020-08-09-add-ILRI-orcids.csv
    +
    $ cat 2020-08-09-add-ILRI-orcids.csv
     dc.contributor.author,cg.creator.id
     "Grace, Delia","Delia Grace: 0000-0002-0195-9489"
     "Delia Grace","Delia Grace: 0000-0002-0195-9489"
    @@ -501,7 +501,7 @@ dc.contributor.author,cg.creator.id
     
    • That got me curious, so I generated a list of all the unique ORCID identifiers we have in the database:
    -
    dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240) TO /tmp/2020-08-09-orcid-identifiers.csv;
    +
    dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240) TO /tmp/2020-08-09-orcid-identifiers.csv;
     COPY 2095
     dspace=# \q
     $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' /tmp/2020-08-09-orcid-identifiers.csv | sort | uniq > /tmp/2020-08-09-orcid-identifiers-uniq.csv
    @@ -517,7 +517,7 @@ $ wc -l /tmp/2020-08-09-orcid-identifiers-uniq.csv
     
     
     
    -
    $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
    +
    $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
     ...
     $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
     
      @@ -534,7 +534,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
    -
    dspace=# SELECT count(text_value) FROM metadatavalue WHERE metadata_field_id = 243 AND resource_type_id = 2;
    +
    dspace=# SELECT count(text_value) FROM metadatavalue WHERE metadata_field_id = 243 AND resource_type_id = 2;
      count
     -------
      50812
    @@ -573,7 +573,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
     
    • Last night I started the processing of the statistics-2016 core with the Atmire stats util and I see some errors like this:
    -
    Record uid: f6b288d7-d60d-4df9-b311-1696b88552a0 couldn't be processed
    +
    Record uid: f6b288d7-d60d-4df9-b311-1696b88552a0 couldn't be processed
     com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: f6b288d7-d60d-4df9-b311-1696b88552a0, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
             at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
             at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
    @@ -598,7 +598,7 @@ Caused by: java.lang.NullPointerException
     
     
  • I purged the unmigrated docs and continued processing:
  • -
    $ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
    +
    $ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
     $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
     $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2016
     
      @@ -608,7 +608,7 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
    -
    $ http 'https://cgspace.cgiar.org/oai/request?verb=ListSets' > /tmp/0.xml
    +
    $ http 'https://cgspace.cgiar.org/oai/request?verb=ListSets' > /tmp/0.xml
     $ for num in {100..1300..100}; do http "https://cgspace.cgiar.org/oai/request?verb=ListSets&resumptionToken=////$num" > /tmp/$num.xml; sleep 2; done
     $ for num in {0..1300..100}; do cat /tmp/$num.xml >> /tmp/cgspace-oai-sets.xml; done
     
      @@ -620,7 +620,7 @@ $ for num in {0..1300..100}; do cat /tmp/$num.xml >> /tmp/cgspace-oai-sets
    • The com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI script that was processing 2015 records last night started spitting shit tons of errors and created 120GB of logs…
    • I looked at a few of the UIDs that it was having problems with and they were unmigrated ones… so I purged them in 2015 and all the rest of the statistics cores
    -
    $ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
    +
    $ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
     ...
     $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
     

    2020-08-19

    @@ -715,13 +715,13 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru -
    $ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=0' User-Agent:'curl' > /tmp/wle-trade-off-page1.xml
    +
    $ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=0' User-Agent:'curl' > /tmp/wle-trade-off-page1.xml
     $ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=100' User-Agent:'curl' > /tmp/wle-trade-off-page2.xml
     $ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=200' User-Agent:'curl' > /tmp/wle-trade-off-page3.xml
     
    -
    $ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page1.xml >> /tmp/ids.txt
    +
    $ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page1.xml >> /tmp/ids.txt
     $ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page2.xml >> /tmp/ids.txt
     $ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page3.xml >> /tmp/ids.txt
     $ sort -u /tmp/ids.txt > /tmp/ids-sorted.txt
    @@ -764,7 +764,7 @@ $ grep -oE '[0-9]+/[0-9]+' /tmp/ids.txt > /tmp/handles.txt
     
    • I ran the CountryCodeTagger on CGSpace and it was very fast:
    -
    $ time chrt -b 0 dspace curate -t countrycodetagger -i all -r - -l 500 -s object | tee /tmp/2020-08-27-countrycodetagger.log
    +
    $ time chrt -b 0 dspace curate -t countrycodetagger -i all -r - -l 500 -s object | tee /tmp/2020-08-27-countrycodetagger.log
     real    2m7.643s
     user    1m48.740s
     sys     0m14.518s
    diff --git a/docs/2020-09/index.html b/docs/2020-09/index.html
    index 860d028ff..91dec6e24 100644
    --- a/docs/2020-09/index.html
    +++ b/docs/2020-09/index.html
    @@ -48,7 +48,7 @@ I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/39
     
     I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40
     "/>
    -
    +
     
     
         
    @@ -153,7 +153,7 @@ I filed an issue on OpenRXV to make some minor edits to the admin UI: https://gi
     
    • I ran the country code tagger on CGSpace:
    -
    $ time chrt -b 0 dspace curate -t countrycodetagger -i all -r - -l 500 -s object | tee /tmp/2020-09-02-countrycodetagger.log
    +
    $ time chrt -b 0 dspace curate -t countrycodetagger -i all -r - -l 500 -s object | tee /tmp/2020-09-02-countrycodetagger.log
     ...
     real    2m10.516s
     user    1m43.953s
    @@ -169,11 +169,11 @@ $ grep -c added /tmp/2020-09-02-countrycodetagger.log
     
     
     
    -
    2020-09-02 12:03:10,666 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A629116488DCC467E1EA2062A2E2EFD7:ip_addr=92.220.02.201:failed_login:no DN found for user aorth
    +
    2020-09-02 12:03:10,666 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A629116488DCC467E1EA2062A2E2EFD7:ip_addr=92.220.02.201:failed_login:no DN found for user aorth
     
    • I tried to query LDAP directly using the application credentials with ldapsearch and it works:
    -
    $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "applicationaccount@cgiarad.org" -W "(sAMAccountName=me)"
    +
    $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "applicationaccount@cgiarad.org" -W "(sAMAccountName=me)"
     
    • According to the DSpace 6 docs we need to escape commas in our LDAP parameters due to the new configuration system
        @@ -191,7 +191,7 @@ $ grep -c added /tmp/2020-09-02-countrycodetagger.log
    -
    $ cat 2020-09-03-fix-review-status.csv
    +
    $ cat 2020-09-03-fix-review-status.csv
     dc.description.version,correct
     Externally Peer Reviewed,Peer Review
     Peer Reviewed,Peer Review
    @@ -225,7 +225,7 @@ $ ./fix-metadata-values.py -i 2020-09-03-fix-review-status.csv -db dspace -u dsp
     
     
     
    -
    Thu Sep 03 12:26:33 CEST 2020 | Query:containerItem:ea7a2648-180d-4fce-bdc5-c3aa2304fc58
    +
    Thu Sep 03 12:26:33 CEST 2020 | Query:containerItem:ea7a2648-180d-4fce-bdc5-c3aa2304fc58
     Error while updating
     java.lang.NullPointerException
             at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1131)
    @@ -259,7 +259,7 @@ java.lang.NullPointerException
     
     
  • I will update our nearly 6,000 metadata values for CIFOR in the database accordingly:
  • -
    dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^(http://)?www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/([[:digit:]]+)\.html$', 'https://www.cifor.org/knowledge/publication/\3') WHERE metadata_field_id=219 AND text_value ~ 'www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/[[:digit:]]+';
    +
    dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^(http://)?www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/([[:digit:]]+)\.html$', 'https://www.cifor.org/knowledge/publication/\3') WHERE metadata_field_id=219 AND text_value ~ 'www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/[[:digit:]]+';
     dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/library/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/library/[[:digit:]]+/?';
     dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/pid/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/pid/[[:digit:]]+';
     
      @@ -285,7 +285,7 @@ dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^http
    -
    https://cgspace.cgiar.org/bitstream/handle/10568/82745/Characteristics-Silage.JPG
    +
    https://cgspace.cgiar.org/bitstream/handle/10568/82745/Characteristics-Silage.JPG
     
    • So they end up getting rate limited due to the XMLUI rate limits
        @@ -308,7 +308,7 @@ dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^http
    -
    $ ~/dspace63/bin/dspace curate -t countrycodetagger -i all -s object
    +
    $ ~/dspace63/bin/dspace curate -t countrycodetagger -i all -s object
     

    2020-09-10

    • I checked the country code tagger on CGSpace and DSpace Test and it ran fine from the systemd timer last night… w00t
    • @@ -318,7 +318,7 @@ dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^http
    -
    $ cat 2020-09-10-fix-cgspace-regions.csv
    +
    $ cat 2020-09-10-fix-cgspace-regions.csv
     cg.coverage.region,correct
     EAST AFRICA,EASTERN AFRICA
     WEST AFRICA,WESTERN AFRICA
    @@ -417,15 +417,15 @@ Would fix 3 occurences of: SOUTHWEST ASIA
     
     
     
    -
    value + "__description:" + cells["dc.type"].value
    +
    value + "__description:" + cells["dc.type"].value
     
    • Then I created a SAF bundle with SAFBuilder:
    -
    $ ./safbuilder.sh -c ~/Downloads/cip-annual-reports/cip-reports.csv
    +
    $ ./safbuilder.sh -c ~/Downloads/cip-annual-reports/cip-reports.csv
     
    • And imported them into my local test instance of CGSpace:
    -
    $ ~/dspace/bin/dspace import -a -e y.arrr@cgiar.org -m /tmp/2020-09-15-cip-annual-reports.map -s ~/Downloads/cip-annual-reports/SimpleArchiveFormat
    +
    $ ~/dspace/bin/dspace import -a -e y.arrr@cgiar.org -m /tmp/2020-09-15-cip-annual-reports.map -s ~/Downloads/cip-annual-reports/SimpleArchiveFormat
     
    • Then I uploaded them to CGSpace
    @@ -475,7 +475,7 @@ Would fix 3 occurences of: SOUTHWEST ASIA -
    $ cat 2020-09-17-add-bioversity-orcids.csv
    +
    $ cat 2020-09-17-add-bioversity-orcids.csv
     dc.contributor.author,cg.creator.id
     "Etten, Jacob van","Jacob van Etten: 0000-0001-7554-2558"
     "van Etten, Jacob","Jacob van Etten: 0000-0001-7554-2558"
    @@ -496,7 +496,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-09-17-add-bioversity-orcids.csv -db dsp
     
     
     
    -
    https://cgspace.cgiar.org/open-search/discover?query=type:"Journal Article" AND status:"Open Access" AND crpsubject:"Water, Land and Ecosystems" AND "tradeoffs"&rpp=100
    +
    https://cgspace.cgiar.org/open-search/discover?query=type:"Journal Article" AND status:"Open Access" AND crpsubject:"Water, Land and Ecosystems" AND "tradeoffs"&rpp=100
     
    • I noticed that my move-collections.sh script didn’t work on DSpace 6 because of the change from IDs to UUIDs, so I modified it to quote the collection resource_id parameters in the PostgreSQL query
    @@ -522,7 +522,7 @@ $ ./add-orcid-identifiers-csv.py -i 2020-09-17-add-bioversity-orcids.csv -db dsp -
    dspacestatistics=# SELECT SUM(views) FROM items;
    +
    dspacestatistics=# SELECT SUM(views) FROM items;
        sum
     ----------
      15714024
    @@ -536,7 +536,7 @@ dspacestatistics=# SELECT SUM(downloads) FROM items;
     
    • I deleted “Report” from twelve items that had it in their peer review field:
    -
    dspace=# BEGIN;
    +
    dspace=# BEGIN;
     BEGIN
     dspace=# DELETE FROM metadatavalue WHERE text_value='Report' AND resource_type_id=2 AND metadata_field_id=68;
     DELETE 12
    @@ -572,7 +572,7 @@ dspace=# COMMIT;
     
     
     
    -
    ...
    +
    ...
     item_ids = ['0079470a-87a1-4373-beb1-b16e3f0c4d81', '007a9df1-0871-4612-8b28-5335982198cb']
     item_ids_str = ' OR '.join(item_ids).replace('-', '\-')
     ...
    @@ -598,7 +598,7 @@ solr_query_params = {
     
    • I did some more work on the dspace-statistics-api and finalized the support for sending a POST to /items:
    -
    $ curl -s -d @request.json https://dspacetest.cgiar.org/rest/statistics/items | json_pp
    +
    $ curl -s -d @request.json https://dspacetest.cgiar.org/rest/statistics/items | json_pp
     {
        "currentPage" : 0,
        "limit" : 10,
    diff --git a/docs/2020-10/index.html b/docs/2020-10/index.html
    index 29e48d153..17a662613 100644
    --- a/docs/2020-10/index.html
    +++ b/docs/2020-10/index.html
    @@ -44,7 +44,7 @@ During the FlywayDB migration I got an error:
     
     
     "/>
    -
    +
     
     
         
    @@ -144,7 +144,7 @@ During the FlywayDB migration I got an error:
     
     
     
    -
    2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
    +
    2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Batch entry 0 update public.bitstreamformatregistry set description='Electronic publishing', internal='FALSE', mimetype='application/epub+zip', short_description='EPUB', support_level=1 where bitstream_format_id=78 was aborted: ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
       Detail: Key (short_description)=(EPUB) already exists.  Call getNextException to see other errors in the batch.
     2020-10-06 21:36:04,138 WARN  org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: 23505
     2020-10-06 21:36:04,138 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ ERROR: duplicate key value violates unique constraint "bitstreamformatregistry_short_description_key"
    @@ -212,7 +212,7 @@ org.hibernate.exception.ConstraintViolationException: could not execute batch
     
     
     
    -
    $ dspace metadata-import -f /tmp/2020-10-06-import-test.csv -e aorth@mjanja.ch
    +
    $ dspace metadata-import -f /tmp/2020-10-06-import-test.csv -e aorth@mjanja.ch
     Loading @mire database changes for module MQM
     Changes have been processed
     -----------------------------------------------------------
    @@ -259,7 +259,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected m
     
    • Also, I tested Listings and Reports and there are still no hits for “Orth, Alan” as a contributor, despite there being dozens of items in the repository and the Solr query generated by Listings and Reports actually returning hits:
    -
    2020-10-06 22:23:44,116 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=search.resourcetype:2&fq=author_keyword:Orth,\+A.+OR+author_keyword:Orth,\+Alan&fq=dateIssued.year:[2013+TO+2021]&rows=500&wt=javabin&version=2} hits=18 status=0 QTime=10 
    +
    2020-10-06 22:23:44,116 INFO org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=*:*&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=search.resourcetype:2&fq=author_keyword:Orth,\+A.+OR+author_keyword:Orth,\+Alan&fq=dateIssued.year:[2013+TO+2021]&rows=500&wt=javabin&version=2} hits=18 status=0 QTime=10 
     
    • Solr returns hits=18 for the L&R query, but there are no result shown in the L&R UI
    • I sent all this feedback to Atmire…
    • @@ -278,16 +278,16 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected m
    -
    $ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
    +
    $ http -f POST https://dspacetest.cgiar.org/rest/login email=aorth@fuuu.com 'password=fuuuu'
     $ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE
     
    • Then we post an item in JSON format to /rest/collections/{uuid}/items:
    -
    $ http POST https://dspacetest.cgiar.org/rest/collections/f10ad667-2746-4705-8b16-4439abe61d22/items Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE < item-object.json
    +
    $ http POST https://dspacetest.cgiar.org/rest/collections/f10ad667-2746-4705-8b16-4439abe61d22/items Cookie:JSESSIONID=EABAC9EFF942028AA52DFDA16DBCAFDE < item-object.json
     
    • Format of JSON is:
    -
    { "metadata": [
    +
    { "metadata": [
         {
           "key": "dc.title",
           "value": "Testing REST API post",
    @@ -362,7 +362,7 @@ $ http https://dspacetest.cgiar.org/rest/status Cookie:JSESSIONID=EABAC9EFF94202
     
     
     
    -
    $ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
    +
    $ http POST http://localhost:8080/rest/login email=aorth@fuuu.com 'password=ddddd'
     $ http http://localhost:8080/rest/status rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099
     $ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:d846f138-75d3-47ba-9180-b88789a28099 < item-object.json
     
      @@ -408,7 +408,7 @@ $ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:
    -
    $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
    +
    $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
     $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
     $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
     $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
    @@ -438,7 +438,7 @@ $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-438
     
  • I added [Ss]pider to the Tomcat Crawler Session Manager Valve regex because this can catch a few more generic bots and force them to use the same Tomcat JSESSIONID
  • I added a few of the patterns from above to our local agents list and ran the check-spider-hits.sh on CGSpace:
  • -
    $ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics -u http://localhost:8083/solr -p
    +
    $ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics -u http://localhost:8083/solr -p
     Purging 228916 hits from RTB website BOT in statistics
     Purging 18707 hits from ILRI Livestock Website Publications importer BOT in statistics
     Purging 2661 hits from ^Java\/[0-9]{1,2}.[0-9] in statistics
    @@ -472,7 +472,7 @@ Total number of bot hits purged: 3684
     
     
  • I can update the country metadata in PostgreSQL like this:
  • -
    dspace=> BEGIN;
    +
    dspace=> BEGIN;
     dspace=> UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE resource_type_id=2 AND metadata_field_id=228;
     UPDATE 51756
     dspace=> COMMIT;
    @@ -483,7 +483,7 @@ dspace=> COMMIT;
     
     
     
    -
    dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.country" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
    +
    dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.country" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
     COPY 195
     
    • Then use OpenRefine and make a new column for corrections, then use this GREL to convert to title case: value.toTitlecase() @@ -493,7 +493,7 @@ COPY 195
    • For the input forms I found out how to do a complicated search and replace in vim:
    -
    :'<,'>s/\<\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\>/\u\2\L\3/g
    +
    :'<,'>s/\<\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\>/\u\2\L\3/g
     
    • It uses a negative lookahead (aka “lookaround” in PCRE?) to match words that are not “pair”, “displayed”, etc because we don’t want to edit the XML tags themselves…
        @@ -509,18 +509,18 @@ COPY 195
    -
    dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.region" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
    +
    dspace=> \COPY (SELECT DISTINCT(text_value) as "cg.coverage.region" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
     COPY 34
     
    • I did the same as the countries in OpenRefine for the database values and in vim for the input forms
    • After testing the replacements locally I ran them on CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
    +
    $ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
     $ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227
     
    • Then I started a full re-indexing:
    -
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    88m21.678s
     user    7m59.182s
    @@ -579,7 +579,7 @@ sys     2m22.713s
     
  • I posted a message on Yammer to inform all our users about the changes to countries, regions, and AGROVOC subjects
  • I modified all AGROVOC subjects to be lower case in PostgreSQL and then exported a list of the top 1500 to update the controlled vocabulary in our submission form:
  • -
    dspace=> BEGIN;
    +
    dspace=> BEGIN;
     dspace=> UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57;
     UPDATE 335063
     dspace=> COMMIT;
    @@ -588,7 +588,7 @@ COPY 1500
     
    • Use my agrovoc-lookup.py script to validate subject terms against the AGROVOC REST API, extract matches with csvgrep, and then update and format the controlled vocabulary:
    -
    $ csvcut -c 1 /tmp/2020-10-15-top-1500-agrovoc-subject.csv | tail -n 1500 > /tmp/subjects.txt
    +
    $ csvcut -c 1 /tmp/2020-10-15-top-1500-agrovoc-subject.csv | tail -n 1500 > /tmp/subjects.txt
     $ ./agrovoc-lookup.py -i /tmp/subjects.txt -o /tmp/subjects.csv -d
     $ csvgrep -c 4 -m 0 -i /tmp/subjects.csv | csvcut -c 1 | sed '1d' > dspace/config/controlled-vocabularies/dc-subject.xml
     # apply formatting in XML file
    @@ -596,7 +596,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x
     
    • Then I started a full re-indexing on CGSpace:
    -
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    88m21.678s
     user    7m59.182s
    @@ -614,7 +614,7 @@ sys     2m22.713s
     
  • They are using the user agent “CCAFS Website Publications importer BOT” so they are getting rate limited by nginx
  • Ideally they would use the REST find-by-metadata-field endpoint, but it is really slow for large result sets (like twenty minutes!):
  • -
    $ curl -f -H "CCAFS Website Publications importer BOT" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100" -d '{"key":"cg.contributor.crp", "value":"Climate Change, Agriculture and Food Security","language": "en_US"}'
    +
    $ curl -f -H "CCAFS Website Publications importer BOT" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100" -d '{"key":"cg.contributor.crp", "value":"Climate Change, Agriculture and Food Security","language": "en_US"}'
     
    • For now I will whitelist their user agent so that they can continue scraping /browse
    • I figured out that the mappings for AReS are stored in Elasticsearch @@ -624,7 +624,7 @@ sys 2m22.713s
    -
    $ curl -XPOST "localhost:9200/openrxv-values/_delete_by_query" -H 'Content-Type: application/json' -d'
    +
    $ curl -XPOST "localhost:9200/openrxv-values/_delete_by_query" -H 'Content-Type: application/json' -d'
     {
       "query": {
         "match": {
    @@ -635,7 +635,7 @@ sys     2m22.713s
     
    • I added a new find/replace:
    -
    $ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
    +
    $ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
     {
       "find": "ALAN1",
       "replace": "ALAN2",
    @@ -645,11 +645,11 @@ sys     2m22.713s
     
  • I see it in Kibana, and I can search it in Elasticsearch, but I don’t see it in OpenRXV’s mapping values dashboard
  • Now I deleted everything in the openrxv-values index:
  • -
    $ curl -XDELETE http://localhost:9200/openrxv-values
    +
    $ curl -XDELETE http://localhost:9200/openrxv-values
     
    • Then I tried posting it again:
    -
    $ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
    +
    $ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
     {
       "find": "ALAN1",
       "replace": "ALAN2",
    @@ -682,12 +682,12 @@ sys     2m22.713s
     
    • Last night I learned how to POST mappings to Elasticsearch for AReS:
    -
    $ curl -XDELETE http://localhost:9200/openrxv-values
    +
    $ curl -XDELETE http://localhost:9200/openrxv-values
     $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @./mapping.json
     
    • The JSON file looks like this, with one instruction on each line:
    -
    {"index":{}}
    +
    {"index":{}}
     { "find": "CRP on Dryland Systems - DS", "replace": "Dryland Systems" }
     {"index":{}}
     { "find": "FISH", "replace": "Fish" }
    @@ -737,7 +737,7 @@ f.close()
     
  • It filters all upper and lower case strings as well as any replacements that end in an acronym like “- ILRI”, reducing the number of mappings from around 4,000 to about 900
  • I deleted the existing openrxv-values Elasticsearch core and then POSTed it:
  • -
    $ ./convert-mapping.py > /tmp/elastic-mappings.txt
    +
    $ ./convert-mapping.py > /tmp/elastic-mappings.txt
     $ curl -XDELETE http://localhost:9200/openrxv-values
     $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elastic-mappings.txt
     
      @@ -762,17 +762,17 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-T
    • I ran the dspace cleanup -v process on CGSpace and got an error:
    -
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
       Detail: Key (bitstream_id)=(192921) is still referenced from table "bundle".
     
    • The solution is, as always:
    -
    $ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
    +
    $ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
     UPDATE 1
     
    • After looking at the CGSpace Solr stats for 2020-10 I found some hits to purge:
    -
    $ ./check-spider-hits.sh -f /tmp/agents -s statistics -u http://localhost:8083/solr -p
    +
    $ ./check-spider-hits.sh -f /tmp/agents -s statistics -u http://localhost:8083/solr -p
     
     Purging 2474 hits from ShortLinkTranslate in statistics
     Purging 2568 hits from RI\/1\.0 in statistics
    @@ -794,7 +794,7 @@ Total number of bot hits purged: 8174
     
     
     
    -
    $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
    +
    $ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
     $ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
     
    • And I saw three hits in Solr with isBot: true!!! @@ -817,7 +817,7 @@ $ curl -s 'http://localhost:8083/solr/statistics/update?softCommit=true'
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
     $ dspace metadata-export -f /tmp/cgspace.csv
     $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US],cg.subject.alliancebiovciat[],cg.subject.alliancebiovciat[en_US],cg.subject.bioversity[en_US],cg.subject.ccafs[],cg.subject.ccafs[en_US],cg.subject.ciat[],cg.subject.ciat[en_US],cg.subject.cip[],cg.subject.cip[en_US],cg.subject.cpwf[en_US],cg.subject.iita,cg.subject.iita[en_US],cg.subject.iwmi[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
     
      @@ -833,7 +833,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
      • Bosede was getting this error on CGSpace yesterday:
      -
      Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1072 by user 1759
      +
      Authorization denied for action WORKFLOW_STEP_1 on COLLECTION:1072 by user 1759
       
      • Collection 1072 appears to be IITA Miscellaneous
          @@ -848,7 +848,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
      -
      $ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&size=10000&q=*:*' > /tmp/affiliations.json
      +
      $ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affiliation&size=10000&q=*:*' > /tmp/affiliations.json
       
      • Then I decided to try a different approach and I adjusted my convert-mapping.py script to re-consider some replacement patterns with acronyms from the original AReS mapping.json file to hopefully address some MEL to CGSpace mappings
          @@ -893,7 +893,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
          • I re-installed DSpace Test with a fresh snapshot of CGSpace’s to test the DSpace 6 upgrade (the last time was in 2020-05, and we’ve fixed a lot of issues since then):
          -
          $ cp dspace/etc/postgres/update-sequences.sql /tmp/dspace5-update-sequences.sql
          +
          $ cp dspace/etc/postgres/update-sequences.sql /tmp/dspace5-update-sequences.sql
           $ git checkout origin/6_x-dev-atmire-modules
           $ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
           $ sudo su - postgres
          @@ -911,7 +911,7 @@ $ sudo systemctl start tomcat7
           
          • Then I started processing the Solr stats one core and 1 million records at a time:
          -
          $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
          +
          $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
           $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
           $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
           $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
          @@ -920,7 +920,7 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
           
          • After the fifth or so run I got this error:
          -
          Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
          +
          Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
           org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
                   at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
                   at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
          @@ -945,7 +945,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
           
      -
      $ curl -s "http://localhost:8083/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
      +
      $ curl -s "http://localhost:8083/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
       
      • Then I restarted the solr-upgrade-statistics-6x process, which apparently had no records left to process
      • I started processing the statistics-2019 core… @@ -958,7 +958,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
        • The statistics processing on the statistics-2018 core errored after 1.8 million records:
        -
        Exception: Java heap space
        +
        Exception: Java heap space
         java.lang.OutOfMemoryError: Java heap space
         
        • I had the same problem when I processed the statistics-2018 core in 2020-07 and 2020-08 @@ -967,7 +967,7 @@ java.lang.OutOfMemoryError: Java heap space
      -
      $ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>id:/.+-unmigrated/</query></delete>"
      +
      $ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>id:/.+-unmigrated/</query></delete>"
       
      • I restarted the process and it crashed again a few minutes later
          @@ -976,7 +976,7 @@ java.lang.OutOfMemoryError: Java heap space
      -
      $ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
      +
      $ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
       
      • Then I started processing the statistics-2017 core…
          @@ -984,7 +984,7 @@ java.lang.OutOfMemoryError: Java heap space
      -
      $ curl -s "http://localhost:8083/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
      +
      $ curl -s "http://localhost:8083/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
       
      • Also I purged 2.7 million unmigrated records from the statistics-2019 core
      • I filed an issue with Atmire about the duplicate values in the owningComm and containerCommunity fields in Solr: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839
      • @@ -1002,7 +1002,7 @@ java.lang.OutOfMemoryError: Java heap space
    -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
    +
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
     
    • Peter asked me to add the new preferred AGROVOC subject “covid-19” to all items we had previously added “coronavirus disease”, and to make sure all items with ILRI subject “ZOONOTIC DISEASES” have the AGROVOC subject “zoonoses”
        @@ -1010,7 +1010,7 @@ java.lang.OutOfMemoryError: Java heap space
    -
    $ dspace metadata-export -f /tmp/cgspace.csv
    +
    $ dspace metadata-export -f /tmp/cgspace.csv
     $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri[en_US]' /tmp/cgspace.csv > /tmp/cgspace-subjects.csv
     
    • I sanity checked the CSV in csv-metadata-quality after exporting from OpenRefine, then applied the changes to 453 items on CGSpace
    • @@ -1040,7 +1040,7 @@ $ csvcut -c 'id,dc.subject[],dc.subject[en_US],cg.subject.ilri[],cg.subject.ilri
    -
    $ ./create-mappings.py > /tmp/elasticsearch-mappings.txt
    +
    $ ./create-mappings.py > /tmp/elasticsearch-mappings.txt
     $ ./convert-mapping.py >> /tmp/elasticsearch-mappings.txt
     $ curl -XDELETE http://localhost:9200/openrxv-values
     $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elasticsearch-mappings.txt
    @@ -1048,12 +1048,12 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-T
     
  • After that I had to manually create and delete a fake mapping in the AReS UI so that the mappings would show up
  • I fixed a few strings in the OpenRXV admin dashboard and then re-built the frontent container:
  • -
    $ docker-compose up --build -d angular_nginx
    +
    $ docker-compose up --build -d angular_nginx
     

    2020-10-28

    • Fix a handful more of grammar and spelling issues in OpenRXV and then re-build the containers:
    -
    $ docker-compose up --build -d --force-recreate angular_nginx
    +
    $ docker-compose up --build -d --force-recreate angular_nginx
     
    • Also, I realized that the mysterious issue with countries getting changed to inconsistent lower case like “Burkina faso” is due to the country formatter (see: backend/src/harvester/consumers/fetch.consumer.ts)
        @@ -1079,7 +1079,7 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-T
    -
    $ cat 2020-10-28-update-regions.csv
    +
    $ cat 2020-10-28-update-regions.csv
     cg.coverage.region,correct
     East Africa,Eastern Africa
     West Africa,Western Africa
    @@ -1092,7 +1092,7 @@ $ ./fix-metadata-values.py -i 2020-10-28-update-regions.csv -db dspace -u dspace
     
    • Then I started a full Discovery re-indexing:
    -
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    92m14.294s
     user    7m59.840s
    @@ -1115,7 +1115,7 @@ sys     2m22.327s
     
     
  • Send Peter a list of affiliations, authors, journals, publishers, investors, and series for correction:
  • -
    dspace=> \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
    +
    dspace=> \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-affiliations.csv WITH CSV HEADER;
     COPY 6357
     dspace=> \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-10-28-sponsors.csv WITH CSV HEADER;
     COPY 730
    @@ -1134,7 +1134,7 @@ COPY 5598
     
     
     
    -
    $ grep -c '"find"' /tmp/elasticsearch-mappings*
    +
    $ grep -c '"find"' /tmp/elasticsearch-mappings*
     /tmp/elasticsearch-mappings2.txt:350
     /tmp/elasticsearch-mappings.txt:1228
     $ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | wc -l
    @@ -1148,7 +1148,7 @@ $ cat /tmp/elasticsearch-mappings* | grep -v '{"index":{}}' | sort | u
     
     
     
    -
    $ cat /tmp/elasticsearch-mappings* > /tmp/new-elasticsearch-mappings.txt
    +
    $ cat /tmp/elasticsearch-mappings* > /tmp/new-elasticsearch-mappings.txt
     $ curl -XDELETE http://localhost:9200/openrxv-values
     $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/new-elasticsearch-mappings.txt
     
      @@ -1159,14 +1159,14 @@ $ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-T
    • Lower case some straggling AGROVOC subjects on CGSpace:
    -
    dspace=# BEGIN;
    +
    dspace=# BEGIN;
     dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
     UPDATE 123
     dspace=# COMMIT;
     
    • Move some top-level communities to the CGIAR System community for Peter:
    -
    $ dspace community-filiator --set --parent 10568/83389 --child 10568/1208
    +
    $ dspace community-filiator --set --parent 10568/83389 --child 10568/1208
     $ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
     

    2020-10-30

      @@ -1187,7 +1187,7 @@ $ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
    -
    or(
    +
    or(
       isNotNull(value.match(/.*\uFFFD.*/)),
       isNotNull(value.match(/.*\u00A0.*/)),
       isNotNull(value.match(/.*\u200A.*/)),
    @@ -1198,7 +1198,7 @@ $ dspace community-filiator --set --parent 10568/83389 --child 10568/56924
     
    • Then I did a test to apply the corrections and deletions on my local DSpace:
    -
    $ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
    +
    $ ./fix-metadata-values.py -i 2020-10-30-fix-854-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -t 'correct' -m 55
     $ ./delete-metadata-values.py -i 2020-10-30-delete-90-journals.csv -db dspace -u dspace -p 'fuuu' -f dc.source -m 55
     $ ./fix-metadata-values.py -i 2020-10-30-fix-386-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -t correct -m 39
     $ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace -u dspace -p 'fuuu' -f dc.publisher -m 39
    @@ -1214,12 +1214,12 @@ $ ./delete-metadata-values.py -i 2020-10-30-delete-10-publishers.csv -db dspace
     
     
  • Quickly process the sponsor corrections Peter sent me a few days ago and test them locally:
  • -
    $ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
    +
    $ ./fix-metadata-values.py -i 2020-10-31-fix-82-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct' -m 29
     $ ./delete-metadata-values.py -i 2020-10-31-delete-74-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
     
    • I applied all the fixes from today and yesterday on CGSpace and then started a full Discovery re-index:
    -
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
    diff --git a/docs/2020-11/index.html b/docs/2020-11/index.html index d9b2ae482..64aa012d6 100644 --- a/docs/2020-11/index.html +++ b/docs/2020-11/index.html @@ -32,7 +32,7 @@ So far we’ve spent at least fifty hours to process the statistics and stat "/> - + @@ -150,12 +150,12 @@ So far we’ve spent at least fifty hours to process the statistics and stat -
    $ ./fix-metadata-values.py -i 2020-11-05-fix-862-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
    +
    $ ./fix-metadata-values.py -i 2020-11-05-fix-862-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
     $ ./delete-metadata-values.py -i 2020-11-05-delete-29-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
     
    • Then I started a Discovery re-index on CGSpace:
    -
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    92m24.993s
     user    8m11.858s
    @@ -190,7 +190,7 @@ sys     2m26.931s
     
  • The statistics-2014 core finished processing after five hours, so I started processing the statistics-2013 core on DSpace Test
  • Since I was going to restart CGSpace and update the Discovery indexes anyways I decided to check for any straggling upper case AGROVOC entries and lower case them:
  • -
    dspace=# BEGIN;
    +
    dspace=# BEGIN;
     dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
     UPDATE 164
     dspace=# COMMIT;
    @@ -211,7 +211,7 @@ dspace=# COMMIT;
     
     
     
    -
    2020-11-10 08:43:59,634 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
    +
    2020-11-10 08:43:59,634 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
     2020-11-10 08:43:59,687 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2018
     2020-11-10 08:43:59,707 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
     2020-11-10 08:44:00,004 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
    @@ -227,7 +227,7 @@ dspace=# COMMIT;
     
     
     
    -
    2020-11-10 08:51:03,007 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2018
    +
    2020-11-10 08:51:03,007 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2018
     2020-11-10 08:51:03,008 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
     2020-11-10 08:51:03,137 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2018
     2020-11-10 08:51:03,153 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
    @@ -281,11 +281,11 @@ dspace=# COMMIT;
     
     
  • First we get the total number of communities with stats (using calcdistinct):
  • -
    facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=1&facet.offset=0&stats=true&stats.field=owningComm&stats.calcdistinct=true&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
    +
    facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=1&facet.offset=0&stats=true&stats.field=owningComm&stats.calcdistinct=true&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
     
    • Then get stats themselves, iterating 100 items at a time with limit and offset:
    -
    facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=100&facet.offset=0&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
    +
    facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=100&facet.offset=0&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
     
    • I was surprised to see 10,000,000 docs with isBot:true when I was testing on DSpace Test…
        @@ -309,7 +309,7 @@ dspace=# COMMIT;
    -
    $ dspace cleanup -v
    +
    $ dspace cleanup -v
     $ git checkout origin/6_x-dev-atmire-modules
     $ npm install -g yarn
     $ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2,\!dspace-jspui clean package
    @@ -329,7 +329,7 @@ $ sudo systemctl start tomcat7
     
     
     
    -
    # systemctl stop tomcat7
    +
    # systemctl stop tomcat7
     # pg_ctlcluster 9.6 main stop
     # tar -cvzpf var-lib-postgresql-9.6.tar.gz /var/lib/postgresql/9.6
     # tar -cvzpf etc-postgresql-9.6.tar.gz /etc/postgresql/9.6
    @@ -345,7 +345,7 @@ $ sudo systemctl start tomcat7
     
  • I disabled the dspace-statistsics-api for now because it won’t work until I migrate all the Solr statistics anyways
  • Start a full Discovery re-indexing:
  • -
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    211m30.726s
     user    134m40.124s
    @@ -353,13 +353,13 @@ sys     2m17.979s
     
    • Towards the end of the indexing there were a few dozen of these messages:
    -
    2020-11-15 13:23:21,685 INFO  com.atmire.dspace.discovery.service.AtmireSolrService @ Removed Item: null from Index
    +
    2020-11-15 13:23:21,685 INFO  com.atmire.dspace.discovery.service.AtmireSolrService @ Removed Item: null from Index
     
    • I updated all the Ansible infrastructure and DSpace branches to be the DSpace 6 ones
    • I will wait until the Discovery indexing is finished to start doing the Solr statistics migration
    • I tested the email functionality and it seems to need more configuration:
    -
    $ dspace test-email
    +
    $ dspace test-email
     
     About to send test email:
      - To: blah@cgiar.org
    @@ -372,12 +372,12 @@ Error sending email:
     
  • I copied the mail.extraproperties = mail.smtp.starttls.enable=true setting from the old DSpace 5 dspace.cfg and now the emails are working
  • After the Discovery indexing finished I started processing the Solr stats one core and 2.5 million records at a time:
  • -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
     $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
     
    • After about 6,000,000 records I got the same error that I’ve gotten every time I test this migration process:
    -
    Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
    +
    Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
     org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
             at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
             at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    @@ -407,7 +407,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
     
    • There are almost 1,500 locks:
    -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     1494
     
    • I sent a mail to the dspace-tech mailing list to ask for help… @@ -417,7 +417,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
    • While processing the statistics-2018 Solr core I got the same memory error that I have gotten every time I processed this core in testing:
    -
    Exception: Java heap space
    +
    Exception: Java heap space
     java.lang.OutOfMemoryError: Java heap space
             at java.util.Arrays.copyOf(Arrays.java:3332)
             at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
    @@ -454,7 +454,7 @@ java.lang.OutOfMemoryError: Java heap space
     
     
     
    -
    Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
    +
    Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
     org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
             at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
             at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    @@ -486,7 +486,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
     
    • There are over 2,000 locks:
    -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     2071
     

    2020-11-18

      @@ -534,7 +534,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
    • Peter got a strange message this evening when trying to update metadata:
    -
    2020-11-18 16:57:33,309 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1]
    +
    2020-11-18 16:57:33,309 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1]
     2020-11-18 16:57:33,316 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [Batch update returned unexpected row count from update [13]; actual row count: 0; expected: 1]
     2020-11-18 16:57:33,385 INFO  org.hibernate.engine.jdbc.batch.internal.AbstractBatchImpl @ HHH000010: On release of batch it still contained JDBC statements
     
      @@ -603,25 +603,25 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
    -
    dspace=# \COPY (SELECT item_id,uuid FROM item WHERE in_archive='t' AND withdrawn='f' AND item_id IS NOT NULL) TO /tmp/2020-11-22-item-id2uuid.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT item_id,uuid FROM item WHERE in_archive='t' AND withdrawn='f' AND item_id IS NOT NULL) TO /tmp/2020-11-22-item-id2uuid.csv WITH CSV HEADER;
     COPY 87411
     
    • Saving some notes I wrote down about faceting by community and collection in Solr, for potential use in the future in the DSpace Statistics API
    • Facet by owningComm to see total number of distinct communities (136):
    -
      facet=true&facet.mincount=1&facet.field=owningComm&facet.limit=1&facet.offset=0&stats=true&stats.field=id&stats.calcdistinct=true
    +
      facet=true&facet.mincount=1&facet.field=owningComm&facet.limit=1&facet.offset=0&stats=true&stats.field=id&stats.calcdistinct=true
     
    • Facet by owningComm and get the first 5 distinct:
    -
      facet=true&facet.mincount=1&facet.field=owningComm&facet.limit=5&facet.offset=0&facet.pivot=id,countryCode
    +
      facet=true&facet.mincount=1&facet.field=owningComm&facet.limit=5&facet.offset=0&facet.pivot=id,countryCode
     
    • Facet by owningComm and countryCode using facet.pivot and maybe I can just skip the normal facet params?
    -
    facet=true&f.owningComm.facet.limit=5&f.owningComm.facet.offset=5&facet.pivot=owningComm,countryCode
    +
    facet=true&f.owningComm.facet.limit=5&f.owningComm.facet.offset=5&facet.pivot=owningComm,countryCode
     
    • Facet by owningComm and countryCode using facet.pivot and limiting to top five countries… fuck it’s possible!
    -
    facet=true&f.owningComm.facet.limit=5&f.owningComm.facet.offset=5&f.countryCode.facet.limit=5&facet.pivot=owningComm,countryCode
    +
    facet=true&f.owningComm.facet.limit=5&f.owningComm.facet.offset=5&f.countryCode.facet.limit=5&facet.pivot=owningComm,countryCode
     

    2020-11-23

    -
    $ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
    +
    $ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
     
    • IWMI sent me a few new ORCID identifiers so I combined them with our existing ones as well as another ILRI one that Tezira asked me to update, filtered the unique ones, and then resolved their names using my resolve-orcids.py script:
    -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt /tmp/hung.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-11-30-combined-orcids.txt
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt /tmp/hung.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-11-30-combined-orcids.txt
     $ ./resolve-orcids.py -i /tmp/2020-11-30-combined-orcids.txt -o /tmp/2020-11-30-combined-orcids-names.txt -d
     # sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
     $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
     
    • I used my fix-metadata-values.py script to update the old occurences of Hung’s ORCID and some others that I see have changed:
    -
    $ cat 2020-11-30-fix-hung-orcid.csv
    +
    $ cat 2020-11-30-fix-hung-orcid.csv
     cg.creator.id,correct
     "Hung Nguyen-Viet: 0000-0001-9877-0596","Hung Nguyen-Viet: 0000-0003-1549-2733"
     "Adriana Tofiño: 0000-0001-7115-7169","Adriana Tofiño Rivera: 0000-0001-7115-7169"
    diff --git a/docs/2020-12/index.html b/docs/2020-12/index.html
    index 89b1689d8..544a75edb 100644
    --- a/docs/2020-12/index.html
    +++ b/docs/2020-12/index.html
    @@ -36,7 +36,7 @@ I started processing those (about 411,000 records):
     
     
     "/>
    -
    +
     
     
         
    @@ -132,7 +132,7 @@ I started processing those (about 411,000 records):
     
     
     
    -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
    +
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
     
    • AReS went down when the renew-letsencrypt service stopped the angular_nginx container in the pre-update hook and failed to bring it back up
        @@ -151,7 +151,7 @@ I started processing those (about 411,000 records):
      • Start testing export/import of yearly Solr statistics data into the main statistics core on DSpace Test, for example:
      -
      $ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
      +
      $ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
       $ ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
       $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
       
        @@ -179,13 +179,13 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
        • First the 2010 core:
        -
        $ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
        +
        $ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o statistics-2010.json -k uid
         $ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
         $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
         
        • Judging by the DSpace logs all these cores had a problem starting up in the last month:
        -
        # grep -rsI "Unable to create core" [dspace]/log/dspace.log.2020-* | grep -o -E "statistics-[0-9]+" | sort | uniq -c
        +
        # grep -rsI "Unable to create core" [dspace]/log/dspace.log.2020-* | grep -o -E "statistics-[0-9]+" | sort | uniq -c
              24 statistics-2010
              24 statistics-2015
              18 statistics-2016
        @@ -193,7 +193,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
         
        • The message is always this:
        -
        org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error CREATEing SolrCore 'statistics-2016': Unable to create core [statistics-2016] Caused by: Lock obtain timed out: NativeFSLock@/[dspace]/solr/statistics-2016/data/index/write.lock
        +
        org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error CREATEing SolrCore 'statistics-2016': Unable to create core [statistics-2016] Caused by: Lock obtain timed out: NativeFSLock@/[dspace]/solr/statistics-2016/data/index/write.lock
         
        • I will migrate all these cores and see if it makes a difference, then probably end up migrating all of them
            @@ -223,7 +223,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
            • There are apparently 1,700 locks right now:
            -
            $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
            +
            $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
             1739
             

            2020-12-08

              @@ -233,7 +233,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
          -
          Record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0 couldn't be processed
          +
          Record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0 couldn't be processed
           com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: 64387815-d9a7-4605-8024-1c0a5c7520e0, an error occured in the com.atmire.statistics.util.update.atomic.processor.DeduplicateValuesProcessor
                   at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
                   at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
          @@ -270,7 +270,7 @@ Caused by: java.lang.UnsupportedOperationException
           
          • I was running the AtomicStatisticsUpdateCLI to remove duplicates on DSpace Test but it failed near the end of the statistics core (after 20 hours or so) with a memory error:
          -
          Successfully finished updating Solr Storage Reports | Wed Dec 09 15:25:11 CET 2020
          +
          Successfully finished updating Solr Storage Reports | Wed Dec 09 15:25:11 CET 2020
           Run 1 —  67% — 10,000/14,935 docs — 6m 6s — 6m 6s
           Exception: GC overhead limit exceeded
           java.lang.OutOfMemoryError: GC overhead limit exceeded
          @@ -279,7 +279,7 @@ java.lang.OutOfMemoryError: GC overhead limit exceeded
           
        • I increased the JVM heap to 2048m and tried again, but it failed with a memory error again…
        • I increased the JVM heap to 4096m and tried again, but it failed with another error:
        -
        Successfully finished updating Solr Storage Reports | Wed Dec 09 15:53:40 CET 2020
        +
        Successfully finished updating Solr Storage Reports | Wed Dec 09 15:53:40 CET 2020
         Exception: parsing error
         org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: parsing error
                 at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:530)
        @@ -341,7 +341,7 @@ Caused by: org.apache.http.TruncatedChunkException: Truncated chunk ( expected s
         
        • I can see it in the openrxv-items-final index:
        -
        $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*' | json_pp
        +
        $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*' | json_pp
         {
            "_shards" : {
               "failed" : 0,
        @@ -355,14 +355,14 @@ Caused by: org.apache.http.TruncatedChunkException: Truncated chunk ( expected s
         
      • I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/64
      • For now I will try to delete the index and start a re-harvest in the Admin UI:
      -
      $ curl -XDELETE http://localhost:9200/openrxv-items-final
      +
      $ curl -XDELETE http://localhost:9200/openrxv-items-final
       {"acknowledged":true}%
       
      • Moayad said he’s working on the harvesting so I stopped it for now to re-deploy his latest changes
      • I updated Tomcat to version 7.0.107 on CGSpace (linode18), ran all updates, and restarted the server
      • I deleted both items indexes and restarted the harvesting:
      -
      $ curl -XDELETE http://localhost:9200/openrxv-items-final
      +
      $ curl -XDELETE http://localhost:9200/openrxv-items-final
       $ curl -XDELETE http://localhost:9200/openrxv-items-temp
       
      • Peter asked me for a list of all submitters and approvers that were active recently on CGSpace @@ -371,7 +371,7 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-temp
    -
    localhost/dspace63= > SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';
    +
    localhost/dspace63= > SELECT * FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-[0-9]{2}-*';
     

    2020-12-14

    • The re-harvesting finished last night on AReS but there are no records in the openrxv-items-final index @@ -380,7 +380,7 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-temp
    -
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp
    +
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*' | json_pp
     {
        "count" : 99992,
        "_shards" : {
    @@ -397,14 +397,14 @@ $ curl -XDELETE http://localhost:9200/openrxv-items-temp
     
     
     
    -
    $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
    +
    $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
     {"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final"}
     $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
     
    • Now I see that the openrxv-items-final index has items, but there are still none in AReS Explorer UI!
    -
    $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
    +
    $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
     {
       "count" : 99992,
       "_shards" : {
    @@ -417,7 +417,7 @@ $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H
     
    • The api logs show this from last night after the harvesting:
    -
    [Nest] 92   - 12/13/2020, 1:58:52 PM   [HarvesterService] Starting Harvest
    +
    [Nest] 92   - 12/13/2020, 1:58:52 PM   [HarvesterService] Starting Harvest
     [Nest] 92   - 12/13/2020, 10:50:20 PM   [FetchConsumer] OnGlobalQueueDrained
     [Nest] 92   - 12/13/2020, 11:00:20 PM   [PluginsConsumer] OnGlobalQueueDrained
     [Nest] 92   - 12/13/2020, 11:00:20 PM   [HarvesterService] reindex function is called
    @@ -432,7 +432,7 @@ $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H
     
  • I cloned the openrxv-items-final index to openrxv-items index and now I see items in the explorer UI
  • The PDF report was broken and I looked in the API logs and saw this:
  • -
    (node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
    +
    (node:94) UnhandledPromiseRejectionWarning: Error: Error: Could not find soffice binary
         at ExportService.downloadFile (/backend/dist/export/services/export/export.service.js:51:19)
         at processTicksAndRejections (internal/process/task_queues.js:97:5)
     
      @@ -457,7 +457,7 @@ $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H
    -
    $ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=0' | json_pp > /tmp/policy1.json
    +
    $ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=0' | json_pp > /tmp/policy1.json
     $ http --print b 'https://cgspace.cgiar.org/rest/collections/defee001-8cc8-4a6c-8ac8-21bb5adab2db?expand=all&limit=100&offset=100' | json_pp > /tmp/policy2.json
     $ query-json '.items | length' /tmp/policy1.json
     100
    @@ -487,7 +487,7 @@ $ query-json '.items | length' /tmp/policy2.json
     
     
     
    -
    $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
    +
    $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-2020-12-14
     $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
     

    2020-12-15

    @@ -499,12 +499,12 @@ $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H
  • I checked the 1,534 fixes in Open Refine (had to fix a few UTF-8 errors, as always from Peter’s CSVs) and then applied them using the fix-metadata-values.py script:
  • -
    $ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
    +
    $ ./fix-metadata-values.py -i /tmp/2020-10-28-fix-1534-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
     $ ./delete-metadata-values.py -i /tmp/2020-10-28-delete-2-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3
     
    • Since I was re-indexing Discovery anyways I decided to check for any uppercase AGROVOC and lowercase them:
    -
    dspace=# BEGIN;
    +
    dspace=# BEGIN;
     BEGIN
     dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
     UPDATE 406
    @@ -513,7 +513,7 @@ COMMIT
     
    • I also updated the Font Awesome icon classes for version 5 syntax:
    -
    dspace=# BEGIN;
    +
    dspace=# BEGIN;
     dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-rss','fas fa-rss', 'g') WHERE text_value LIKE '%fa fa-rss%';
     UPDATE 74
     dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'fa fa-at','fas fa-at', 'g') WHERE text_value LIKE '%fa fa-at%';
    @@ -522,7 +522,7 @@ dspace=# COMMIT;
     
    • Then I started a full Discovery re-index:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
     $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    265m11.224s
    @@ -544,7 +544,7 @@ sys     2m41.097s
     
    • After the Discovery re-indexing finished on CGSpace I prepared to start re-harvesting AReS by making sure the openrxv-items-temp index was empty and that the backup index I made yesterday was still there:
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
     {
       "acknowledged" : true
     }
    @@ -576,7 +576,7 @@ $ curl -s 'http://localhost:9200/openrxv-items-2020-12-14/_count?q=*&pretty'
     
     
     
    -
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
    +
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
     {
       "count" : 100046,
       "_shards" : {
    @@ -611,7 +611,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp?pretty'
     
     
  • Generate a list of submitters and approvers active in the last months using the Provenance field on CGSpace:
  • -
    $ psql -h localhost -U postgres dspace -c "SELECT text_value FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-(06|07|08|09|10|11|12)-*'" > /tmp/provenance.txt
    +
    $ psql -h localhost -U postgres dspace -c "SELECT text_value FROM metadatavalue WHERE metadata_field_id=28 AND text_value ~ '^.*on 2020-(06|07|08|09|10|11|12)-*'" > /tmp/provenance.txt
     $ grep -o -E 'by .*)' /tmp/provenance.txt | grep -v -E "( on |checksum)" | sed -e 's/by //' -e 's/ (/,/' -e 's/)//' | sort | uniq > /tmp/recent-submitters-approvers.csv
     
    • Peter wanted it to send some mail to the users…
    • @@ -620,7 +620,7 @@ $ grep -o -E 'by .*)' /tmp/provenance.txt | grep -v -E "( on |checksum)&quo
      • I see some errors from CUA in our Tomcat logs:
      -
      Thu Dec 17 07:35:27 CET 2020 | Query:containerItem:b049326a-0e76-45a8-ac0c-d8ec043a50c6
      +
      Thu Dec 17 07:35:27 CET 2020 | Query:containerItem:b049326a-0e76-45a8-ac0c-d8ec043a50c6
       Error while updating
       java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp
               at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1155)
      @@ -636,7 +636,7 @@ java.lang.UnsupportedOperationException: Multiple update components target the s
       
       
    • I was trying to export the ILRI community on CGSpace so I could update one of the ILRI author’s names, but it throws an error…
    -
    $ dspace metadata-export -i 10568/1 -f /tmp/2020-12-17-ILRI.csv
    +
    $ dspace metadata-export -i 10568/1 -f /tmp/2020-12-17-ILRI.csv
     Loading @mire database changes for module MQM
     Changes have been processed
     Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
    @@ -657,7 +657,7 @@ java.lang.NullPointerException
     
    • I did it via CSV with fix-metadata-values.py instead:
    -
    $ cat 2020-12-17-update-ILRI-author.csv
    +
    $ cat 2020-12-17-update-ILRI-author.csv
     dc.contributor.author,correct
     "Padmakumar, V.P.","Varijakshapanicker, Padmakumar"
     $ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
    @@ -668,7 +668,7 @@ $ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u ds
     
     
     
    -
    $ csvcut -c 'dc.identifier.citation[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dc.date.issued,dc.date.issued[],dc.date.issued[en_US],cg.identifier.status[en_US]' ~/Downloads/10568-80099.csv | csvgrep -c 'cg.identifier.status[en_US]' -m 'Limited Access' | csvgrep -c 'dc.date.issued' -m 2020 -c 'dc.date.issued[]' -m 2020 -c 'dc.date.issued[en_US]' -m 2020 > /tmp/limited-2020.csv
    +
    $ csvcut -c 'dc.identifier.citation[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dc.date.issued,dc.date.issued[],dc.date.issued[en_US],cg.identifier.status[en_US]' ~/Downloads/10568-80099.csv | csvgrep -c 'cg.identifier.status[en_US]' -m 'Limited Access' | csvgrep -c 'dc.date.issued' -m 2020 -c 'dc.date.issued[]' -m 2020 -c 'dc.date.issued[en_US]' -m 2020 > /tmp/limited-2020.csv
     

    2020-12-18

    • I added support for indexing community views and downloads to dspace-statistics-api @@ -689,7 +689,7 @@ $ ./fix-metadata-values.py -i 2020-12-17-update-ILRI-author.csv -db dspace -u ds
      • The DeduplicateValuesProcessor has been running on DSpace Test since two days ago and it almost completed its second twelve-hour run, but crashed near the end:
      -
      ...
      +
      ...
       Run 1 — 100% — 8,230,000/8,239,228 docs — 39s — 9h 8m 31s
       Exception: Java heap space
       java.lang.OutOfMemoryError: Java heap space
      @@ -744,7 +744,7 @@ java.lang.OutOfMemoryError: Java heap space
       
    • The AReS harvest finished this morning and I moved the Elasticsearch index manually
    • First, check the number of records in the temp index to make sure it seems complete and not with double data:
    -
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
    +
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
     {
       "count" : 100135,
       "_shards" : {
    @@ -757,13 +757,13 @@ java.lang.OutOfMemoryError: Java heap space
     
    • Then delete the old backup and clone the current items index as a backup:
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-14?pretty'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-2020-12-14?pretty'
     $ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2020-12-21
     
    • Then delete the current items index and clone it from temp:
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
     $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
     $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
    @@ -806,11 +806,11 @@ $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H
     
     
     
    -
    statistics-2012: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
    +
    statistics-2012: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error opening new searcher
     
    • I exported the 2012 stats from the year core and imported them to the main statistics core with solr-import-export-json:
    -
    $ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2012 -a export -o statistics-2012.json -k uid
    +
    $ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2012 -a export -o statistics-2012.json -k uid
     $ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics -a import -o statistics-2010.json -k uid
     $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"
     
      @@ -824,7 +824,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
    -
    $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
    +
    $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'
     {
       "count" : 100135,
       "_shards" : {
    @@ -842,7 +842,7 @@ $ curl -X PUT "localhost:9200/openrxv-items/_settings?pretty" -H 'Cont
     
    • The indexing on AReS finished so I cloned the openrxv-items-temp index to openrxv-items and deleted the backup index:
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items?pretty'
     $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
     $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings?pretty" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
    diff --git a/docs/2021-01/index.html b/docs/2021-01/index.html
    index a9df20b62..7607a467f 100644
    --- a/docs/2021-01/index.html
    +++ b/docs/2021-01/index.html
    @@ -50,7 +50,7 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
     
     
     "/>
    -
    +
     
     
         
    @@ -160,12 +160,12 @@ For example, this item has 51 views on CGSpace, but 0 on AReS
     
     
     
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
     # start indexing in AReS
     
    • Then, the next morning when it’s done, check the results of the harvesting, backup the current openrxv-items index, and clone the openrxv-items-temp index to openrxv-items:
    -
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
    +
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
     {
       "count" : 100278,
       "_shards" : {
    @@ -214,7 +214,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-04'
     
     
     
    -
    $ ./doi-to-handle.py -db dspace -u dspace -p 'fuuu' -i /tmp/dois.txt -o /tmp/out.csv
    +
    $ ./doi-to-handle.py -db dspace -u dspace -p 'fuuu' -i /tmp/dois.txt -o /tmp/out.csv
     
    • Help Udana export IWMI records from AReS
        @@ -261,12 +261,12 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-04'
    -
    2021-01-10 10:03:27,692 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=1e8fb96c-b994-4fe2-8f0c-0a98ab138be0, ObjectType=(Unknown), ObjectID=null, TimeStamp=1610269383279, dispatcher=1544803905, detail=[null], transactionID="TX35636856957739531161091194485578658698")
    +
    2021-01-10 10:03:27,692 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=1e8fb96c-b994-4fe2-8f0c-0a98ab138be0, ObjectType=(Unknown), ObjectID=null, TimeStamp=1610269383279, dispatcher=1544803905, detail=[null], transactionID="TX35636856957739531161091194485578658698")
     
    • I filed a bug on Atmire’s issue tracker
    • Peter asked me to move the CGIAR Gender Platform community to the top level of CGSpace, but I get an error when I use the community-filiator command:
    -
    $ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
    +
    $ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
     Loading @mire database changes for module MQM
     Changes have been processed
     Exception: null
    @@ -301,7 +301,7 @@ java.lang.UnsupportedOperationException
     
     
     
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
     # start indexing in AReS
     ... after ten hours
     $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
    @@ -331,7 +331,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
     
     
     
    -
    $ cat log/dspace.log.2020-12-2* | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=64.62.202.71' | sort | uniq | wc -l
    +
    $ cat log/dspace.log.2020-12-2* | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=64.62.202.71' | sort | uniq | wc -l
     0
     
    • So now I should really add it to the DSpace spider agent list so it doesn’t create Solr hits @@ -341,7 +341,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
    • I purged the existing hits using my check-spider-ip-hits.sh script:
    -
    $ ./check-spider-ip-hits.sh -d -f /tmp/ips -s http://localhost:8081/solr -s statistics -p
    +
    $ ./check-spider-ip-hits.sh -d -f /tmp/ips -s http://localhost:8081/solr -s statistics -p
     

    2021-01-11

    • The AReS indexing finished this morning and I moved the openrxv-items-temp core to openrxv-items (see above) @@ -351,7 +351,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
    • I deployed the community-filiator fix on CGSpace and moved the Gender Platform community to the top level of CGSpace:
    -
    $ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
    +
    $ dspace community-filiator --remove --parent=10568/66598 --child=10568/106605
     

    2021-01-12

    • IWMI is really pressuring us to have a periodic CSV export of their community @@ -393,12 +393,12 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
     # start indexing in AReS
     
    • Then, the next morning when it’s done, check the results of the harvesting, backup the current openrxv-items index, and clone the openrxv-items-temp index to openrxv-items:
    -
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
    +
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
     {
       "count" : 100540,
       "_shards" : {
    @@ -445,7 +445,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-18'
     
     
     
    -
    localhost/dspace63= > BEGIN;
    +
    localhost/dspace63= > BEGIN;
     localhost/dspace63= > DELETE FROM metadatavalue WHERE metadata_field_id IN (115, 116, 117, 118);
     DELETE 27
     localhost/dspace63= > COMMIT;
    @@ -462,7 +462,7 @@ localhost/dspace63= > COMMIT;
     
     
     
    -
    $ docker exec -it api /bin/bash
    +
    $ docker exec -it api /bin/bash
     # apt update && apt install unoconv
     
    • Help Peter get a list of titles and DOIs for CGSpace items that Altmetric does not have an attention score for @@ -512,12 +512,12 @@ localhost/dspace63= > COMMIT;
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
     # start indexing in AReS
     
    • Then, the next morning when it’s done, check the results of the harvesting, backup the current openrxv-items index, and clone the openrxv-items-temp index to openrxv-items:
    -
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
    +
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
     {
       "count" : 100699,
       "_shards" : {
    @@ -579,7 +579,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-01-25'
     
     
     
    -
    Jan 26, 2021 10:47:23 AM org.apache.coyote.http11.AbstractHttp11Processor process
    +
    Jan 26, 2021 10:47:23 AM org.apache.coyote.http11.AbstractHttp11Processor process
     INFO: Error parsing HTTP request header
      Note: further occurrences of HTTP request parsing errors will be logged at DEBUG level.
     java.lang.IllegalArgumentException: Invalid character found in the request target [/discover/search/csv?query=*&scope=~&filters=author:(Alan\%20Orth)]. The valid characters are defined in RFC 7230 and RFC 3986
    @@ -601,12 +601,12 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
     
  • I filed a bug on DSpace’s issue tracker (though I accidentally hit Enter and submitted it before I finished, and there is no edit function)
  • Looking into Linode report that the load outbound traffic rate was high this morning:
  • -
    # grep -E '26/Jan/2021:(08|09|10|11|12)' /var/log/nginx/rest.log | goaccess --log-format=COMBINED -
    +
    # grep -E '26/Jan/2021:(08|09|10|11|12)' /var/log/nginx/rest.log | goaccess --log-format=COMBINED -
     
    • The culprit seems to be the ILRI publications importer, so that’s OK
    • But I also see an IP in Jordan hitting the REST API 1,100 times today:
    -
    80.10.12.54 - - [26/Jan/2021:09:43:42 +0100] "GET /rest/rest/bitstreams/98309f17-a831-48ed-8f0a-2d3244cc5a1c/retrieve HTTP/2.0" 302 138 "http://wp.local/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
    +
    80.10.12.54 - - [26/Jan/2021:09:43:42 +0100] "GET /rest/rest/bitstreams/98309f17-a831-48ed-8f0a-2d3244cc5a1c/retrieve HTTP/2.0" 302 138 "http://wp.local/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
     
    • Seems to be someone from CodeObia working on WordPress
        @@ -615,7 +615,7 @@ java.lang.IllegalArgumentException: Invalid character found in the request targe
      • I purged all ~3,000 statistics hits that have the “http://wp.local/" referrer:
      -
      $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>referrer:http\:\/\/wp\.local\/</query></delete>"
      +
      $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>referrer:http\:\/\/wp\.local\/</query></delete>"
       
      -
      $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
      +
      $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
       # start indexing in AReS
       
      • Sent out emails about CG Core v2 to Macaroni Bros, Fabio, Hector at CCAFS, Dani and Tariku
      • diff --git a/docs/2021-02/index.html b/docs/2021-02/index.html index a5853861f..a2e9050be 100644 --- a/docs/2021-02/index.html +++ b/docs/2021-02/index.html @@ -60,7 +60,7 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty } } "/> - + @@ -157,7 +157,7 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty
      • I had a call with CodeObia to discuss the work on OpenRXV
      • Check the results of the AReS harvesting from last night:
      -
      $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
      +
      $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
       {
         "count" : 100875,
         "_shards" : {
      @@ -170,18 +170,18 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty
       
      • Set the current items index to read only and make a backup:
      -
      $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
      +
      $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
       $ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-01
       
      • Delete the current items index and clone the temp one to it:
      -
      $ curl -XDELETE 'http://localhost:9200/openrxv-items'
      +
      $ curl -XDELETE 'http://localhost:9200/openrxv-items'
       $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
       $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
       
      • Then delete the temp and backup:
      -
      $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
      +
      $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
       {"acknowledged":true}%
       $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-01'
       
        @@ -196,7 +196,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-01'
      • I tried to export the ILRI community from CGSpace but I got an error:
      -
      $ dspace metadata-export -i 10568/1 -f /tmp/2021-02-01-ILRI.csv
      +
      $ dspace metadata-export -i 10568/1 -f /tmp/2021-02-01-ILRI.csv
       Loading @mire database changes for module MQM
       Changes have been processed
       Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
      @@ -234,16 +234,16 @@ java.lang.NullPointerException
       
    • Maria Garruccio sent me some new ORCID iDs for Bioversity authors, as well as a correction for Stefan Burkart’s iD
    • I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using resolve-orcids.py:
    -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-02-02-combined-orcids.txt
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-02-02-combined-orcids.txt
     $ ./ilri/resolve-orcids.py -i /tmp/2021-02-02-combined-orcids.txt -o /tmp/2021-02-02-combined-orcid-names.txt
     
    • I sorted the names and added the XML formatting in vim, then ran it through tidy:
    -
    $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
    +
    $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
     
    • Then I added all the changed names plus Stefan’s incorrect ones to a CSV and processed them with fix-metadata-values.py:
    -
    $ cat 2021-02-02-fix-orcid-ids.csv 
    +
    $ cat 2021-02-02-fix-orcid-ids.csv 
     cg.creator.id,correct
     Burkart Stefan: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184
     Burkart Stefan: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184
    @@ -263,7 +263,7 @@ $ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u
     
    • Tag forty-three items from Bioversity’s new authors with ORCID iDs using add-orcid-identifiers-csv.py:
    -
    $ cat /tmp/2021-02-02-add-orcid-ids.csv
    +
    $ cat /tmp/2021-02-02-add-orcid-ids.csv
     dc.contributor.author,cg.creator.id
     "Nchanji, E.",Eileen Bogweh Nchanji: 0000-0002-6859-0962
     "Nchanji, Eileen",Eileen Bogweh Nchanji: 0000-0002-6859-0962
    @@ -300,7 +300,7 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db d
     
     
     
    -
    $ time chrt -b 0 dspace index-discovery -b
    +
    $ time chrt -b 0 dspace index-discovery -b
     $ dspace oai import -c
     
    • Attend Accenture meeting for repository managers @@ -333,7 +333,7 @@ $ dspace oai import -c
    -
    $ ./ilri/delete-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -m 43
    +
    $ ./ilri/delete-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -m 43
     
    • The corrected versions have a lot of encoding issues so I asked Peter to give me the correct ones so I can search/replace them:
        @@ -358,7 +358,7 @@ $ dspace oai import -c
      • I ended up using python-ftfy to fix those very easily, then replaced them in the CSV
      • Then I trimmed whitespace at the beginning, end, and around the “;”, and applied the 1,600 fixes using fix-metadata-values.py:
      -
      $ ./ilri/fix-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -t 'correct' -m 43
      +
      $ ./ilri/fix-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -t 'correct' -m 43
       
      • Help Peter debug an issue with one of Alan Duncan’s new FEAST Data reports on CGSpace
          @@ -372,7 +372,7 @@ $ dspace oai import -c
        • Run system updates on CGSpace (linode18), deploy latest 6_x-prod branch, and reboot the server
        • After the server came back up I started a full Discovery re-indexing:
        -
        $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
        +
        $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
         
         real    247m30.850s
         user    160m36.657s
        @@ -385,13 +385,13 @@ sys     2m26.050s
         
      • Delete the old Elasticsearch temp index to prepare for starting an AReS re-harvest:
      -
      $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
      +
      $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
       # start indexing in AReS
       

      2021-02-08

      • Finish rotating the AReS indexes after the harvesting last night:
      -
      $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
      +
      $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
       {
         "count" : 100983,
         "_shards" : {
      @@ -429,7 +429,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-08'
       
    -
    $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | wc -l
    +
    $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | wc -l
     30354
     $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort -u | wc -l
     18555
    @@ -452,15 +452,15 @@ $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h |
     
     
     
    -
    $ csvcut -c 'id,dc.date.issued,dc.date.issued[],dc.date.issued[en_US],dc.rights,dc.rights[],dc.rights[en],dc.rights[en_US],dc.publisher,dc.publisher[],dc.publisher[en_US],dc.type[en_US]' /tmp/2021-02-10-ILRI.csv | csvgrep -c 'dc.type[en_US]' -r '^.+[^(Journal Item|Journal Article|Book|Book Chapter)]'
    +
    $ csvcut -c 'id,dc.date.issued,dc.date.issued[],dc.date.issued[en_US],dc.rights,dc.rights[],dc.rights[en],dc.rights[en_US],dc.publisher,dc.publisher[],dc.publisher[en_US],dc.type[en_US]' /tmp/2021-02-10-ILRI.csv | csvgrep -c 'dc.type[en_US]' -r '^.+[^(Journal Item|Journal Article|Book|Book Chapter)]'
     
    • I imported the CSV into OpenRefine and converted the date text values to date types so I could facet by dates before 2010:
    -
    if(diff(value,"01/01/2010".toDate(),"days")<0, true, false)
    +
    if(diff(value,"01/01/2010".toDate(),"days")<0, true, false)
     
    • Then I filtered by publisher to make sure they were only ours:
    -
    or(
    +
    or(
       value.contains("International Livestock Research Institute"),
       value.contains("ILRI"),
       value.contains("International Livestock Centre for Africa"),
    @@ -488,7 +488,7 @@ $ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h |
     
  • Run system updates, deploy latest 6_x-prod branch, and reboot CGSpace (linode18)
  • Normalize text_lang of DSpace item metadata on CGSpace:
  • -
    dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
    +
    dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
      text_lang |  count  
     -----------+---------
      en_US     | 2567413
    @@ -504,7 +504,7 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (S
     
    • Clear the OpenRXV temp items index:
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
     
    • Then start a full harvesting of CGSpace in the AReS Explorer admin dashboard
    • Peter asked me about a few other recently submitted FEAST items that are restricted @@ -521,12 +521,12 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (S
    -
    $ ./ilri/move-metadata-values.py -i /tmp/move.txt -db dspace -u dspace -p 'fuuu' -f 43 -t 55
    +
    $ ./ilri/move-metadata-values.py -i /tmp/move.txt -db dspace -u dspace -p 'fuuu' -f 43 -t 55
     

    2021-02-15

    • Check the results of the AReS Harvesting from last night:
    -
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
    +
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
     {
       "count" : 101126,
       "_shards" : {
    @@ -539,12 +539,12 @@ dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (S
     
    • Set the current items index to read only and make a backup:
    -
    $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
    +
    $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-15
     
    • Delete the current items index and clone the temp one:
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items'
     $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
     $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
    @@ -563,18 +563,18 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-15'
     
     
  • They are definitely bots posing as users, as I see they have created six thousand DSpace sessions today:
  • -
    $ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=45.146.165.203' | sort | uniq | wc -l
    +
    $ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=45.146.165.203' | sort | uniq | wc -l
     4007
     $ cat dspace.log.2021-02-16 | grep -E 'session_id=[A-Z0-9]{32}:ip_addr=130.255.161.231' | sort | uniq | wc -l
     2128
     
    • Ah, actually 45.146.165.203 is making requests like this:
    -
    "http://cgspace.cgiar.org:80/bitstream/handle/10568/238/Res_report_no3.pdf;jsessionid=7311DD88B30EEF9A8F526FF89378C2C5%' AND 4313=CONCAT(CHAR(113)+CHAR(98)+CHAR(106)+CHAR(112)+CHAR(113),(SELECT (CASE WHEN (4313=4313) THEN CHAR(49) ELSE CHAR(48) END)),CHAR(113)+CHAR(106)+CHAR(98)+CHAR(112)+CHAR(113)) AND 'XzQO%'='XzQO"
    +
    "http://cgspace.cgiar.org:80/bitstream/handle/10568/238/Res_report_no3.pdf;jsessionid=7311DD88B30EEF9A8F526FF89378C2C5%' AND 4313=CONCAT(CHAR(113)+CHAR(98)+CHAR(106)+CHAR(112)+CHAR(113),(SELECT (CASE WHEN (4313=4313) THEN CHAR(49) ELSE CHAR(48) END)),CHAR(113)+CHAR(106)+CHAR(98)+CHAR(112)+CHAR(113)) AND 'XzQO%'='XzQO"
     
    • I purged the hits from these two using my check-spider-ip-hits.sh:
    -
    $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
    +
    $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
     Purging 4005 hits from 45.146.165.203 in statistics
     Purging 3493 hits from 130.255.161.231 in statistics
     
    @@ -582,7 +582,7 @@ Total number of bot hits purged: 7498
     
    • Ugh, I looked in Solr for the top IPs in 2021-01 and found a few more of these Russian IPs so I purged them too:
    -
    $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
    +
    $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
     Purging 27163 hits from 45.146.164.176 in statistics
     Purging 19556 hits from 45.146.165.105 in statistics
     Purging 15927 hits from 45.146.165.83 in statistics
    @@ -596,7 +596,7 @@ Total number of bot hits purged: 70731
     
     
     
    -
    $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
    +
    $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips -p
     Purging 3 hits from 130.255.161.231 in statistics
     Purging 16773 hits from 64.39.99.15 in statistics
     Purging 6976 hits from 64.39.99.13 in statistics
    @@ -627,7 +627,7 @@ Total number of bot hits purged: 23789
     
  • Abenet asked me to add Tom Randolph’s ORCID identifier to CGSpace
  • I also tagged all his 247 existing items on CGSpace:
  • -
    $ cat 2021-02-17-add-tom-orcid.csv 
    +
    $ cat 2021-02-17-add-tom-orcid.csv 
     dc.contributor.author,cg.creator.id
     "Randolph, Thomas F.","Thomas Fitz Randolph: 0000-0003-1849-9877"
     $ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace -u dspace -p 'fuuu'
    @@ -640,7 +640,7 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-02-17-add-tom-orcid.csv -db dspace
     
  • Start the CG Core v2 migration on CGSpace (linode18)
  • After deploying the latest 6_x-prod branch and running migrate-fields.sh I started a full Discovery reindex:
  • -
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    311m12.617s
     user    217m3.102s
    @@ -648,7 +648,7 @@ sys     2m37.363s
     
    • Then update OAI:
    -
    $ dspace oai import -c
    +
    $ dspace oai import -c
     $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
     
    • Ben Hack was asking if there is a REST API query that will give him all ILRI outputs for their new Sharepoint intranet @@ -668,14 +668,14 @@ $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
    -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx1024m'
     $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
     
    • The process took an hour or so!
    • I added colorized output to the csv-metadata-quality tool and tagged version 0.4.4 on GitHub
    • I updated the fields in AReS Explorer and then removed the old temp index so I can start a fresh re-harvest of CGSpace:
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
     # start indexing in AReS
     

    2021-02-22

      @@ -687,7 +687,7 @@ $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
    -
    localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
    +
    localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
     UPDATE 104
     
    • As for splitting the other values, I think I can export the dspace_object_id and text_value and then upload it as a CSV rather than writing a Python script to create the new metadata values
    • @@ -696,7 +696,7 @@ UPDATE 104
      • Check the results of the AReS harvesting from last night:
      -
      $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
      +
      $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
       {
         "count" : 101380,
         "_shards" : {
      @@ -709,18 +709,18 @@ UPDATE 104
       
      • Set the current items index to read only and make a backup:
      -
      $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
      +
      $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
       $ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-22
       
      • Delete the current items index and clone the temp one to it:
      -
      $ curl -XDELETE 'http://localhost:9200/openrxv-items'
      +
      $ curl -XDELETE 'http://localhost:9200/openrxv-items'
       $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
       $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
       
      • Then delete the temp and backup:
      -
      $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
      +
      $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
       {"acknowledged":true}%
       $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22'
       

      2021-02-23

      @@ -732,21 +732,21 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22'
    • Remove semicolons from series names without numbers:
    -
    dspace=# BEGIN;
    +
    dspace=# BEGIN;
     dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
     UPDATE 104
     dspace=# COMMIT;
     
    • Set all text_lang values on CGSpace to en_US to make the series replacements easier (this didn’t work, read below):
    -
    dspace=# BEGIN;
    +
    dspace=# BEGIN;
     dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE text_lang !='en_US' AND dspace_object_id IN (SELECT uuid FROM item);
     UPDATE 911
     cgspace=# COMMIT;
     
    • Then export all series with their IDs to CSV:
    -
    dspace=# \COPY (SELECT dspace_object_id, text_value as "dcterms.isPartOf[en_US]" FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER;
    +
    dspace=# \COPY (SELECT dspace_object_id, text_value as "dcterms.isPartOf[en_US]" FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER;
     
    • In OpenRefine I trimmed and consolidated whitespace, then made some quick cleanups to normalize the fields based on a sanity check
        @@ -761,22 +761,22 @@ cgspace=# COMMIT;
    -
    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_value_id=5355845;
    +
    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_value_id=5355845;
     UPDATE 1
     
    • This also seems to work, using the id for just that one item:
    -
    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id='9840d19b-a6ae-4352-a087-6d74d2629322';
    +
    dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id='9840d19b-a6ae-4352-a087-6d74d2629322';
     UPDATE 37
     
    • This seems to work better for some reason:
    -
    dspacetest=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item);
    +
    dspacetest=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item);
     UPDATE 18659
     
    • I split the CSV file in batches of 5,000 using xsv, then imported them one by one in CGSpace:
    -
    $ dspace metadata-import -f /tmp/0.csv
    +
    $ dspace metadata-import -f /tmp/0.csv
     
    • It took FOREVER to import each file… like several hours each. MY GOD DSpace 6 is slow.
    • Help Dominique Perera debug some issues with the WordPress DSpace importer plugin from Macaroni Bros @@ -785,7 +785,7 @@ UPDATE 18659
    -
    104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] "GET /rest/communities?limit=1000 HTTP/1.1" 200 188779 "https://cgspace.cgiar.org/rest /communities?limit=1000" "RTB website BOT"
    +
    104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] "GET /rest/communities?limit=1000 HTTP/1.1" 200 188779 "https://cgspace.cgiar.org/rest /communities?limit=1000" "RTB website BOT"
     104.198.97.97 - - [23/Feb/2021:11:41:18 +0100] "GET /rest/communities//communities HTTP/1.1" 404 714 "https://cgspace.cgiar.org/rest/communities//communities" "RTB website BOT"
     
    • The first request is OK, but the second one is malformed for sure
    • @@ -794,12 +794,12 @@ UPDATE 18659
      • Export a list of journals for Peter to look through:
      -
      localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.journal", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
      +
      localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.journal", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
       COPY 3345
       
      • Start a fresh harvesting on AReS because Udana mapped some items today and wants to include them in his report:
      -
      $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
      +
      $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
       # start indexing in AReS
       
      • Also, I want to include the new series name/number cleanups so it’s not a total waste of time
      • @@ -808,7 +808,7 @@ COPY 3345
        • Hmm the AReS harvest last night seems to have finished successfully, but the number of items is less than I was expecting:
        -
        $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
        +
        $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
         {
           "count" : 99546,
           "_shards" : {
        @@ -843,7 +843,7 @@ COPY 3345
         
    -
    value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/\(.*\)/,"")
    +
    value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/\(.*\)/,"")
     value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1")
     
    • This value.partition was new to me… and it took me a bit of time to figure out whether I needed to escape the parentheses in the issue number or not (no) and how to reference a capture group with value.replace
    • @@ -857,7 +857,7 @@ value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1")
    • Niroshini from IWMI is still having issues adding WLE subjects to items during the metadata review step in the workflow
    • It seems the BatchEditConsumer log spam is gone since I applied Atmire’s patch
    -
    $ grep -c 'BatchEditConsumer should not have been given' dspace.log.2021-02-[12]*
    +
    $ grep -c 'BatchEditConsumer should not have been given' dspace.log.2021-02-[12]*
     dspace.log.2021-02-10:5067
     dspace.log.2021-02-11:2647
     dspace.log.2021-02-12:4231
    diff --git a/docs/2021-03/index.html b/docs/2021-03/index.html
    index 866ecf994..1c15f279e 100644
    --- a/docs/2021-03/index.html
    +++ b/docs/2021-03/index.html
    @@ -34,7 +34,7 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
     
     
     "/>
    -
    +
     
     
         
    @@ -163,14 +163,14 @@ Also, we found some issues building and running OpenRXV currently due to ecosyst
     
    • I looked at the number of connections in PostgreSQL and it’s definitely high again:
    -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     1020
     
    • I reported it to Atmire to take a look, on the same issue we had been tracking this before
    • Abenet asked me to add a new ORCID for ILRI staff member Zoe Campbell
    • I added it to the controlled vocabulary and then tagged her existing items on CGSpace using my add-orcid-identifier.py script:
    -
    $ cat 2021-03-04-add-zoe-campbell-orcid.csv 
    +
    $ cat 2021-03-04-add-zoe-campbell-orcid.csv 
     dc.contributor.author,cg.creator.identifier
     "Campbell, Zoë","Zoe Campbell: 0000-0002-4759-9976"
     "Campbell, Zoe A.","Zoe Campbell: 0000-0002-4759-9976"
    @@ -183,7 +183,7 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -
     
     
     
    -
    localhost/dspace63= > \COPY (SELECT dspace_object_id AS id, text_value as "cg.journal" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
    +
    localhost/dspace63= > \COPY (SELECT dspace_object_id AS id, text_value as "cg.journal" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
     COPY 32087
     
    • I used OpenRefine to remove all journal values that didn’t have one of these values: ; ( ) @@ -193,7 +193,7 @@ COPY 32087
    -
    value.partition(';')[0].trim() # to get journal names
    +
    value.partition(';')[0].trim() # to get journal names
     value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,"$1") # to get journal volumes
     value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") # to get journal issues
     
      @@ -233,7 +233,7 @@ value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") #
      • I migrated the Docker bind mount for the AReS Elasticsearch container to a Docker volume:
      -
      $ docker-compose -f docker/docker-compose.yml down
      +
      $ docker-compose -f docker/docker-compose.yml down
       $ docker volume create docker_esData_7
       $ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
       $ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
      @@ -249,12 +249,12 @@ $ docker-compose -f docker/docker-compose.yml up -d
       
    • I still need to make the changes to git master and add these notes to the pull request so Moayad and others can benefit
    • Delete the openrxv-items-temp index to test a fresh harvesting:
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
     

    2021-03-05

    • Check the results of the AReS harvesting from last night:
    -
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
    +
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
     {
       "count" : 101761,
       "_shards" : {
    @@ -267,18 +267,18 @@ $ docker-compose -f docker/docker-compose.yml up -d
     
    • Set the current items index to read only and make a backup:
    -
    $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
    +
    $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05
     
    • Delete the current items index and clone the temp one to it:
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items'
     $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
     
    • Then delete the temp and backup:
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
     {"acknowledged":true}%
     $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
     
      @@ -298,7 +298,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
    -
    $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
    +
    $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
     ...
         "openrxv-items-final": {
             "aliases": {
    @@ -308,7 +308,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
     
    • But on AReS production openrxv-items has somehow become a concrete index:
    -
    $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
    +
    $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
     ...
         "openrxv-items": {
             "aliases": {}
    @@ -322,7 +322,7 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
     
    • I fixed the issue on production by cloning the openrxv-items index to openrxv-items-final, deleting openrxv-items, and then re-creating it as an alias:
    -
    $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
    +
    $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
     $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
     $ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
    @@ -331,7 +331,7 @@ $ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application
     
    • Delete backups and remove read-only mode on openrxv-items:
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
     $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
     
    • Linode sent alerts about the CPU usage on CGSpace yesterday and the day before @@ -340,11 +340,11 @@ $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Typ
    -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -
     
    • I see the usual IPs for CCAFS and ILRI importer bots, but also 143.233.242.132 which appears to be for GARDIAN:
    -
    # zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
    +
    # zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
     6237
     # zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c -v Delphi
     6418
    @@ -375,7 +375,7 @@ $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Typ
     
     
     
    -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     13
     
    • On 2021-03-03 the PostgreSQL transactions started rising:
    • @@ -409,7 +409,7 @@ $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Typ
    -
    $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
    +
    $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
     # start harvesting on AReS
     
      @@ -434,7 +434,7 @@ $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items
    -
    $ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p 'fuuu'
    +
    $ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p 'fuuu'
     

    2021-03-10

    • Colleagues from ICARDA asked about how we should handle ISI journals in CG Core, as CGSpace uses cg.isijournal and MELSpace uses mel.impact-factor @@ -444,7 +444,7 @@ $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items
    • Peter said he doesn’t see “Source Code” or “Software” in the output type facet on the ILRI community, but I see it on the home page, so I will try to do a full Discovery re-index:
    -
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    318m20.485s
     user    215m15.196s
    @@ -467,7 +467,7 @@ sys     2m51.529s
     
    • Switch to linux-kvm kernel on linode20 and linode18:
    -
    # apt update && apt full-upgrade
    +
    # apt update && apt full-upgrade
     # apt install linux-kvm
     # apt remove linux-generic linux-image-generic linux-headers-generic linux-firmware
     # apt autoremove && apt autoclean
    @@ -478,13 +478,13 @@ sys     2m51.529s
     
  • Last week Peter added OpenRXV to CGSpace: https://hdl.handle.net/10568/112982
  • Back up the current openrxv-items-final index on AReS to start a new harvest:
  • -
    $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
    +
    $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-14
     $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
     
    • After the harvesting finished it seems the indexes got messed up again, as openrxv-items is an alias of openrxv-items-temp instead of openrxv-items-final:
    -
    $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
    +
    $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
     ...
         "openrxv-items-final": {
             "aliases": {}
    @@ -535,7 +535,7 @@ $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Conte
     
     
  • Back up the current openrxv-items-final index to start a fresh AReS Harvest:
  • -
    $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
    +
    $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-21
     $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
     
      @@ -545,7 +545,7 @@ $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Conte
      • The harvesting on AReS yesterday completed, but somehow I have twice the number of items:
      -
      $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
      +
      $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
       {
         "count" : 206204,
         "_shards" : {
      @@ -558,7 +558,7 @@ $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Conte
       
      • Hmmm and even my backup index has a strange number of items:
      -
      $ curl -s 'http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&pretty'
      +
      $ curl -s 'http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&pretty'
       {
         "count" : 844,
         "_shards" : {
      @@ -571,7 +571,7 @@ $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Conte
       
      • I deleted all indexes and re-created the openrxv-items alias:
      -
      $ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
      +
      $ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
       $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
       ...
           "openrxv-items-temp": {
      @@ -591,7 +591,7 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
       
       
    • The AReS harvest finally finished, with 1047 pages of items, but the openrxv-items-final index is empty and the openrxv-items-temp index has a 103,000 items:
    -
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
    +
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
     {
       "count" : 103162,
       "_shards" : {
    @@ -604,12 +604,12 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
     
    • I tried to clone the temp index to the final, but got an error:
    -
    $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
    +
    $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
     {"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"}],"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"},"status":400}% 
     
    • I looked in the Docker logs for Elasticsearch and saw a few memory errors:
    -
    java.lang.OutOfMemoryError: Java heap space
    +
    java.lang.OutOfMemoryError: Java heap space
     
    • According to /usr/share/elasticsearch/config/jvm.options in the Elasticsearch container the default JVM heap is 1g
        @@ -622,7 +622,7 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
    -
        "openrxv-items-final": {
    +
        "openrxv-items-final": {
             "aliases": {}
         },
         "openrxv-items-temp": {
    @@ -634,7 +634,7 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
     
    • For reference you can also get the Elasticsearch JVM stats from the API:
    -
    $ curl -s 'http://localhost:9200/_nodes/jvm?human' | python -m json.tool
    +
    $ curl -s 'http://localhost:9200/_nodes/jvm?human' | python -m json.tool
     
    • I re-deployed AReS with 1.5GB of heap using the ES_JAVA_OPTS environment variable
        @@ -644,7 +644,7 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
      • Then I fixed the aliases to make sure openrxv-items was an alias of openrxv-items-final, similar to how I did a few weeks ago
      • I re-created the temp index:
      -
      $ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
      +
      $ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
       

      2021-03-24

    -
    # du -s /home/dspacetest.cgiar.org/solr/statistics
    +
    # du -s /home/dspacetest.cgiar.org/solr/statistics
     57861236        /home/dspacetest.cgiar.org/solr/statistics
     
    • I applied their changes to config/spring/api/atmire-cua-update.xml and started the duplicate processor:
    -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
     $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 1000 -c statistics -t 12
     
    • The default number of records per query is 10,000, which caused memory issues, so I will try with 1000 (Atmire used 100, but that seems too low!)
    • Hah, I still got a memory error after only a few minutes:
    -
    ...
    +
    ...
     Run 1 —  80% — 5,000/6,263 docs — 25s — 6m 31s                                      
     Exception: GC overhead limit exceeded                                                                          
     java.lang.OutOfMemoryError: GC overhead limit exceeded 
    @@ -678,7 +678,7 @@ java.lang.OutOfMemoryError: GC overhead limit exceeded
     
  • I guess we really do have to use -r 100
  • Now the thing runs for a few minutes and “finishes”:
  • -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12
    +
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12
     Loading @mire database changes for module MQM
     Changes have been processed
     
    @@ -796,7 +796,7 @@ Run 1 took 5m 53s
     
     
     
    -
    2021-03-29 08:55:40,073 INFO  org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&wt=javabin&version=2} hits=143 status=0 QTime=0
    +
    2021-03-29 08:55:40,073 INFO  org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&wt=javabin&version=2} hits=143 status=0 QTime=0
     
    • But the item mapper only displays ten items, with no pagination
        @@ -845,7 +845,7 @@ r = requests.
    • I exported a list of all our ISSNs from CGSpace:
    -
    localhost/dspace63= > \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
    +
    localhost/dspace63= > \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
     COPY 3081
     
    • I wrote a script to check the ISSNs against Crossref’s API: crossref-issn-lookup.py diff --git a/docs/2021-04/index.html b/docs/2021-04/index.html index 21b26ebd4..ec0e85636 100644 --- a/docs/2021-04/index.html +++ b/docs/2021-04/index.html @@ -44,7 +44,7 @@ Perhaps one of the containers crashed, I should have looked closer but I was in "/> - + @@ -153,16 +153,16 @@ Perhaps one of the containers crashed, I should have looked closer but I was in
    -
    $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-account" -W "(sAMAccountName=otheraccounttoquery)"
    +
    $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-account" -W "(sAMAccountName=otheraccounttoquery)"
     

    2021-04-04

    • Check the index aliases on AReS Explorer to make sure they are sane before starting a new harvest:
    -
    $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
    +
    $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
     
    • Then set the openrxv-items-final index to read-only so we can make a backup:
    -
    $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' 
    +
    $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' 
     {"acknowledged":true}%
     $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-backup
     {"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final-backup"}%
    @@ -181,7 +181,7 @@ $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Conte
     
     
     
    -
    $ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253
    +
    $ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253
     
    • For now I only fixed obvious errors like “1234-5678.” and “e-ISSN: 1234-5678” etc, but there are still lots of invalid ones which need more manual work:
        @@ -196,7 +196,7 @@ $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Conte
        • The AReS Explorer harvesting from yesterday finished, and the results look OK, but actually the Elasticsearch indexes are messed up again:
        -
        $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
        +
        $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
         {
             "openrxv-items-final": {
                 "aliases": {}
        @@ -218,7 +218,7 @@ $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Conte
         
    -
    $ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
    +
    $ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
     $ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.csv | \
       sed '1d' | \
       csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, "")||COALESCE(c, "")||COALESCE(d, "") AS issued FROM stdin' | \
    @@ -257,13 +257,13 @@ $ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.
     
    • Then I submitted the file three times (changing the page parameter):
    -
    $ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page1.json
    +
    $ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page1.json
     $ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page2.json
     $ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page3.json
     
    • Then I extracted the views and downloads in the most ridiculous way:
    -
    $ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
    +
    $ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
     30364
     $ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
     9100
    @@ -290,16 +290,16 @@ $ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs |
     
     
     
    -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     12413
     
    • The system journal shows thousands of these messages in the system journal, this is the first one:
    -
    Apr 06 07:52:13 linode18 tomcat7[556]: Apr 06, 2021 7:52:13 AM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
    +
    Apr 06 07:52:13 linode18 tomcat7[556]: Apr 06, 2021 7:52:13 AM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
     
    • Around that time in the dspace log I see nothing unusual, but maybe these?
    -
    2021-04-06 07:52:29,409 INFO  com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics
    +
    2021-04-06 07:52:29,409 INFO  com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics
     
    • (BTW what is the deal with the “200/127”? I should send a comment to Atmire)
        @@ -308,7 +308,7 @@ $ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs |
      • I restarted the PostgreSQL and Tomcat services and now I see less connections, but still WAY high:
      -
      $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
      +
      $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
       3640
       $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
       2968
      @@ -318,7 +318,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
       
    • After ten minutes or so it went back down…
    • And now it’s back up in the thousands… I am seeing a lot of stuff in dspace log like this:
    -
    2021-04-06 11:59:34,364 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
    +
    2021-04-06 11:59:34,364 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
     2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717952
     2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717953
     2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717954
    @@ -354,17 +354,17 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
     
  • I had a meeting with Peter and Abenet about CGSpace TODOs
  • CGSpace went down again and the PostgreSQL locks are through the roof:
  • -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     12154
     
    • I don’t see any activity on REST API, but in the last four hours there have been 3,500 DSpace sessions:
    -
    # grep -a -E '2021-04-06 (13|14|15|16|17):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
    +
    # grep -a -E '2021-04-06 (13|14|15|16|17):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
     3547
     
    • I looked at the same time of day for the past few weeks and it seems to be a normal number of sessions:
    -
    # for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do grep -a -E "2021-0(3|4)-[0-9]{2} (13|14|15|16|17):" "$file" | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l; done
    +
    # for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do grep -a -E "2021-0(3|4)-[0-9]{2} (13|14|15|16|17):" "$file" | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l; done
     ...
     3572
     4085
    @@ -390,7 +390,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
     
    • What about total number of sessions per day?
    -
    # for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do echo "$file:"; grep -a -o -E 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
    +
    # for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do echo "$file:"; grep -a -o -E 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
     ...
     /home/cgspace.cgiar.org/log/dspace.log.2021-03-28:
     11784
    @@ -421,7 +421,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
     
     
  • The locks in PostgreSQL shot up again…
  • -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     3447
     $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     3527
    @@ -440,7 +440,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
     
    • While looking at the nginx logs I see that MEL is trying to log into CGSpace’s REST API and delete items:
    -
    34.209.213.122 - - [06/Apr/2021:03:50:46 +0200] "POST /rest/login HTTP/1.1" 401 727 "-" "MEL"
    +
    34.209.213.122 - - [06/Apr/2021:03:50:46 +0200] "POST /rest/login HTTP/1.1" 401 727 "-" "MEL"
     34.209.213.122 - - [06/Apr/2021:03:50:48 +0200] "DELETE /rest/items/95f52bf1-f082-4e10-ad57-268a76ca18ec/metadata HTTP/1.1" 401 704 "-" "-"
     
    • I see a few of these per day going back several months @@ -450,7 +450,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
    • Also annoying, I see tons of what look like penetration testing requests from Qualys:
    -
    2021-04-04 06:35:17,889 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user "'><qss a=X158062356Y1_2Z>
    +
    2021-04-04 06:35:17,889 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user "'><qss a=X158062356Y1_2Z>
     2021-04-04 06:35:17,889 INFO  org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user="'><qss a=X158062356Y1_2Z>
     2021-04-04 06:35:17,890 INFO  org.dspace.app.xmlui.utils.AuthenticationUtil @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:email="'><qss a=X158062356Y1_2Z>, realm=null, result=2
     2021-04-04 06:35:18,145 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:auth:attempting trivial auth of user=was@qualys.com
    @@ -464,19 +464,19 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
     
     
  • 10PM and the server is down again, with locks through the roof:
  • -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     12198
     
    • I see that there are tons of PostgreSQL connections getting abandoned today, compared to very few in the past few weeks:
    -
    $ journalctl -u tomcat7 --since=today | grep -c 'ConnectionPool abandon'
    +
    $ journalctl -u tomcat7 --since=today | grep -c 'ConnectionPool abandon'
     1838
     $ journalctl -u tomcat7 --since=2021-03-20 --until=2021-04-05 | grep -c 'ConnectionPool abandon'
     3
     
    • I even restarted the server and connections were low for a few minutes until they shot back up:
    -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     13
     $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     8651
    @@ -488,12 +488,12 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
     
  • I had to go to bed and I bet it will crash and be down for hours until I wake up…
  • What the hell is this user agent?
  • -
    54.197.119.143 - - [06/Apr/2021:19:18:11 +0200] "GET /handle/10568/16499 HTTP/1.1" 499 0 "-" "GetUrl/1.0 wdestiny@umich.edu (Linux)"
    +
    54.197.119.143 - - [06/Apr/2021:19:18:11 +0200] "GET /handle/10568/16499 HTTP/1.1" 499 0 "-" "GetUrl/1.0 wdestiny@umich.edu (Linux)"
     

    2021-04-07

    • CGSpace was still down from last night of course, with tons of database locks:
    -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     12168
     
    • I restarted the server again and the locks came back
    • @@ -504,7 +504,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
    -
    2021-04-01 12:45:11,414 WARN  org.dspace.workflowbasic.BasicWorkflowServiceImpl @ a.akwarandu@cgiar.org:session_id=2F20F20D4A8C36DB53D42DE45DFA3CCE:notifyGroupofTask:cannot email user group_id=aecf811b-b7e9-4b6f-8776-3d372e6a048b workflow_item_id=33085\colon;  Invalid Addresses (com.sun.mail.smtp.SMTPAddressFailedException\colon; 501 5.1.3 Invalid address
    +
    2021-04-01 12:45:11,414 WARN  org.dspace.workflowbasic.BasicWorkflowServiceImpl @ a.akwarandu@cgiar.org:session_id=2F20F20D4A8C36DB53D42DE45DFA3CCE:notifyGroupofTask:cannot email user group_id=aecf811b-b7e9-4b6f-8776-3d372e6a048b workflow_item_id=33085\colon;  Invalid Addresses (com.sun.mail.smtp.SMTPAddressFailedException\colon; 501 5.1.3 Invalid address
     
    • The issue is not the named user above, but a member of the group…
    • And the group does have users with invalid email addresses (probably accounts created automatically after authenticating with LDAP):
    • @@ -513,7 +513,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
      • I extracted all the group IDs from recent logs that had users with invalid email addresses:
      -
      $ grep -a -E 'email user group_id=\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' /home/cgspace.cgiar.org/log/dspace.log.* | grep -o -E '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' | sort | uniq
      +
      $ grep -a -E 'email user group_id=\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' /home/cgspace.cgiar.org/log/dspace.log.* | grep -o -E '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' | sort | uniq
       0a30d6ae-74a6-4eee-a8f5-ee5d15192ee6
       1769137c-36d4-42b2-8fec-60585e110db7
       203c8614-8a97-4ac8-9686-d9d62cb52acc
      @@ -565,12 +565,12 @@ fe800006-aaec-4f9e-9ab4-f9475b4cbdc3
       
    -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     12070
     
    • I restarted PostgreSQL and Tomcat and the locks go straight back up!
    -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     13
     $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     986
    @@ -608,7 +608,7 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
     
     
     
    -
    $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
    +
    $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-backup
     $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
     $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
    @@ -616,18 +616,18 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
     
    • Then I updated all Docker containers and rebooted the server (linode20) so that the correct indexes would be created again:
    -
    $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
    +
    $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
     
    • Then I realized I have to clone the backup index directly to openrxv-items-final, and re-create the openrxv-items alias:
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
     $ curl -X PUT "localhost:9200/openrxv-items-backup/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items-backup/_clone/openrxv-items-final
     $ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
     
    • Now I see both openrxv-items-final and openrxv-items have the current number of items:
    -
    $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'     
    +
    $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'     
     {
       "count" : 103373,
       "_shards" : {
    @@ -672,24 +672,24 @@ $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
     
    • 13,000 requests in the last two months from a user with user agent SomeRandomText, for example:
    -
    84.33.2.97 - - [06/Apr/2021:06:25:13 +0200] "GET /bitstream/handle/10568/77776/CROP%20SCIENCE.jpg.jpg HTTP/1.1" 404 10890 "-" "SomeRandomText"
    +
    84.33.2.97 - - [06/Apr/2021:06:25:13 +0200] "GET /bitstream/handle/10568/77776/CROP%20SCIENCE.jpg.jpg HTTP/1.1" 404 10890 "-" "SomeRandomText"
     
    • I purged them:
    -
    $ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
    +
    $ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
     Purging 13159 hits from SomeRandomText in statistics
     
     Total number of bot hits purged: 13159
     
    • I noticed there were 78 items submitted in the hour before CGSpace crashed:
    -
    # grep -a -E '2021-04-06 0(6|7):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item 
    +
    # grep -a -E '2021-04-06 0(6|7):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item 
     78
     
    • Of those 78, 77 of them were from Udana
    • Compared to other mornings (0 to 9 AM) this month that seems to be pretty high:
    -
    # for num in {01..13}; do grep -a -E "2021-04-$num 0" /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a
    +
    # for num in {01..13}; do grep -a -E "2021-04-$num 0" /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a
      add_item; done
     32
     0
    @@ -723,7 +723,7 @@ Total number of bot hits purged: 13159
     
     
  • Create a test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:
  • -
    $ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
    +
    $ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'
     
    • I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
        @@ -735,12 +735,12 @@ Total number of bot hits purged: 13159
        • Update all containers on AReS (linode20):
        -
        $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
        +
        $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
         
        • Then run all system updates and reboot the server
        • I learned a new command for Elasticsearch:
        -
        $ curl http://localhost:9200/_cat/indices
        +
        $ curl http://localhost:9200/_cat/indices
         yellow open openrxv-values           ChyhGwMDQpevJtlNWO1vcw 1 1   1579      0 537.6kb 537.6kb
         yellow open openrxv-items-temp       PhV5ieuxQsyftByvCxzSIw 1 1 103585 104372 482.7mb 482.7mb
         yellow open openrxv-shared           J_8cxIz6QL6XTRZct7UBBQ 1 1    127      0 115.7kb 115.7kb
        @@ -754,7 +754,7 @@ yellow open users                    M0t2LaZhSm2NrF5xb64dnw 1 1      2      0  1
         
        • Somehow the openrxv-items-final index only has a few items and the majority are in openrxv-items-temp, via the openrxv-items alias (which is in the temp index):
        -
        $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty' 
        +
        $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty' 
         {
           "count" : 103585,
           "_shards" : {
        @@ -767,7 +767,7 @@ yellow open users                    M0t2LaZhSm2NrF5xb64dnw 1 1      2      0  1
         
        • I found a cool tool to help with exporting and restoring Elasticsearch indexes:
        -
        $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
        +
        $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
         $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
         ...
         Sun, 18 Apr 2021 06:27:07 GMT | Total Writes: 103585
        @@ -776,20 +776,20 @@ Sun, 18 Apr 2021 06:27:07 GMT | dump complete
         
      • It took only two or three minutes to export everything…
      • I did a test to restore the index:
      -
      $ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-test --type=mapping
      +
      $ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-test --type=mapping
       $ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-test --limit 1000 --type=data
       
      • So that’s pretty cool!
      • I deleted the openrxv-items-final index and openrxv-items-temp indexes and then restored the mappings to openrxv-items-final, added the openrxv-items alias, and started restoring the data to openrxv-items with elasticdump:
      -
      $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
      +
      $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
       $ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
       $ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
       $ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items --limit 1000 --type=data
       
      • AReS seems to be working fine аfter that, so I created the openrxv-items-temp index and then started a fresh harvest on AReS Explorer:
      -
      $ curl -X PUT "localhost:9200/openrxv-items-temp"
      +
      $ curl -X PUT "localhost:9200/openrxv-items-temp"
       
      • Run system updates on CGSpace (linode18) and run the latest Ansible infrastructure playbook to update the DSpace Statistics API, PostgreSQL JDBC driver, etc, and then reboot the system
      • I wasted a bit of time trying to get TSLint and then ESLint running for OpenRXV on GitHub Actions
      • @@ -798,13 +798,13 @@ $ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localh
        • The AReS harvesting last night seems to have completed successfully, but the number of results is strange:
        -
        $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
        +
        $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
         yellow open openrxv-items-temp       kNUlupUyS_i7vlBGiuVxwg 1 1 103741 105553 483.6mb 483.6mb
         yellow open openrxv-items-final      HFc3uytTRq2GPpn13vkbmg 1 1    970      0   2.3mb   2.3mb
         
        • The indices endpoint doesn’t include the openrxv-items alias, but it is currently in the openrxv-items-temp index so the number of items is the same:
        -
        $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'     
        +
        $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'     
         {
           "count" : 103741,
           "_shards" : {
        @@ -821,7 +821,7 @@ yellow open openrxv-items-final      HFc3uytTRq2GPpn13vkbmg 1 1    970      0
         
    -
    $ dspace test-email
    +
    $ dspace test-email
     ...
     Error sending email:
      - Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 550 5.2.1 Mailbox cannot be accessed [PR0P264CA0280.FRAP264.PROD.OUTLOOK.COM]
    @@ -850,7 +850,7 @@ Error sending email:
     
     
     
    -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
     $ cp atmire-cua-update.xml-20210124-132112.old /home/dspacetest.cgiar.org/config/spring/api/atmire-cua-update.xml
     $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12 -g
     
      @@ -869,7 +869,7 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
    -
    $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
    +
    $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
     $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
     $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
     $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
    @@ -883,13 +883,13 @@ $ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localh
     
    • The AReS harvest last night seems to have finished successfully and the number of items looks good:
    -
    $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
    +
    $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
     yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1      0 0    283b    283b
     yellow open openrxv-items-final      ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0   254mb   254mb
     
    • And the aliases seem correct for once:
    -
    $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
    +
    $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
     ...
         "openrxv-items-final": {
             "aliases": {
    @@ -904,7 +904,7 @@ yellow open openrxv-items-final      ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0   254mb
     
  • That’s 250 new items in the index since the last harvest!
  • Re-create my local Artifactory container because I’m getting errors starting it and it has been a few months since it was updated:
  • -
    $ podman rm artifactory
    +
    $ podman rm artifactory
     $ podman pull docker.bintray.io/jfrog/artifactory-oss:latest
     $ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081-8082:8081-8082 docker.bintray.io/jfrog/artifactory-oss
     $ podman start artifactory
    @@ -925,11 +925,11 @@ $ podman start artifactory
     
     
  • I tried to delete all the Atmire SQL migrations:
  • -
    localhost/dspace7b5= > DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
    +
    localhost/dspace7b5= > DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
     
    • But I got an error when running dspace database migrate:
    -
    $ ~/dspace7b5/bin/dspace database migrate
    +
    $ ~/dspace7b5/bin/dspace database migrate
     
     Database URL: jdbc:postgresql://localhost:5432/dspace7b5
     Migrating database to latest version... (Check dspace logs for details)
    @@ -961,11 +961,11 @@ Detected applied migration not resolved locally: 6.0.2017.09.25
     
    • I deleted those migrations:
    -
    localhost/dspace7b5= > DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
    +
    localhost/dspace7b5= > DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
     
    • Then when I ran the migration again it failed for a new reason, related to the configurable workflow:
    -
    Database URL: jdbc:postgresql://localhost:5432/dspace7b5
    +
    Database URL: jdbc:postgresql://localhost:5432/dspace7b5
     Migrating database to latest version... (Check dspace logs for details)
     Migration exception:
     java.sql.SQLException: Flyway migration error occurred
    @@ -993,12 +993,12 @@ Statement  : UPDATE cwf_pooltask SET workflow_id='defaultWorkflow' WHERE workflo
     
    -
    $ ~/dspace7b5/bin/dspace database migrate ignored
    +
    $ ~/dspace7b5/bin/dspace database migrate ignored
     
    • Now I see all migrations have completed and DSpace actually starts up fine!
    • I will try to do a full re-index to see how long it takes:
    -
    $ time ~/dspace7b5/bin/dspace index-discovery -b
    +
    $ time ~/dspace7b5/bin/dspace index-discovery -b
     ...
     ~/dspace7b5/bin/dspace index-discovery -b  25156.71s user 64.22s system 97% cpu 7:11:09.94 total
     
      @@ -1012,7 +1012,7 @@ Statement : UPDATE cwf_pooltask SET workflow_id='defaultWorkflow' WHERE workflo
    -
    $ csvgrep -e 'windows-1252' -c 'Handle.net IDs' -i -m '10568/' ~/Downloads/Altmetric\ -\ Research\ Outputs\ -\ CGSpace\ -\ 2021-04-26.csv | csvcut -c DOI | sed '1d' > /tmp/dois.txt
    +
    $ csvgrep -e 'windows-1252' -c 'Handle.net IDs' -i -m '10568/' ~/Downloads/Altmetric\ -\ Research\ Outputs\ -\ CGSpace\ -\ 2021-04-26.csv | csvcut -c DOI | sed '1d' > /tmp/dois.txt
     $ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.csv -db dspace63 -u dspace -p 'fuuu' -d
     
    • He will Tweet them…
    • diff --git a/docs/2021-05/index.html b/docs/2021-05/index.html index 2ea78126d..9d8283c92 100644 --- a/docs/2021-05/index.html +++ b/docs/2021-05/index.html @@ -36,7 +36,7 @@ I looked at the top user agents and IPs in the Solr statistics for last month an I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one… as that’s an actual user… "/> - + @@ -147,7 +147,7 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
    -
    193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
    +
    193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
     193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata-21%2B21*01 HTTP/1.1" 200 458201 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
     193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata'||lower('')||' HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
     193.169.254.178 - - [21/Apr/2021:02:02:10 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata'%2Brtrim('')%2B' HTTP/1.1" 200 458209 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
    @@ -155,7 +155,7 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
     
  • I will report the IP on abuseipdb.com and purge their hits from Solr
  • The second IP is in Colombia and is making thousands of requests for what looks like some test site:
  • -
    181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"
    +
    181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"
     181.62.166.177 - - [20/Apr/2021:22:55:39 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"
     
    • But this site does not exist (yet?) @@ -165,11 +165,11 @@ I will add the RI/1.0 pattern to our DSpace agents overload and purge them from
    • The third IP is in Russia apparently, and the user agent has the pl-PL locale with thousands of requests like this:
    -
    45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] "GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&isAllowed=y HTTP/1.1" 200 918998 "http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf" "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15"
    +
    45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] "GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&isAllowed=y HTTP/1.1" 200 918998 "http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf" "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15"
     
    • I will purge these all with my check-spider-ip-hits.sh script:
    -
    $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
    +
    $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
     Purging 21648 hits from 193.169.254.178 in statistics
     Purging 20323 hits from 181.62.166.177 in statistics
     Purging 19376 hits from 45.146.166.180 in statistics
    @@ -179,7 +179,7 @@ Total number of bot hits purged: 61347
     
    • Check the AReS Harvester indexes:
    -
    $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
    +
    $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
     yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1      0 0    283b    283b
     yellow open openrxv-items-final      ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0   254mb   254mb
     $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
    @@ -195,13 +195,13 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
     
    • I think they look OK (openrxv-items is an alias of openrxv-items-final), but I took a backup just in case:
    -
    $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
    +
    $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
     $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
     
    • Then I started an indexing in the AReS Explorer admin dashboard
    • The indexing finished, but it looks like the aliases are messed up again:
    -
    $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
    +
    $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
     yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
     yellow open openrxv-items-final      d0tbMM_SRWimirxr_gm9YA 1 1    937      0   2.2mb   2.2mb
     

    2021-05-05

    @@ -229,7 +229,7 @@ yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0 -
    $ time ~/dspace64/bin/dspace index-discovery -b
    +
    $ time ~/dspace64/bin/dspace index-discovery -b
     ~/dspace64/bin/dspace index-discovery -b  4053.24s user 53.17s system 38% cpu 2:58:53.83 total
     
    • Nope! Still slow, and still no mapped item… @@ -244,7 +244,7 @@ yellow open openrxv-items-final d0tbMM_SRWimirxr_gm9YA 1 1 937 0
    • The indexes on AReS Explorer are messed up after last week’s harvesting:
    -
    $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
    +
    $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
     yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
     yellow open openrxv-items-final      d0tbMM_SRWimirxr_gm9YA 1 1    937      0   2.2mb   2.2mb
     
    @@ -262,21 +262,21 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
     
  • openrxv-items should be an alias of openrxv-items-final
  • I made a backup of the temp index and then started indexing on the AReS Explorer admin dashboard:
  • -
    $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
    +
    $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-temp-backup
     $ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
     

    2021-05-10

    • Amazing, the harvesting on AReS finished but it messed up all the indexes and now there are no items in any index!
    -
    $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
    +
    $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
     yellow open openrxv-items-temp        8thRX0WVRUeAzmd2hkG6TA 1 1      0     0    283b    283b
     yellow open openrxv-items-temp-backup _0tyvctBTg2pjOlcoVP1LA 1 1 104165 20134 305.5mb 305.5mb
     yellow open openrxv-items-final       BtvV9kwVQ3yBYCZvJS1QyQ 1 1      0     0    283b    283b
     
    • I fixed the indexes manually by re-creating them and cloning from the backup:
    -
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
    +
    $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
     $ curl -X PUT "localhost:9200/openrxv-items-temp-backup/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
     $ curl -s -X POST http://localhost:9200/openrxv-items-temp-backup/_clone/openrxv-items-final
     $ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
    @@ -284,11 +284,11 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp-backup'
     
    • Also I ran all updated on the server and updated all Docker images, then rebooted the server (linode20):
    -
    $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
    +
    $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
     
    • I backed up the AReS Elasticsearch data using elasticdump, then started a new harvest:
    -
    $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
    +
    $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
     $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
     
    • Discuss CGSpace statistics with the CIP team @@ -329,7 +329,7 @@ $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/o
    • I checked the CLARISA list against ROR’s April, 2020 release (“Version 9”, on figshare, though it is version 8 in the dump):
    -
    $ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
    +
    $ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
     $ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
     1770
     
      @@ -341,7 +341,7 @@ $ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
      • Fix a few thousand IWMI URLs that are using HTTP instead of HTTPS on CGSpace:
      -
      localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://www.iwmi.cgiar.org','https://www.iwmi.cgiar.org', 'g') WHERE text_value LIKE 'http://www.iwmi.cgiar.org%' AND metadata_field_id=219;
      +
      localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://www.iwmi.cgiar.org','https://www.iwmi.cgiar.org', 'g') WHERE text_value LIKE 'http://www.iwmi.cgiar.org%' AND metadata_field_id=219;
       UPDATE 1132
       localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://publications.iwmi.org','https://publications.iwmi.org', 'g') WHERE text_value LIKE 'http://publications.iwmi.org%' AND metadata_field_id=219;
       UPDATE 1803
      @@ -367,7 +367,7 @@ UPDATE 1803
       
      • I have to fix the Elasticsearch indexes on AReS after last week’s harvesting because, as always, the openrxv-items index should be an alias of openrxv-items-final instead of openrxv-items-temp:
      -
      $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
      +
      $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
           "openrxv-items-final": {
               "aliases": {}
           },
      @@ -380,13 +380,13 @@ UPDATE 1803
       
      • I took a backup of the openrxv-items index with elasticdump so I can re-create them manually before starting a new harvest tomorrow:
      -
      $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
      +
      $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
       $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
       

      2021-05-16

      • I deleted and re-created the Elasticsearch indexes on AReS:
      -
      $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
      +
      $ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
       $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
       $ curl -XPUT 'http://localhost:9200/openrxv-items-final'
       $ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
      @@ -394,7 +394,7 @@ $ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application
       
      • Then I re-imported the backup that I created with elasticdump yesterday:
      -
      $ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
      +
      $ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
       $ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000 
       
      • Then I started a new harvest on AReS
      • @@ -403,7 +403,7 @@ $ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localh
        • The AReS harvest finished and the Elasticsearch indexes seem OK so I shouldn’t have to fix them next time…
        -
        $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
        +
        $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
         yellow open openrxv-items-temp       o3ijJLcyTtGMOPeWpAJiVA 1 1      0 0    283b    283b
         yellow open openrxv-items-final      TrJ1Ict3QZ-vFkj-4VcAzw 1 1 104317 0 259.4mb 259.4mb
         $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
        @@ -423,7 +423,7 @@ $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
         
    -
    $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-ldap@cgiarad.org" -W "(sAMAccountName=aorth)"
    +
    $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-ldap@cgiarad.org" -W "(sAMAccountName=aorth)"
     Enter LDAP Password: 
     ldap_bind: Invalid credentials (49)
             additional info: 80090308: LdapErr: DSID-0C090453, comment: AcceptSecurityContext error, data 532, v3839
    @@ -446,11 +446,11 @@ ldap_bind: Invalid credentials (49)
     
     
     
    -
    $ xmllint --xpath '//value-pairs[@value-pairs-name="ccafsprojectpii"]/pair/stored-value/node()' dspace/config/input-forms.xml
    +
    $ xmllint --xpath '//value-pairs[@value-pairs-name="ccafsprojectpii"]/pair/stored-value/node()' dspace/config/input-forms.xml
     
    • I formatted the input file with tidy, especially because one of the new project tags has an ampersand character… grrr:
    -
    $ tidy -xml -utf8 -m -iq -w 0 dspace/config/input-forms.xml      
    +
    $ tidy -xml -utf8 -m -iq -w 0 dspace/config/input-forms.xml      
     line 3658 column 26 - Warning: unescaped & or unknown entity "&WA_EU-IFAD"
     line 3659 column 23 - Warning: unescaped & or unknown entity "&WA_EU-IFAD"
     
      @@ -461,16 +461,16 @@ line 3659 column 23 - Warning: unescaped & or unknown entity "&WA_E
    • Paola from the Alliance emailed me some new ORCID identifiers to add to CGSpace
    • I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using resolve-orcids.py:
    -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-05-18-combined.txt
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-05-18-combined.txt
     $ ./ilri/resolve-orcids.py -i /tmp/2021-05-18-combined.txt -o /tmp/2021-05-18-combined-names.txt
     
    • I sorted the names and added the XML formatting in vim, then ran it through tidy:
    -
    $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-identifier.xml
    +
    $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-identifier.xml
     
    • Tag fifty-five items from the Alliance’s new authors with ORCID iDs using add-orcid-identifiers-csv.py:
    -
    $ cat 2021-05-18-add-orcids.csv 
    +
    $ cat 2021-05-18-add-orcids.csv 
     dc.contributor.author,cg.creator.identifier
     "Urioste Daza, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
     "Urioste, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
    @@ -504,7 +504,7 @@ $ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspa
     
     
     
    -
    dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
    +
    dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
     UPDATE 47405
     
    • That’s interesting because we lowercased them all a few months ago, so these must all be new… wow @@ -518,7 +518,7 @@ UPDATE 47405
      • Export the top 5,000 AGROVOC terms to validate them:
      -
      localhost/dspace63= > \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
      +
      localhost/dspace63= > \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
       COPY 5000
       $ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d > /tmp/2021-05-20-agrovoc.txt
       $ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv
      @@ -545,7 +545,7 @@ $ csvgrep -c "number of matches" -r '^0$' /tmp/2021-05-20-agrovoc-resu
       
      • Add ORCID identifiers for missing ILRI authors and tag 550 others based on a few authors I noticed that were missing them:
      -
      $ cat 2021-05-24-add-orcids.csv 
      +
      $ cat 2021-05-24-add-orcids.csv 
       dc.contributor.author,cg.creator.identifier
       "Patel, Ekta","Ekta Patel: 0000-0001-9400-6988"
       "Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417"
      @@ -562,7 +562,7 @@ $ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u
       
      • A few days ago I took a backup of the Elasticsearch indexes on AReS using elasticdump:
      -
      $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
      +
      $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
       $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
       
      • The indexes look OK so I started a harvesting on AReS
      • @@ -571,13 +571,13 @@ $ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/o
        • The AReS harvest got messed up somehow, as I see the number of items in the indexes are the same as before the harvesting:
        -
        $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items                                                
        +
        $ curl -s http://localhost:9200/_cat/indices | grep openrxv-items                                                
         yellow open openrxv-items-temp       o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb
         yellow open openrxv-items-final      soEzAnp3TDClIGZbmVyEIw 1 1    953      0   2.3mb   2.3mb
         
        • Update all docker images on the AReS server (linode20):
        -
        $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
        +
        $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
         $ docker-compose -f docker/docker-compose.yml down
         $ docker-compose -f docker/docker-compose.yml build
         
          @@ -585,7 +585,7 @@ $ docker-compose -f docker/docker-compose.yml build
        • Oh crap, I deleted everything on AReS and restored the backup and the total items are now 104317… so it was actually correct before!
        • For reference, this is how I re-created everything:
        -
        curl -XDELETE 'http://localhost:9200/openrxv-items-final'
        +
        curl -XDELETE 'http://localhost:9200/openrxv-items-final'
         curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
         curl -XPUT 'http://localhost:9200/openrxv-items-final'
         curl -XPUT 'http://localhost:9200/openrxv-items-temp'
        @@ -605,7 +605,7 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
         
         
      • Looking in the DSpace log for this morning I see a big hole in the logs at that time (UTC+2 server time):
      -
      2021-05-26 02:17:52,808 INFO  org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/70659 with status: 2. Result: '10568/70659: item has country codes, skipping'
      +
      2021-05-26 02:17:52,808 INFO  org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/70659 with status: 2. Result: '10568/70659: item has country codes, skipping'
       2021-05-26 02:17:52,853 INFO  org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/66761 with status: 2. Result: '10568/66761: item has country codes, skipping'
       2021-05-26 03:00:05,772 INFO  org.dspace.statistics.SolrLoggerServiceImpl @ solr-statistics.spidersfile:null
       2021-05-26 03:00:05,773 INFO  org.dspace.statistics.SolrLoggerServiceImpl @ solr-statistics.server:http://localhost:8081/solr/statistics
      @@ -613,7 +613,7 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
       
    • There are no logs between 02:17 and 03:00… hmmm.
    • I see a similar gap in the Solr log, though it starts at 02:15:
    -
    2021-05-26 02:15:07,968 INFO  org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={f.location.coll.facet.sort=count&facet.field=location.comm&facet.field=location.coll&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=search.resourcetype:2&fq=NOT(discoverable:false)&rows=0&version=2&q=*:*&f.location.coll.facet.limit=-1&facet.mincount=1&facet=true&f.location.comm.facet.sort=count&wt=javabin&facet.offset=0&f.location.comm.facet.limit=-1} hits=90792 status=0 QTime=6 
    +
    2021-05-26 02:15:07,968 INFO  org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={f.location.coll.facet.sort=count&facet.field=location.comm&facet.field=location.coll&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=search.resourcetype:2&fq=NOT(discoverable:false)&rows=0&version=2&q=*:*&f.location.coll.facet.limit=-1&facet.mincount=1&facet=true&f.location.comm.facet.sort=count&wt=javabin&facet.offset=0&f.location.comm.facet.limit=-1} hits=90792 status=0 QTime=6 
     2021-05-26 02:15:09,446 INFO  org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/update params={wt=javabin&version=2} status=0 QTime=1 
     2021-05-26 02:28:03,602 INFO  org.apache.solr.update.UpdateHandler @ start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
     2021-05-26 02:28:03,630 INFO  org.apache.solr.core.SolrCore @ SolrDeletionPolicy.onCommit: commits: num=2
    @@ -626,19 +626,19 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
     
    -
    May 26, 2021
    +
    May 26, 2021
     Connectivity Issue - Frankfurt
     Resolved - We haven’t observed any additional connectivity issues in our Frankfurt data center, and will now consider this incident resolved. If you continue to experience problems, please open a Support ticket for assistance.
     May 26, 02:57 UTC 
     
    • While looking in the logs I noticed an error about SMTP:
    -
    2021-05-26 02:00:18,015 ERROR org.dspace.eperson.SubscribeCLITool @ Failed to send subscription to eperson_id=934cb92f-2e77-4881-89e2-6f13ad4b1378
    +
    2021-05-26 02:00:18,015 ERROR org.dspace.eperson.SubscribeCLITool @ Failed to send subscription to eperson_id=934cb92f-2e77-4881-89e2-6f13ad4b1378
     2021-05-26 02:00:18,015 ERROR org.dspace.eperson.SubscribeCLITool @ javax.mail.SendFailedException: Send failure (javax.mail.MessagingException: Could not convert socket to TLS (javax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)))
     
    • And indeed the email seems to be broken:
    -
    $ dspace test-email
    +
    $ dspace test-email
     
     About to send test email:
      - To: fuuuuuu
    diff --git a/docs/2021-06/index.html b/docs/2021-06/index.html
    index c9033c7bf..c7e7f4a48 100644
    --- a/docs/2021-06/index.html
    +++ b/docs/2021-06/index.html
    @@ -36,7 +36,7 @@ I simply started it and AReS was running again:
     
     
     "/>
    -
    +
     
     
         
    @@ -132,7 +132,7 @@ I simply started it and AReS was running again:
     
     
     
    -
    $ docker-compose -f docker/docker-compose.yml start angular_nginx
    +
    $ docker-compose -f docker/docker-compose.yml start angular_nginx
     
    • Margarita from CCAFS emailed me to say that workflow alerts haven’t been working lately
        @@ -152,7 +152,7 @@ I simply started it and AReS was running again:
    -
    https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&scope=10568/16814&order=DESC&rpp=100&sort_by=2&start=1
    +
    https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&scope=10568/16814&order=DESC&rpp=100&sort_by=2&start=1
     
    • That will sort by date issued (see: webui.itemlist.sort-option.2 in dspace.cfg), give 100 results per page, and start on item 1
    • Otherwise, another alternative would be to use the IWMI CSV that we are already exporting every week
    • @@ -162,7 +162,7 @@ I simply started it and AReS was running again:
      • The Elasticsearch indexes are messed up so I dumped and re-created them correctly:
      -
      curl -XDELETE 'http://localhost:9200/openrxv-items-final'
      +
      curl -XDELETE 'http://localhost:9200/openrxv-items-final'
       curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
       curl -XPUT 'http://localhost:9200/openrxv-items-final'
       curl -XPUT 'http://localhost:9200/openrxv-items-temp'
      @@ -208,7 +208,7 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
       
    -
    $ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
    +
    $ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
     
    • The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it’s much faster
        @@ -231,7 +231,7 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
    -
    $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
    +
    $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
     90459
     $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
     90380
    @@ -255,11 +255,11 @@ $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-it
     
     
     
    -
    $ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
    +
    $ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
     
    • Then I used csvcut to extract just the columns I needed and do the replacement into a new CSV:
    -
    $ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' > /tmp/2021-06-20-IWMI-new-subjects.csv
    +
    $ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' > /tmp/2021-06-20-IWMI-new-subjects.csv
     
    • Then I uploaded the resulting CSV to CGSpace, updating 161 items
    • Start a harvest on AReS
    • @@ -278,7 +278,7 @@ $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-it
    -
    $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | wc -l
    +
    $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | wc -l
     90937
     $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort -u | wc -l
     85709
    @@ -289,7 +289,7 @@ $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep
     
     
     
    -
    $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort | uniq -c | sort -h
    +
    $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort | uniq -c | sort -h
     
    • Unfortunately I found no pattern:
        @@ -312,7 +312,7 @@ $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep
    -
    $ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
    +
    $ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
     5
     $ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle'
     "10673/4"
    @@ -355,7 +355,7 @@ $ curl -s -H "Accept: application/json" "https://demo.dspace.org/
     
     
     
    -
    $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | wc -l
    +
    $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | wc -l
     90327
     $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | sort -u | wc -l
     90317
    @@ -368,7 +368,7 @@ $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-it
     
     
     
    -
    $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p   
    +
    $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p   
     Purging 1339 hits from RI\/1\.0 in statistics
     Purging 447 hits from crusty in statistics
     Purging 3736 hits from newspaper in statistics
    @@ -397,7 +397,7 @@ Total number of bot hits purged: 5522
     
     
     
    -
    # journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
    +
    # journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
     978
     $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     10100
    @@ -412,16 +412,16 @@ $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid =
     
     
  • After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:
  • -
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     63
     
    • Looking in the DSpace log, the first “pool empty” message I saw this morning was at 4AM:
    -
    2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
    +
    2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
     
    • Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:
    -
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
    +
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
     
    -
    $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l
    +
    $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l
     104797
     $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
     99186
    @@ -456,7 +456,7 @@ $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspa
     
  • This number is probably unique for that particular harvest, but I don’t think it represents the true number of items…
  • The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:
  • -
    $ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l
    +
    $ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l
     90990
     
    • So the harvest on the live site is missing items, then why didn’t the add missing items plugin find them?! @@ -469,7 +469,7 @@ $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspa
    -
    172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV"
    +
    172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV"
     
    • I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins… now it’s checking 180,000+ handles to see if they are collections or items…
        @@ -478,7 +478,7 @@ $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspa
      • According to the api logs we will be adding 5,697 items:
      -
      $ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
      +
      $ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
       5697
       
      • Spent a few hours with Moayad troubleshooting and improving OpenRXV @@ -496,7 +496,7 @@ $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspa
    -
    $ redis-cli
    +
    $ redis-cli
     127.0.0.1:6379> SCAN 0 COUNT 5
     1) "49152"
     2) 1) "bull:plugins:476595"
    @@ -507,14 +507,14 @@ $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspa
     
    • We can apparently get the names of the jobs in each hash using hget:
    -
    127.0.0.1:6379> TYPE bull:plugins:401827
    +
    127.0.0.1:6379> TYPE bull:plugins:401827
     hash
     127.0.0.1:6379> HGET bull:plugins:401827 name
     "dspace_add_missing_items"
     
    • I whipped up a one liner to get the keys for all plugin jobs, convert to redis HGET commands to extract the value of the name field, and then sort them by their counts:
    -
    $ redis-cli KEYS "bull:plugins:*" \
    +
    $ redis-cli KEYS "bull:plugins:*" \
       | sed -e 's/^bull/HGET bull/' -e 's/\([[:digit:]]\)$/\1 name/' \
       | ncat -w 3 localhost 6379 \
       | grep -v -E '^\$' | sort | uniq -c | sort -h
    @@ -544,7 +544,7 @@ hash
     
    • Looking at the DSpace log I see there was definitely a higher number of sessions that day, perhaps twice the normal:
    -
    $ for file in dspace.log.2021-06-[12]*; do echo "$file"; grep -oE 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
    +
    $ for file in dspace.log.2021-06-[12]*; do echo "$file"; grep -oE 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
     dspace.log.2021-06-10
     19072
     dspace.log.2021-06-11
    @@ -584,7 +584,7 @@ dspace.log.2021-06-27
     
    • I see 15,000 unique IPs in the XMLUI logs alone on that day:
    -
    # zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep '23/Jun/2021' | awk '{print $1}' | sort | uniq | wc -l
    +
    # zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep '23/Jun/2021' | awk '{print $1}' | sort | uniq | wc -l
     15835
     
    • Annoyingly I found 37,000 more hits from Bing using dns:*msnbot* AND dns:*.msn.com. as a Solr filter @@ -628,7 +628,7 @@ dspace.log.2021-06-27
    • The DSpace log shows:
    -
    2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
    +
    2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
     
    • The first one of these I see is from last night at 2021-06-29 at 10:47 PM
    • I restarted Tomcat 7 and CGSpace came back up…
    • @@ -641,12 +641,12 @@ dspace.log.2021-06-27
    • Export a list of all CGSpace’s AGROVOC keywords with counts for Enrico and Elizabeth Arnaud to discuss with AGROVOC:
    -
    localhost/dspace63= > \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
    +
    localhost/dspace63= > \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
     COPY 20780
     
    • Actually Enrico wanted NON AGROVOC, so I extracted all the center and CRP subjects (ignoring system office and themes):
    -
    localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
    +
    localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
     COPY 1710
     
    • Fix an issue in the Ansible infrastructure playbooks for the DSpace role @@ -657,12 +657,12 @@ COPY 1710
    • I saw a strange message in the Tomcat 7 journal on DSpace Test (linode26):
    -
    Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
    +
    Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
     
    • What’s even crazier is that it is twice that on CGSpace (linode18)!
    • Apparently OpenJDK defaults to using /dev/random (see /etc/java-8-openjdk/security/java.security):
    -
    securerandom.source=file:/dev/urandom
    +
    securerandom.source=file:/dev/urandom
     
    • /dev/random blocks and can take a long time to get entropy, and urandom on modern Linux is a cryptographically secure pseudorandom number generator
        diff --git a/docs/2021-07/index.html b/docs/2021-07/index.html index 0ac9baf47..0f4db0776 100644 --- a/docs/2021-07/index.html +++ b/docs/2021-07/index.html @@ -30,7 +30,7 @@ Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVO localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER; COPY 20994 "/> - + @@ -120,13 +120,13 @@ COPY 20994
        • Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
        -
        localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
        +
        localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
         COPY 20994
         

        2021-07-04

        • Update all Docker containers on the AReS server (linode20) and rebuild OpenRXV:
        -
        $ cd OpenRXV
        +
        $ cd OpenRXV
         $ docker-compose -f docker/docker-compose.yml down
         $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
         $ docker-compose -f docker/docker-compose.yml build
        @@ -172,7 +172,7 @@ $ docker-compose -f docker/docker-compose.yml build
         
    -
    $ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
    +
    $ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
     Purging 95 hits from Drupal in statistics
     Purging 38 hits from DTS Agent in statistics
     Purging 601 hits from Microsoft Office Existence Discovery in statistics
    @@ -189,7 +189,7 @@ Total number of bot hits purged: 15030
     
  • Meet with the CGIAR–AGROVOC task group to discuss how we want to do the workflow for submitting new terms to AGROVOC
  • I extracted another list of all subjects to check against AGROVOC:
  • -
    \COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
    +
    \COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
     $ csvcut -c 1 /tmp/2021-07-06-all-subjects.csv | sed 1d > /tmp/2021-07-06-all-subjects.txt
     $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-06-agrovoc-results-all-subjects.csv -d
     
      @@ -205,7 +205,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
    -
    # for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
    +
    # for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
     2021-06-10
     10693
     2021-06-11
    @@ -243,7 +243,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
     
    • Similarly, the number of connections to the REST API was around the average for the recent weeks before:
    -
    # for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/rest.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
    +
    # for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/rest.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
     2021-06-10
     1183
     2021-06-11
    @@ -281,7 +281,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
     
    • According to goaccess, the traffic spike started at 2AM (remember that the first “Pool empty” error in dspace.log was at 4:01AM):
    -
    # zcat /var/log/nginx/access.log.1[45].gz /var/log/nginx/library-access.log.1[45].gz | grep -E '23/Jun/2021' | goaccess --log-format=COMBINED -
    +
    # zcat /var/log/nginx/access.log.1[45].gz /var/log/nginx/library-access.log.1[45].gz | grep -E '23/Jun/2021' | goaccess --log-format=COMBINED -
     
    • Moayad sent a fix for the add missing items plugins issue (#107)
        @@ -311,7 +311,7 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0
    -
    postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +
    postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     2302
     postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
     2564
    @@ -320,7 +320,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
     
    • The locks are held by XMLUI, not REST API or OAI:
    -
    postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n
    +
    postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n
          57 dspaceApi
        2671 dspaceWeb
     
      @@ -338,7 +338,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
    -
    # grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
    +
    # grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
          32 91.243.191.124
          33 91.243.191.129
          33 91.243.191.200
    @@ -392,7 +392,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
     
     
     
    -
    $ ./asn -n 45.80.217.235  
    +
    $ ./asn -n 45.80.217.235  
     
     ╭──────────────────────────────╮
     │ ASN lookup for 45.80.217.235 │
    @@ -410,7 +410,7 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
     
    • Slowly slowly I manually built up a list of the IPs, ISP names, and network blocks, for example:
    -
    IP, Organization, Website, Network
    +
    IP, Organization, Website, Network
     45.148.126.246, TrafficTransitSolution LLC, traffictransitsolution.us, 45.148.126.0/24 (Net-traffictransitsolution-15)
     45.138.102.253, TrafficTransitSolution LLC, traffictransitsolution.us, 45.138.102.0/24 (Net-traffictransitsolution-11)
     45.140.205.104, Bulgakov Alexey Yurievich, finegroupservers.com, 45.140.204.0/23 (CHINA_NETWORK)
    @@ -496,17 +496,17 @@ postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activi
     
     
     
    -
    # grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq  > /tmp/ips-sorted.txt
    +
    # grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq  > /tmp/ips-sorted.txt
     # wc -l /tmp/ips-sorted.txt 
     10776 /tmp/ips-sorted.txt
     
    • Then resolve them all:
    -
    $ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips-sorted.txt -o /tmp/out.csv
    +
    $ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips-sorted.txt -o /tmp/out.csv
     
    • Then get the top 10 organizations and top ten ASNs:
    -
    $ csvcut -c 2 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
    +
    $ csvcut -c 2 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
         213 AMAZON-AES
         218 ASN-QUADRANET-GLOBAL
         246 Silverstar Invest Limited
    @@ -531,7 +531,7 @@ $ csvcut -c 3 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
     
    • I will download blocklists for all these except Ethiopian Telecom, Quadranet, and Amazon, though I’m concerned about Global Layer because it’s a huge ASN that seems to have legit hosts too…?
    -
    $ wget https://asn.ipinfo.app/api/text/nginx/AS49453
    +
    $ wget https://asn.ipinfo.app/api/text/nginx/AS49453
     $ wget https://asn.ipinfo.app/api/text/nginx/AS46844
     $ wget https://asn.ipinfo.app/api/text/nginx/AS206485
     $ wget https://asn.ipinfo.app/api/text/nginx/AS62282
    @@ -543,7 +543,7 @@ $ wc -l /tmp/abusive-networks.txt
     
    • Combining with my existing rules and filtering uniques:
    -
    $ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l
    +
    $ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l
     2298
     
    -
    $ sudo zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 | grep -E " (200|499) " | awk '{print $1}' | sort | uniq > /tmp/all-ips.txt
    +
    $ sudo zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 | grep -E " (200|499) " | awk '{print $1}' | sort | uniq > /tmp/all-ips.txt
     $ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips.txt -o /tmp/all-ips-out.csv
     $ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/all-ips-to-block.txt
     $ wc -l /tmp/all-ips-to-block.txt 
    @@ -571,7 +571,7 @@ $ wc -l /tmp/all-ips-to-block.txt
     
     
  • I decided to extract the networks from the GeoIP database with resolve-addresses-geoip2.py so I can block them more efficiently than using the 5,000 IPs in an ipset:
  • -
    $ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c network | sed 1d | sort | uniq > /tmp/all-networks-to-block.txt
    +
    $ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c network | sed 1d | sort | uniq > /tmp/all-networks-to-block.txt
     $ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq | wc -l
     2354
     
      @@ -582,7 +582,7 @@ $ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq
    • Then I got a list of all the 5,095 IPs from above and used check-spider-ip-hits.sh to purge them from Solr:
    -
    $ ilri/check-spider-ip-hits.sh -f /tmp/all-ips-to-block.txt -p
    +
    $ ilri/check-spider-ip-hits.sh -f /tmp/all-ips-to-block.txt -p
     ...
     Total number of bot hits purged: 197116
     
      @@ -592,13 +592,13 @@ Total number of bot hits purged: 197116
      • Looking again at the IPs making connections to CGSpace over the last few days from these seven ASNs, it’s much higher than I noticed yesterday:
      -
      $ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
      +
      $ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
       5643
       
      • I purged 27,000 more hits from the Solr stats using this new list of IPs with my check-spider-ip-hits.sh script
      • Surprise surprise, I checked the nginx logs from 2021-06-23 when we last had issues with thousands of XMLUI sessions and PostgreSQL connections and I see IPs from the same ASNs!
      -
      $ sudo zcat --force /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/all-ips-june-23.txt
      +
      $ sudo zcat --force /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/all-ips-june-23.txt
       $ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips-june-23.txt -o /tmp/out.csv
       $ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
           265 GOOGLE,15169
      @@ -619,12 +619,12 @@ $ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
       
      • Again it was over 5,000 IPs:
      -
      $ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l         
      +
      $ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l         
       5228
       
      • Interestingly, it seems these are five thousand different IP addresses than the attack from last weekend, as there are over 10,000 unique ones if I combine them!
      -
      $ cat /tmp/ips-june23.txt /tmp/ips-jul16.txt | sort | uniq | wc -l
      +
      $ cat /tmp/ips-june23.txt /tmp/ips-jul16.txt | sort | uniq | wc -l
       10458
       
      • I purged all the (26,000) hits from these new IP addresses from Solr as well
      • @@ -636,7 +636,7 @@ $ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
      • Adding QuadraNet brings the total networks seen during these two attacks to 262, and the number of unique IPs to 10900:
      -
      # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/ddos-ips.txt
      +
      # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/ddos-ips.txt
       # wc -l /tmp/ddos-ips.txt 
       54002 /tmp/ddos-ips.txt
       $ ./ilri/resolve-addresses-geoip2.py -i /tmp/ddos-ips.txt -o /tmp/ddos-ips.csv
      @@ -649,7 +649,7 @@ $ wc -l /tmp/ddos-networks-to-block.txt
       
      • The new total number of networks to block, including the network prefixes for these ASNs downloaded from asn.ipinfo.app, is 4,007:
      -
      $ wget https://asn.ipinfo.app/api/text/nginx/AS49453 \
      +
      $ wget https://asn.ipinfo.app/api/text/nginx/AS49453 \
       https://asn.ipinfo.app/api/text/nginx/AS46844 \
       https://asn.ipinfo.app/api/text/nginx/AS206485 \
       https://asn.ipinfo.app/api/text/nginx/AS62282 \
      diff --git a/docs/2021-08/index.html b/docs/2021-08/index.html
      index eae20a094..c4f7b6bd3 100644
      --- a/docs/2021-08/index.html
      +++ b/docs/2021-08/index.html
      @@ -32,7 +32,7 @@ Update Docker images on AReS server (linode20) and reboot the server:
       
       I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
       "/>
      -
      +
       
       
           
      @@ -122,14 +122,14 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
       
      • Update Docker images on AReS server (linode20) and reboot the server:
      -
      # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
      +
      # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
       
      • I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
      • First running all existing updates, taking some backups, checking for broken packages, and then rebooting:
      -
      # apt update && apt dist-upgrade
      +
      # apt update && apt dist-upgrade
       # apt autoremove && apt autoclean
       # check for any packages with residual configs we can purge
       # dpkg -l | grep -E '^rc' | awk '{print $2}'
      @@ -144,13 +144,13 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
       
    • … but of course it hit the libxcrypt bug
    • I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually
    -
    # apt install -f
    +
    # apt install -f
     # apt dist-upgrade
     # reboot
     
    • After rebooting I purged all packages with residual configs and cleaned up again:
    -
    # dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
    +
    # dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
     # apt autoremove && apt autoclean
     
    • Then I cleared my local Ansible fact cache and re-ran the infrastructure playbooks
    • @@ -190,7 +190,7 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
    -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/2021-08-05-all-ips.txt
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/2021-08-05-all-ips.txt
     # wc -l /tmp/2021-08-05-all-ips.txt
     43428 /tmp/2021-08-05-all-ips.txt
     
      @@ -200,7 +200,7 @@ I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
    -
    $ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
    +
    $ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
     $ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/2021-08-05-all-ips-to-purge.csv
     $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
     0 /tmp/2021-08-05-all-ips-to-purge.csv
    @@ -220,7 +220,7 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
     
     
     
    -
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
    +
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
     
    • That IP is on Amazon, and from looking at the DSpace logs I don’t see them logging in at all, only scraping… so I will purge hits from that IP
    • I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too @@ -232,13 +232,13 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
    • 3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart
    • 61.143.40.50 is in China and uses this hilarious user agent:
    -
    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
    +
    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
     
    • 47.252.80.214 is owned by Alibaba in the US and has the same user agent
    • 159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours
    • 95.87.154.12 seems to be a new bot with the following user agent:
    -
    Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
    +
    Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
     
    • They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
        @@ -247,14 +247,14 @@ $ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
      • I see a new bot using this user agent:
      -
      nettle (+https://www.nettle.sk)
      +
      nettle (+https://www.nettle.sk)
       
      • 129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.
      • 217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day
      • 103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human
      • There are probably more but that’s most of them over 1,000 hits last month, so I will purge them:
      -
      $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
      +
      $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
       Purging 10796 hits from 35.174.144.154 in statistics
       Purging 9993 hits from 93.158.90.30 in statistics
       Purging 6092 hits from 130.255.162.173 in statistics
      @@ -272,7 +272,7 @@ Total number of bot hits purged: 90485
       
      • Then I purged a few thousand more by user agent:
      -
      $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri 
      +
      $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri 
       Found 2707 hits from MaCoCu in statistics
       Found 1785 hits from nettle in statistics
       
      @@ -289,7 +289,7 @@ Total number of hits from bots: 4492
       
    • I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:
    -
    $ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv
    +
    $ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv
     
    • Then in OpenRefine I merged all null, blank, and en fields into the en_US one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
        @@ -303,19 +303,19 @@ Total number of hits from bots: 4492
        • Extract all unique ISSNs to look up on Sherpa Romeo and Crossref
        -
        $ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq > /tmp/2021-08-09-issns.txt
        +
        $ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq > /tmp/2021-08-09-issns.txt
         $ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
         $ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
         
        • Then I updated the CSV headers for each and joined the CSVs on the issn column:
        -
        $ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
        +
        $ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
         $ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
         $ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv > /tmp/2021-08-09-journals-all.csv
         
        • In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:
        -
        if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,"same","different")
        +
        if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,"same","different")
         
        • Then I exported the list of journals that differ and sent it to Peter for comments and corrections
            @@ -332,7 +332,7 @@ $ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-jour
          • I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:
          -
          $ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
          +
          $ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
           39004:0.08
           $ /usr/bin/time -f %M:%e gm convert IPCC.pdf\[0\] -quality 85 -thumbnail x600 -flatten IPCC-gm.jpg 
           40932:0.53
          @@ -359,7 +359,7 @@ $ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409
           
      -
      $ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
      +
      $ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
       $ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
       $ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
       1911
      @@ -367,7 +367,7 @@ $ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
       
    • Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine
    • I exported a list of all the journal titles we have in the cg.journal field:
    -
    localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
    +
    localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
     COPY 3245
     
    • I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don’t match, so I’d have to go check many of them manually before selecting a match or fixing them… @@ -421,7 +421,7 @@ COPY 3245
    -
    $ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
    +
    $ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
     $ dspace community-filiator --set --parent=10568/114644 --child=10568/35730
     $ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
     
      @@ -446,17 +446,17 @@ $ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
    • Lower case all AGROVOC metadata, as I had noticed a few in sentence case:
    -
    dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
    +
    dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
     UPDATE 484
     
    • Also update some DOIs using the dx.doi.org format, just to keep things uniform:
    -
    dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
    +
    dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
     UPDATE 469
     
    • Then start a full Discovery re-indexing to update the Feed the Future community item counts that have been stuck at 0 since we moved the three projects to be a subcommunity a few days ago:
    -
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
     
     real    322m16.917s
     user    226m43.121s
    @@ -464,7 +464,7 @@ sys     3m17.469s
     
    • I learned how to use the OpenRXV API, which is just a thin wrapper around Elasticsearch:
    -
    $ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
    +
    $ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
         -H 'Content-Type: application/json' \
         -d '{
         "size": 10,
    @@ -525,17 +525,17 @@ $ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5
     
     
     
    -
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-08-25-combined-orcids.txt
    +
    $ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-08-25-combined-orcids.txt
     $ wc -l /tmp/2021-08-25-combined-orcids.txt
     1331
     
    • After I combined them and removed duplicates, I resolved all the names using my resolve-orcids.py script:
    -
    $ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
    +
    $ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
     
    • Tag existing items from the Alliance’s new authors with ORCID iDs using add-orcid-identifiers-csv.py (181 new metadata fields added):
    -
    $ cat 2021-08-25-add-orcids.csv 
    +
    $ cat 2021-08-25-add-orcids.csv 
     dc.contributor.author,cg.creator.identifier
     "Chege, Christine G. Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
     "Chege, Christine Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
    diff --git a/docs/2021-09/index.html b/docs/2021-09/index.html
    index 2a4fce2f5..8ffd9c0a1 100644
    --- a/docs/2021-09/index.html
    +++ b/docs/2021-09/index.html
    @@ -26,7 +26,7 @@ The syntax Moayad showed me last month doesn’t seem to honor the search qu
     
     
     
    -
    +
     
     
     
    @@ -48,7 +48,7 @@ The syntax Moayad showed me last month doesn’t seem to honor the search qu
     
     
     "/>
    -
    +
     
     
         
    @@ -58,9 +58,9 @@ The syntax Moayad showed me last month doesn’t seem to honor the search qu
       "@type": "BlogPosting",
       "headline": "September, 2021",
       "url": "https://alanorth.github.io/cgspace-notes/2021-09/",
    -  "wordCount": "176",
    +  "wordCount": "637",
       "datePublished": "2021-09-01T09:14:07+03:00",
    -  "dateModified": "2021-09-04T21:16:03+03:00",
    +  "dateModified": "2021-09-06T12:31:11+03:00",
       "author": {
         "@type": "Person",
         "name": "Alan Orth"
    @@ -154,7 +154,7 @@ The syntax Moayad showed me last month doesn’t seem to honor the search qu
     
    • Update Docker images on AReS server (linode20) and rebuild OpenRXV:
    -
    $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
    +
    $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
     $ docker-compose build
     
    • Then run system updates and reboot the server @@ -163,6 +163,61 @@ $ docker-compose build
    +

    2021-09-07

    +
      +
    • Checking last month’s Solr statistics to see if there are any new bots that I need to purge and add to the list +
        +
      • 78.203.225.68 made 50,000 requests on one day in August, and it is using this user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36
      • +
      • It’s a fixed line ISP in Montpellier according to AbuseIPDB.com, and has not been flagged as abusive, so it must be some CGIAR SMO person doing some web application harvesting from the browser
      • +
      • 130.255.162.154 is in Sweden and made 46,000 requests in August and it is using this user agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0
      • +
      • 35.174.144.154 is on Amazon and made 28,000 requests with this user agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
      • +
      • 192.121.135.6 is in Sweden and made 9,000 requests with this user agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0
      • +
      • 185.38.40.66 is in Germany and made 6,000 requests with this user agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0 BoldBrains SC/1.10.2.4
      • +
      • 3.225.28.105 is in Amazon and made 3,000 requests with this user agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
      • +
      • I also noticed that we still have tons (25,000) of requests by MSNbot using this normal-looking user agent: Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko
      • +
      • I can identify them by their reverse DNS: msnbot-40-77-167-105.search.msn.com.
      • +
      • I had already purged a bunch of these by their IPs in 2021-06, so it looks like I have to do that again
      • +
      • While looking at the MSN requests I noticed tons of requests from another strange host using reverse IP DNS: malta2095.startdedicated.com., astra5139.startdedicated.com., and many others
      • +
      • They must be related, because I see them all using the exact same user agent: Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko
      • +
      • So this startdedicated.com DNS is some Bing bot also…
      • +
      +
    • +
    • I extracted all the IPs and purged them using my check-spider-ip-hits.sh script +
        +
      • In total I purged 225,000 hits…
      • +
      +
    • +
    +

    2021-09-12

    +
      +
    • Start a harvest on AReS
    • +
    +

    2021-09-13

    +
      +
    • Mishell Portilla asked me about thumbnails on CGSpace being small +
        +
      • For example, 10568/114576 has a lot of white space on the left side
      • +
      • I created a new thumbnail with vipsthumbnail:
      • +
      +
    • +
    +
    $ vipsthumbnail ARRTB2020ST.pdf -s x600 -o '%s.jpg[Q=85,optimize_coding,strip]'
    +
      +
    • Looking at the PDF’s metadata I see: +
        +
      • Producer: iLovePDF
      • +
      • Creator: Adobe InDesign 15.0 (Windows)
      • +
      • Format: PDF-1.7
      • +
      +
    • +
    • Eventually I should do more tests on this and perhaps file a bug with DSpace…
    • +
    • Some Alliance people contacted me about getting access to the CGSpace API to deposit with their TIP tool +
        +
      • I told them I can give them access to DSpace Test and that we should have a meeting soon
      • +
      • We need to figure out what controlled vocabularies they should use
      • +
      +
    • +
    diff --git a/docs/404.html b/docs/404.html index d476e8c84..8f2b84da8 100644 --- a/docs/404.html +++ b/docs/404.html @@ -17,7 +17,7 @@ - + diff --git a/docs/categories/index.html b/docs/categories/index.html index 87262658f..888d4452b 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,14 +10,14 @@ - + - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index d0342af65..af28d1169 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,14 +10,14 @@ - + - + @@ -127,7 +127,7 @@
    • Update Docker images on AReS server (linode20) and reboot the server:
    -
    # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
    +
    # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
     
    • I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
    @@ -152,7 +152,7 @@
    • Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
    -
    localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
    +
    localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
     COPY 20994
     
    Read more → @@ -315,7 +315,7 @@ COPY 20994
  • I had a call with CodeObia to discuss the work on OpenRXV
  • Check the results of the AReS harvesting from last night:
  • -
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
    +
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
     {
       "count" : 100875,
       "_shards" : {
    diff --git a/docs/categories/notes/index.xml b/docs/categories/notes/index.xml
    index 44148ff62..de2d1329a 100644
    --- a/docs/categories/notes/index.xml
    +++ b/docs/categories/notes/index.xml
    @@ -41,7 +41,7 @@
     <ul>
     <li>Update Docker images on AReS server (linode20) and reboot the server:</li>
     </ul>
    -<pre><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
    +<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
     </code></pre><ul>
     <li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
     </ul>
    @@ -57,7 +57,7 @@
     <ul>
     <li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
     </ul>
    -<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
    +<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
     COPY 20994
     </code></pre>
         
    @@ -164,7 +164,7 @@ COPY 20994
     <li>I had a call with CodeObia to discuss the work on OpenRXV</li>
     <li>Check the results of the AReS harvesting from last night:</li>
     </ul>
    -<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
    +<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
     {
       &quot;count&quot; : 100875,
       &quot;_shards&quot; : {
    @@ -471,7 +471,7 @@ COPY 20994
     </ul>
     </li>
     </ul>
    -<pre><code># apt update &amp;&amp; apt full-upgrade
    +<pre tabindex="0"><code># apt update &amp;&amp; apt full-upgrade
     # apt-get autoremove &amp;&amp; apt-get autoclean
     # dpkg -C
     # reboot
    @@ -492,7 +492,7 @@ COPY 20994
     </ul>
     </li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
     4671942
     # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
     1277694
    @@ -500,7 +500,7 @@ COPY 20994
     <li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
     <li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
     1183456 
     # zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
     106781
    @@ -527,7 +527,7 @@ COPY 20994
     <li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li>
     <li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         440 17.58.101.255
         441 157.55.39.101
         485 207.46.13.43
    @@ -628,7 +628,7 @@ COPY 20994
     </li>
     <li>The item seems to be in a pre-submitted state, so I tried to delete it from there:</li>
     </ul>
    -<pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
    +<pre tabindex="0"><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
     </code></pre><ul>
     <li>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</li>
    @@ -654,13 +654,13 @@ DELETE 1
     </ul>
     </li>
     </ul>
    -<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    +<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
        4432 200
     </code></pre><ul>
     <li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
     <li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
     </ul>
    -<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
    +<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    @@ -701,7 +701,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     <li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
     <li>The top IPs before, during, and after this latest alert tonight were:</li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         245 207.46.13.5
         332 54.70.40.11
         385 5.143.231.38
    @@ -717,7 +717,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     <li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
     <li>There were just over 3 million accesses in the nginx logs last month:</li>
     </ul>
    -<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
    +<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
     3018243
     
     real    0m19.873s
    @@ -737,7 +737,7 @@ sys     0m1.979s
     <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
     <li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          92 40.77.167.4
          99 210.7.29.100
         120 38.126.157.45
    @@ -825,7 +825,7 @@ sys     0m1.979s
     <ul>
     <li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li>
     </ul>
    -<pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
    +<pre tabindex="0"><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
     [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
     [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
     </code></pre><ul>
    @@ -848,11 +848,11 @@ sys     0m1.979s
     <ul>
     <li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li>
     </ul>
    -<pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
    +<pre tabindex="0"><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
     </code></pre><ul>
     <li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li>
     </ul>
    -<pre><code>There is insufficient memory for the Java Runtime Environment to continue.
    +<pre tabindex="0"><code>There is insufficient memory for the Java Runtime Environment to continue.
     </code></pre>
         
         
    @@ -872,12 +872,12 @@ sys     0m1.979s
     <li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
     <li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li>
     </ul>
    -<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    +<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
     </code></pre><ul>
     <li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></li>
     <li>Time to index ~70,000 items on CGSpace:</li>
     </ul>
    -<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
    +<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
    @@ -958,19 +958,19 @@ sys     2m7.289s
     <li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li>
     <li>And just before that I see this:</li>
     </ul>
    -<pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    +<pre tabindex="0"><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
     </code></pre><ul>
     <li>Ah hah! So the pool was actually empty!</li>
     <li>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</li>
     <li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</li>
     <li>I notice this error quite a few times in dspace.log:</li>
     </ul>
    -<pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
    +<pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
     </code></pre><ul>
     <li>And there are many of these errors every day for the past month:</li>
     </ul>
    -<pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
    +<pre tabindex="0"><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
     dspace.log.2017-11-21:4
     dspace.log.2017-11-22:1
     dspace.log.2017-11-23:4
    @@ -1048,12 +1048,12 @@ dspace.log.2018-01-02:34
     <ul>
     <li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li>
     </ul>
    -<pre><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
    +<pre tabindex="0"><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
     0
     </code></pre><ul>
     <li>Generate list of authors on CGSpace for Peter to go through and correct:</li>
     </ul>
    -<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
    +<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
     COPY 54701
     </code></pre>
         
    @@ -1068,7 +1068,7 @@ COPY 54701
     <ul>
     <li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li>
     </ul>
    -<pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    +<pre tabindex="0"><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
     </code></pre><ul>
     <li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
     <li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>
    diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
    index 76fb3b85a..4f27fac36 100644
    --- a/docs/categories/notes/page/2/index.html
    +++ b/docs/categories/notes/page/2/index.html
    @@ -10,14 +10,14 @@
     
     
     
    -
    +
     
     
     
     
     
     
    -
    +
     
     
         
    diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
    index 869a0417d..0e63ca8bd 100644
    --- a/docs/categories/notes/page/3/index.html
    +++ b/docs/categories/notes/page/3/index.html
    @@ -10,14 +10,14 @@
     
     
     
    -
    +
     
     
     
     
     
     
    -
    +
     
     
         
    @@ -195,7 +195,7 @@
     
     
     
    -
    # apt update && apt full-upgrade
    +
    # apt update && apt full-upgrade
     # apt-get autoremove && apt-get autoclean
     # dpkg -C
     # reboot
    @@ -225,7 +225,7 @@
     
     
     
    -
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
    +
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     4671942
     # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     1277694
    @@ -233,7 +233,7 @@
     
  • So 4.6 million from XMLUI and another 1.2 million from API requests
  • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
  • -
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
    +
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
     1183456 
     # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
     106781
    @@ -278,7 +278,7 @@
     
  • Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
  • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
  • -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         440 17.58.101.255
         441 157.55.39.101
         485 207.46.13.43
    diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
    index 7905ad991..f3a9cb197 100644
    --- a/docs/categories/notes/page/4/index.html
    +++ b/docs/categories/notes/page/4/index.html
    @@ -10,14 +10,14 @@
     
     
     
    -
    +
     
     
     
     
     
     
    -
    +
     
     
         
    @@ -101,7 +101,7 @@
     
     
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • -
    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
    +
    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
     
    • But after this I tried to delete the item from the XMLUI and it is still present…
    • @@ -136,13 +136,13 @@ DELETE 1
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
        4432 200
     
    • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
    • Apply country and region corrections and deletions on DSpace Test and CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
    +
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    @@ -201,7 +201,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
  • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
  • The top IPs before, during, and after this latest alert tonight were:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         245 207.46.13.5
         332 54.70.40.11
         385 5.143.231.38
    @@ -217,7 +217,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
  • There were just over 3 million accesses in the nginx logs last month:
  • -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
     
     real    0m19.873s
    @@ -246,7 +246,7 @@ sys     0m1.979s
     
  • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
  • I don’t see anything interesting in the web server logs around that time though:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          92 40.77.167.4
          99 210.7.29.100
         120 38.126.157.45
    @@ -379,7 +379,7 @@ sys     0m1.979s
     
    • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:
    -
    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
    +
    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
     [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
     [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
     
      diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index e4a1086b9..8943b6abf 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,14 +10,14 @@ - + - + @@ -94,11 +94,11 @@
      • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:
      -
      $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
      +
      $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
       
      • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:
      -
      There is insufficient memory for the Java Runtime Environment to continue.
      +
      There is insufficient memory for the Java Runtime Environment to continue.
       
      Read more → @@ -127,12 +127,12 @@
    • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
    • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
    -
    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    +
    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
     
    • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
    • Time to index ~70,000 items on CGSpace:
    -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
    +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
    @@ -258,19 +258,19 @@ sys     2m7.289s
     
  • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
  • And just before that I see this:
  • -
    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    +
    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
     
    • Ah hah! So the pool was actually empty!
    • I need to increase that, let’s try to bump it up from 50 to 75
    • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
    • I notice this error quite a few times in dspace.log:
    -
    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
    +
    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
     
    • And there are many of these errors every day for the past month:
    -
    $ grep -c "Error while searching for sidebar facets" dspace.log.*
    +
    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
     dspace.log.2017-11-22:1
     dspace.log.2017-11-23:4
    @@ -366,12 +366,12 @@ dspace.log.2018-01-02:34
     
    • Today there have been no hits by CORE and no alerts from Linode (coincidence?)
    -
    # grep -c "CORE" /var/log/nginx/access.log
    +
    # grep -c "CORE" /var/log/nginx/access.log
     0
     
    • Generate list of authors on CGSpace for Peter to go through and correct:
    -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
     COPY 54701
     
    Read more → @@ -395,7 +395,7 @@ COPY 54701 -
    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    +
    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
     
    • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
    • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
    • diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index 0aa02b6ea..66101459d 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,14 +10,14 @@ - + - + diff --git a/docs/cgiar-library-migration/index.html b/docs/cgiar-library-migration/index.html index 0b0a9c25d..46881b1c8 100644 --- a/docs/cgiar-library-migration/index.html +++ b/docs/cgiar-library-migration/index.html @@ -18,7 +18,7 @@ - + @@ -132,7 +132,7 @@
    • Temporarily disable nightly index-discovery cron job because the import process will be taking place during some of this time and I don’t want them to be competing to update the Solr index
    • Copy HTTPS certificate key pair from CGIAR Library server’s Tomcat keystore:
    -
    $ keytool -list -keystore tomcat.keystore
    +
    $ keytool -list -keystore tomcat.keystore
     $ keytool -importkeystore -srckeystore tomcat.keystore -destkeystore library.cgiar.org.p12 -deststoretype PKCS12 -srcalias tomcat
     $ openssl pkcs12 -in library.cgiar.org.p12 -nokeys -out library.cgiar.org.crt.pem
     $ openssl pkcs12 -in library.cgiar.org.p12 -nodes -nocerts -out library.cgiar.org.key.pem
    @@ -140,7 +140,7 @@ $ wget https://certs.godaddy.com/repository/gdroot-g2.crt https://certs.godaddy.
     $ cat library.cgiar.org.crt.pem gdig2.crt.pem > library.cgiar.org-chained.pem
     

    Migration Process

    Export all top-level communities and collections from DSpace Test:

    -
    $ export PATH=$PATH:/home/dspacetest.cgiar.org/bin
    +
    $ export PATH=$PATH:/home/dspacetest.cgiar.org/bin
     $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2515 10947-2515/10947-2515.zip
     $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2516 10947-2516/10947-2516.zip
     $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/2517 10947-2517/10947-2517.zip
    @@ -158,12 +158,12 @@ $ dspace packager -d -a -t AIP -e aorth@mjanja.ch -i 10947/1 10947-1/10947-1.zip
     
  • Copy all exports from DSpace Test
  • Add ingestion overrides to dspace.cfg before import:
  • -
    mets.dspaceAIP.ingest.crosswalk.METSRIGHTS = NIL
    +
    mets.dspaceAIP.ingest.crosswalk.METSRIGHTS = NIL
     mets.dspaceAIP.ingest.crosswalk.DSPACE-ROLES = NIL
     
    • Import communities and collections, paying attention to options to skip missing parents and ignore handles:
    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit -XX:+TieredCompilation -XX:TieredStopAtLevel=1"
     $ export PATH=$PATH:/home/cgspace.cgiar.org/bin
     $ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2515/10947-2515.zip
     $ dspace packager -r -u -a -t AIP -o skipIfParentMissing=true -e aorth@mjanja.ch -p 10568/83389 10947-2516/10947-2516.zip
    @@ -189,7 +189,7 @@ $ for item in 10947-1/ITEM@10947-*; do dspace packager -r -f -u -t AIP -e aorth@
     
     
     
    -
    $ dspace packager -r -t AIP -o ignoreHandle=false -e aorth@mjanja.ch -p 10568/83536 10568-93760/COLLECTION@10947-4651.zip
    +
    $ dspace packager -r -t AIP -o ignoreHandle=false -e aorth@mjanja.ch -p 10568/83536 10568-93760/COLLECTION@10947-4651.zip
     $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e aorth@mjanja.ch $item; done
     
    • Create CGIAR System Management Office sub-community: 10568/83537 @@ -199,17 +199,17 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
    -
    $ for item in 10568-93759/ITEM@10947-46*; do dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/83538 $item; done
    +
    $ for item in 10568-93759/ITEM@10947-46*; do dspace packager -r -t AIP -o ignoreHandle=false -o ignoreParent=true -e aorth@mjanja.ch -p 10568/83538 $item; done
     

    Get the handles for the last few items from CGIAR Library that were created since we did the migration to DSpace Test in May:

    -
    dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z');
    +
    dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id in (select item_id from metadatavalue where metadata_field_id=11 and date(text_value) > '2017-05-01T00:00:00Z');
     
    • Export them from the CGIAR Library:
    -
    # for handle in 10947/4658 10947/4659 10947/4660 10947/4661 10947/4665 10947/4664 10947/4666 10947/4669; do /usr/local/dspace/bin/dspace packager -d -a -t AIP -e m.marus@cgiar.org -i $handle ${handle}.zip; done
    +
    # for handle in 10947/4658 10947/4659 10947/4660 10947/4661 10947/4665 10947/4664 10947/4666 10947/4669; do /usr/local/dspace/bin/dspace packager -d -a -t AIP -e m.marus@cgiar.org -i $handle ${handle}.zip; done
     
    • Import on CGSpace:
    -
    $ for item in 10947-latest/*.zip; do dspace packager -r -u -t AIP -e aorth@mjanja.ch $item; done
    +
    $ for item in 10947-latest/*.zip; do dspace packager -r -u -t AIP -e aorth@mjanja.ch $item; done
     

    Post Migration

    • Shut down Tomcat and run update-sequences.sql as the system’s postgres user
    • @@ -218,7 +218,7 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
    • Enable nightly index-discovery cron job
    • Adjust CGSpace’s handle-server/config.dct to add the new prefix alongside our existing 10568, ie:
    -
    "server_admins" = (
    +
    "server_admins" = (
     "300:0.NA/10568"
     "300:0.NA/10947"
     )
    @@ -244,22 +244,22 @@ $ for item in 10568-93760/ITEM@10947-465*; do dspace packager -r -f -u -t AIP -e
     
  • Run system updates and reboot server
  • Switch to Let’s Encrypt HTTPS certificates (after DNS is updated and server isn’t busy):
  • -
    $ sudo systemctl stop nginx
    +
    $ sudo systemctl stop nginx
     $ /opt/certbot-auto certonly --standalone -d library.cgiar.org
     $ sudo systemctl start nginx
     

    Troubleshooting

    Foreign Key Error in dspace cleanup

    The cleanup script is sometimes used during import processes to clean the database and assetstore after failed AIP imports. If you see the following error with dspace cleanup -v:

    -
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"                                                                                                                       
    +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"                                                                                                                       
       Detail: Key (bitstream_id)=(119841) is still referenced from table "bundle".
     

    The solution is to set the primary_bitstream_id to NULL in PostgreSQL:

    -
    dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (119841);
    +
    dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (119841);
     

    PSQLException During AIP Ingest

    After a few rounds of ingesting—possibly with failures—you might end up with inconsistent IDs in the database. In this case, during AIP ingest of a single collection in submit mode (-s):

    -
    org.dspace.content.packager.PackageValidationException: Exception while ingesting 10947-2527/10947-2527.zip, Reason: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey"                                    
    +
    org.dspace.content.packager.PackageValidationException: Exception while ingesting 10947-2527/10947-2527.zip, Reason: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey"                                    
       Detail: Key (handle_id)=(86227) already exists.
     

    The normal solution is to run the update-sequences.sql script (with Tomcat shut down) but it doesn’t seem to work in this case. Finding the maximum handle_id and manually updating the sequence seems to work:

    -
    dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
    +
    dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
     dspace=# select setval('handle_seq',86873);
     
    diff --git a/docs/cgspace-cgcorev2-migration/index.html b/docs/cgspace-cgcorev2-migration/index.html index 8bb50f61c..9287ab005 100644 --- a/docs/cgspace-cgcorev2-migration/index.html +++ b/docs/cgspace-cgcorev2-migration/index.html @@ -18,7 +18,7 @@ - + @@ -440,7 +440,7 @@

    ¹ Not committed yet because I don’t want to have to make minor adjustments in multiple commits. Re-apply the gauntlet of fixes with the sed script:

    -
    $ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname "*.xsl" -exec sed -i -f ./cgcore-xsl-replacements.sed {} \;
    +
    $ find dspace/modules/xmlui-mirage2/src/main/webapp/themes -iname "*.xsl" -exec sed -i -f ./cgcore-xsl-replacements.sed {} \;
     
    diff --git a/docs/cgspace-dspace6-upgrade/index.html b/docs/cgspace-dspace6-upgrade/index.html index 890f92d9a..161a123a7 100644 --- a/docs/cgspace-dspace6-upgrade/index.html +++ b/docs/cgspace-dspace6-upgrade/index.html @@ -18,7 +18,7 @@ - + @@ -129,14 +129,14 @@

    Re-import OAI with clean index

    After the upgrade is complete, re-index all items into OAI with a clean index:

    -
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m"
     $ dspace oai -c import
     

    The process ran out of memory several times so I had to keep trying again with more JVM heap memory.

    Processing Solr Statistics With solr-upgrade-statistics-6x

    After the main upgrade process was finished and DSpace was running I started processing the Solr statistics with solr-upgrade-statistics-6x to migrate all IDs to UUIDs.

    statistics

    First process the current year’s statistics core:

    -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
     $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
     ...
     =================================================================
    @@ -159,10 +159,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
     
  • 698,000: *:* NOT id:/.{36}/
  • Majority are type: 5 (aka SITE, according to Constants.java) so we can purge them:
  • -
    $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
    +
    $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
     

    statistics-2019

    Processing the statistics-2019 core:

    -
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
    +
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
     ...
     =================================================================
             *** Statistics Records with Legacy Id ***
    @@ -184,10 +184,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
     
  • 4,184,896: *:* NOT id:/.{36}/
  • 4,172,929 are type: 5 (aka SITE) so we can purge them:
  • -
    $ curl -s "http://localhost:8081/solr/statistics-2019/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
    +
    $ curl -s "http://localhost:8081/solr/statistics-2019/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
     

    statistics-2018

    Processing the statistics-2018 core:

    -
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
    +
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
     ...
     =================================================================
             *** Statistics Records with Legacy Id ***
    @@ -203,7 +203,7 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
                5,561,166    TOTAL
     =================================================================
     

    After some time I got an error about Java heap space so I increased the JVM memory and restarted processing:

    -
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
    +
    $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
     $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
     

    Eventually the processing finished. Here are some statistics about unmigrated documents:

      @@ -212,10 +212,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
    • 923,158: *:* NOT id:/.{36}/
    • 823,293: are type: 5 so we can purge them:
    -
    $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
    +
    $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
     

    statistics-2017

    Processing the statistics-2017 core:

    -
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2017
    +
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2017
     ...
     =================================================================
             *** Statistics Records with Legacy Id ***
    @@ -237,10 +237,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
     
  • 1,702,177: *:* NOT id:/.{36}/
  • 1,660,524 are type: 5 (SITE) so we can purge them:
  • -
    $ curl -s "http://localhost:8081/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
    +
    $ curl -s "http://localhost:8081/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
     

    statistics-2016

    Processing the statistics-2016 core:

    -
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2016
    +
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2016
     ...
     =================================================================
             *** Statistics Records with Legacy Id ***
    @@ -261,10 +261,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
     
  • 1,477,155: *:* NOT id:/.{36}/
  • 1,469,706 are type: 5 (SITE) so we can purge them:
  • -
    $ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
    +
    $ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
     

    statistics-2015

    Processing the statistics-2015 core:

    -
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2015
    +
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2015
     ...
     =================================================================
             *** Statistics Records with Legacy Id ***
    @@ -286,10 +286,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
     
  • 262,439: *:* NOT id:/.{36}/
  • 247,400 are type: 5 (SITE) so we can purge them:
  • -
    $ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
    +
    $ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
     

    statistics-2014

    Processing the statistics-2014 core:

    -
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2014
    +
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2014
     ...
     =================================================================
             *** Statistics Records with Legacy Id ***
    @@ -312,10 +312,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
     
  • 222,078: *:* NOT id:/.{36}/
  • 188,791 are type: 5 (SITE) so we can purge them:
  • -
    $ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
    +
    $ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
     

    statistics-2013

    Processing the statistics-2013 core:

    -
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2013
    +
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2013
     ...
     =================================================================
             *** Statistics Records with Legacy Id ***
    @@ -338,10 +338,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
     
  • 32,320: *:* NOT id:/.{36}/
  • 15,691 are type: 5 (SITE) so we can purge them:
  • -
    $ curl -s "http://localhost:8081/solr/statistics-2013/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
    +
    $ curl -s "http://localhost:8081/solr/statistics-2013/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
     

    statistics-2012

    Processing the statistics-2012 core:

    -
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2012
    +
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2012
     ...
     =================================================================
             *** Statistics Records with Legacy Id ***
    @@ -360,10 +360,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
     
  • 33,161: *:* NOT id:/.{36}/
  • 33,161 are type: 3 (COLLECTION), which is different than I’ve seen previously… but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:
  • -
    $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
    +
    $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
     

    statistics-2011

    Processing the statistics-2011 core:

    -
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2011
    +
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2011
     ...
     =================================================================
             *** Statistics Records with Legacy Id ***
    @@ -382,10 +382,10 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
     
  • 17,551: *:* NOT id:/.{36}/
  • 12,116 are type: 3 (COLLECTION), which is different than I’ve seen previously… but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:
  • -
    $ curl -s "http://localhost:8081/solr/statistics-2011/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
    +
    $ curl -s "http://localhost:8081/solr/statistics-2011/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
     

    statistics-2010

    Processing the statistics-2010 core:

    -
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2010
    +
    $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2010
     ...
     =================================================================
             *** Statistics Records with Legacy Id ***
    @@ -404,52 +404,52 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics-2018
     
  • 1,012: *:* NOT id:/.{36}/
  • 654 are type: 3 (COLLECTION), which is different than I’ve seen previously… but I suppose I still have to purge them because there will be errors in the Atmire modules otherwise:
  • -
    $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
    +
    $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:* NOT id:/.{36}/</query></delete>"
     

    Processing Solr statistics with AtomicStatisticsUpdateCLI

    On 2020-11-18 I finished processing the Solr statistics with solr-upgrade-statistics-6x and I started processing them with AtomicStatisticsUpdateCLI.

    statistics

    First the current year’s statistics core, in 12-hour batches:

    -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
    +
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
     

    It took ~38 hours to finish processing this core.

    statistics-2019

    The statistics-2019 core, in 12-hour batches:

    -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2019
    +
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2019
     

    It took ~32 hours to finish processing this core.

    statistics-2018

    The statistics-2018 core, in 12-hour batches:

    -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2018
    +
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2018
     

    It took ~28 hours to finish processing this core.

    statistics-2017

    The statistics-2017 core, in 12-hour batches:

    -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2017
    +
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2017
     

    It took ~24 hours to finish processing this core.

    statistics-2016

    The statistics-2016 core, in 12-hour batches:

    -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2016
    +
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2016
     

    It took ~20 hours to finish processing this core.

    statistics-2015

    The statistics-2015 core, in 12-hour batches:

    -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
    +
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2015
     

    It took ~21 hours to finish processing this core.

    statistics-2014

    The statistics-2014 core, in 12-hour batches:

    -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2014
    +
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2014
     

    It took ~12 hours to finish processing this core.

    statistics-2013

    The statistics-2013 core, in 12-hour batches:

    -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2013
    +
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2013
     

    It took ~3 hours to finish processing this core.

    statistics-2012

    The statistics-2012 core, in 12-hour batches:

    -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2012
    +
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2012
     

    It took ~2 hours to finish processing this core.

    statistics-2011

    The statistics-2011 core, in 12-hour batches:

    -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2011
    +
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2011
     

    It took 1 hour to finish processing this core.

    statistics-2010

    The statistics-2010 core, in 12-hour batches:

    -
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2010
    +
    $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2010
     

    It took five minutes to finish processing this core.

    diff --git a/docs/index.html b/docs/index.html index 0e45919e7..c3ffdd23d 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,14 +10,14 @@ - + - + @@ -142,7 +142,7 @@
    • Update Docker images on AReS server (linode20) and reboot the server:
    -
    # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
    +
    # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
     
    • I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
    @@ -167,7 +167,7 @@
    • Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
    -
    localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
    +
    localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
     COPY 20994
     
    Read more → @@ -330,7 +330,7 @@ COPY 20994
  • I had a call with CodeObia to discuss the work on OpenRXV
  • Check the results of the AReS harvesting from last night:
  • -
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
    +
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
     {
       "count" : 100875,
       "_shards" : {
    diff --git a/docs/index.xml b/docs/index.xml
    index 2dedb947d..f7863eb4f 100644
    --- a/docs/index.xml
    +++ b/docs/index.xml
    @@ -41,7 +41,7 @@
     <ul>
     <li>Update Docker images on AReS server (linode20) and reboot the server:</li>
     </ul>
    -<pre><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
    +<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
     </code></pre><ul>
     <li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
     </ul>
    @@ -57,7 +57,7 @@
     <ul>
     <li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
     </ul>
    -<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
    +<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
     COPY 20994
     </code></pre>
         
    @@ -164,7 +164,7 @@ COPY 20994
     <li>I had a call with CodeObia to discuss the work on OpenRXV</li>
     <li>Check the results of the AReS harvesting from last night:</li>
     </ul>
    -<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
    +<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
     {
       &quot;count&quot; : 100875,
       &quot;_shards&quot; : {
    @@ -471,7 +471,7 @@ COPY 20994
     </ul>
     </li>
     </ul>
    -<pre><code># apt update &amp;&amp; apt full-upgrade
    +<pre tabindex="0"><code># apt update &amp;&amp; apt full-upgrade
     # apt-get autoremove &amp;&amp; apt-get autoclean
     # dpkg -C
     # reboot
    @@ -492,7 +492,7 @@ COPY 20994
     </ul>
     </li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
     4671942
     # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
     1277694
    @@ -500,7 +500,7 @@ COPY 20994
     <li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
     <li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
     1183456 
     # zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
     106781
    @@ -527,7 +527,7 @@ COPY 20994
     <li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li>
     <li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         440 17.58.101.255
         441 157.55.39.101
         485 207.46.13.43
    @@ -628,7 +628,7 @@ COPY 20994
     </li>
     <li>The item seems to be in a pre-submitted state, so I tried to delete it from there:</li>
     </ul>
    -<pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
    +<pre tabindex="0"><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
     </code></pre><ul>
     <li>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</li>
    @@ -654,13 +654,13 @@ DELETE 1
     </ul>
     </li>
     </ul>
    -<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    +<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
        4432 200
     </code></pre><ul>
     <li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
     <li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
     </ul>
    -<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
    +<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    @@ -701,7 +701,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     <li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
     <li>The top IPs before, during, and after this latest alert tonight were:</li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         245 207.46.13.5
         332 54.70.40.11
         385 5.143.231.38
    @@ -717,7 +717,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     <li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
     <li>There were just over 3 million accesses in the nginx logs last month:</li>
     </ul>
    -<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
    +<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
     3018243
     
     real    0m19.873s
    @@ -737,7 +737,7 @@ sys     0m1.979s
     <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
     <li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          92 40.77.167.4
          99 210.7.29.100
         120 38.126.157.45
    @@ -825,7 +825,7 @@ sys     0m1.979s
     <ul>
     <li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li>
     </ul>
    -<pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
    +<pre tabindex="0"><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
     [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
     [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
     </code></pre><ul>
    @@ -848,11 +848,11 @@ sys     0m1.979s
     <ul>
     <li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li>
     </ul>
    -<pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
    +<pre tabindex="0"><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
     </code></pre><ul>
     <li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li>
     </ul>
    -<pre><code>There is insufficient memory for the Java Runtime Environment to continue.
    +<pre tabindex="0"><code>There is insufficient memory for the Java Runtime Environment to continue.
     </code></pre>
         
         
    @@ -872,12 +872,12 @@ sys     0m1.979s
     <li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
     <li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li>
     </ul>
    -<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    +<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
     </code></pre><ul>
     <li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></li>
     <li>Time to index ~70,000 items on CGSpace:</li>
     </ul>
    -<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
    +<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
    @@ -958,19 +958,19 @@ sys     2m7.289s
     <li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li>
     <li>And just before that I see this:</li>
     </ul>
    -<pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    +<pre tabindex="0"><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
     </code></pre><ul>
     <li>Ah hah! So the pool was actually empty!</li>
     <li>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</li>
     <li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</li>
     <li>I notice this error quite a few times in dspace.log:</li>
     </ul>
    -<pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
    +<pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
     </code></pre><ul>
     <li>And there are many of these errors every day for the past month:</li>
     </ul>
    -<pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
    +<pre tabindex="0"><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
     dspace.log.2017-11-21:4
     dspace.log.2017-11-22:1
     dspace.log.2017-11-23:4
    @@ -1048,12 +1048,12 @@ dspace.log.2018-01-02:34
     <ul>
     <li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li>
     </ul>
    -<pre><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
    +<pre tabindex="0"><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
     0
     </code></pre><ul>
     <li>Generate list of authors on CGSpace for Peter to go through and correct:</li>
     </ul>
    -<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
    +<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
     COPY 54701
     </code></pre>
         
    @@ -1068,7 +1068,7 @@ COPY 54701
     <ul>
     <li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li>
     </ul>
    -<pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    +<pre tabindex="0"><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
     </code></pre><ul>
     <li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
     <li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>
    @@ -1182,7 +1182,7 @@ COPY 54701
     <li>Remove redundant/duplicate text in the DSpace submission license</li>
     <li>Testing the CMYK patch on a collection with 650 items:</li>
     </ul>
    -<pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
    +<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
     </code></pre>
         
         
    @@ -1208,7 +1208,7 @@ COPY 54701
     <li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li>
     <li>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li>
     </ul>
    -<pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
    +<pre tabindex="0"><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
     </code></pre>
         
    @@ -1223,7 +1223,7 @@ COPY 54701
     <ul>
     <li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li>
     </ul>
    -<pre><code>dspace=# select * from collection2item where item_id = '80278';
    +<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80278';
       id   | collection_id | item_id
     -------+---------------+---------
      92551 |           313 |   80278
    @@ -1263,7 +1263,7 @@ DELETE 1
     <li>CGSpace was down for five hours in the morning while I was sleeping</li>
     <li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li>
     </ul>
    -<pre><code>2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
    +<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
    @@ -1305,7 +1305,7 @@ DELETE 1
     </li>
     <li>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li>
     </ul>
    -<pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    +<pre tabindex="0"><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
     </code></pre>
         
         
    @@ -1322,7 +1322,7 @@ DELETE 1
     <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li>
     <li>It looks like we might be able to use OUs now, instead of DCs:</li>
     </ul>
    -<pre><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
    +<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
     </code></pre>
         
         
    @@ -1341,7 +1341,7 @@ DELETE 1
     <li>Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of <code>fonts</code>)</li>
     <li>Start working on DSpace 5.1 → 5.5 port:</li>
     </ul>
    -<pre><code>$ git checkout -b 55new 5_x-prod
    +<pre tabindex="0"><code>$ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
     </code></pre>
    @@ -1358,7 +1358,7 @@ $ git rebase -i dspace-5.5
     <li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li>
     <li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li>
     </ul>
    -<pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    +<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
      text_value
    @@ -1398,7 +1398,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
     <li>I have blocked access to the API now</li>
     <li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li>
     </ul>
    -<pre><code># awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
    +<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
     </code></pre>
         
    @@ -1476,7 +1476,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
     <ul>
     <li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li>
     </ul>
    -<pre><code># cd /home/dspacetest.cgiar.org/log
    +<pre tabindex="0"><code># cd /home/dspacetest.cgiar.org/log
     # ls -lh dspace.log.2015-11-18*
     -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
     -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
    @@ -1496,7 +1496,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
     <li>Looks like DSpace exhausted its PostgreSQL connection pool</li>
     <li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li>
     </ul>
    -<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
    +<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     78
     </code></pre>
         
    diff --git a/docs/page/2/index.html b/docs/page/2/index.html
    index f46f9218d..dc4249635 100644
    --- a/docs/page/2/index.html
    +++ b/docs/page/2/index.html
    @@ -10,14 +10,14 @@
     
     
     
    -
    +
     
     
     
     
     
     
    -
    +
     
     
         
    diff --git a/docs/page/3/index.html b/docs/page/3/index.html
    index 888b91a0f..39c30debe 100644
    --- a/docs/page/3/index.html
    +++ b/docs/page/3/index.html
    @@ -10,14 +10,14 @@
     
     
     
    -
    +
     
     
     
     
     
     
    -
    +
     
     
         
    @@ -210,7 +210,7 @@
     
     
     
    -
    # apt update && apt full-upgrade
    +
    # apt update && apt full-upgrade
     # apt-get autoremove && apt-get autoclean
     # dpkg -C
     # reboot
    @@ -240,7 +240,7 @@
     
     
     
    -
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
    +
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     4671942
     # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     1277694
    @@ -248,7 +248,7 @@
     
  • So 4.6 million from XMLUI and another 1.2 million from API requests
  • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
  • -
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
    +
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
     1183456 
     # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
     106781
    @@ -293,7 +293,7 @@
     
  • Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
  • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
  • -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         440 17.58.101.255
         441 157.55.39.101
         485 207.46.13.43
    diff --git a/docs/page/4/index.html b/docs/page/4/index.html
    index b4c76e022..ddb3aa5df 100644
    --- a/docs/page/4/index.html
    +++ b/docs/page/4/index.html
    @@ -10,14 +10,14 @@
     
     
     
    -
    +
     
     
     
     
     
     
    -
    +
     
     
         
    @@ -116,7 +116,7 @@
     
     
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • -
    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
    +
    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
     
    • But after this I tried to delete the item from the XMLUI and it is still present…
    • @@ -151,13 +151,13 @@ DELETE 1
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
        4432 200
     
    • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
    • Apply country and region corrections and deletions on DSpace Test and CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
    +
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    @@ -216,7 +216,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
  • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
  • The top IPs before, during, and after this latest alert tonight were:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         245 207.46.13.5
         332 54.70.40.11
         385 5.143.231.38
    @@ -232,7 +232,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
  • There were just over 3 million accesses in the nginx logs last month:
  • -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
     
     real    0m19.873s
    @@ -261,7 +261,7 @@ sys     0m1.979s
     
  • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
  • I don’t see anything interesting in the web server logs around that time though:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          92 40.77.167.4
          99 210.7.29.100
         120 38.126.157.45
    @@ -394,7 +394,7 @@ sys     0m1.979s
     
    • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:
    -
    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
    +
    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
     [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
     [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
     
      diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 4e061a742..76de759ce 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,14 +10,14 @@ - + - + @@ -109,11 +109,11 @@
      • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:
      -
      $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
      +
      $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
       
      • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:
      -
      There is insufficient memory for the Java Runtime Environment to continue.
      +
      There is insufficient memory for the Java Runtime Environment to continue.
       
      Read more → @@ -142,12 +142,12 @@
    • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
    • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
    -
    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    +
    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
     
    • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
    • Time to index ~70,000 items on CGSpace:
    -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
    +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
    @@ -273,19 +273,19 @@ sys     2m7.289s
     
  • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
  • And just before that I see this:
  • -
    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    +
    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
     
    • Ah hah! So the pool was actually empty!
    • I need to increase that, let’s try to bump it up from 50 to 75
    • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
    • I notice this error quite a few times in dspace.log:
    -
    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
    +
    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
     
    • And there are many of these errors every day for the past month:
    -
    $ grep -c "Error while searching for sidebar facets" dspace.log.*
    +
    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
     dspace.log.2017-11-22:1
     dspace.log.2017-11-23:4
    @@ -381,12 +381,12 @@ dspace.log.2018-01-02:34
     
    • Today there have been no hits by CORE and no alerts from Linode (coincidence?)
    -
    # grep -c "CORE" /var/log/nginx/access.log
    +
    # grep -c "CORE" /var/log/nginx/access.log
     0
     
    • Generate list of authors on CGSpace for Peter to go through and correct:
    -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
     COPY 54701
     
    Read more → @@ -410,7 +410,7 @@ COPY 54701 -
    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    +
    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
     
    • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
    • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
    • diff --git a/docs/page/6/index.html b/docs/page/6/index.html index b1376370e..af800f96d 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,14 +10,14 @@ - + - + @@ -262,7 +262,7 @@
    • Remove redundant/duplicate text in the DSpace submission license
    • Testing the CMYK patch on a collection with 650 items:
    -
    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    +
    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
     
    Read more → @@ -297,7 +297,7 @@
  • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
  • -
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
    +
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
     
    Read more → @@ -321,7 +321,7 @@
    • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
    -
    dspace=# select * from collection2item where item_id = '80278';
    +
    dspace=# select * from collection2item where item_id = '80278';
       id   | collection_id | item_id
     -------+---------------+---------
      92551 |           313 |   80278
    diff --git a/docs/page/7/index.html b/docs/page/7/index.html
    index 8eddfe601..8e4d4431d 100644
    --- a/docs/page/7/index.html
    +++ b/docs/page/7/index.html
    @@ -10,14 +10,14 @@
     
     
     
    -
    +
     
     
     
     
     
     
    -
    +
     
     
         
    @@ -110,7 +110,7 @@
     
  • CGSpace was down for five hours in the morning while I was sleeping
  • While looking in the logs for errors, I see tons of warnings about Atmire MQM:
  • -
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    +
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
    @@ -170,7 +170,7 @@
     
     
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • -
    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    +
    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
     
    Read more → @@ -196,7 +196,7 @@
  • We had been using DC=ILRI to determine whether a user was ILRI or not
  • It looks like we might be able to use OUs now, instead of DCs:
  • -
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    +
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
     
    Read more → @@ -224,7 +224,7 @@
  • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
  • Start working on DSpace 5.1 → 5.5 port:
  • -
    $ git checkout -b 55new 5_x-prod
    +
    $ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
     
    @@ -250,7 +250,7 @@ $ git rebase -i dspace-5.5
  • Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
  • I think this query should find and replace all authors that have “,” at the end of their names:
  • -
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    +
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
      text_value
    @@ -308,7 +308,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
     
  • I have blocked access to the API now
  • There are 3,000 IPs accessing the REST API in a 24-hour period!
  • -
    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
    +
    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
     
    Read more → diff --git a/docs/page/8/index.html b/docs/page/8/index.html index 78ae8d615..de35dc154 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,14 +10,14 @@ - + - + @@ -160,7 +160,7 @@
    • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:
    -
    # cd /home/dspacetest.cgiar.org/log
    +
    # cd /home/dspacetest.cgiar.org/log
     # ls -lh dspace.log.2015-11-18*
     -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
     -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
    @@ -189,7 +189,7 @@
     
  • Looks like DSpace exhausted its PostgreSQL connection pool
  • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
  • -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     78
     
    Read more → diff --git a/docs/posts/index.html b/docs/posts/index.html index c90d6ec9b..011f8c0bf 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,14 +10,14 @@ - + - + @@ -142,7 +142,7 @@
    • Update Docker images on AReS server (linode20) and reboot the server:
    -
    # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
    +
    # docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
     
    • I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
    @@ -167,7 +167,7 @@
    • Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
    -
    localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
    +
    localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
     COPY 20994
     
    Read more → @@ -330,7 +330,7 @@ COPY 20994
  • I had a call with CodeObia to discuss the work on OpenRXV
  • Check the results of the AReS harvesting from last night:
  • -
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
    +
    $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
     {
       "count" : 100875,
       "_shards" : {
    diff --git a/docs/posts/index.xml b/docs/posts/index.xml
    index c5e5e1205..8fcfa5201 100644
    --- a/docs/posts/index.xml
    +++ b/docs/posts/index.xml
    @@ -41,7 +41,7 @@
     <ul>
     <li>Update Docker images on AReS server (linode20) and reboot the server:</li>
     </ul>
    -<pre><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
    +<pre tabindex="0"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
     </code></pre><ul>
     <li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
     </ul>
    @@ -57,7 +57,7 @@
     <ul>
     <li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
     </ul>
    -<pre><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
    +<pre tabindex="0"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
     COPY 20994
     </code></pre>
         
    @@ -164,7 +164,7 @@ COPY 20994
     <li>I had a call with CodeObia to discuss the work on OpenRXV</li>
     <li>Check the results of the AReS harvesting from last night:</li>
     </ul>
    -<pre><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
    +<pre tabindex="0"><code class="language-console" data-lang="console">$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty'
     {
       &quot;count&quot; : 100875,
       &quot;_shards&quot; : {
    @@ -471,7 +471,7 @@ COPY 20994
     </ul>
     </li>
     </ul>
    -<pre><code># apt update &amp;&amp; apt full-upgrade
    +<pre tabindex="0"><code># apt update &amp;&amp; apt full-upgrade
     # apt-get autoremove &amp;&amp; apt-get autoclean
     # dpkg -C
     # reboot
    @@ -492,7 +492,7 @@ COPY 20994
     </ul>
     </li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
     4671942
     # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
     1277694
    @@ -500,7 +500,7 @@ COPY 20994
     <li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
     <li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
     1183456 
     # zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
     106781
    @@ -527,7 +527,7 @@ COPY 20994
     <li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li>
     <li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         440 17.58.101.255
         441 157.55.39.101
         485 207.46.13.43
    @@ -628,7 +628,7 @@ COPY 20994
     </li>
     <li>The item seems to be in a pre-submitted state, so I tried to delete it from there:</li>
     </ul>
    -<pre><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
    +<pre tabindex="0"><code>dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
     </code></pre><ul>
     <li>But after this I tried to delete the item from the XMLUI and it is <em>still</em> present&hellip;</li>
    @@ -654,13 +654,13 @@ DELETE 1
     </ul>
     </li>
     </ul>
    -<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    +<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
        4432 200
     </code></pre><ul>
     <li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
     <li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
     </ul>
    -<pre><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
    +<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    @@ -701,7 +701,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     <li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
     <li>The top IPs before, during, and after this latest alert tonight were:</li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         245 207.46.13.5
         332 54.70.40.11
         385 5.143.231.38
    @@ -717,7 +717,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     <li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
     <li>There were just over 3 million accesses in the nginx logs last month:</li>
     </ul>
    -<pre><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
    +<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
     3018243
     
     real    0m19.873s
    @@ -737,7 +737,7 @@ sys     0m1.979s
     <li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
     <li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li>
     </ul>
    -<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          92 40.77.167.4
          99 210.7.29.100
         120 38.126.157.45
    @@ -825,7 +825,7 @@ sys     0m1.979s
     <ul>
     <li>DSpace Test had crashed at some point yesterday morning and I see the following in <code>dmesg</code>:</li>
     </ul>
    -<pre><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
    +<pre tabindex="0"><code>[Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
     [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
     [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
     </code></pre><ul>
    @@ -848,11 +848,11 @@ sys     0m1.979s
     <ul>
     <li>I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:</li>
     </ul>
    -<pre><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
    +<pre tabindex="0"><code>$ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
     </code></pre><ul>
     <li>During the <code>mvn package</code> stage on the 5.8 branch I kept getting issues with java running out of memory:</li>
     </ul>
    -<pre><code>There is insufficient memory for the Java Runtime Environment to continue.
    +<pre tabindex="0"><code>There is insufficient memory for the Java Runtime Environment to continue.
     </code></pre>
         
         
    @@ -872,12 +872,12 @@ sys     0m1.979s
     <li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
     <li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li>
     </ul>
    -<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    +<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
     </code></pre><ul>
     <li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/">March, 2018</a></li>
     <li>Time to index ~70,000 items on CGSpace:</li>
     </ul>
    -<pre><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
    +<pre tabindex="0"><code>$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
    @@ -958,19 +958,19 @@ sys     2m7.289s
     <li>In dspace.log around that time I see many errors like &ldquo;Client closed the connection before file download was complete&rdquo;</li>
     <li>And just before that I see this:</li>
     </ul>
    -<pre><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    +<pre tabindex="0"><code>Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
     </code></pre><ul>
     <li>Ah hah! So the pool was actually empty!</li>
     <li>I need to increase that, let&rsquo;s try to bump it up from 50 to 75</li>
     <li>After that one client got an HTTP 499 but then the rest were HTTP 200, so I don&rsquo;t know what the hell Uptime Robot saw</li>
     <li>I notice this error quite a few times in dspace.log:</li>
     </ul>
    -<pre><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
    +<pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
     </code></pre><ul>
     <li>And there are many of these errors every day for the past month:</li>
     </ul>
    -<pre><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
    +<pre tabindex="0"><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
     dspace.log.2017-11-21:4
     dspace.log.2017-11-22:1
     dspace.log.2017-11-23:4
    @@ -1048,12 +1048,12 @@ dspace.log.2018-01-02:34
     <ul>
     <li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li>
     </ul>
    -<pre><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
    +<pre tabindex="0"><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
     0
     </code></pre><ul>
     <li>Generate list of authors on CGSpace for Peter to go through and correct:</li>
     </ul>
    -<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
    +<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
     COPY 54701
     </code></pre>
         
    @@ -1068,7 +1068,7 @@ COPY 54701
     <ul>
     <li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li>
     </ul>
    -<pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    +<pre tabindex="0"><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
     </code></pre><ul>
     <li>There appears to be a pattern but I&rsquo;ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
     <li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>
    @@ -1182,7 +1182,7 @@ COPY 54701
     <li>Remove redundant/duplicate text in the DSpace submission license</li>
     <li>Testing the CMYK patch on a collection with 650 items:</li>
     </ul>
    -<pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
    +<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
     </code></pre>
         
         
    @@ -1208,7 +1208,7 @@ COPY 54701
     <li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li>
     <li>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li>
     </ul>
    -<pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
    +<pre tabindex="0"><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
     </code></pre>
         
    @@ -1223,7 +1223,7 @@ COPY 54701
     <ul>
     <li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li>
     </ul>
    -<pre><code>dspace=# select * from collection2item where item_id = '80278';
    +<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80278';
       id   | collection_id | item_id
     -------+---------------+---------
      92551 |           313 |   80278
    @@ -1263,7 +1263,7 @@ DELETE 1
     <li>CGSpace was down for five hours in the morning while I was sleeping</li>
     <li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li>
     </ul>
    -<pre><code>2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
    +<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
    @@ -1305,7 +1305,7 @@ DELETE 1
     </li>
     <li>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li>
     </ul>
    -<pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    +<pre tabindex="0"><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
     </code></pre>
         
         
    @@ -1322,7 +1322,7 @@ DELETE 1
     <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li>
     <li>It looks like we might be able to use OUs now, instead of DCs:</li>
     </ul>
    -<pre><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
    +<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
     </code></pre>
         
         
    @@ -1341,7 +1341,7 @@ DELETE 1
     <li>Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of <code>fonts</code>)</li>
     <li>Start working on DSpace 5.1 → 5.5 port:</li>
     </ul>
    -<pre><code>$ git checkout -b 55new 5_x-prod
    +<pre tabindex="0"><code>$ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
     </code></pre>
    @@ -1358,7 +1358,7 @@ $ git rebase -i dspace-5.5
     <li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li>
     <li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li>
     </ul>
    -<pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    +<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
      text_value
    @@ -1398,7 +1398,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
     <li>I have blocked access to the API now</li>
     <li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li>
     </ul>
    -<pre><code># awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
    +<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
     </code></pre>
         
    @@ -1476,7 +1476,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
     <ul>
     <li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li>
     </ul>
    -<pre><code># cd /home/dspacetest.cgiar.org/log
    +<pre tabindex="0"><code># cd /home/dspacetest.cgiar.org/log
     # ls -lh dspace.log.2015-11-18*
     -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
     -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
    @@ -1496,7 +1496,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
     <li>Looks like DSpace exhausted its PostgreSQL connection pool</li>
     <li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li>
     </ul>
    -<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
    +<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     78
     </code></pre>
         
    diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
    index c2fefd7b9..1f59c6aa8 100644
    --- a/docs/posts/page/2/index.html
    +++ b/docs/posts/page/2/index.html
    @@ -10,14 +10,14 @@
     
     
     
    -
    +
     
     
     
     
     
     
    -
    +
     
     
         
    diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
    index b5747d975..c1f56ab89 100644
    --- a/docs/posts/page/3/index.html
    +++ b/docs/posts/page/3/index.html
    @@ -10,14 +10,14 @@
     
     
     
    -
    +
     
     
     
     
     
     
    -
    +
     
     
         
    @@ -210,7 +210,7 @@
     
     
     
    -
    # apt update && apt full-upgrade
    +
    # apt update && apt full-upgrade
     # apt-get autoremove && apt-get autoclean
     # dpkg -C
     # reboot
    @@ -240,7 +240,7 @@
     
     
     
    -
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
    +
    # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     4671942
     # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
     1277694
    @@ -248,7 +248,7 @@
     
  • So 4.6 million from XMLUI and another 1.2 million from API requests
  • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):
  • -
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
    +
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
     1183456 
     # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
     106781
    @@ -293,7 +293,7 @@
     
  • Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
  • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
  • -
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         440 17.58.101.255
         441 157.55.39.101
         485 207.46.13.43
    diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
    index fcc6fb8aa..eae17f448 100644
    --- a/docs/posts/page/4/index.html
    +++ b/docs/posts/page/4/index.html
    @@ -10,14 +10,14 @@
     
     
     
    -
    +
     
     
     
     
     
     
    -
    +
     
     
         
    @@ -116,7 +116,7 @@
     
     
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
  • -
    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
    +
    dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
     DELETE 1
     
    • But after this I tried to delete the item from the XMLUI and it is still present…
    • @@ -151,13 +151,13 @@ DELETE 1
    -
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
    +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
        4432 200
     
    • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
    • Apply country and region corrections and deletions on DSpace Test and CGSpace:
    -
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
    +
    $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
     $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
     $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
    @@ -216,7 +216,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
  • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
  • The top IPs before, during, and after this latest alert tonight were:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
         245 207.46.13.5
         332 54.70.40.11
         385 5.143.231.38
    @@ -232,7 +232,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
     
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
  • There were just over 3 million accesses in the nginx logs last month:
  • -
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
    +
    # time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
     3018243
     
     real    0m19.873s
    @@ -261,7 +261,7 @@ sys     0m1.979s
     
  • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
  • I don’t see anything interesting in the web server logs around that time though:
  • -
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    +
    # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
          92 40.77.167.4
          99 210.7.29.100
         120 38.126.157.45
    @@ -394,7 +394,7 @@ sys     0m1.979s
     
    • DSpace Test had crashed at some point yesterday morning and I see the following in dmesg:
    -
    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
    +
    [Tue Jul 31 00:00:41 2018] Out of memory: Kill process 1394 (java) score 668 or sacrifice child
     [Tue Jul 31 00:00:41 2018] Killed process 1394 (java) total-vm:15601860kB, anon-rss:5355528kB, file-rss:0kB, shmem-rss:0kB
     [Tue Jul 31 00:00:41 2018] oom_reaper: reaped process 1394 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
     
      diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 5c13f0e9e..947ddd1ef 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,14 +10,14 @@ - + - + @@ -109,11 +109,11 @@
      • I want to upgrade DSpace Test to DSpace 5.8 so I took a backup of its current database just in case:
      -
      $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
      +
      $ pg_dump -b -v -o --format=custom -U dspace -f dspace-2018-07-01.backup dspace
       
      • During the mvn package stage on the 5.8 branch I kept getting issues with java running out of memory:
      -
      There is insufficient memory for the Java Runtime Environment to continue.
      +
      There is insufficient memory for the Java Runtime Environment to continue.
       
      Read more → @@ -142,12 +142,12 @@
    • I added the new CCAFS Phase II Project Tag PII-FP1_PACCA2 and merged it into the 5_x-prod branch (#379)
    • I proofed and tested the ILRI author corrections that Peter sent back to me this week:
    -
    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
    +
    $ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
     
    • I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in March, 2018
    • Time to index ~70,000 items on CGSpace:
    -
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
    +
    $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b                                  
     
     real    74m42.646s
     user    8m5.056s
    @@ -273,19 +273,19 @@ sys     2m7.289s
     
  • In dspace.log around that time I see many errors like “Client closed the connection before file download was complete”
  • And just before that I see this:
  • -
    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
    +
    Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-980] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:50; busy:50; idle:0; lastwait:5000].
     
    • Ah hah! So the pool was actually empty!
    • I need to increase that, let’s try to bump it up from 50 to 75
    • After that one client got an HTTP 499 but then the rest were HTTP 200, so I don’t know what the hell Uptime Robot saw
    • I notice this error quite a few times in dspace.log:
    -
    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
    +
    2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
     org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered " "]" "] "" at line 1, column 32.
     
    • And there are many of these errors every day for the past month:
    -
    $ grep -c "Error while searching for sidebar facets" dspace.log.*
    +
    $ grep -c "Error while searching for sidebar facets" dspace.log.*
     dspace.log.2017-11-21:4
     dspace.log.2017-11-22:1
     dspace.log.2017-11-23:4
    @@ -381,12 +381,12 @@ dspace.log.2018-01-02:34
     
    • Today there have been no hits by CORE and no alerts from Linode (coincidence?)
    -
    # grep -c "CORE" /var/log/nginx/access.log
    +
    # grep -c "CORE" /var/log/nginx/access.log
     0
     
    • Generate list of authors on CGSpace for Peter to go through and correct:
    -
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
    +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
     COPY 54701
     
    Read more → @@ -410,7 +410,7 @@ COPY 54701 -
    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
    +
    http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
     
    • There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
    • Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
    • diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 5b56c1bae..3719adac2 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,14 +10,14 @@ - + - + @@ -262,7 +262,7 @@
    • Remove redundant/duplicate text in the DSpace submission license
    • Testing the CMYK patch on a collection with 650 items:
    -
    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    +
    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
     
    Read more → @@ -297,7 +297,7 @@
  • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
  • -
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
    +
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
     
    Read more → @@ -321,7 +321,7 @@
    • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
    -
    dspace=# select * from collection2item where item_id = '80278';
    +
    dspace=# select * from collection2item where item_id = '80278';
       id   | collection_id | item_id
     -------+---------------+---------
      92551 |           313 |   80278
    diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
    index d4745b40b..b86e0143f 100644
    --- a/docs/posts/page/7/index.html
    +++ b/docs/posts/page/7/index.html
    @@ -10,14 +10,14 @@
     
     
     
    -
    +
     
     
     
     
     
     
    -
    +
     
     
         
    @@ -110,7 +110,7 @@
     
  • CGSpace was down for five hours in the morning while I was sleeping
  • While looking in the logs for errors, I see tons of warnings about Atmire MQM:
  • -
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    +
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
    @@ -170,7 +170,7 @@
     
     
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • -
    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    +
    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
     
    Read more → @@ -196,7 +196,7 @@
  • We had been using DC=ILRI to determine whether a user was ILRI or not
  • It looks like we might be able to use OUs now, instead of DCs:
  • -
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    +
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
     
    Read more → @@ -224,7 +224,7 @@
  • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
  • Start working on DSpace 5.1 → 5.5 port:
  • -
    $ git checkout -b 55new 5_x-prod
    +
    $ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
     
    @@ -250,7 +250,7 @@ $ git rebase -i dspace-5.5
  • Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
  • I think this query should find and replace all authors that have “,” at the end of their names:
  • -
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    +
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
      text_value
    @@ -308,7 +308,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
     
  • I have blocked access to the API now
  • There are 3,000 IPs accessing the REST API in a 24-hour period!
  • -
    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
    +
    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
     
    Read more → diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index 13e923b2f..c58f83f50 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,14 +10,14 @@ - + - + @@ -160,7 +160,7 @@
    • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:
    -
    # cd /home/dspacetest.cgiar.org/log
    +
    # cd /home/dspacetest.cgiar.org/log
     # ls -lh dspace.log.2015-11-18*
     -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
     -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
    @@ -189,7 +189,7 @@
     
  • Looks like DSpace exhausted its PostgreSQL connection pool
  • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
  • -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     78
     
    Read more → diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 3e16b65b5..39c50313a 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2021-09-04T21:16:03+03:00 + 2021-09-06T12:31:11+03:00 https://alanorth.github.io/cgspace-notes/ - 2021-09-04T21:16:03+03:00 + 2021-09-06T12:31:11+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2021-09-04T21:16:03+03:00 + 2021-09-06T12:31:11+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2021-09-04T21:16:03+03:00 + 2021-09-06T12:31:11+03:00 https://alanorth.github.io/cgspace-notes/2021-09/ - 2021-09-04T21:16:03+03:00 + 2021-09-06T12:31:11+03:00 https://alanorth.github.io/cgspace-notes/2021-08/ 2021-09-02T17:06:28+03:00 diff --git a/docs/tags/index.html b/docs/tags/index.html index cab51643c..36bf33a67 100644 --- a/docs/tags/index.html +++ b/docs/tags/index.html @@ -17,7 +17,7 @@ - + diff --git a/docs/tags/migration/index.html b/docs/tags/migration/index.html index 9ddf48e5a..866a90f1d 100644 --- a/docs/tags/migration/index.html +++ b/docs/tags/migration/index.html @@ -17,7 +17,7 @@ - + diff --git a/docs/tags/notes/index.html b/docs/tags/notes/index.html index 1032f1d02..bfc4435f6 100644 --- a/docs/tags/notes/index.html +++ b/docs/tags/notes/index.html @@ -17,7 +17,7 @@ - + @@ -227,7 +227,7 @@
  • Remove redundant/duplicate text in the DSpace submission license
  • Testing the CMYK patch on a collection with 650 items:
  • -
    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
    +
    $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
     
    Read more → @@ -262,7 +262,7 @@
  • Discovered that the ImageMagic filter-media plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK
  • Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing regeneration using DSpace 5.x’s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see 10568/51999):
  • -
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
    +
    $ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
     
    Read more → @@ -286,7 +286,7 @@
    • An item was mapped twice erroneously again, so I had to remove one of the mappings manually:
    -
    dspace=# select * from collection2item where item_id = '80278';
    +
    dspace=# select * from collection2item where item_id = '80278';
       id   | collection_id | item_id
     -------+---------------+---------
      92551 |           313 |   80278
    @@ -344,7 +344,7 @@ DELETE 1
     
  • CGSpace was down for five hours in the morning while I was sleeping
  • While looking in the logs for errors, I see tons of warnings about Atmire MQM:
  • -
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
    +
    2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail="dc.title", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail="THUMBNAIL", transactionID="TX157907838689377964651674089851855413607")
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail="-1", transactionID="TX157907838689377964651674089851855413607")
    diff --git a/docs/tags/notes/index.xml b/docs/tags/notes/index.xml
    index abf500f89..6e5bd9dba 100644
    --- a/docs/tags/notes/index.xml
    +++ b/docs/tags/notes/index.xml
    @@ -105,7 +105,7 @@
     <li>Remove redundant/duplicate text in the DSpace submission license</li>
     <li>Testing the CMYK patch on a collection with 650 items:</li>
     </ul>
    -<pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
    +<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
     </code></pre>
         
         
    @@ -131,7 +131,7 @@
     <li>Discovered that the ImageMagic <code>filter-media</code> plugin creates JPG thumbnails with the CMYK colorspace when the source PDF is using CMYK</li>
     <li>Interestingly, it seems DSpace 4.x&rsquo;s thumbnails were sRGB, but forcing regeneration using DSpace 5.x&rsquo;s ImageMagick plugin creates CMYK JPGs if the source PDF was CMYK (see <a href="https://cgspace.cgiar.org/handle/10568/51999">10568/51999</a>):</li>
     </ul>
    -<pre><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
    +<pre tabindex="0"><code>$ identify ~/Desktop/alc_contrastes_desafios.jpg
     /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000
     </code></pre>
         
    @@ -146,7 +146,7 @@
     <ul>
     <li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li>
     </ul>
    -<pre><code>dspace=# select * from collection2item where item_id = '80278';
    +<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80278';
       id   | collection_id | item_id
     -------+---------------+---------
      92551 |           313 |   80278
    @@ -186,7 +186,7 @@ DELETE 1
     <li>CGSpace was down for five hours in the morning while I was sleeping</li>
     <li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li>
     </ul>
    -<pre><code>2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
    +<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
     2016-12-02 03:00:32,353 WARN  com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
    @@ -228,7 +228,7 @@ DELETE 1
     </li>
     <li>I exported a random item&rsquo;s metadata as CSV, deleted <em>all columns</em> except id and collection, and made a new coloum called <code>ORCID:dc.contributor.author</code> with the following random ORCIDs from the ORCID registry:</li>
     </ul>
    -<pre><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    +<pre tabindex="0"><code>0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
     </code></pre>
         
         
    @@ -245,7 +245,7 @@ DELETE 1
     <li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li>
     <li>It looks like we might be able to use OUs now, instead of DCs:</li>
     </ul>
    -<pre><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
    +<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
     </code></pre>
         
         
    @@ -264,7 +264,7 @@ DELETE 1
     <li>Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of <code>fonts</code>)</li>
     <li>Start working on DSpace 5.1 → 5.5 port:</li>
     </ul>
    -<pre><code>$ git checkout -b 55new 5_x-prod
    +<pre tabindex="0"><code>$ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
     </code></pre>
    @@ -281,7 +281,7 @@ $ git rebase -i dspace-5.5
     <li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li>
     <li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li>
     </ul>
    -<pre><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    +<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
      text_value
    @@ -321,7 +321,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
     <li>I have blocked access to the API now</li>
     <li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li>
     </ul>
    -<pre><code># awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
    +<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
     </code></pre>
         
    @@ -399,7 +399,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
     <ul>
     <li>Replace <code>lzop</code> with <code>xz</code> in log compression cron jobs on DSpace Test—it uses less space:</li>
     </ul>
    -<pre><code># cd /home/dspacetest.cgiar.org/log
    +<pre tabindex="0"><code># cd /home/dspacetest.cgiar.org/log
     # ls -lh dspace.log.2015-11-18*
     -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
     -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
    @@ -419,7 +419,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
     <li>Looks like DSpace exhausted its PostgreSQL connection pool</li>
     <li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li>
     </ul>
    -<pre><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
    +<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     78
     </code></pre>
         
    diff --git a/docs/tags/notes/page/2/index.html b/docs/tags/notes/page/2/index.html
    index 30aa49e8f..e56cb0202 100644
    --- a/docs/tags/notes/page/2/index.html
    +++ b/docs/tags/notes/page/2/index.html
    @@ -17,7 +17,7 @@
     
     
     
    -
    +
     
     
         
    @@ -123,7 +123,7 @@
     
     
  • I exported a random item’s metadata as CSV, deleted all columns except id and collection, and made a new coloum called ORCID:dc.contributor.author with the following random ORCIDs from the ORCID registry:
  • -
    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
    +
    0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
     
    Read more → @@ -149,7 +149,7 @@
  • We had been using DC=ILRI to determine whether a user was ILRI or not
  • It looks like we might be able to use OUs now, instead of DCs:
  • -
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
    +
    $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)"
     
    Read more → @@ -177,7 +177,7 @@
  • Anything after Bootstrap 3.3.1 makes glyphicons disappear (HTTP 404 trying to access from incorrect path of fonts)
  • Start working on DSpace 5.1 → 5.5 port:
  • -
    $ git checkout -b 55new 5_x-prod
    +
    $ git checkout -b 55new 5_x-prod
     $ git reset --hard ilri/5_x-prod
     $ git rebase -i dspace-5.5
     
    @@ -203,7 +203,7 @@ $ git rebase -i dspace-5.5
  • Add dc.description.sponsorship to Discovery sidebar facets and make investors clickable in item view (#232)
  • I think this query should find and replace all authors that have “,” at the end of their names:
  • -
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
    +
    dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
     UPDATE 95
     dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
      text_value
    @@ -261,7 +261,7 @@ dspacetest=# select text_value from  metadatavalue where metadata_field_id=3 and
     
  • I have blocked access to the API now
  • There are 3,000 IPs accessing the REST API in a 24-hour period!
  • -
    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
    +
    # awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
     3168
     
    Read more → diff --git a/docs/tags/notes/page/3/index.html b/docs/tags/notes/page/3/index.html index 6a414d882..985cae1f4 100644 --- a/docs/tags/notes/page/3/index.html +++ b/docs/tags/notes/page/3/index.html @@ -17,7 +17,7 @@ - + @@ -117,7 +117,7 @@
    • Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less space:
    -
    # cd /home/dspacetest.cgiar.org/log
    +
    # cd /home/dspacetest.cgiar.org/log
     # ls -lh dspace.log.2015-11-18*
     -rw-rw-r-- 1 tomcat7 tomcat7 2.0M Nov 18 23:59 dspace.log.2015-11-18
     -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo
    @@ -146,7 +146,7 @@
     
  • Looks like DSpace exhausted its PostgreSQL connection pool
  • Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:
  • -
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
    +
    $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
     78
     
    Read more →